检索结果-内蒙古大学图书馆

41st International Conference on Machine Learning, ICML 2024

作者： Ma, Yuexiao Li, Huixia Zheng, Xiawu Ling, Feng Xiao, Xuefeng Wang, Rui Wen, Shilei Chao, Fei Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University 361005 China ByteDance Inc. China Peng Cheng Laboratory Shenzhen China Institute of Artificial Intelligence Xiamen University China

Post-Training Quantization (PTQ) is a vital technique for network compression and acceleration, gaining prominence as model sizes inc.ease. This paper addresses a critical challenge in PTQ: the severe impact of outliers on the accuracy of quantized transformer architectures. Specifically, we introduce the concept of 'reconstruction granularity' as a novel solution to this issue, which has been overlooked in previous works. Our work provides theoretical insights into the role of reconstruction granularity in mitigating the outlier problem in transformer models. This theoretical framework is supported by empirical analysis, demonstrating that varying reconstruction granularities significantly influence quantization performance. Our findings indicate that different architectural designs necessitate distinc. optimal reconstruction granularities. For instance, the multi-stage Swin Transformer architecture benefits from finer granularity, a deviation from the trends observed in ViT and DeiT models. We further develop an algorithm for determining the optimal reconstruction granularity for various ViT models, achieving state-of-the-art (SOTA) performance in PTQ. For example, applying our method to 4-bit quantization, the Swin-Base model achieves a Top-1 accuracy of 82.24% on the ImageNet classification task. This result surpasses the RepQ-ViT by 3.92% (82.24% VS 78.32%). Similarly, our approach elevates the ViT-Small to a Top-1 accuracy of 80.50%, outperforming NoisyQuant by 3.64% (80.50% VS 76.86%). Copyright 2024 by the author(s)

关键词：

来源：评论

学校读者我要写书评

暂无评论

AFFINEQUANT: AFFINE TRANSFORMATION QUANTIZATION FOR LARGE LANGUAGE MODELS 12

AFFINEQUANT: AFFINE TRANSFORMATION QUANTIZATION FOR LARGE LA...

引用

12th International Conference on Learning Representations, ICLR 2024

The significant resource requirements associated with Large-scale Language Models (LLMs) have generated considerable interest in the development of techniques aimed at compressing and accelerating neural networks. Among these techniques, Post-Training Quantization (PTQ) has emerged as a subject of considerable interest due to its noteworthy compression efficiency and cost-effectiveness in the context of training. Existing PTQ methods for LLMs limit the optimization scope to scaling transformations between pre- and post-quantization weights. This constraint results in significant errors after quantization, particularly in low-bit configurations. In this paper, we advocate for the direct optimization using equivalent Affine transformations in PTQ (AffineQuant). This approach extends the optimization scope and thus significantly minimizing quantization errors. Additionally, by employing the corresponding inverse matrix, we can ensure equivalence between the pre- and post-quantization outputs of PTQ, thereby maintaining its efficiency and generalization capabilities. To ensure the invertibility of the transformation during optimization, we further introduce a gradual mask optimization method. This method initially focuses on optimizing the diagonal elements and gradually extends to the other elements. Such an approach aligns with the Levy-Desplanques theorem, theoretically ensuring invertibility of the transformation. As a result, significant performance improvements are evident across different LLMs on diverse datasets. Notably, these improvements are most pronounced when using very low-bit quantization, enabling the deployment of large models on edge devices. To illustrate, we attain a C4 perplexity of 15.76 (2.26↓ vs 18.02 in OmniQuant) on the LLaMA2-7B model of W4A4 quantization without overhead. On zero-shot tasks, AffineQuant achieves an average of 58.61% accuracy (1.98% ↑ vs 56.63 in OmniQuant) when using 4/4-bit quantization for LLaMA-30B, which setting a new st

关键词： Cost effectiveness

来源：评论

学校读者我要写书评

暂无评论

Outlier-aware slicing for post-training quantization in vision transformer 24

Outlier-aware slicing for post-training quantization in visi...

引用

Proceedings of the 41st International Conference on Machine Learning

作者： Yuexiao Ma Huixia Li Xiawu Zheng Feng Ling Xuefeng Xiao Rui Wang Shilei Wen Fei Chao Rongrong Ji ByteDance Inc. and Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University P.R. China. ByteDance Inc. Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University P.R. China and Peng Cheng Laboratory Shenzhen China and Institute of Artificial Intelligence Xiamen University Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University P.R. China. Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China School of Informatics Xiamen University P.R. China and and Institute of Artificial Intelligence Xiamen University

关键词：

来源：评论

学校读者我要写书评

暂无评论

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Clover: Towards A Unified Video-Language Alignment and Fusio...

引用

Conference on Computer Vision and Pattern Recognition (CVPR)

作者： Jingjia Huang Yinan Li Jiashi Feng Xinglong Wu Xiaoshuai Sun Rongrong Ji Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University China ByteDance Inc China

Building a universal Video-Language model for solving various video understanding tasks (e.g., text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent works build the model by stacking uni-modal and cross-modal feature encoders and train it with pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. They mostly adopt different architectures to deal with different downstream tasks. We find this is because the pair-wise training cannot well align and fuse features from different modalities. We then introduce Clover-a Correlated Video-Language pre-training method-towards a universal Video-Language modelfor solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via inc.rporating learning from semantic masked samples and a new pair-wise ranking loss. Clover establishes new state-of-the-arts on multiple downstream tasks, inc.uding three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at https://***/LeeYN-43/Clover.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Drag Your Gaussian: Effective Drag-Based Editing with Score Distillation for 3D Gaussian Splatting

arXiv

引用

arXiv 2025年

作者： Qu, Yansong Chen, Dian Li, Xinyang Li, Xiaofan Zhang, Shengchuan Cao, Liujuan Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University China Baidu Inc China

Recent advancements in 3D scene editing have been propelled by the rapid development of generative models. Existing methods typically utilize generative models to perform text-guided editing on 3D representations, such as 3D Gaussian Splatting (3DGS). However, these methods are often limited to texture modifications and fail when addressing geometric changes, such as editing a character’s head to turn around. Moreover, such methods lack accurate control over the spatial position of editing results, as language struggles to precisely describe the extent of edits. To overcome these limitations, we introduce DYG, an effective 3D drag-based editing method for 3D Gaussian Splatting. It enables users to conveniently specify the desired editing region and the desired dragging direction through the input of 3D masks and pairs of control points, thereby enabling precise control over the extent of editing. DYG integrates the strengths of the implicit triplane representation to establish the geometric scaffold of the editing results, effectively overcoming suboptimal editing outcomes caused by the sparsity of 3DGS in the desired editing regions. Additionally, we inc.rporate a drag-based Latent Diffusion Model into our method through the proposed Drag-SDS loss function, enabling flexible, multi-view consistent, and fine-grained editing. Extensive experiments demonstrate that DYG conducts effective drag-based editing guided by control point prompts, surpassing other baselines in terms of editing effect and quality, both qualitatively and quantitatively. Visit our project page at https://***/Drag-Your-Gaussian/. Copyright © 2025, The Authors. All rights reserved.

关键词： Gaussian distribution

来源：评论

学校读者我要写书评

暂无评论

ComfyGPT: A Self-Optimizing Multi-Agent System for Comprehensive ComfyUI Workflow Generation

arXiv

引用

arXiv 2025年

作者： Huang, Oucheng Ma, Yuhang Zhao, Zeng Wu, Mingrui Ji, Jiayi Zhang, Rongsheng Hu, Zhipeng Sun, Xiaoshuai Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education Xiamen University China Fuxi AI Lab Netease Inc China

ComfyUI provides a widely-adopted, workflow-based interface that enables users to customize various image generation tasks through an intuitive node-based architecture. However, the intricate connections between nodes and diverse modules often present a steep learning curve for users. In this paper, we introduce ComfyGPT, the first self-optimizing multi-agent system designed to generate ComfyUI workflows based on task descriptions automatically. ComfyGPT comprises four specialized agents: ReformatAgent, FlowAgent, RefineAgent, and ExecuteAgent. The core innovation of ComfyGPT lies in two key aspects. First, it focuses on generating individual node links rather than entire workflows, significantly improving generation precision. Second, we proposed FlowAgent, a LLM-based workflow generation agent that uses both supervised fine-tuning (SFT) and reinforcement learning (RL) to improve workflow generation accuracy. Moreover, we introduce FlowDataset, a large-scale dataset containing 13,571 workflow-description pairs, and FlowBench, a comprehensive benchmark for evaluating workflow generation systems. We also propose four novel evaluation metrics: Format Validation (FV), Pass Accuracy (PA), Pass Instruct Alignment (PIA), and Pass Node Diversity (PND). Experimental results demonstrate that ComfyGPT significantly outperforms existing LLM-based methods in workflow generation. Copyright © 2025, The Authors. All rights reserved.

关键词： Reinforcement learning

来源：评论

学校读者我要写书评

暂无评论

NeRF-DetS: Enhanced Adaptive Spatial-wise Sampling and View-wise Fusion Strategies for NeRF-based Indoor Multi-view 3D Object Detection

arXiv

引用

arXiv 2024年

作者： Huang, Chi Li, Xinyang Qu, Yansong Wu, Changli Li, Xiaofan Zhang, Shengchuan Cao, Liujuan Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Baidu Inc China

In indoor scenes, the diverse distribution of object locations and scales makes the visual 3D perception task a big challenge. Previous works (e.g., NeRF-Det) have demonstrated that implicit representation has the capacity to benefit the visual 3D perception task in indoor scenes with high amount of overlap between input images. However, previous works cannot fully utilize the advancement of implicit representation because of fixed sampling and simple multi-view feature fusion. In this paper, inspired by sparse fashion method (e.g., DETR3D), we propose a simple yet effective method, NeRF-DetS, to address above issues. NeRF-DetS inc.udes two modules: Progressive Adaptive Sampling Strategy (PASS) and Depth-Guided Simplified Multi-Head Attention Fusion (DS-MHA). Specifically, (1) PASS can automatically sample features of each layer within a dense 3D detector, using offsets predicted by the previous layer. (2) DS-MHA can not only efficiently fuse multi-view features with strong occlusion awareness but also reduce computational cost. Extensive experiments on ScanNetV2 dataset demonstrate our NeRF-DetS outperforms NeRF-Det, by achieving +5.02% and +5.92% improvement in mAP under IoU25 and IoU50, respectively. Also, NeRF-DetS shows consistent improvements on ARKITScenes. Copyright © 2024, The Authors. All rights reserved.

关键词： Object detection

来源：评论

学校读者我要写书评

暂无评论

Towards efficient Diffusion-Based Image Editing with Instant Attention Masks

arXiv

引用

arXiv 2024年

作者： Zou, Siyu Tang, Jiji Zhou, Yiyi He, Jing Zhao, Chaoyi Zhang, Rongsheng Hu, Zhipeng Sun, Xiaoshuai Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Fuxi AI Lab NetEase Inc. Hangzhou China

Diffusion-based Image Editing (DIE) is an emerging research hot-spot, which often applies a semantic mask to control the target area for diffusion-based editing. However, most existing solutions obtain these masks via manual operations or off-line processing, greatly reducing their efficiency. In this paper, we propose a novel and efficient image editing method for Text-to-Image (T2I) diffusion models, termed Instant Diffusion Editing (InstDiffEdit). In particular, InstDiffEdit aims to employ the cross-modal attention ability of existing diffusion models to achieve instant mask guidance during the diffusion steps. To reduce the noise of attention maps and realize the full automatics, we equip InstDiffEdit with a training-free refinement scheme to adaptively aggregate the attention distributions for the automatic yet accurate mask generation. Meanwhile, to supplement the existing evaluations of DIE, we propose a new benchmark called Editing-Mask to examine the mask accuracy and local editing ability of existing methods. To validate InstDiffEdit, we also conduct extensive experiments on ImageNet and Imagen, and compare it with a bunch of the SOTA methods. The experimental results show that InstDiffEdit not only outperforms the SOTA methods in both image quality and editing results, but also has a much faster inference speed, i.e., +5 to +6 times. Our code available at https://***/r/InstDiffEdit-C306/ Copyright © 2024, The Authors. All rights reserved.

关键词： Diffusion

来源：评论

学校读者我要写书评

暂无评论

MLLM-Selector: Necessity and Diversity-driven High-Value Data Selection for Enhanced Visual Instruction Tuning

arXiv

引用

arXiv 2025年

作者： Ma, Yiwei Xu, Guohai Sun, Xiaoshuai Ji, Jiayi Lou, Jie Zhang, Debing Ji, Rongrong Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University Fujian361005 China Xiaohongshu Inc Shanghai200025 China

Visual instruction tuning (VIT) has emerged as a crucial technique for enabling multi-modal large language models (MLLMs) to follow user instructions adeptly. Yet, a significant gap persists in understanding the attributes of high-quality instruction tuning data and frameworks for its automated selection. To address this, we introduce MLLM-Selector, an automated approach that identifies valuable data for VIT by weighing necessity and diversity. Our process starts by randomly sampling a subset from the VIT data pool to fine-tune a pretrained model, thus creating a seed model with an initial ability to follow instructions. Then, leveraging the seed model, we calculate necessity scores for each sample in the VIT data pool to identify samples pivotal for enhancing model performance. Our findings underscore the importance of mixing necessity and diversity in data choice, leading to the creation of MLLM-Selector, our methodology that fuses necessity scoring with strategic sampling for superior data refinement. Empirical results indicate that within identical experimental conditions, MLLM-Selector surpasses LLaVA-1.5 in some benchmarks with less than 1% of the data and consistently exceeds performance across all validated benchmarks when using less than 50%. Copyright © 2025, The Authors. All rights reserved.

关键词： Visual languages

来源：评论

学校读者我要写书评

暂无评论

StoryWeaver: A Unified World Model for Knowledge-Enhanced Story Character Customization

arXiv

引用

arXiv 2024年

作者： Zhang, Jinlu Tang, Jiji Zhang, Rongsheng Lv, Tangjie Sun, Xiaoshuai Key Laboratory of Multimedia Trusted Perception and Efficient Computing Ministry of Education of China Xiamen University 361005 China Fuxi AI Lab Netease Inc

Story visualization has gained inc.easing attention in artificial intelligence. However, existing methods still struggle with maintaining a balance between character identity preservation and text-semantics alignment, largely due to a lack of detailed semantic modeling of the story scene. To tackle Copyright © 2024, The Authors. All rights reserved.

关键词： Knowledge graph

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：