检索结果-内蒙古大学图书馆

EVASION: Efficient KV CAche CompreSsion vIa PrOduct QuaNtization

学校读者我要写书评

暂无评论

EVASION: Efficient KV CAche CompreSsion vIa PrOduct QuaNtiza...

Design, Automation and Test in Europe Conference and Exhibition

作者： Zongwu Wang Fangxin Liu Peng Xu Qingxiao Sun Junping Zhao Li Jiang Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai Qi Zhi Institute Dept. of CST SSSLab China University of Petroleum-Beijing China Ant Group MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University

ISBN: (数字)9783982674100

ISBN: (纸本)9798331534646

Large language models (LLMs) are increasingly utilized for complex tasks requiring longer context lengths, with some models supporting up to 128K or 1M tokens. This trend, however, presents significant challenges in inference speed and memory management. The primary bottleneck in long-context LLM inference is the quadratic computational complexity of attention mechanisms, causing substantial slowdowns as sequence length increases. KV cache mechanism alleviates this issue by storing pre-computed data, but introduces memory requirements that scale linearly with context length, hindering efficient LLM dep.oyment. Quantization emerges as a promising approach to address the widening gap between LLM size and memory capacity. However, traditional quantization schemes often yield suboptimal compression results for KV caches due to two key factors: i) On-the-fly quantization and de-quantization, causing significant performance overhead; ii) Prevalence of outliers in KV values, challenging low-bitwidth uniform quantization. To this end, we propose EVASION, a novel quantization framework achieving low-bitwidth KV cache through product quantization. First, we conduct a thorough analysis of KV cache distribution, revealing the limitations of existing quantization schemes. Second, we introduce a non-uniform quantization algorithm based on product quantization, which efficiently compresses data while preserving accuracy. Third, we develop a high-performance GPU inference framework for EVASION that leverages sparse computation and asynchronous quantization, significantly enhancing inference speed. Comprehensive evaluation results demonstrate that EVASION can achieve 4 bits quantization trivial perplexity and accuracy loss.

关键词： Quantization (signal) Accuracy Large language models Memory management Graphics processing units Europe Market research Inference algorithms Low latency communication Context modeling

Is Your Image a Good Storyteller?

学校读者我要写书评

暂无评论

arXiv 2024年

作者： Song, Xiujie Pang, Xiaoyi Tang, Haifeng Wu, Mengyue Zhu, Kenny Q. X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China China Merchants Bank Credit Card Center Shanghai China University of Texas at Arlington ArlingtonTX United States

Quantifying image complexity at the entity level is straightforward, but the assessment of semantic complexity has been largely overlooked. In fact, there are differences in semantic complexity across images. Images with richer semantics can tell vivid and engaging stories and offer a wide range of application scenarios. For example, the Cookie Theft picture is such a kind of image and is widely used to assess human language and cognitive abilities due to its higher semantic complexity. Additionally, semantically rich images can benefit the development of vision models, as images with limited semantics are becoming less challenging for them. However, such images are scarce, highlighting the need for a greater number of them. For instance, there is a need for more images like Cookie Theft to cater to people from different cultural backgrounds and eras. Assessing semantic complexity requires human experts and empirical evidence. Automatic evaluation of how semantically rich an image will be the first step of mining or generating more images with rich semantics, and benefit human cognitive assessment, Artificial Intelligence, and various other applications. In response, we propose the Image Semantic Assessment (ISA) task to address this problem. We introduce the first ISA dataset and a novel method that leverages language to solve this vision problem. Experiments on our dataset demonstrate the effectiveness of our approach. Copyright © 2024, The Authors. All rights reserved.

关键词：

LONGFNT: LONG-FORM SPEECH RECOGNITION WITH FACTORIZED NEURAL TRANSDUCER

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Gong, Xun Wu, Yu Li, Jinyu Liu, Shujie Zhao, Rui Chen, Xie Qian, Yanmin MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University China Microsoft China

Traditional automatic speech recognition (ASR) systems usually focus on individual utterances, without considering long-form speech with useful historical information, which is more practical in real scenarios. Simply attending longer transcription history for a vanilla neural transducer model shows no much gain in our preliminary experiments, since the prediction network is not a pure language model. This motivates us to leverage the factorized neural transducer structure, containing a real language model, the vocabulary predictor. We propose the LongFNT-Text architecture, which fuses the sentence-level long-form features directly with the output of the vocabulary predictor and then embeds token-level long-form features inside the vocabulary predictor, with a pre-trained contextual encoder RoBERTa to further boost the performance. Moreover, we propose the LongFNT architecture by extending the long-form speech to the original speech input and achieve the best performance. The effectiveness of our LongFNT approach is validated on LibriSpeech and GigaSpeech corpora with 19% and 12% relative word error rate (WER) reduction, respectively. Copyright © 2022, The Authors. All rights reserved.

关键词： Speech recognition

LEVERAGING SPEECH PTM, TEXT LLM, AND EMOTIONAL TTS FOR SPEECH EMOTION RECOGNITION

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Ma, Ziyang Wu, Wen Zheng, Zhisheng Guo, Yiwei Chen, Qian Zhang, Shiliang Chen, Xie MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China Department of Engineering University of Cambridge Cambridge United Kingdom Speech Lab of DAMO Academy Alibaba Group China

In this paper, we explored how to boost speech emotion recognition (SER) with the state-of-the-art speech pre-trained model (PTM), data2vec, text generation technique, GPT-4, and speech synthesis technique, Azure TTS. First, we investigated the representation ability of different speech self-supervised pre-trained models, and we found that data2vec has a good representation ability on the SER task. Second, we employed a powerful large language model (LLM), GPT-4, and emotional text-to-speech (TTS) model, Azure TTS, to generate emotionally congruent text and speech. We carefully designed the text prompt and dataset construction, to obtain the synthetic emotional speech data with high quality. Third, we studied different ways of data augmentation to promote the SER task with synthetic speech, including random mixing, adversarial training, transfer learning, and curriculum learning. Experiments and ablation studies on the IEMOCAP dataset demonstrate the effectiveness of our method, compared with other data augmentation methods, and data augmentation with other synthetic data. Copyright © 2023, The Authors. All rights reserved.

关键词： Speech synthesis

Making Offline RL Online: Collaborative World Models for Offline Visual Reinforcement Learning

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Wang, Qi Yang, Junming Wang, Yunbo Jin, Xin Zeng, Wenjun Yang, Xiaokang MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University China Ningbo Institute of Digital Twin Eastern Institute of Technology China School of Computer Science and Engineering Southeast University China

Training offline RL models using visual inputs poses two significant challenges, i.e., the overfitting problem in representation learning and the overestimation bias for expected future rewards. Recent work has attempted to alleviate the overestimation bias by encouraging conservative behaviors. This paper, in contrast, tries to build more flexible constraints for value estimation without impeding the exploration of potential advantages. The key idea is to leverage off-the-shelf RL simulators, which can be easily interacted with in an online manner, as the "test bed" for offline policies. To enable effective online-to-offline knowledge transfer, we introduce CoWorld, a model-based RL approach that mitigates cross-domain discrepancies in state and reward spaces. Experimental results demonstrate the effectiveness of CoWorld, outperforming existing RL approaches by large margins. Copyright © 2023, The Authors. All rights reserved.

关键词： Knowledge transfer

SkiM: Skipping Memory LSTM for Low-Latency Real-Time Continuous Speech Separation

学校读者我要写书评

暂无评论

arXiv 2022年

作者： Li, Chenda Yang, Lei Wang, Weiqin Qian, Yanmin MoE Key Lab of Artificial Intelligence AI Institute X-LANCE Lab Department of Computer Science and Engineering Shanghai Jiao Tong University Shanghai China China

Continuous speech separation for meeting pre-processing has recently become a focused research topic. Compared to the data in utterance-level speech separation, the meeting-style audio stream lasts longer, has an uncertain number of speakers. We adopt the time-domain speech separation method and the recently proposed Graph-PIT to build a super low-latency online speech separation model, which is very important for the real application. The low-latency time-domain encoder with a small stride leads to an extremely long feature sequence. We proposed a simple yet efficient model named Skipping Memory (SkiM) for the long sequence modeling. Experimental results show that SkiM achieves on par or even better separation performance than DPRNN. Meanwhile, the computational cost of SkiM is reduced by 75% compared to DPRNN. The strong long sequence modeling capability and low computational cost make SkiM a suitable model for online CSS applications. Our fastest real-time model gets 17.1 dB signal-to-distortion (SDR) improvement with less than 1-millisecond latency in the simulated meeting-style evaluation. Copyright © 2022, The Authors. All rights reserved.

关键词： Long short-term memory

SP3: Enhancing Structured Pruning via PCA Projection

学校读者我要写书评

暂无评论

arXiv 2023年

作者： Hu, Yuxuan Zhang, Jing Zhao, Zhe Zhao, Chen Chen, Xiaodong Li, Cuiping Chen, Hong School of Information Renmin University of China Beijing China Key Laboratory of Data Engineering and Knowledge Engineering MOE China Engineering Research Center of Database and Business Intelligence MOE China Tencent AI Lab Tencent Beijing China School of Computer Science and Technology Xi'an Jiaotong University Xi'An China

Structured pruning is a widely used technique for reducing the size of pre-trained language models (PLMs), but current methods often overlook the potential of compressing the hidden dimension (d) in PLMs, a dimension critical to model size and efficiency. This paper introduces a novel structured pruning approach, Structured Pruning with PCA Projection (SP3), targeting the effective reduction of d by projecting features into a space defined by principal components before masking. Extensive experiments on benchmarks (GLUE and SQuAD) show that SP3 can reduce d by 70%, compress 94% of the BERTbase model, and maintain over 96% accuracy and outperform other methods that compress d by 6% in accuracy at the same compression ratio. SP3 has also proven effective with other models, including OPT and Llama. Our data and code are available at ours repo. © 2023, CC BY.

关键词：

Flow-TSVAD: Target-Speaker Voice Activity Detection via Latent Flow Matching for Speaker Diarization

学校读者我要写书评

暂无评论

Flow-TSVAD: Target-Speaker Voice Activity Detection via Late...

International Conference on Acoustics, Speech, and Signal Processing (ICASSP)

作者： Zhengyang Chen Bing Han Shuai Wang Yidi Jiang Yanmin Qian Department of Computer Science and Engineering Auditory Cognition and Computational Acoustics Lab MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China Shenzhen Research Institute of Big Data Shenzhen China School of Data Science The Chinese University of Hong Kong Shenzhen China National University of Singapore Singapore

ISBN: (数字)9798350368741

ISBN: (纸本)9798350368758

Speaker diarization is typically considered as a discriminative task, using discriminative approaches to produce fixed diarization results. In this paper, we explore for the first time the use of neural network-based generative methods for speaker diarization. We implement a Flow-Matching (FM) based generative algorithm within the sequenceto-sequence target speaker voice activity detection (Seq2Seq-TSVAD) diarization system. Our experiments reveal that applying the generative method directly to the original binary label sequence space of the TS-VAD output is ineffective. To address this issue, we propose mapping the binary label sequence into a dense latent space before applying the generative algorithm, and our proposed Flow-TSVAD method can significantly outperform the traditional Seq2Seq-TSVAD system. Additionally, we observe that the FM algorithm converges rapidly during the inference stage, only requiring two inference steps to achieve promising results. Moreover, as a generative model, Flow-TSVAD allows for sampling different diarization results by running the model multiple times, so the ensemble system combining the results from various sampling instances can further boost the diarization performance.

关键词： Voice activity detection Frequency modulation Computational modeling Signal processing algorithms Signal processing Inference algorithms Acoustics Computational efficiency

End-to-End Streaming Customizable keyword Spotting Based on Text-Adaptive Neural Search 1

学校读者我要写书评

暂无评论

18th National Conference on Man-Machine Speech Communication, NCMMSC 2023

作者： Yang, Baochen Guo, Jiaqi Li, Haoyu Xi, Yu Zhuo, Qing Yu, Kai X-LANCE Lab Department of Computer Science and Engineering MoE Key Lab of Artificial Intelligence AI Institute Shanghai Jiao Tong University Shanghai China State Key Laboratory of Media Convergence Production Technology and Systems Beijing China AISpeech Ltd. Suzhou China Department of Automation Tsinghua University Beijing China

ISBN: (数字)9789819706013

ISBN: (纸本)9789819706006

Streaming keyword spotting (KWS) is an important technique for voice assistant wake-up. While KWS with a preset fixed keyword has been well studied, test-time customizable keyword spotting in streaming mode remains a great challenge due to the lack of pre-collected keyword-specific training data and the requirement of streaming detection output. In this paper, we propose a novel end-to-end text-adaptive neural search architecture with a multi-label trigger mechanism to allow any pre-trained ASR acoustic model to be effectively used for fast streaming customizable keyword spotting. Evaluation results on various datasets show that our approach significantly outperforms both traditional post-processing baseline and the neural search baseline, meanwhile achieving a 44x search speedup compared to the traditional post-processing method. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.

关键词： Customizable keyword keyword spotting Streaming Text-adaptive search