With the advancement of technologies like cloud computing and Internet of things, there are enormous areas in which these can be utilized. One such domain is Smart City concept utilizing or based on the architecture o...
详细信息
Autism Spectrum Disorder (ASD) is a brain-based condition characterized by social difficulties and repetitive behaviors. Traditional diagnostic methods can be subjective and time-consuming. Current diagnostic techniqu...
详细信息
Lead-based perovskite solar cells (PSCs) are popular in the photovoltaic industry for their remarkable properties, but issues like toxicity and instability limit their use. To address these problems, eco-friendly, lea...
详细信息
Right now, a lot of subjective human judgment goes into classifying mangoes and other fruits, especially when it comes to poor productivity, which results in less than ideal classification accuracy. Our work suggests ...
详细信息
Satellite imagery offers extensive information that can be used for a variety of societal applications, from the number of buildings in a metropolis to the land cover types of a specific area. However, extracting such...
详细信息
The integration of blockchain technology into the grocery purchasing process offers a transformative approach to enhancing transparency, security, and efficiency. This paper presents a comprehensive framework for a bl...
详细信息
Fog networking is an aspect of the IoT (Internet of Things) idea, which sees most of the products used by humans on a daily basis connected to one another. Smart phones, smart health monitoring equipment, as...
详细信息
The explosion of online information necessitates efficient and accurate methods for retrieving and analyzing relevant data. This research proposes a novel framework that leverages Retrieval-Augmented Generation (RAG) ...
详细信息
Machine learning engineering relies on MLOps, a collection of best practises for commercializing models. As fresh data is added, machine learning models can become less effective. ML models can't use all data. To ...
详细信息
Efficiently serving large language models (LLMs) requires batching many requests together to reduce the cost per request. Yet, the key-value (KV) cache, which stores attention keys and values to avoid re-computations,...
详细信息
Efficiently serving large language models (LLMs) requires batching many requests together to reduce the cost per request. Yet, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. This memory demand increases with larger batch sizes and longer context lengths. Additionally, the inference speed is limited by the size of KV cache, as the GPU's SRAM must load the entire KV cache from the main GPU memory for each token generated, causing the computational core to be idle during this process. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm, named KIVI. With the hardware-friendly implementation, KIVI can enable Llama (Llama-2), Falcon, and Mistral models to maintain almost the same quality while using 2.6× less peak memory usage (including the model weight). This reduction in memory usage enables up to 4× larger batch size, bringing 2.35× ∼ 3.47× throughput on real LLM inference workload. The source code is available at https://***/jy-yuan/KIVI. Copyright 2024 by the author(s)
暂无评论