With global EV sales projected to reach 3.5 million units by 2023 and public charging stations increasing by 40%, effective management and optimization of charging infrastructure have become critical. While current re...
详细信息
The problem of converting images of text into plain text is a widely researched topic in both academia and industry. Arabic handwritten Text Recognation (AHTR) poses additional challenges due to diverse handwriting st...
详细信息
As stated by the United Arab Emirates's (UAE) Community Development Authority (CDA), there are around 3,065 individuals with hearing disabilities in the country. These individuals often struggle to communicate wit...
详细信息
Millions of developers share their code on open-source platforms like GitHub, which offer social coding opportunities such as distributed collaboration and popularity-based ranking. Software engineering researchers ha...
详细信息
Efficiently serving large language models (LLMs) requires batching many requests together to reduce the cost per request. Yet, the key-value (KV) cache, which stores attention keys and values to avoid re-computations,...
详细信息
Efficiently serving large language models (LLMs) requires batching many requests together to reduce the cost per request. Yet, the key-value (KV) cache, which stores attention keys and values to avoid re-computations, significantly increases memory demands and becomes the new bottleneck in speed and memory usage. This memory demand increases with larger batch sizes and longer context lengths. Additionally, the inference speed is limited by the size of KV cache, as the GPU's SRAM must load the entire KV cache from the main GPU memory for each token generated, causing the computational core to be idle during this process. A straightforward and effective solution to reduce KV cache size is quantization, which decreases the total bytes taken by KV cache. However, there is a lack of in-depth studies that explore the element distribution of KV cache to understand the hardness and limitation of KV cache quantization. To fill the gap, we conducted a comprehensive study on the element distribution in KV cache of popular LLMs. Our findings indicate that the key cache should be quantized per-channel, i.e., group elements along the channel dimension and quantize them together. In contrast, the value cache should be quantized per-token. From this analysis, we developed a tuning-free 2bit KV cache quantization algorithm, named KIVI. With the hardware-friendly implementation, KIVI can enable Llama (Llama-2), Falcon, and Mistral models to maintain almost the same quality while using 2.6× less peak memory usage (including the model weight). This reduction in memory usage enables up to 4× larger batch size, bringing 2.35× ∼ 3.47× throughput on real LLM inference workload. The source code is available at https://***/jy-yuan/KIVI. Copyright 2024 by the author(s)
In autonomous driving, safety assessment is becoming an essential component, particularly in the perception and analysis of the environment around them. To make safe driving judgments, autonomous cars mostly rely on t...
详细信息
In numerous real-world healthcare applications,handling incomplete medical data poses significant challenges for missing value imputation and subsequent clustering or classification *** approaches often rely on statis...
详细信息
In numerous real-world healthcare applications,handling incomplete medical data poses significant challenges for missing value imputation and subsequent clustering or classification *** approaches often rely on statistical methods for imputation,which may yield suboptimal results and be computationally *** paper aims to integrate imputation and clustering techniques to enhance the classification of incomplete medical data with improved *** classification methods are ill-suited for incomplete medical *** enhance efficiency without compromising accuracy,this paper introduces a novel approach that combines imputation and clustering for the classification of incomplete ***,the linear interpolation imputation method alongside an iterative Fuzzy c-means clustering method is applied and followed by a classification *** effectiveness of the proposed approach is evaluated using multiple performance metrics,including accuracy,precision,specificity,and *** encouraging results demonstrate that our proposed method surpasses classical approaches across various performance criteria.
Unmanned aerial vehicles (UAVs) are being utilized for damage assessment in natural disasters and for search and rescue operations. Currently, the search for victims primarily relies on analyzing images captured by ca...
详细信息
Skin cancer is the most prevalent type of cancer worldwide, and detecting it early is crucial to a successful course of treatment. In recent years, machine learning methods have demonstrated great potential for making...
详细信息
Smart meters are an important component of the smart grid, and the large-scale deployment of meters on the user side generates a large amount of data that brings huge expenses to the smart grid. In addition, attackers...
详细信息
暂无评论