Nowadays, data centers consolidate latency-critical (LC) tenants and best-effort (BE) tenants on the same cloud platform to increase resource utilization and reduce costs. In such a scenario, the underlying distribute...
详细信息
ISBN:
(纸本)9798400717932
Nowadays, data centers consolidate latency-critical (LC) tenants and best-effort (BE) tenants on the same cloud platform to increase resource utilization and reduce costs. In such a scenario, the underlying distributed storage systems are responsible for guaranteeing SLOs for LC tenants while maximizing bandwidth for BE tenants. As high-performance NVMe SSDs are widely deployed, how to make full use of their performance capabilities and guarantee SLOs has become an urgent problem. However, current methods restrict the performance capabilities of NVMe SSDs based on a conservative offline model, and also ignore runtime changes in tenant loads and device states, which definitely affect the performance capabilities. In this paper, we present zQoS, an efficient technique that unleashes full performance capabilities of NVMe SSDs, and increases bandwidth of BE tenants while guaranteeing SLOs of LC tenants. First, zQoS builds a more accurate offline performance model for NVMe SSDs to accurately reflect their performance characteristics. Second, a fine-grained online adjustment mechanism is proposed to dynamically adjust the performance capabilities of NVMe SSDs at runtime. Finally, to cope with abrupt load changes, an adaptive per-tenant adjustment method is designed to guarantee SLOs and increase utilization. We evaluate zQoS in a wide variety of mixed workload scenarios. Results show that zQoS significantly outperforms the state-of-the-art approaches. It achieves up to a 17x increase in BE tenant bandwidth without violating LC tenant SLOs.
Sparse Matrix-Vector Multiplication (SpMV) plays a pivotal role in a wide range of scientific computations. However, SpMV operations on graph matrices often encounter challenges such as inefficient cache utilization a...
详细信息
ISBN:
(纸本)9783031695827;9783031695834
Sparse Matrix-Vector Multiplication (SpMV) plays a pivotal role in a wide range of scientific computations. However, SpMV operations on graph matrices often encounter challenges such as inefficient cache utilization and imbalanced workloads. This paper presents a novel solution, named VeCa, to accelerating SpMV for sparse graph matrices by integrating selective vectorization with hierarchical blocking. Firstly, the matrix is divided into small blocks fitting in the cache, where multi-level partitioning occurs according to the estimated workloads per blocks. Then, the rows within each block is selectively vectorized with distinct instruction sets. Experimental results show that VeCa considerably outperforms state-of-the-art SpMV methods and graph processing systems, achieving a speedup of 1.29x compared to the second fastest approach. Moreover, a comprehensive evaluation is conducted, analyzing performance factors, branch prediction, cache efficiency, and parameter tuning, to promote a thorough understanding of VeCa's efficacy.
Artificial intelligence has shown great potential in a variety of applications, from natural language models to audio visual recognition, classification, and manipulation. AI Researchers have to work with massive amou...
详细信息
Efficient processing of extensive datasets is crucial in data-driven applications, particularly for anomaly detection. This article explores the application of parallel and distributed machine learning techniques to e...
详细信息
The matching problem formulated as Maximum Cardinality Matching in General Graphs (MCMGG) finds the largest matching on graphs without restrictions. The Micah-Vazirani algorithm has the best asymptotic complexity for ...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
The matching problem formulated as Maximum Cardinality Matching in General Graphs (MCMGG) finds the largest matching on graphs without restrictions. The Micah-Vazirani algorithm has the best asymptotic complexity for solving MCMGG when the graphs are sparse. parallelizing matching in general graphs on the GPU is difficult for multiple reasons. First, the augmenting path procedure is highly recursive, and NVID1A GPUs use registers to store kernel arguments, which eventually spill into cached device memory, with a performance penalty. Second, extracting parallelism from the matching process requires partitioning the graph to avoid any overlapping augmenting paths. We propose an implementation of the Micali-Vazirani algorithm which identifies bridge edges using threadparallel breadth-first search, followed by block-parallel path augmentation and blossom contraction. Augmenting path and Union find methods were implemented as stack-based iterative methods, with a stack allocated in shared memory. Our experimentation shows that compared to the serial implementation, our approach results in up to 15-fold speed-up for very sparse regular graphs, up to 5 -fold slowdown for denser regular graphs, and finally a 50-fold slowdown for power-law distributed Kronecker graphs. This implementation has been open-sourced for further research on developing combinatorial graph algorithms on GPUs.
Learning effective image representation and constructing a suitable metric space are two main challenges in few-shot image classification. Existing methods normally consider the joint characteristic distribution of th...
详细信息
image is an important information-bearing medium with many important attributes. If the image data is released directly, personal privacy will be compromised. This paper aims at how to use the method of differential p...
详细信息
ISBN:
(纸本)9783030967727;9783030967710
image is an important information-bearing medium with many important attributes. If the image data is released directly, personal privacy will be compromised. This paper aims at how to use the method of differential privacy to protect the privacy of image data and make the image data have high usability. In this paper, a WIP method based on wavelet change is proposed. Firstly, wavelet transform is used to compress the image. Then, noise is added to the main features after transformation to obtain the published image satisfying the differential privacy. It solves the problem of low usability of large images and the problem that Fourier transform cannot deal with abrupt signal. Experimental results show that compared with similar methods in the frequency domain, the denoised image obtained by the proposed WIP method is more distinguishable and the information entropy is closer to the original image. The accuracy is 10% higher than other methods. Compared with other frequency-domain methods for image differential privacy protection, the proposed WIP method has higher usability and robustness.
One-class defect detection has proven to be an effective technique. However, the performance of complex models is often limited by existing data augmentation methods. To address this issue, this paper proposes a novel...
详细信息
ISBN:
(纸本)9789819708109;9789819708116
One-class defect detection has proven to be an effective technique. However, the performance of complex models is often limited by existing data augmentation methods. To address this issue, this paper proposes a novel data augmentation method based on a denoising diffusion probability model. This approach generates high-quality image samples using partial noise diffusion, eliminating the need for extensive training on large-scale datasets. Experimental results demonstrate that the proposed method outperforms current methods in one-class defect detection tasks. The proposed method offers a new perspective on data augmentation and demonstrates its potential to tackle challenging computer vision problems.
Deep Learning (DL) model sizes are increasing at a rapid pace, as larger models typically offer better statistical performance. Modern Large Language Models (LLMs) and imageprocessing models contain billions of train...
详细信息
ISBN:
(纸本)9798350383225
Deep Learning (DL) model sizes are increasing at a rapid pace, as larger models typically offer better statistical performance. Modern Large Language Models (LLMs) and imageprocessing models contain billions of trainable parameters. Training such massive neural networks incurs significant memory requirements and financial cost. Hybrid-parallel training approaches have emerged that combine pipelining with data and tensor parallelism to facilitate the training of large DL models on distributed hardware setups. However, existing approaches to design a hybrid-parallel partitioning and parallelization plan for DL models focus on achieving high throughput and not on minimizing memory usage and financial cost. We introduce CAPTURE, a partitioning and parallelization approach for hybrid parallelism that minimizes peak memory usage. CAPTURE combines a profiling-based approach with statistical modeling to recommend a partitioning and parallelization plan that minimizes the peak memory usage across all the Graphics processing Units (GPUs) in the hardware setup. Our results show a reduction in memory usage of up to 43.9% compared to partitioners in state-of-the-art hybrid-parallel training systems. The reduced memory footprint enables the training of larger DL models on the same hardware resources and training with larger batch sizes. CAPTURE can also train a given model on a smaller hardware setup than other approaches, reducing the financial cost of training massive DL models.
image inpainting has made significant progress benefiting from the advantages of convolutional neural networks (CNNs). Deep learning-based methods have shown extraordinary performance in this field. In this paper, we ...
详细信息
ISBN:
(纸本)9781728198354
image inpainting has made significant progress benefiting from the advantages of convolutional neural networks (CNNs). Deep learning-based methods have shown extraordinary performance in this field. In this paper, we propose a novel image inpainting architecture with pure CNN that can jointly reconstruct the structure and texture of the image. Our generative network architecture (TSFC) consists of two parallel stages: structure generation and texture generation. In the structure generation stage, we use the large convolution kernel, which is highly neglected in modern networks, using the effective perceptual field of the large convolution kernel to enhance the perception of overall structural features. In the texture generation stage, we use the small convolution kernel to extract local texture features. Qualitative and quantitative experimental results on CelebA-HQ and Paris Street View datasets demonstrate the effectiveness and superiority of our method.
暂无评论