All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tre...
详细信息
All-reduce is a widely used communication technique for distributed and parallel applications typically implemented using either a tree-based or ring-based scheme. Each of these approaches has its own limitations: tree-based schemes struggle with efficiently exchanging large messages, while ring-based solutions assume constant communication throughput,an unrealistic expectation in modern network communication infrastructures. We present FMCC-RT, an all-reduce approach that combines the advantages of tree-and ring-based implementations while mitigating their drawbacks. FMCC-RT dynamically switches between tree and ring-based implementations depending on the size of the message being processed. It utilizes an analytical model to assess the impact of message sizes on the achieved throughput, enabling the derivation of optimal work partitioning parameters. Furthermore, FMCC-RT is designed with an Open MPI-compatible API, requiring no modification to user code. We evaluated FMCC-RT through micro-benchmarks and real-world application tests. Experimental results show that FMCC-RT outperforms state-of-the-art tree-and ring-based methods, achieving speedups of up to 5.6×.
Graphics Processing Units (GPUs) are widely used as powerful hardware accelerators for data-intensive tasks. However, their efficacy can be hindered by constraints in device memory and data transfer speeds via the PCI...
详细信息
This paper examines how co-locating multiple VMs on a single physical server with shared storage impacts I/O performance, specifically focusing on latency of I/O operations, and the overall throughput. We introduce a ...
详细信息
distributed deep neural network training necessitates efficient GPU collective communications, which are inherently susceptible to deadlocks. GPU collective deadlocks arise easily in distributed deep learning applicat...
详细信息
The increasing demand for real-time data analysis in Internet of Things (IoT) ecosystems has created several challenges, particularly in environments where resources are limited, and minimizing data processing latency...
详细信息
The burgeoning complexity of communication necessitates a high demand for security. Access control encryption is a promising primitive to meet the security demand but the bulk of its constructions rely on formulating ...
详细信息
Thee-vote is regarded as a waytoexpress the opinion that the voters ask for. Actually, the e-vote could be applied wildly likequestionnaire,***,thecoexistences of efficiency and security as well as transparency and pr...
详细信息
Transaction processing systems are the crux for modern data-center applications, yet current multi-node systems are slow due to network overheads. This paper advocates for Compute Express Link (CXL) as a network alter...
详细信息
Industrial part surface defect detection aims to precisely locate defects in images, which is crucial for quality control in manufacturing. The traditional method needs to be designed in advance, but it has shortcomin...
详细信息
作者:
Wang, HongfeiWan, CaixueJin, HaiHuazhong University of Science and Technology
National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Hubei Key Laboratory of Distributed System Security Hubei Engineering Research Center on Big Data Security School of Cyber Science and Engineering Wuhan430074 China Huazhong University of Science and Technology
National Engineering Research Center for Big Data Technology and System Services Computing Technology and System Lab Cluster and Grid Computing Lab School of Computer Science and Technology Wuhan430074 China
The Physical Unclonable Function (PUF) is valued for its lightweight nature and unique functionality, making it a common choice for securing hardware products requiring authentication and key generation mechanisms. In...
详细信息
暂无评论