Overlapping communications with computations in distributed applications should increase their performances and allow to reach better scalability. This implies, by construction, communications are executed in parallel...
详细信息
ISBN:
(纸本)9783031061561;9783031061554
Overlapping communications with computations in distributed applications should increase their performances and allow to reach better scalability. This implies, by construction, communications are executed in parallel of computations. In this work, we explore the impact of computations on communication performances and vice-versa, with a focus on the role of memory contention. One main observation is that highly memory-bound computations can have a severe impact on network bandwidth.
This paper introduces a game-theoretic framework aimed at enhancing offloading decisions within a network of multiple tethered aerial vehicles (TAVs) employed as communication base stations and relay platforms. Given ...
详细信息
By using the cloud computing platform that has achieved excellent commercial results to perform parallel classification processing of massive remote sensing data, it can meet the requirements of improving the parallel...
详细信息
Sparse matrix-vector multiplication (SpMV) is extensively used in scientific computing and often accounts for a significant portion of the overall computational overhead. Therefore, improving the performance of SpMV i...
详细信息
ISBN:
(数字)9789819708017
ISBN:
(纸本)9789819708000;9789819708017
Sparse matrix-vector multiplication (SpMV) is extensively used in scientific computing and often accounts for a significant portion of the overall computational overhead. Therefore, improving the performance of SpMV is crucial. However, sparse matrices exhibit a sporadic and irregular distribution of non-zero elements, resulting in workload imbalance among threads and challenges in vectorization. To address these issues, numerous efforts have focused on optimizing SpMV based on the hardware characteristics of computing platforms. In this paper, we present an optimization on CSR-Based SpMV, since the CSR format is the most widely used and supported by various high-performance sparse computing libraries, on a novel MIMD computing platform Pezy-SC3s. Based on the hardware characteristics of Pezy-SC3s, we tackle poor data locality, workload imbalance, and vectorization challenges in CSRBased SpMV by employing matrix chunking, applying Atomic Cache for workload scheduling, and utilizing SIMD instructions during performing SpMV. As the first study to investigate SpMV optimization on Pezy-SC3s, we evaluate the performance of our work by comparing it with the CSR-Based SpMV and SpMV provided by Nvidia's CuSparse. Through experiments conducted on 2092 matrices obtained from SuiteSparse, we demonstrate that our optimization achieves a maximum speedup ratio of x17.63 and an average of x1.56 over CSR-Based SpMV and an average bandwidth utilization of 35.22% for large-scale matrices (nnz >= 10(6)) compared with 36.17% obtained using CuSparse. These results demonstrate that our optimization effectively harnesses the hardware resources of Pezy-SC3s, leading to improved performance of CSR-Based SpMV.
The large scale computer system provides a high performance platform for engineering applications. Mesh generation is the basis for numerical simulation for computing science, which heavily relies on user' experie...
详细信息
The emerging class of high velocity and high volume data analytic workflows comprise interwoven data ingestion, organization, and processing stages, with ingestion and organization steps often contributing comparable ...
详细信息
The growing demands placed upon modern compute and network resources are far exceeding the capabilities of traditional computer architectures. It is now customary for accelerators to perform the bulk of compute, and t...
详细信息
ISBN:
(纸本)9798350341515
The growing demands placed upon modern compute and network resources are far exceeding the capabilities of traditional computer architectures. It is now customary for accelerators to perform the bulk of compute, and this compute is being pushed ever closer to the network. FPGA vendors have brought powerful datacenter cards to the market, combining reprogrammable FPGA fabrics with high bandwidth networking capability. However, the supporting infrastructure has yet to reach maturity, so modern and diverse workloads are not yet able to fully leverage these architectural advances. In this paper we present DiAD;a novel framework providing FPGA firmware and driver support for fully unified, distributed compute and network acceleration across a commodity Ethernet network. We present a far richer feature set and greater flexibility than other existing solutions. We show comparable networking performance from the host when compared to Xilinx's OpenNIC solution. As well as allowing for host networking through the FPGA fabric, we demonstrate reliable data transfer directly between FPGA fabrics without host intervention, and show native memory transactions sent over the network supporting pointer-chasing workloads. We achieve line rate for dataflow type communication and approach line rate for larger native memory transfers (87G).
We study the maximum set coverage problem in the massively parallel model. In this setting, m sets that are subsets of a universe of n elements are distributed among m machines. In each round, these machines can commu...
详细信息
We present an empirical approach to identify the key factors affecting the execution performance of task-based workflows on a High Performance computing (HPC) infrastructure composed of heterogeneous CPU-GPU clusters....
详细信息
Deep Learning (DL), especially with Large Language Models (LLMs), brings benefits to various areas. However, DL training systems usually yield prominent idling GPU resources due to many factors, such as resource alloc...
详细信息
暂无评论