Sparse matrix-vector multiplication (SpMV) is extensively used in scientific computing and often accounts for a significant portion of the overall computational overhead. Therefore, improving the performance of SpMV i...
详细信息
ISBN:
(数字)9789819708017
ISBN:
(纸本)9789819708000;9789819708017
Sparse matrix-vector multiplication (SpMV) is extensively used in scientific computing and often accounts for a significant portion of the overall computational overhead. Therefore, improving the performance of SpMV is crucial. However, sparse matrices exhibit a sporadic and irregular distribution of non-zero elements, resulting in workload imbalance among threads and challenges in vectorization. To address these issues, numerous efforts have focused on optimizing SpMV based on the hardware characteristics of computing platforms. In this paper, we present an optimization on CSR-Based SpMV, since the CSR format is the most widely used and supported by various high-performance sparse computing libraries, on a novel MIMD computing platform Pezy-SC3s. Based on the hardware characteristics of Pezy-SC3s, we tackle poor data locality, workload imbalance, and vectorization challenges in CSRBased SpMV by employing matrix chunking, applying Atomic Cache for workload scheduling, and utilizing SIMD instructions during performing SpMV. As the first study to investigate SpMV optimization on Pezy-SC3s, we evaluate the performance of our work by comparing it with the CSR-Based SpMV and SpMV provided by Nvidia's CuSparse. Through experiments conducted on 2092 matrices obtained from SuiteSparse, we demonstrate that our optimization achieves a maximum speedup ratio of x17.63 and an average of x1.56 over CSR-Based SpMV and an average bandwidth utilization of 35.22% for large-scale matrices (nnz >= 10(6)) compared with 36.17% obtained using CuSparse. These results demonstrate that our optimization effectively harnesses the hardware resources of Pezy-SC3s, leading to improved performance of CSR-Based SpMV.
The utilization of large-scale datasets in various fields is increasing due to the advancement of big data technology. Due to limited computing resources, traditional serial frameworks are no longer efficient in proce...
详细信息
ISBN:
(数字)9789819707980
ISBN:
(纸本)9789819707973;9789819707980
The utilization of large-scale datasets in various fields is increasing due to the advancement of big data technology. Due to limited computing resources, traditional serial frameworks are no longer efficient in processing such massive data. Furthermore, as Moore's Law gradually loses its effect, improving program performance from the hardware level becomes increasingly challenging. Consequently, numerous parallel frameworks with distinct features and architectures have emerged, and selecting an appropriate one can enhance researchers' performance across various tasks. This paper evaluates three prominent parallel frameworks-Spark, Ray, and MPI-and employs minimap2, a third-generation CPU-based sequence alignment tool, as the benchmark program. The experimental results are discussed comprehensively. To evaluate the three frameworks, we devised a parallel algorithm for minimap2 and implemented its parallel versions using Ray and MPI, respectively. Furthermore, we selected IMOS as the Spark version of minimap2. The experiments involved six real datasets and one simulated dataset to evaluate and compare speedup, efficiency, throughput, scalability, peak memory, latency, and load balance. The findings demonstrate that MPI outperforms Apache Spark and Ray in terms of achieving a maximum speedup of 104.019, 81.3% efficiency, 33.510 MB/s throughput, the lowest latency, and better load balance. However, MPI exhibits poor fault tolerance. Apache Spark demonstrated the second-best performance, with a speedup of 88.937, efficiency of 69.5%, throughput of 29.546 MB/s, low latency, and the best load balance. Furthermore, it exhibited good fault tolerance and benefited from a mature ecosystem. Ray achieves a speedup of 76.828, efficiency of 60.0%, and throughput of 25.009 MB/s. However, it experiences high latency fluctuations, possesses less load balance compared to the previous two frameworks, and maintains good fault tolerance. The source code and a comprehensive user manual f
As blockchain technology garners increased adoption, permissioned blockchains like Hyperledger Fabric emerge as a popular blockchain system for developing scalable decentralized applications. Nonetheless, parallel exe...
详细信息
ISBN:
(数字)9789819708628
ISBN:
(纸本)9789819708611;9789819708628
As blockchain technology garners increased adoption, permissioned blockchains like Hyperledger Fabric emerge as a popular blockchain system for developing scalable decentralized applications. Nonetheless, parallel execution in Fabric leads to concurrent conflicting transactions attempting to read and write the same key in the ledger simultaneously. Such conflicts necessitate the abortion of transactions, thereby impacting performance. The mainstream solution involves constructing a conflict graph to reorder the transactions, thereby reducing the abort rate. However, it experiences considerable overhead during scenarios with a large volume of transactions or high data contention due to capture dependencies between each transaction. Therefore, one critical problem is how to efficiently order conflicting transactions during the ordering phase. In this paper, we introduce an optimized reordering algorithm designed for efficient concurrency control. Initially, we leverage key dependency instead of transaction dependency to build a conflict graph that considers read/write units as vertices and intra-transaction dependency as edges. Subsequently, a key sorting algorithm generates a serializable transaction order for validation. Our empirical results indicate that the proposed key-based reordering method diminishes transaction latency by 36.3% and considerably reduces system memory costs while maintaining a low abort rate compared to benchmark methods.
暂无评论