Some graph analyses, such as social network and biological network, need large-scale graph construction and maintenance over distributed memory space. Distributed data-streaming tools, including MapReduce and Spark, r...
详细信息
ISBN:
(纸本)9781728162515
Some graph analyses, such as social network and biological network, need large-scale graph construction and maintenance over distributed memory space. Distributed data-streaming tools, including MapReduce and Spark, restrict some computational freedom of incremental graph modification and run-time graph visualization. Instead, we take an agent-based approach. We construct a graph from a scientific dataset in CSV, tab, and XML formats;dispatch many reactive agents on it;and analyze the graph in the form of their collective group behavior: propagation, flocking, and collision. The key to success is how to automate the run-time construction and visualization of agent-navigable graphs mapped over distributed memory. We implemented this distributed graph-computing support in the multi-agent spatial simulation (MASS) library, coupled with the Cytoscape graph visualization software. This paper presents the MASS implementation techniques and demonstrates its execution performance in comparison to MapReduce and Spark, using two benchmark programs: (1) an incremental construction of a complete graph and (2) a KD tree construction.
Read-copy update (RCU) can provide ideal scalability for read-mostly workloads, but some believe that it provides only poor performance for updates. This belief is due to the lack of RCU-centric update synchronization...
详细信息
ISBN:
(纸本)9781450368827
Read-copy update (RCU) can provide ideal scalability for read-mostly workloads, but some believe that it provides only poor performance for updates. This belief is due to the lack of RCU-centric update synchronization mechanisms. RCU instead works with a range of update-side mechanisms, such as locking. In fact, many developers embrace simplicity by using global locking. Logging, hardware transactional memory, or fine-grained locking can provide better scalability, but each of these approaches has limitations, such as imposing overhead on readers or poor scalability on nonuniform memory access (NUMA) systems, mainly due to their lack of NUMA-aware design principles. This paper introduces an RCU extension (RCX) that provides highly scalable RCU updates on NUMA systems while retaining RCU's read-side benefits. RCX is a software-based synchronization mechanism combining hardware transactional memory (HTM) and traditional locking based on our NUMA-aware design principles for RCU. Micro-bench-marks on a NUMA system having 144 hardware threads show RCX has up to 22.6 times better performance and up to 145 times lower HTM abort rates compared to a state-of-the-art RCU/HTM combination. To demonstrate the effectiveness and applicability of RCX, we have applied RCX to parallelize some of the Linux kernel memory management system and an in-memory database system. The optimized kernel and the database show up to 24 and 17 times better performance compared to the original version, respectively.
Performance Portability frameworks allow developers to write code for familiar High-Performance Computing (HPC) architecture and minimize development effort over time to port it to other HPC architectures with little ...
详细信息
ISBN:
(纸本)9783030410056;9783030410049
Performance Portability frameworks allow developers to write code for familiar High-Performance Computing (HPC) architecture and minimize development effort over time to port it to other HPC architectures with little to no loss of performance. In our research, we conducted experiments with the same codebase on a Serial, OpenMP, and CUDA execution and memory space and compared it to the Kokkos Performance Portability framework. We assessed how well these approaches meet the goals of Performance Portability by solving a thermal conduction model on a 2D plate on multiple architectures (NVIDIA (K20, P100, V100, XAVIER), Intel Xeon, IBM Power 9, ARM64) and collected execution times (wall-clock) and performance counters with perf and nvprof for analysis. We used the Serial model to determine a baseline and to confirm that the model converges on both the native and Kokkos code. The OpenMP and CUDA models were used to analyze the parallelization strategy as compared to the Kokkos framework for the same execution and memory spaces.
A cloud parallel programming system CPPS being under development at the Institute of Informatics Systems is aimed to be an interactive visual environment of functional and parallel programming for supporting of comput...
详细信息
ISBN:
(纸本)9781728166957
A cloud parallel programming system CPPS being under development at the Institute of Informatics Systems is aimed to be an interactive visual environment of functional and parallel programming for supporting of computer science teaching and learning. The system will support the development, verification and debugging of architecture-independent parallel Cloud Sisal programs and their correct conversion into efficient code of parallel computing systems for its execution in clouds. In the paper, methods and tools of the CPPS system intended for formal verification of Cloud Sisal programs are described.
The development of autonomous Unmanned Aerial Vehicles (UAVs) is a priority to many civilian and military organizations. Real time optimal trajectory planning is an essential element for the autonomy of UAVs. The use ...
详细信息
Recently, OpenCL has been emerging as a programming model for energy-efficient FPGA accelerators. However, the state-of-the-art OpenCL frameworks for FPGAs suffer from poor performance and usability. This paper propos...
详细信息
ISBN:
(纸本)9781728146614
Recently, OpenCL has been emerging as a programming model for energy-efficient FPGA accelerators. However, the state-of-the-art OpenCL frameworks for FPGAs suffer from poor performance and usability. This paper proposes a high-level synthesis framework of OpenCL for FPGAs, called SOFF. It automatically synthesizes a datapath to execute many OpenCL kernel threads in a pipelined manner. It also synthesizes an efficient memory subsystem for the datapath based on the characteristics of OpenCL kernels. Unlike previous high-level synthesis techniques, we propose a formal way to handle variable-latency instructions, complex control flows, OpenCL barriers, and atomic operations that appear in real-world OpenCL kernels. SOFF is the first OpenCL framework that correctly compiles and executes all applications in the SPEC ACCEL benchmark suite except three applications that require more FPGA resources than are available. In addition, SOFF achieves the speedup of 1.33 over Intel FPGA SDK for OpenCL without any explicit user annotation or source code modification.
We present the implementation of two sparse linear algebra kernels on a migratory memory-side processing architecture. The first is the Sparse Matrix-Vector (SpMV) multiplication, and the second is the Symmetric Gauss...
详细信息
ISBN:
(纸本)9781728192192
We present the implementation of two sparse linear algebra kernels on a migratory memory-side processing architecture. The first is the Sparse Matrix-Vector (SpMV) multiplication, and the second is the Symmetric Gauss-Seidel (SymGS) method. Both were chosen as they account for the largest run time of the HPCG benchmark. We introduce the system used for the experiments, as well as its programming model and key aspects to get the most performance from it. We describe the data distribution used to allow an efficient parallelization of the algorithms, and their actual implementations. We then present hardware results and simulator traces to explain their behavior. We show an almost linear strong scaling with the code, and discuss future work and improvements.
GPUs have emerged as a key computing platform for an ever-growing range of applications. Unlike traditional bulk-synchronous GPU programs, many emerging GPU-accelerated applications, such as graph processing, have irr...
详细信息
ISBN:
(纸本)9781728146614
GPUs have emerged as a key computing platform for an ever-growing range of applications. Unlike traditional bulk-synchronous GPU programs, many emerging GPU-accelerated applications, such as graph processing, have irregular interaction among the concurrent threads. Consequently, they need complex synchronization. To enable both high performance and adequate synchronization, GPU vendors have introduced scoped synchronization operations that allow a programmer to synchronize within a subset of concurrent threads (a.k.a., scope) that she deems adequate. Scoped-synchronization avoids the performance overhead of synchronization across thousands of GPU threads while ensuring correctness when used appropriately. This flexibility, however, could be a new source of incorrect synchronization where a race can occur due to insufficient scope of the synchronization operation, and not due to missing synchronization as in a typical race. We introduce ScoRD, a race detector that enables hardware support for efficiently detecting global memory races in a GPU program, including those that arise due to insufficient scopes of synchronization operations. We show that ScoRD can detect a variety of races with a modest performance overhead (on average, 35%). In the process of this study, we also created a benchmark suite consisting of seven applications and three categories of microbenchmarks that use scoped synchronization operations.
As HPC progresses toward exascale, writing applications that are highly efficient, portable, and support programmer productivity is becoming more challenging than ever. The growing scale, diversity, and heterogeneity ...
详细信息
ISBN:
(纸本)9781665422840
As HPC progresses toward exascale, writing applications that are highly efficient, portable, and support programmer productivity is becoming more challenging than ever. The growing scale, diversity, and heterogeneity in compute platforms increases the burden on software to efficiently use available distributed parallel resources. This burden has fallen on developers who, increasingly, are experts in application domains rather than traditional computer scientists and engineers. We propose CASPER-Compiler Abstractions Supporting high Performance on Extreme-scale Resources-a novel domain-specific compiler and runtime framework to enable domain scientists to achieve high performance and scalability on complex HPC systems. CASPER extends domain-specific languages with machine learning to map software tasks to distributed, heterogeneous resources, and provides a runtime framework to support a variety of adaptive optimizations in dynamic environments. This paper presents an initial design and analysis of CASPER for synthetic aperture radar and computational fluid dynamics domains.
AI-powered edge devices currently lack the ability to adapt their embedded inference models to the ever-changing environment. To tackle this issue, Continual Learning (CL) strategies aim at incrementally improving the...
详细信息
ISBN:
(纸本)9781728180991
AI-powered edge devices currently lack the ability to adapt their embedded inference models to the ever-changing environment. To tackle this issue, Continual Learning (CL) strategies aim at incrementally improving the decision capabilities based on newly acquired data. In this work, after quantifying memory and computational requirements of CL algorithms, we define a novel HW/SW extreme-edge platform featuring a low power RISC-V octa-core cluster tailored for on-demand incremental learning over locally sensed data. The presented multi-core HW/SW architecture achieves a peak performance of 2.21 and 1.70 MAC/cycle, respectively, when running forward and backward steps of the gradient descent. We report the trade-off between memory footprint, latency, and accuracy for learning a new class with Latent Replay CL when targeting an image classification task on the CORe50 dataset. For a CL setting that retrains all the layers, taking 5h to learn a new class and achieving up to 77.3% of precision, a more efficient solution retrains only part of the network, reaching an accuracy of 72.5% with a memory requirement of 300 MB and a computation latency of 1.5 hours. On the other side, retraining only the last layer results in the fastest (867 ms) and less memory hungry (20 MB) solution but scoring 58% on the CORe50 dataset. Thanks to the parallelism of the low-power cluster engine, our HW/SW platform results 25x faster than typical MCU device, on which CL is still impractical, and demonstrates an 11x gain in terms of energy consumption with respect to mobile-class solutions.
暂无评论