Existing tiled manycore architectures propose to convert abundant silicon resources into general-purpose parallel processors with unmatched computational density and programmability. However, as we approach 100K cores...
详细信息
ISBN:
(纸本)9798350326598;9798350326581
Existing tiled manycore architectures propose to convert abundant silicon resources into general-purpose parallel processors with unmatched computational density and programmability. However, as we approach 100K cores in one chip, conventional manycore architectures struggle to navigate three key axes: scalability, programmability, and density. Many manycores sacrifice programmability for density;or scalability for programmability. In this paper, we explore HammerBlade, which simultaneously achieves scalability, programmability and density. HammerBlade is a fully open-source RISC-V manycore architecture, which has been silicon-validated with a 2048-core ASIC implementation using a 14/16nm process. We evaluate the system using a suite of parallel benchmarks that captures a broad spectrum of computation and communication patterns.
New algorithms for embedding graphs have reduced the asymptotic complexity of finding low-dimensional representations. One-Hot Graph Encoder Embedding (GEE) uses a single, linear pass over edges and produces an embedd...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
New algorithms for embedding graphs have reduced the asymptotic complexity of finding low-dimensional representations. One-Hot Graph Encoder Embedding (GEE) uses a single, linear pass over edges and produces an embedding that converges asymptotically to the spectral embedding. The scaling and performance benefits of this approach have been limited by a serial implementation in an interpreted language. We refactor GEE into a parallel program in the Ligra graph engine that maps functions over the edges of the graph and uses lock-free atomic instructions to prevent data races. On a graph with 1.86 edges, this results in a 500 times speedup over the original implementation and a 17 times speedup over a just-in-time compiled version.
Modern Python programs in high-performance computing call into compiled libraries and kernels for performance critical tasks. However, effectively parallelizing these finer-grained, and often dynamic, kernels across m...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
Modern Python programs in high-performance computing call into compiled libraries and kernels for performance critical tasks. However, effectively parallelizing these finer-grained, and often dynamic, kernels across modern heterogeneous platforms remains a challenge. First, we perform an experimental study to examine the impact of Python's Global Interpreter Lock (GIL), and potential speedups under a GIL-less PEP703 future, to guide runtime design. Using our optimized runtime, we explore scheduling tasks with constraints that require resources across multiple, potentially diverse, devices through the introduction of new programming abstractions and runtime mechanisms. We extend an existing Python tasking library, Parla, to augment its performance and add support for such multi -device tasks. Our experimental analysis, using tasks graphs from synthetic and real applications, shows at least a 3x (and up to 6x) performance improvement over its predecessor in scenarios with high GIL contention. When scheduling multi-GPU tasks, we observe an Ax reduction in per-task launching overhead compared to a multi-process system.
Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresse...
详细信息
ISBN:
(纸本)9783031695827;9783031695834
Stream processing systems must often cope with workloads varying in content, format, size, and input rate. The high variability and unpredictability make statically fine-tuning them very challenging. Our work addresses this limitation by providing a new framework and run-time system to simplify implementing and assessing new self-adaptive algorithms and optimizations. We implement a prototype on top of MPI called MPR and show its functionality. We focus on horizontal scaling by supporting the addition and removal of processes during execution time. Experiments reveal that MPR can achieve performance similar to that of a handwritten static MPI application. We also assess MPR's adaptation capabilities, showing that it can readily re-configure itself, with the help of a self-adaptive algorithm, in response to workload variations.
Gantt charts are frequently used to explore execution traces of large-scale parallel programs. In these visualizations, each parallel processor is assigned a row showing the computation state of a processor at a parti...
详细信息
ISBN:
(纸本)9798331528492;9798331528485
Gantt charts are frequently used to explore execution traces of large-scale parallel programs. In these visualizations, each parallel processor is assigned a row showing the computation state of a processor at a particular time. Lines are drawn between rows to show communication between these processors. When drawn to align equivalent calls across rows, visual patterns can emerge reflecting communication behavior of the executing code. However, though these patterns have the same definition at any scale, they can be obscured by the density of rendered lines when displaying more than a few hundred processors. We seek to understand the effectiveness of various strategies for recognizing these patterns in Gantt charts. Specifically, we conduct a pre-registered user study comparing recognition of patterns when viewing all processors, a subset of processors, or a set of abstracted glyphs overlaid on the chart. We find that all strategies have limitations when scaling, motivating further designs. Our results further indicate that for simple patterns, the glyphs are more effective in general pattern recognition while the zoomed subsets provide nuance to specific characteristics, such as offsets, in patterns. These results suggest the development of a combined approach may be appropriate to enable pattern comprehension in large-scale Gantt charts.
Some of the fastest CUDA codes contain "benign" data races to boost their performance. However, such races can lead to unpredictable behavior and incorrect results on other hardware and compilers, making the...
详细信息
ISBN:
(纸本)9798350356045;9798350356038
Some of the fastest CUDA codes contain "benign" data races to boost their performance. However, such races can lead to unpredictable behavior and incorrect results on other hardware and compilers, making their elimination crucial for producing reliable and portable programs. This paper investigates the performance impact of removing data races from six high-end graph analytics codes. We identify and eliminate the races from these GPU programs by adding necessary synchronization and validating their correctness. We present our race-free codes and their original versions as an open-source suite. Comparing the performance of our new codes with their baseline counterparts on multiple inputs and GPUs, we observe that race-free implementations do not always incur a performance penalty. In fact, some race-free versions are faster, with our validated maximal independent set implementation achieving a 5-11% speedup. Our findings indicate that race-free code can reach comparable or even superior performance, supporting the adoption of best practices for parallel programming.
We present two new assignments in the Peachy parallel Assignments series of assignments for teaching parallel and distributed computing. Submitted assignments must have been successfully used previously and are select...
详细信息
ISBN:
(纸本)9798350364613;9798350364606
We present two new assignments in the Peachy parallel Assignments series of assignments for teaching parallel and distributed computing. Submitted assignments must have been successfully used previously and are selected for being easy for other instructors to adopt and for being "cool and inspirational" so that students spend time on them and talk about them with others. The first assignment in this paper familiarizes students with the RAFT library for performing GPU-accelerated computation, pail of the RAPIDS AI ecosystem. Students use this library to accelerate a Radius Nearest Neighbor computation, finding all points within a given distance from a query point. In the second assignment, students parallelize a bird flocking simulation using OpenMP or OpenACC. It is a visual assignment which allows students to readily see the performance improvement.
In this paper, the Numba, JAX, CuPy, PyTorch, and TensorFlow Python GPU accelerated libraries were benchmarked using scientific numerical kernels on a NVIDIA V100 GPU. The benchmarks consisted of a simple Monte Carlo ...
详细信息
ISBN:
(纸本)9783031521850;9783031521867
In this paper, the Numba, JAX, CuPy, PyTorch, and TensorFlow Python GPU accelerated libraries were benchmarked using scientific numerical kernels on a NVIDIA V100 GPU. The benchmarks consisted of a simple Monte Carlo estimation, a particle interaction kernel, a stencil evolution of an array, and tensor operations. The benchmarking procedure included general memory consumption measurements, a statistical analysis of scalability with problem size to determine the best libraries for the benchmarks, and a productivity measurement using source lines of code (SLOC) as a metric. It was statistically determined that the Numba library outperforms the rest on the Monte Carlo, particle interaction, and stencil benchmarks. The deep learning libraries show better performance on tensor operations. The SLOC count was similar for all the libraries except Numba which presented a higher SLOC count which implies more time is needed for code development.
Hybrid tabular-textual question answering (HTQA) involves tapping into a mosaic of data sources, traditionally managed through LSTM-based step-by-step reasoning, which has been vulnerable to exposure bias and subseque...
详细信息
ISBN:
(纸本)9789819772315;9789819772322
Hybrid tabular-textual question answering (HTQA) involves tapping into a mosaic of data sources, traditionally managed through LSTM-based step-by-step reasoning, which has been vulnerable to exposure bias and subsequent error accumulation. This paper introduces an innovative parallel program generation method, ConcurGen, aiming to transform this paradigm by simultaneously formulating comprehensive program constructs that seamlessly blend operations and values. This approach not only rectifies the inherent pitfalls of sequential methodologies but also infuses efficiency into the process. When subjected to rigorous evaluation on benchmarks like the ConvFinQA and MultiHiertt datasets, our methodology showcased significant superiority over prevalent models such as FinQANet and MT2Net. This was evidenced by enhancements in various performance metrics, effectively raising the bar for what's deemed state-of-the-art. Notably, beyond setting these commendable benchmarks, our method facilitates a striking acceleration in program creation, achieving speeds nearly 21 times faster. Additionally, a salient feature of our approach becomes evident when numerical reasoning steps escalate: unlike traditional models, our system sustains its robust performance, emphasizing its adaptability and resilience in complex scenarios.
The main objective of this work is to bring supercomputing and parallel processing closer to non-specialized audiences by building a Raspberry Pi cluster, called Clupiter, which emulates the operation of a supercomput...
详细信息
ISBN:
(纸本)9798350381993;9798350382006
The main objective of this work is to bring supercomputing and parallel processing closer to non-specialized audiences by building a Raspberry Pi cluster, called Clupiter, which emulates the operation of a supercomputer. It consists of eight Raspberry Pi devices interconnected to each other so that they can run jobs in parallel. To make it easier to show how it works, a web application has been developed. It allows launching parallel applications and accessing a monitoring system to see the resource usage when these applications are running. The NAS parallel Benchmarks (NPB) are used as demonstration applications. From this web application a couple of educational videos can also be accessed. They deal, in a very informative way, with the concepts of supercomputing and parallel programming.
暂无评论