The proceedings contain 22 papers. The topics discuss include: towards efficient OpenCL pipe specification for hardware accelerators;SimSYCL: a SYCL implementation targeting development, debugging, simulation and conf...
ISBN:
(纸本)9798400717901
The proceedings contain 22 papers. The topics discuss include: towards efficient OpenCL pipe specification for hardware accelerators;SimSYCL: a SYCL implementation targeting development, debugging, simulation and conformance;experiences with implementing Kokkos’ SYCL backend;optimization and evaluation of breadth first search with oneAPI/SYCL on Intel FPGAs: from describing algorithms to describing architectures;improving performance portability of the procedurally generated high energy physics event generator MadGraph using SYCL;unlocking performance portability on LUMI-G supercomputer: a virtual screening case study;evaluation of SYCL’s different data parallel kernels;smoothing the migration from CUDA to SYCL: SYCLomatic utility features;and optimization of fast Fourier transform (FFT) for Qualcomm Adreno graphics processing unit.
TCP throughput and RTT prediction are essential to model TCP behavior and optimize network configurations. Flows adapt their sending rate to network parameters like link capacity or buffer size and interact with paral...
详细信息
ISBN:
(纸本)9781450399333
TCP throughput and RTT prediction are essential to model TCP behavior and optimize network configurations. Flows adapt their sending rate to network parameters like link capacity or buffer size and interact with parallel flows. Especially the elastic behavior of TCP congestion control can vary, even when only slight changes in the network occur. Thus, existing analytical models for TCP behavior reach their limits due to the number and complexity of different algorithms. Machine learning approaches, in contrast, are often fixed to specific network topologies. This paper presents a TCP bandwidth and RTT prediction approach that can handle different algorithms and topologies. For this, we utilize Gated Graph Neural Networks and simulated network traffic. We evaluate different encodings of the input data into graphs and how network size, number of flows, and TCP algorithms influence prediction accuracy. Additionally, we quantify the impact of different input features on our models. We show that Graph Neural Networks can be used to model TCP behavior. The resulting models can predict RTT with a median relative error of 2.29 % and throughput with an error of 13.31 %.
In this paper we present our experience implementing domain decomposition preconditioners on vector architectures. In particular, we will focus on the solution of unstructured network equations arising from electrical...
详细信息
ISBN:
(纸本)9781450384414
In this paper we present our experience implementing domain decomposition preconditioners on vector architectures. In particular, we will focus on the solution of unstructured network equations arising from electrical power systems by preconditioning iterative algorithms with the Additive Schwarz Method (ASM). The implementation will be carried out using the Julia programming language, which allows for easy prototyping and interfacing with GPU architectures thanks to its multiple dispatch features. In our experiments, we will show the trade-off between device throughput and convergence of the iterative algorithm as the size of the domain varies, and determine optimal fronts of computational performance.
We present a scalable algorithm that computes the transitive closure of a graph on shared memory architectures using the OpenMP API in C++. Two different parallelization strategies have been presented and the performa...
详细信息
ISBN:
(纸本)9781665411400
We present a scalable algorithm that computes the transitive closure of a graph on shared memory architectures using the OpenMP API in C++. Two different parallelization strategies have been presented and the performance of the two algorithms has been compared for several data-sets of varying sizes. We demonstrate the scalability of the best parallel implementation up to 176 threads on a shared memory architecture, by producing a graph with more than 3.82 trillion edges. To the best of our knowledge, this is the first implementation that has computed the transitive closure of such a large graph on a shared memory system. Optimization strategies for better cache utilization for large data-sets have been discussed. The important issue of load balancing has been analyzed and its mitigation using the optimal OpenMP scheduling clause has been discussed in detail.
The proceedings contain 14 papers. The topics discussed include: ML-based performance portability for time-dependent density functional theory in HPC environments;a comprehensive evaluation of novel ai accelerators fo...
ISBN:
(纸本)9781665451857
The proceedings contain 14 papers. The topics discussed include: ML-based performance portability for time-dependent density functional theory in HPC environments;a comprehensive evaluation of novel ai accelerators for deep learning workloads;frontier vs the exascale report: why so long? and are we really there yet?;evaluating ISO C++ parallelalgorithms on heterogeneous HPC systems;going green: optimizing GPUs for energy efficiency through model-steered auto-tuning;performance analysis with unified hardware counter metrics;a methodology for evaluating tightly-integrated and disaggregated accelerated architectures;WfBench: automated generation of scientific workflow benchmarks;high-performance GMRES multi-precision benchmark: design, performance, and challenges;and an initial evaluation of arm’s scalable matrix extension.
The parallel Random Access Machines (PRAM) abstraction is the simplest and most elegant algorithmic model for the design and analysis of parallelalgorithms. It consists of different models categorized based on the un...
详细信息
ISBN:
(纸本)9781450384414
The parallel Random Access Machines (PRAM) abstraction is the simplest and most elegant algorithmic model for the design and analysis of parallelalgorithms. It consists of different models categorized based on the underlying memory access mode used, the most powerful of which is the Concurrent Read Concurrent Write (CRCW) model. A PRAM algorithm describes a series of rounds, each of which consists of a collection of operations that can be executed concurrently within the same time step. However, the lack of support for concurrent memory accesses and the prevalence of asynchronous programming models led to the belief that implementing CRCW PRAM algorithms is unattainable and prompted many to avoid this model except for theoretical studies of optimal performance. In this work, we study the arbitrary and common concurrent writes in the CRCW PRAM model and explore implementation challenges on general-purpose systems. Moreover, we examine current practices for implementing common/arbitrary concurrent writes and propose a new efficient lightweight and thread-safe method to implement concurrent writes through leveraging atomic instructions. To demonstrate the efficacy of our method, we developed OpenMP kernels for classical CRCW PRAM algorithms and provide experimental results and comparisons based on run time performance measured over the x86 multicore architecture. Our results show a performance speedup compared to current practices up to 4.5x across all our benchmarks.
Recent additions to the C++ standard and ongoing standardization efforts aim to add data-parallel types to the C++ standard library. This enables the use of vectorization techniques in existing C++ codes without havin...
详细信息
The proceedings contain 8 papers. The topics discussed include: accelerating domain propagation: an efficient GPU-parallel algorithm over sparse matrices;parallelizing irregular computations for molecular docking;redu...
ISBN:
(纸本)9780738110905
The proceedings contain 8 papers. The topics discussed include: accelerating domain propagation: an efficient GPU-parallel algorithm over sparse matrices;parallelizing irregular computations for molecular docking;reducing queuing impact in irregular data streaming applications;supporting irregularity in throughput-oriented computing by SIMT-SIMD integration;DistDGL: distributed graph neural network training for billion-scale graphs;labeled triangle indexing for efficiency gains in distributed interactive subgraph search;distributed memory graph coloring algorithms for multiple GPU;and performance evaluation of the vectorizable binary search algorithms on an FPGA platform.
The proceedings contain 5 papers. The topics discussed include: sparse matrix-dense matrix multiplication on heterogeneous CPU+FPGA embedded system;run-time power modelling in embedded GPUs with dynamic voltage and fr...
The proceedings contain 5 papers. The topics discussed include: sparse matrix-dense matrix multiplication on heterogeneous CPU+FPGA embedded system;run-time power modelling in embedded GPUs with dynamic voltage and frequency scaling;fault-tolerant online scheduling algorithms for CubeSats;an OpenMP parallel genetic algorithm for design space exploration of heterogeneous multi-processor embedded systems;and automated precision tuning in activity classification systems: a case study.
The proceedings contain 5 papers. The topics discussed include: sparse matrix-dense matrix multiplication on heterogeneous CPU+FPGA embedded system;run-time power modeling in embedded GPUs with dynamic voltage and fre...
ISBN:
(纸本)9781450375450
The proceedings contain 5 papers. The topics discussed include: sparse matrix-dense matrix multiplication on heterogeneous CPU+FPGA embedded system;run-time power modeling in embedded GPUs with dynamic voltage and frequency scaling;fault-tolerant online scheduling algorithms for CubeSats;an OpenMP parallel genetic algorithm for design space exploration of heterogeneous multi-processor embedded systems;and automated precision tuning in activity classification systems: a case study.
暂无评论