Version control systems (VCS) are tools used to track and manage the changes made to a set of files over time. Among the VCS tools available today, Git has become the most popular for software development. Being used ...
详细信息
ISBN:
(数字)9781665451550
ISBN:
(纸本)9781665451550
Version control systems (VCS) are tools used to track and manage the changes made to a set of files over time. Among the VCS tools available today, Git has become the most popular for software development. Being used in small personal projects of a few megabytes and massive corporate repositories with more than 300 GB and 3.5 million files, speed and scalability are among the top priorities for the tool. However, its performance sometimes falls short of what is desired on networked file systems (e.g. NFS), where input and output (I/O) operations tend to be more costly. In particular, that is the case for the checkout command, which is responsible for restoring files from specific versions of a project. Despite the optimizations implemented over the years, the sequential processing of files still carried a large time penalty for NFS, as well as being suboptimal for local file systems on SSDs. In this project, we worked to parallelize the Git checkout machinery, resulting in speedups of up to 4.5x on NFS and 3.6x on SSDs. We also studied how parallelism affects the I/O requests performed by checkout on different storage systems. The optimization was submitted upstream and made available to all Git users starting at version 2.32.0, from June 2021.
In this paper, we present the use of parallel-in-space simulation approach to accelerate the dynamic simulation of power systems with high penetration of distributed generation, i. e. high number of power electronic d...
详细信息
ISBN:
(数字)9781665466189
ISBN:
(纸本)9781665466189
In this paper, we present the use of parallel-in-space simulation approach to accelerate the dynamic simulation of power systems with high penetration of distributed generation, i. e. high number of power electronic devices. The approach is implemented using the OpenCL framework and executed on a graphic processing unit (GPU). We benchmark our prototype implementation using a distribution network with increasing number of distributed generator which is modeled by voltage source inverter. Results show that the computation time for the distributed generators solution stays nearly constant with the increasing number of the distributed generators in the network.
Contemporary microprocessors are one of the more complex digital systems designed by humans. Today's microprocessor chips contain several different building blocks including processor, memory and interface logic. ...
详细信息
ISBN:
(纸本)9781665485005
Contemporary microprocessors are one of the more complex digital systems designed by humans. Today's microprocessor chips contain several different building blocks including processor, memory and interface logic. Thanks to Gordon Moore [4] prediction, that the transistor density will double every 18 months, and Dennard scaling [8], who pointed out that transistor power density remains constant as physical dimensions of MOSFET's become smaller, the microprocessor clock speed permanently increased up to 2003 year. However, augmentation of the CPU clock frequency cannot exceed some limits due to significant increment in power dissipation and consequently the amount of heating. Having this fact in mind, the microprocessor manufacturers have designed new types of microprocessors, referred to as multi- and many-core processors. Multicore technology has become ubiquitous today, with most personal and embedded computers. The hardware structure of computers that are based on multi-/many-core concept requires involving of parallel programming. By using this approach, the following two benefits are evident, the first one deals with achievement of scalable computer performance while the second with the possibility of large-scale data processing. But now, the skill to create efficient parallel programs requires from programmers a solid understanding of new computational principles, algorithms, and programming software tools. This article may be of help to a reader with aim to understand both the architecture of modern microprocessor systems and their programming starting from single to multicore.
The implementation of parallel applications is always a challenge. It embracesmany distinctive design decisions that are to be taken. The paper presents issues of parallel processing with use of .NET applications and ...
详细信息
ISBN:
(数字)9789811982347
ISBN:
(纸本)9789811982330;9789811982347
The implementation of parallel applications is always a challenge. It embracesmany distinctive design decisions that are to be taken. The paper presents issues of parallel processing with use of .NET applications and popular database management systems. In the paper, three design dilemmas are addressed: how efficient is the auto-parallelism implemented in the .NET TPL library, how do popular DBMSes differ in serving parallel requests, and what is the optimal size of data chunks in the data parallelism scheme. All of them are analyzed in the context of the typical and practical business case originated from IT solutions which are dedicated for the energy market participants. The paper presents the results of experiments conducted in a controlled, onpremises environment. The experiments allowed to compare the performance of the TPL auto-parallelism with a wide range of manually-set numbers of worker threads. They also helped to evaluate 4 DBMSes: Oracle, MySQL, PostgreSQL, and MSSQL in the scenario of serving parallel queries. Finally, they showed the impact of data chunk sizes on the overall performance.
Real-time gang task scheduling has received much recent attention due to the emerging trend of applying highly parallel accelerators (e.g., GPU) and parallel programming models (e.g., OpenMP) in many real-time computi...
详细信息
ISBN:
(数字)9781665453462
ISBN:
(纸本)9781665453462
Real-time gang task scheduling has received much recent attention due to the emerging trend of applying highly parallel accelerators (e.g., GPU) and parallel programming models (e.g., OpenMP) in many real-time computing domains. However, existing works on gang task scheduling mainly focus on the preemptive scheduling case, which contradicts a bit with the non-preemptive executing nature of applying gang scheduling techniques in practice. In this paper, we present a set of non-trivial techniques that can analyze the schedulability of scheduling a hard real-time sporadic gang task system under non-preemptive GEDF on multiprocessors. A utilization-based schedulability test (first-of-its-kind) is derived, which is shown to be rather effective via experiments. Rather interestingly, for a special case where each gang task becomes an ordinary sporadic task, our developed test is shown by experiments that it improves schedulability by 75% on average upon a state-of-the-art utilization-based test designed for non-preemptive scheduling of ordinary sporadic tasks on multiprocessors.
The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers...
详细信息
ISBN:
(纸本)9781665454445
The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science. The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial -little public literature explores optimization of mixed precision computations at this scale. In this work, we demonstrate how to scale up the HPLAI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP ...
详细信息
ISBN:
(纸本)9781450394451
Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and internode parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application.
We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous high performance computing (HPC) systems. Various implementations of SVD have ...
详细信息
ISBN:
(数字)9781665497862
ISBN:
(纸本)9781665497862
We propose an efficient, distributed, out-of-memory implementation of the truncated singular value decomposition (t-SVD) for heterogeneous high performance computing (HPC) systems. Various implementations of SVD have been proposed, with most only estimate the singular values as the estimation of the singular vectors can significantly increase the time and memory complexity of the algorithm. In this work, we propose an implementation of SVD based on the power method, which is a truncated singular values and singular vectors estimation method. Memory utilization bottlenecks in the power method used to decompose a matrix A are typically associated with the computation of the Gram matrix A(T) A, which can be significant when A is large and dense, or when A is super-large and sparse. The proposed implementation is optimized for out-of-memory problems where the memory required to factorize a given matrix is greater than the available GPU memory. We reduce the memory complexity of A(T) A by using a batching strategy where the intermediate factors are computed block by block, and we hide I/O latency associated with both host-to-device (H2D) and device-to-host (D2H) batch copies by overlapping each batch copy with compute using CUDA streams. Furthermore, we use optimized NCCL based communicators to reduce the latency associated with collective communications (both intra-node and inter-node). In addition, sparse and dense matrix multiplications are significantly accelerated with GPU cores (or tensors cores when available), resulting in an implementation with good scaling. We demonstrate the scalability of our distributed out of core SVD algorithm to successfully decompose dense matrix of size 1TB and sparse matrix of 1e-6 sparsity with size of 128 PB in dense format.
Fortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with th...
详细信息
ISBN:
(纸本)9781665451857
Fortran DO CONCURRENT has emerged as a new way to achieve parallel execution of loops on CPUs and GPUs. This paper studies the performance portability of this construct on a range of processors and compares it with the incumbent models: OpenMP, OpenACC and CUDA. To do this study fairly, we implemented the BabelStream memory bandwidth benchmark from scratch, entirely in modern Fortran, for all of the models considered, which include Fortran DO CONCURRENT, as well as two variants of OpenACC, four variants of OpenMP (2 CPU and 2 GPU), CUDA Fortran, and both loop- and array-based references. BabelStream Fortran matches the C++ implementation as closely as possible, and can be used to make language-based comparisons. This paper represents one of the first detailed studies of the performance of Fortran support on heterogeneous architectures;we include results for AArch64 and x86_64 CPUs as well as AMD, Intel and NVIDIA GPU platforms.
Usage of multiprocessor and multicore computers implies parallel programming. Tools for preparing parallel programs include parallel languages and libraries as well as parallelizing compilers and convertors that can p...
详细信息
Usage of multiprocessor and multicore computers implies parallel programming. Tools for preparing parallel programs include parallel languages and libraries as well as parallelizing compilers and convertors that can perform automatic parallelization. The basic approach for parallelism detection is analysis of data dependencies and properties of program components, including data use and predicates. In this article a suite of used data and predicates sets for program components is proposed and an algorithm for computing these sets is suggested. The algorithm is based on wave propagation on graphs with cycles and labelling. This method allows analysing complex program components, improving data localization and thus providing enhanced data parallelism detection.
暂无评论