Future Exascale systems will feature massive parallelism, many-core processors and heterogeneous architectures. In this scenario, it is increasingly difficult for HPC applications to fully and efficiently utilize the ...
详细信息
ISBN:
(纸本)9798350387117;9798350387124
Future Exascale systems will feature massive parallelism, many-core processors and heterogeneous architectures. In this scenario, it is increasingly difficult for HPC applications to fully and efficiently utilize the resources in system nodes. Moreover, the increased parallelism exacerbates the effects of existing inefficiencies in current applications. Research has shown that co-scheduling applications to share system nodes instead of executing each application exclusively can increase resource utilization and efficiency. Nevertheless, the current oversubscription and co-location techniques to share nodes have several drawbacks which limit their applicability and make them very application-dependent. This paper presents co-execution through system-wide scheduling. Co-execution is a novel fine-grained technique to execute multiple HPC applications simultaneously on the same node, outperforming current state-of-the-art approaches. We implement this technique in nOS-V, a lightweight tasking library that supports co-execution through system-wide task scheduling. Moreover, nOS-V can be easily integrated with existing programming models, requiring no changes to user applications. We showcase how co-execution with nOS-V significantly reduces schedule makespan for several applications on different scenarios, outperforming prior node-sharing techniques.
Current programming models face challenges in dealing with modern supercomputers' growing parallelism and heterogeneity. Emerging programming models, like the task-based programming model found in the asynchronous...
详细信息
programming parallel applications for heterogeneous High Performance Computing platforms is easier when using the task-based programming paradigm, where a Direct Acyclic Graph (DAG) of tasks models the application beh...
详细信息
programming parallel applications for heterogeneous High Performance Computing platforms is easier when using the task-based programming paradigm, where a Direct Acyclic Graph (DAG) of tasks models the application behavior. The simplicity exists because a runtime, like StarPU, takes care of many activities usually carried out by the application developer, such as task scheduling, load balancing, and memory management. This memory management refers to the runtime responsibility for handling memory operations, like copying the necessary data to the location where a given task is scheduled to execute. Poor scheduling or lack of appropriate information may lead to inadequate memory management by the runtime. Discover if an application presents memory-related performance problems is complex. The task-based applications’ and runtimes’ programmers would benefit from specialized performance analysis methodologies that check for possible memory management problems. In this way, this work proposes methods and tools to investigate heterogeneous CPU-GPU-Disk memory management of the StarPU runtime, a popular task-based middleware for HPC applications. The base of these methods is the execution traces that are collected by the runtime. These traces provide information about the runtime decisions and the system performance; however, a simple application can have huge amounts of trace data stored that need to be analyzed and converted to useful metrics or visualizations. The use of a methodology specific to task-based applications could lead to a better understanding of memory behavior and possible performance optimizations. The proposed strategies are applied on three different problems, a dense Cholesky solver, a CFD simulation, and a sparse QR factorization. On the dense Cholesky solver, the strategies found a problem on StarPU that a correction leads to 66% performance improvement. On the CFD simulation, the strategies guided the insertion of extra information on the DAG and d
The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from the OpenMP standard. The main functionality of the library is presented....
详细信息
The recent version of the Parallel Linear Algebra Software for Multicore Architectures (PLASMA) library is based on tasks with dependencies from the OpenMP standard. The main functionality of the library is presented. Extensive benchmarks are targeted on three recent multicore and manycore architectures, namely, an Intel Xeon, Intel Xeon Phi, and IBM POWER 8 processors.
We present a C++ header-only parallel sparse matrix library, based on sparse quadtree representation of matrices using the Chunks and tasks programming model. The library implements a number of sparse matrix algorithm...
详细信息
We present a C++ header-only parallel sparse matrix library, based on sparse quadtree representation of matrices using the Chunks and tasks programming model. The library implements a number of sparse matrix algorithms for distributed memory parallelization that are able to dynamically exploit data locality to avoid movement of data. This is demonstrated for the example of block-sparse matrix-matrix multiplication applied to three sequences of matrices with different nonzero structure, using the CHT-MPI 2.0 runtime library implementation of the Chunks and tasks model. The runtime library succeeds to dynamically load balance the calculation regardless of the sparsity structure. (C) 2022 The Authors. Published by Elsevier B.V.
The constant increase in parallelism available on large-scale distributed computers poses major scalability challenges to many scientific applications. A common strategy to improve scalability is to express algorithms...
详细信息
The constant increase in parallelism available on large-scale distributed computers poses major scalability challenges to many scientific applications. A common strategy to improve scalability is to express algorithms in terms of independent tasks that can be executed concurrently on a runtime system. In this manuscript, we consider a generalization of this approach where task-level speculation is allowed. In this context, a probability is attached to each task which corresponds to the likelihood that the output of the speculative task will be consumed as part of the larger calculation. We consider the problem of optimal resource allocation to each of the possible tasks so as to maximize the total expected computational throughput. The power of this approach is demonstrated by analyzing its application to Parallel Trajectory Splicing, a massively-parallel long-time-dynamics method for atomistic simulations.
Many applications, ranging from big data analytics to nanostructure designs, require the solution of large dense singular value decomposition (SVD) or eigenvalue problems. A first step in the solution methodology for ...
详细信息
Many applications, ranging from big data analytics to nanostructure designs, require the solution of large dense singular value decomposition (SVD) or eigenvalue problems. A first step in the solution methodology for these problems is the reduction of the matrix at hand to condensed form by two-sided orthogonal transformations. This step is standardly used to significantly accelerate the solution process. We present a performance analysis of the main two-sided factorizations used in these reductions: the bidiagonalization, tridiagonalization, and the upper Hessenberg factorizations on heterogeneous systems of multicore CPUs and Xeon Phi coprocessors. We derive a performance model and use it to guide the analysis and to evaluate performance. We develop optimized implementations for these methods that get up to 80% of the optimal performance bounds. Finally, we describe the heterogeneous multicore and coprocessor development considerations and the techniques that enable us to achieve these high-performance results. The work here presents the first highly optimized implementation of these main factorizations for Xeon Phi coprocessors. Compared to the LAPACK versions optmized by Intel for Xeon Phi (in MKL), we achieve up to 50% speedup.
暂无评论