Heterogeneous systems with CPUs and computational accelerators such as GPUs, FPGAs or the upcoming Intel MIC are becoming mainstream. In these systems, peak performance includes the performance of not just the CPUs bu...
详细信息
ISBN:
(纸本)9780769546759
Heterogeneous systems with CPUs and computational accelerators such as GPUs, FPGAs or the upcoming Intel MIC are becoming mainstream. In these systems, peak performance includes the performance of not just the CPUs but also all available accelerators. In spite of this fact, the majority of programming models for heterogeneous computing focus on only one of these. With the development of Accelerated OpenMP for GPUs, both from PGI and Cray, we have a clear path to extend traditional OpenMP applications incrementally to use GPUs. The extensions are geared toward switching from CPU parallelism to GPU parallelism. However they do not preserve the former while adding the latter. Thus computational potential is wasted since either the CPU cores or the GPU cores are left idle. Our goal is to create a runtime system that can intelligently divide an accelerated OpenMP region across all available resources automatically. This paper presents our proof-of-concept runtime system for dynamic task scheduling across CPUs and GPUs. Further, we motivate the addition of this system into the proposed OpenMP for Accelerators standard. Finally, we show that this option can produce as much as a two-fold performance improvement over using either the CPU or GPU alone.
This work presents NchooseK, a unified programming model for constraint satisfaction problems that can be mapped to both quantum circuit and annealing devices through Quadratic Unconstrained Binary Operators (QUBOs). ...
详细信息
ISBN:
(纸本)9781728186740
This work presents NchooseK, a unified programming model for constraint satisfaction problems that can be mapped to both quantum circuit and annealing devices through Quadratic Unconstrained Binary Operators (QUBOs). Our mapping provides an approachable and effective way to program both types of quantum computers. We provide examples of NchooseK being used.
This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present mechanisms and policies for adaptive exploitation and sc...
详细信息
ISBN:
(纸本)9781595936028
This paper addresses the problem of orchestrating and scheduling parallelism at multiple levels of granularity on heterogeneous multicore processors. We present mechanisms and policies for adaptive exploitation and scheduling of layered parallelism on the Cell Broadband Engine. Our policies combine event-driven task scheduling with malleable loop-level parallelism, which is exploited from the runtime system whenever task-level parallelism leaves idle cores. We present a scheduler for applications with layered parallelism on Cell and investigate its performance with RAxML, an application which infers large phylogenetic trees, using the Maximum Likelihood (ML) method. Our experiments show that the Cell benefits significantly from dynamic methods that selectively exploit the layers of parallelism in the system, in response to workload fluctuation. Our scheduler outperforms the MPI version of RAxML, scheduled by the Linux kernel, by up to a factor of 2.6. We are able to execute RAxML on one Cell four times faster than on a dual-processor system with Hyperthreaded Xeon processors, and 5-10% faster than on a single-processor system with a dual-core, quad-thread IBM Power5 processor.
With the forthcoming age of exascale computing, the efficient support of different programming models has become a crucial performance factor for high-performance computing systems. When adopting novel programming par...
详细信息
ISBN:
(纸本)9781728196664
With the forthcoming age of exascale computing, the efficient support of different programming models has become a crucial performance factor for high-performance computing systems. When adopting novel programming paradigms, the performance assessment through standardized and comparable benchmarks plays an essential role in both the effective use of the heterogeneous system hardware and the application performance tuning. Alternatives to MPI such as the partitioned global address space (PGAS) model have become increasingly popular. One such PGAS API is the Global Address Space programming Interface (GASPI). This paper introduces the GASPI Benchmark Suite (GBS), which combines a comprehensive, standardized set of microbenchmarks with application kernels. The microbenchmarks target common GASPI communication patterns, including one-sided, collective, passive, and global atomics, while the application kernels stress communication schemes commonly found in real HPC applications. The effectiveness of GBS is demonstrated by evaluating the GASPI communication performance for the two networking communication standards Ethernet and InfiniBand.
Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism;while the latter relies on fine-grained ...
详细信息
ISBN:
(纸本)9781728145358
Shared memory programming models usually provide worksharing and task constructs. The former relies on the efficient fork-join execution model to exploit structured parallelism;while the latter relies on fine-grained synchronization among tasks and a flexible data-flow execution model to exploit dynamic, irregular, and nested parallelism. On applications that show both structured and unstructured parallelism, both worksharing and task constructs can be combined. However, it is difficult to mix both execution models without penalizing the data-flow execution model. Hence, on many applications structured parallelism is also exploited using tasks to leverage the full benefits of a pure data-flow execution model. However, task creation and management might introduce a non-negligible overhead that prevents the efficient exploitation of fine-grained structured parallelism, especially on many-core processors. In this work, we propose worksharing tasks. These are tasks that internally leverage worksharing techniques to exploit fine-grained structured loop-based parallelism. The evaluation shows promising results on several benchmarks and platforms.
Computational scientists and engineers commonly rely on established software libraries to achieve high performance and reliability in their numerical applications. Unfortunately, this approach does not work well if th...
详细信息
ISBN:
(纸本)9780769549712
Computational scientists and engineers commonly rely on established software libraries to achieve high performance and reliability in their numerical applications. Unfortunately, this approach does not work well if the desired functionality is absent in existing libraries or if the integration is difficult. In such scenarios, one is often forced to explore alternative algorithms and in-house implementations. Such exploration can be a challenging task for computational scientists and engineers without sufficient computer science background. To address this issue, we design and build an automated rapid prototyping tool for regular grid-based numerical applications. This new tool allows programmers to specify algorithms as composition of familiar computation patterns such as those easily found in open literature expressed as generalized elemental subroutines. The tool then automatically transforms such subroutines into code which adapts to the prescribed data structures and delivers performance expected from the underlying algorithms. We demonstrate the tool in use cases including a production-grade computational fluid dynamic application.
The first exascale supercomputers are expected by the end of this decade and will presumably feature an increase in core count, but a decrease in the amount of memory available per core. As of now, it is still unclear...
详细信息
ISBN:
(纸本)9781479941162
The first exascale supercomputers are expected by the end of this decade and will presumably feature an increase in core count, but a decrease in the amount of memory available per core. As of now, it is still unclear if the current programming models will provide high performance on exascale systems. One programming model considered to be an alternative to MPI is the so-called partitioned global address space (PGAS) model. Within this paper we evaluate a relatively new PGAS API: the Global Address Space programming Interface (GASPI) and compare it to MPI on the basis of microbenchmarks. These benchmarks show that GASPI provides about the same level of performance for single-threaded communication, but is up to an order of magnitude faster than both Intel and IBM MPI for multi-threaded communication. Hereafter, we discuss the different features of GASPI in comparison to two main PGAS languages, namely UPC and CAF. In addition, we present a basic numerical algorithm, a dense matrix-matrix multiplication, as an example on how an implementation can make efficient use of GASPI's features, especially the asynchronous and one-sided communication mechanisms.
This work presents a novel approach to synthesize approximate circuits for the ansatze of variational quantum algorithms (VQA) and demonstrates its effectiveness in the context of solving integer linear programming (I...
详细信息
ISBN:
(纸本)9798331541378
This work presents a novel approach to synthesize approximate circuits for the ansatze of variational quantum algorithms (VQA) and demonstrates its effectiveness in the context of solving integer linear programming (ILP) problems. Synthesis is generalized to produce parametric circuits in close approximation of the original circuit and to do so offline. This removes synthesis from the (online) critical path between repeated quantum circuit executions of VQA. We hypothesize that this approach will yield novel high fidelity results beyond those discovered by the baseline without synthesis. Simulation and real device experiments complement the baseline in finding correct results in many cases where the baseline fails to find any and do so with on average 32% fewer CNOTs in circuits.
Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying t...
详细信息
ISBN:
(纸本)9781450351331
Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. These modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale.
暂无评论