Programmers need to combine different programmingmodels and fully optimize their codes to take advantage of various levels of parallelism available in heterogeneous clusters. To reduce the complexity of this process,...
详细信息
ISBN:
(纸本)9783319322438;9783319322421
Programmers need to combine different programmingmodels and fully optimize their codes to take advantage of various levels of parallelism available in heterogeneous clusters. To reduce the complexity of this process, we propose a task-based approach for crowd simulation using OmpSs, CUDA and MPI, which allows taking the full advantage of computational resources available in heterogeneous clusters. We also present the performance analysis of the algorithm under different workloads executed on a GPU Cluster.
Implementations of parallel programming models are provided either as language extensions, completely new languages or as a library. The first two options often provides high productivity, but requires the porting of ...
详细信息
ISBN:
(纸本)9781509036820
Implementations of parallel programming models are provided either as language extensions, completely new languages or as a library. The first two options often provides high productivity, but requires the porting of codes. In contrast, calls to new libraries can be added more easily, however the use of abstractions in such programming model implementations can have high runtime overhead. In both cases, the mentioned drawbacks often hinder the adaptation of novel programmingmodels for large existing codes. To combine the advantages of compiler analysis with the composability of pure libraries towards more efficient programming model implementations, in this paper, we propose a low level API for programmer controlled binary rewriting at runtime. This can be used by programmingmodels provided as libraries to efficiently integrate their abstractions with application code. It enables incremental adoption for existing codes as well as favoring input-dependent optimization strategies yet providing similar performance as language extension approaches. We show first promising experiences.
We present TPR0F, an energy profiling tool for OpenMP-like task-parallel programs. To compute the energy consumed by each task in a parallel application, TPRoF dynamically traces the parallel execution and uses a nove...
详细信息
We present TPR0F, an energy profiling tool for OpenMP-like task-parallel programs. To compute the energy consumed by each task in a parallel application, TPRoF dynamically traces the parallel execution and uses a novel technique to estimate the per-task energy consumption. To achieve this estimation, TPRoF apportions the total processor energy among cores and overcomes the limitation of current works which would otherwise make parallel accounting impossible to achieve. We demonstrate the value of TPRoF by characterizing a set of task parallel programs, where we find that data locality, memory access patterns and task working sets are responsible for significant variance in energy consumption between seemingly homogeneous tasks. In addition, we identify opportunities for fine-grain energy optimization by applying per-task Dynamic Voltage and Frequency Scaling (DVFS). (C) 2014 Published by Elsevier Inc.
Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the marke...
详细信息
Current generation of computing platforms is embracing multi-core and many-core processors to improve the overall performance of the system, meeting at the same time the stringent energy budgets requested by the market. parallelprogramming languages are nowadays paramount to extracting the tremendous potential offered by these platforms: parallel computing is no longer a niche in the high performance computing (HPC) field, but an essential ingredient in all domains of computer science. The advent of next-generation many-core embedded platforms has the chance of intercepting a converging need for predictable high-performance coming from both the High-Performance Computing (HPC) and Embedded Computing (EC) domains. On one side, new kinds of HPC applications are being required by markets needing huge amounts of information to be processed within a bounded amount of time. On the other side, EC systems are increasingly concerned with providing higher performance in real-time, challenging the performance capabilities of current architectures. This converging demand raises the problem about how to guarantee timing requirements in presence of parallel execution. The paper presents how the time-criticality and parallelisation challenges are addressed by merging techniques coming from both HPC and EC domains, and provides an overview of the proposed framework to achieve these objectives. (c) 2015 Elsevier B.V. All rights reserved.
Current high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a general-purpose host processor is coupled to a programmable manycore accelerator (PMCA). Such PMCAs typically leverage hi...
详细信息
ISBN:
(纸本)9781479986705
Current high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a general-purpose host processor is coupled to a programmable manycore accelerator (PMCA). Such PMCAs typically leverage hierarchical interconnect and distributed memory with non-uniform access (NUMA). Nested parallelism is a convenient programming abstraction for large-scale cc-NUMA systems, which allows to hierarchically (and dynamically) create multiple levels of fine-grained parallelism whenever it is available. Available implementations for cc-NUMA systems introduce large overheads for nested parallelism management, which cannot be tolerated due to the extremely fine-grained nature of embedded parallel workloads. In particular, creating a team of parallel threads has a cost that increases linearly with the number of threads, which is inherently non scalable. This work presents a software cache mechanism for frequently-used parallel team configurations to speed up parallel thread creation overheads in PMCA systems. When a configuration is found in the cache the cost for parallel team creation has a constant time, providing a scalable mechanism. We evaluated our support on the STMicroelectronics STHORM many-core. Compared to the state-of-the art, our solution shows that: i) the cost for parallel team creation is reduced by up to 67%;ii) the tangible effect on real ultra-fine-grained parallel kernels is a speedup of up to 80%.
The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new software/hardware architectures for the future Cyber-Physical Systems (CPSs). These systems are expected to react in real-time, provide en...
详细信息
ISBN:
(纸本)9781467373111
The AXIOM project (Agile, eXtensible, fast I/O Module) aims at researching new software/hardware architectures for the future Cyber-Physical Systems (CPSs). These systems are expected to react in real-time, provide enough computational power for the assigned tasks, consume the least possible energy for such task (energy efficiency), scale up through modularity, allow for an easy programmability across performance scaling, and exploit at best existing standards at minimal costs. Current solutions for providing enough computational power are mainly based on multi-or many-core architectures. For example, some current research projects (such as ADEPT or P-SOCRATES) are already investigating how to join efforts from the High-Performance Computing (HPC) and the Embedded Computing domains, which are both focused on high power efficiency, while GPUs and new Dataflow platforms such as Maxeler, or in general FPGAs, are claimed as the most energy efficient. We present the project's initial approach, ideas and key concepts, and describe the AXIOM preliminary architecture. Our starting point uses power efficient multi-core nodes, such as ARM cores and FPGA accelerators on the same die, as in the Xilinx Zynq. We will work to provide an integrated environment that supports programmability of the parallel, interconnected nodes that form a CPS system, and evaluate our ideas using demanding test application scenarios.
This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to ...
详细信息
ISBN:
(纸本)9781467391603
This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programmingmodels such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.
This paper introduces the JStar parallelprogramming language, which is a Java-based declarative language aimed at discouraging sequential programming, encouraging massively parallelprogramming, and giving the compil...
详细信息
This paper introduces the JStar parallelprogramming language, which is a Java-based declarative language aimed at discouraging sequential programming, encouraging massively parallelprogramming, and giving the compiler and runtime maximum freedom to try alternative parallelisation strategies. We describe the execution semantics and runtime support of the language, several optimisations and parallelism strategies, with some benchmark results. (C) 2013 Elsevier B.V. All rights reserved.
The fast multipole method (FMM) is a complex, multi-stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address sp...
详细信息
The fast multipole method (FMM) is a complex, multi-stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchronous activities. The parallel tasks comprising FMM may be expressed in X10 by using a scalable pattern of activities. This paper demonstrates the use of X10 to implement FMM for simulation of electrostatic interactions between ions in a cyclotron resonance mass spectrometer. X10's task-parallel model is used to express parallelism by using a pattern of activities mapping directly onto the tree. X10's work stealing runtime handles load balancing fine-grained parallel activities, avoiding the need for explicit work sharing. The use of global references and active messages to create and synchronize parallel activities over a distributed tree structure is also demonstrated. In contrast to previous simulations of ion trajectories in cyclotron resonance mass spectrometers, our code enables both simulation of realistic particle numbers and guaranteed error bounds. Single-node performance is comparable with the fastest published FMM implementations, and critical expansion operators are faster for high accuracy calculations. A comparison of parallel and sequential codes shows the overhead of activity management and work stealing in this application is low. Scalability is evaluated for 8k cores on a Blue Gene/Q system and 512 cores on a Nehalem/InfiniBand cluster. Copyright (c) 2013 John Wiley & Sons, Ltd.
This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address ...
详细信息
This paper introduces hybrid address spaces as a fundamental design methodology for implementing scalable runtime systems on many-core architectures without hardware support for cache coherence. We use hybrid address spaces for an implementation of MapReduce, a programming model for large-scale data processing, and the implementation of a remote memory access (RMA) model. Both implementations are available on the Intel SCC and are portable to similar architectures. We present the design and implementation of HyMR, a MapReduce runtime system whereby different stages and the synchronization operations between them alternate between a distributed memory address space and a shared memory address space, to improve performance and scalability. We compare HyMR to a reference implementation and we find that HyMR improves performance by a factor of 1.71 x over a set of representative MapReduce benchmarks. We also compare HyMR with Phoenix(++), a state-of-art implementation for systems with hardware-managed cache coherence in terms of scalability and sustained to peak data processing bandwidth, where HyMR demonstrates improvements of a factor of 3.1 x and 3.2x respectively. We further evaluate our hybrid remote memory access (HyRMA) programming model and assess its performance to be superior of that of message passing. (C) 2014 Elsevier Inc. All rights reserved.
暂无评论