this paper proposes a new parallel execution model where programmers augment a sequential program with pieces of code called serializers that dynamically map computational operations into serialization sets of depende...
详细信息
ISBN:
(纸本)9781605583976
this paper proposes a new parallel execution model where programmers augment a sequential program with pieces of code called serializers that dynamically map computational operations into serialization sets of dependent operations. A runtime system executes operations in the same serialization set in program order, and may concurrently execute operations in different sets. Because serialization sets establish a logical ordering on all operations, the resulting parallel execution is predictable and deterministic. We describe the API and design of Prometheus, a C++ library that implements the serialization set abstraction through compile-time template instantiation and a runtime support library. We evaluate a set of parallel programs running on the x86_64 and SPARC-V9 instruction sets and study their performance on multi-core, symmetric multiprocessor, and ccNUMA parallel machines. By contrast with conventional parallel execution models, we find that Prometheus programs are significantly easier to write, test, and debug, and their parallel execution achieves comparable performance.
On multiprocessors with explicitly managed memory hierarchies (EMM), software has the responsibility of moving data in and out of fast local memories. this task can be complex and error-prone even for expert programme...
详细信息
ISBN:
(纸本)9781605583976
On multiprocessors with explicitly managed memory hierarchies (EMM), software has the responsibility of moving data in and out of fast local memories. this task can be complex and error-prone even for expert programmers. Before we can allow compilers to handle the complexity for us, we must identify the abstractions that are general enough to allow us to write applications with reasonable effort, yet specific enough to exploit the vast on-chip memory bandwidth of EMM multi-processors. To this end, we compare two programming models against hand-tuned codes on the STI Cell, paying attention to programmability and performance. the first programming model, Sequoia, abstracts the memory hierarchy as private address spaces, each corresponding to a parallel task. the second, Cellgen, is a new framework which provides OpenMP-like semantics and the abstraction of a shared address spaces divided into private and shared data. We compare three applications programmed using these models against their hand-optimized counterparts in terms of abstractions, programming complexity, and performance.
A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe...
详细信息
ISBN:
(纸本)9781605583976
A trend is developing in high performance computing in which commodity processors are coupled to various types of computational accelerators. Such systems are commonly called hybrid systems. In this paper, we describe our experience developing an implementation of the Linpack benchmark for a petascale hybrid system, the LANL Roadrunner cluster built by IBM for Los Alamos National Laboratory. this system combines traditional x86-64 host processors with IBM PowerXCell (TM) 8i accelerator processors. the implementation of Linpack we developed was the first to achieve a performance result in excess of 1.0 PFLOPS, and made Roadrunner the #1 system on the Top500 list in June 2008. We describe the design and implementation of hybrid Linpack, including the special optimizations we developed for this hybrid architecture. We then present actual results for single node and multi-node executions. From this work, we conclude that it is possible to achieve high performance for certain applications on hybrid architectures when careful attention is given to efficient use of memory bandwidth, scheduling of data movement between the host and accelerator memories, and proper distribution of work between the host and accelerator processors.
High-productivity languages for parallel computing become more important as parallel environments including multicores become more common. Cilk is such a language. It provides good load balancing for many applications...
详细信息
ISBN:
(纸本)9781605583976
High-productivity languages for parallel computing become more important as parallel environments including multicores become more common. Cilk is such a language. It provides good load balancing for many applications including irregular ones;that is, it keeps all workers busy by creating plenty of "logical" threads and adopting the oldest-first work stealing strategy. this paper proposes a "logical thread"-free framework called Tascell, which achieves a higher performance and supports a wider range of parallel environments including clusters without loss of productivity. A Tascell worker spawns a "real" task only when requested by another idle worker. the worker performs the spawning by temporarily "backtracking" and restoring its oldest task-spawnable state. Our approach eliminates the cost of spawning/managing logical threads. It also promotes the reuse of workspaces and improves the locality of reference since it does not need to prepare a workspace for each concurrently runnable logical thread. Furthermore, Tascell enables elegant and highly-efficient backtrack search algorithms with delayed workspace copying. For instance, our 16-queens problem solver is 1.86 times faster than Cilk on a system with two dual-core processors. Our approach also enables a single program to run in both shared and distributed memory environments with reasonable efficiency and scalability.
Understanding why the performance of a multithreaded program does not improve linearly withthe number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical...
详细信息
ISBN:
(纸本)9781605583976
Understanding why the performance of a multithreaded program does not improve linearly withthe number of cores in a shared-memory node populated with one or more multicore processors is a problem of growing practical importance. this paper makes three contributions to performance analysis of multithreaded programs. First, we describe how to measure and attribute parallel idleness, namely, where threads are stalled and unable to work. this technique applies broadly to programming models ranging from explicit threading (e. g., Pthreads) to higher-level models such as Cilk and OpenMP. Second, we describe how to measure and attribute parallel overhead-when a thread is performing miscellaneous work other than executing the user's computation. By employing a combination of compiler support and post-mortem analysis, we incur no measurement cost beyond normal profiling to glean this information. Using idleness and overhead metrics enables one to pinpoint areas of an application where concurrency should be increased (to reduce idleness), decreased (to reduce overhead), or where the present parallelization is hopeless (where idleness and overhead are both high). third, we describe how to measure and attribute arbitrary performance metrics for high-level multithreaded programming models, such as Cilk. this requires bridging the gap between the expression of logical concurrency in programs and its realization at run-time as it is adaptively partitioned and scheduled onto a pool of threads. We have prototyped these ideas in the context of Rice University's HPCTOOLKIT performance tools. We describe our approach, implementation, and experiences applying this approach to measure and attribute work, idleness, and overhead in executions of Cilk programs.
this paper considers the problem of formal verification of MPI programs operating under a fixed test harness for safety properties without building verification models. In our approach, we directly model-check the MPI...
详细信息
ISBN:
(纸本)9781605583976
this paper considers the problem of formal verification of MPI programs operating under a fixed test harness for safety properties without building verification models. In our approach, we directly model-check the MPI/C source code, executing its interleavings withthe help of a verification scheduler. Unfortunately, the total feasible number of interleavings is exponential, and impractical to examine even for our modest goals. Our earlier publications formalized and implemented a partial order reduction approach that avoided exploring equivalent interleavings, and presented a verification tool called ISP. this paper presents algorithmic and engineering innovations to ISP, including the use of OpenMP parallelization, that now enables it to handle practical MPI programs, including: (i) ParMETIS - a widely used hypergraph partitioner, and (ii) MADRE - a Memory Aware Data Re-distribution Engine, both developed outside our group. Over these benchmarks, ISP has automatically verified up to 14K lines of MPI/C code, producing error traces of deadlocks and assertion violations within seconds.
the proceedings contain 42 papers. the topics discussed include: automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories;type inference for locality anal...
ISBN:
(纸本)9781595939609
the proceedings contain 42 papers. the topics discussed include: automatic data movement and computation mapping for multi-level parallel architectures with explicitly managed memories;type inference for locality analysis of distributed data structures;quasi-static scheduling for safe futures;scalable packet classification using interpreting: a cross-platform multi-core solution;FastForward for efficient pipeline parallelism: a cache-optimized concurrent lock-free queue;matrix product on heterogeneous master-worker platforms;high performance dense linear algebra on a spatially distributed processor;optimization principles and application performance evaluation of a multithreaded GPU using CUDA;a case study in SIMD text processing withparallel bit streams: UTF-8 to UTF-16 transcoding;programming with tiles;design and implementation of a high-performance MPI for C# and the common language infrastructure;and a portable runtime interface for multi-level memory hierarchies.
the rising interest in Java for High Performance Computing (HPC) is based on the appealing features of this language for programming multi-core cluster architectures, particularly the built-in networking and multithre...
详细信息
ISBN:
(纸本)9781605585987
the rising interest in Java for High Performance Computing (HPC) is based on the appealing features of this language for programming multi-core cluster architectures, particularly the built-in networking and multithreading support, and the continuous increase in Java Virtual Machine (JVM) performance. However, its adoption in this area is being delayed by the lack of analysis of the existing programming options in Java for HPC and evaluations of their performance, as well as the unawareness of the current research projects in this field, whose solutions are needed in order to boost the embracement of Java in HPC. this paper analyzes the current state of Java for HPC, both for shared and distributed memory programming, presents related research projects, and finally, evaluates the performance of current Java HPC solutions and research developments on a multi-core cluster with a high-speed network, InfiniBand, and a 24-core shared memory machine. the main conclusions are that: (1) the significant interest on Java for HPC has led to the development of numerous projects, although usually quite modest, which may have prevented a higher development of Java in this field;and (2) Java can achieve almost similar performance to native languages, both for sequential and parallel applications, being an alternative for HPC programming. thus, the good prospects of Java in this area are attracting the attention of both industry and academia, which can take significant advantage of Java adoption in HPC. Copyright 2009 ACM.
the arrival of multi-core chips has heightened interest in the discipline of parallelprogramming, a topic that has received much attention for many years. Computer architects have much to learn from sound principles ...
详细信息
ISBN:
(纸本)9781605583976
the arrival of multi-core chips has heightened interest in the discipline of parallelprogramming, a topic that has received much attention for many years. Computer architects have much to learn from sound principles for structuring software and expressing parallel computation. this talk will cover principles for the design of computer systems to support composable parallel software - the idea that any parallel program is usable, without change, as a component of larger parallel programs. By following these principles, a revolution in the ease of building robust and high-performance parallel software can be achieved. the principles suggest interesting directions for computer architecture; the tools to experiment with new architecture concepts are ready and waiting for the savvy and ambitious researcher
暂无评论