We have modified the C language to support a programming model based on a shared address space with physically distributed memory. Withthis model users can write programs in which the nodes of a massively parallel pr...
详细信息
We have modified the C language to support a programming model based on a shared address space with physically distributed memory. Withthis model users can write programs in which the nodes of a massively parallel processor can access remote memory without message passing. AC provides support for distributed arrays as well as pointers to distributed data. Simple array references and pointer dereferencing are sufficient to generate low-overhead remote reads and writes. We have implemented these ideas in a compiler based on the GNU C compiler and targeted at Cray Research's T3D. Initial performance measurements show that AC generates code for remote accesses which is considerably faster than that of the native compiler for structures up to about 16 words in size and virtually equivalent for larger transfers.
We introduce our major ideas of a wait-free, linearizable, and disjoint-access parallel NCAS library, called RTNCAS. It focuses the construction of wait-free data structure operations (DSO) in real-time circumstances....
详细信息
ISBN:
(纸本)9781450301190
We introduce our major ideas of a wait-free, linearizable, and disjoint-access parallel NCAS library, called RTNCAS. It focuses the construction of wait-free data structure operations (DSO) in real-time circumstances. RTNCAS is able to conditionally swap multiple independent words (NCAS) in an atomic manner. It allows us, furthermore, to implement arbitrary DSO by means of their sequential specification.
Scalable busy-wait synchronization algorithms are essential for achieving good parallel program performance on large scale multiprocessors. Such algorithms include mutual exclusion locks, reader-writer locks, and barr...
详细信息
Scalable busy-wait synchronization algorithms are essential for achieving good parallel program performance on large scale multiprocessors. Such algorithms include mutual exclusion locks, reader-writer locks, and barrier synchronization. Unfortunately, scalable synchronization algorithms are particularly sensitive to the effects of multiprogramming: their performance degrades sharply when processors are shared among different applications, or even among processes of the same application. In this paper we describe the design and evaluation of scalable scheduler-conscious mutual exclusion locks, reader-writer locks, and barriers, and show that by sharing information across the kernel/application interface we can improve the performance of scheduler-oblivious implementations by more than an order of magnitude.
Effective memory hierarchy utilization is critical to the performance of modem multiprocessor architectures. We have developed the first compiler system that fully automatically parallelizes sequential programs and ch...
详细信息
Effective memory hierarchy utilization is critical to the performance of modem multiprocessor architectures. We have developed the first compiler system that fully automatically parallelizes sequential programs and changes the original array layouts to improve memory system performance. Our optimization algorithm consists of two steps. the first step chooses the parallelization and computation assignment such that synchronization and data sharing are minimized. the second step then restructures the layout of the data in the shared address space with an algorithm that is based on a new data transformation framework. We ran our compiler on a set of application programs and measured their performance on the Stanford DASH multiprocessor. Our results show that the compiler can effectively optimize parallelism in conjunction with memory subsystem performance.
A major challenge in fine-grained computing is achieving locality without excessive scheduling overhead. We built two J-Machine implementations of a fine-grained programming model, the Berkeley threaded Abstract Machi...
详细信息
A major challenge in fine-grained computing is achieving locality without excessive scheduling overhead. We built two J-Machine implementations of a fine-grained programming model, the Berkeley threaded Abstract Machine. One implementation takes an Active Messages approach, maintaining a scheduling hierarchy in software in order to improve data cache performance. Another approach relies on the J-Machine's message queues and fast task switch, lowering the control costs at the expense of data locality. Our analysis measures the costs and benefits of each approach, for a variety of programs and cache configurations. the Active Messages implementation is strongest when miss penalties are high and for the finest-grained programs. the hardware-buffered implementation is strongest in direct-mapped caches, where it achieves substantially better instruction cache performance.
We present a simple yet effective technique for improving performance of lock-based code using the hardware lock elision (HLE) feature in Intel's upcoming Haswell processor. We also describe how to extend Haswell&...
详细信息
ISBN:
(纸本)9781450319225
We present a simple yet effective technique for improving performance of lock-based code using the hardware lock elision (HLE) feature in Intel's upcoming Haswell processor. We also describe how to extend Haswell's HLE mechanism to achieve a similar effect to our lock elision scheme entirely in hardware.
In this tutorial participants learn how to build their own parallelprogramming language features by developing them as language extensions in the ableC [4] extensible C compiler framework. By implementing new paralle...
详细信息
ISBN:
(纸本)9781450362252
In this tutorial participants learn how to build their own parallelprogramming language features by developing them as language extensions in the ableC [4] extensible C compiler framework. By implementing new parallelprogramming abstractions as language extensions one can build on an existing host language and thus avoid re-implementing common language features such as the type checking and code generation of arithmetic expressions and control flow statements. Using ableC, one can build expressive language features that fit seamlessly into the C11 host language.
the Pilot library is a new method for programming MPI-enabled clusters in C, targeted at novice parallel programmers. Formal elements from Communicating Sequential Processes (CSP) are used to realize a process/channel...
详细信息
ISBN:
(纸本)9781605587080
the Pilot library is a new method for programming MPI-enabled clusters in C, targeted at novice parallel programmers. Formal elements from Communicating Sequential Processes (CSP) are used to realize a process/channel model of parallel computation that reduces opportunities for deadlock and other communication errors. this simple model, plus an application programming interface (API) styled after C's formatted I/O, are designed to make the library easy to learn. the Pilot library exists as a thin layer on top of any standard Message Passing Interface (MPI) implementation, preserving MPI's portability and efficiency, with little performance overhead arising as result of Pilot's additional features.
We describe two novel constructs for programmingparallel machines with multi-level memory hierarchies: call-up, which allows a child task to invoke computation on its parent, and spawn, which spawns a dynamically det...
详细信息
ISBN:
(纸本)9781450301190
We describe two novel constructs for programmingparallel machines with multi-level memory hierarchies: call-up, which allows a child task to invoke computation on its parent, and spawn, which spawns a dynamically determined number of parallel children until some termination condition in the parent is met. Together we show that these constructs allow applications with irregular parallelism to be programmed in a straightforward manner, and furthermore these constructs complement and can be combined with constructs for expressing regular parallelism. We have implemented spawn and call-up in Sequoia and we present an experimental evaluation on a number of irregular applications.
We present KUMQUAT, a system for automatically generating data-parallel implementations of UNIX shell commands and pipelines. the generated parallel versions split input streams, execute multiple instantiations of the...
详细信息
ISBN:
(纸本)9781450392044
We present KUMQUAT, a system for automatically generating data-parallel implementations of UNIX shell commands and pipelines. the generated parallel versions split input streams, execute multiple instantiations of the original pipeline commands to process the splits in parallel, then combine the resulting parallel outputs to produce the final output stream. KumQUAT automatically synthesizes the combine operators, with a domain-specific combiner language acting as a strong regularizer that promotes efficient inference of correct combiners. We present experimental results that show that these combiners enable the effective parallelization of our benchmark scripts.
暂无评论