Many of today's high level parallel languages support dynamic, fine-grained parallelism. these languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than...
详细信息
Many of today's high level parallel languages support dynamic, fine-grained parallelism. these languages allow the user to expose all the parallelism in the program, which is typically of a much higher degree than the number of processors. Hence an efficient scheduling algorithm is required to assign computations to processors at runtime. Besides having low overheads and good load balancing, it is important for the scheduling algorithm to minimize the space usage of the parallel program. this paper presents a scheduling algorithm that is provably space-efficient and time-efficient for nested parallel languages. In addition to proving the space and time bounds of the parallel schedule generated by the algorithm, we demonstrate that it is efficient in practice. We have implemented a runtime system that uses our algorithm to schedule parallelthreads. the results of executing parallel programs on this system show that our scheduling algorithm significantly reduces memory usage compared to previous techniques, without compromising performance.
In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. How...
详细信息
ISBN:
(纸本)9781581135886
In programming high performance applications, shared address-space platforms are preferable for fine-grained computation, while distributed address-space platforms are more suitable for coarse-grained computation. However, currently only distributed address-space systems scale beyond the low hundreds of processors. In this paper we introduce a hybrid architecture that allows users to trade off local memory usage for coherence communication, making possible larger-scale shared memory architectures. We introduce a programming model and examine possible implementations of hardware mechanisms, evaluating some of the trade-offs inherent in each. Preliminary experiments on an application with particularly fine-grained communication requirements indicate that effective placement of directives can reduce coherence communication by more than a factor of 10 for 64 processors.
We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. the key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately e...
详细信息
ISBN:
(纸本)9781450368186
We introduce a structure-aware parallel technique for context-bounded analysis of concurrent programs. the key intuition consists in decomposing the set of concurrent traces into symbolic subsets that are separately explored by multiple instances of the same decision procedure running in parallel. the decision procedures work on different partitions of the search space without cooperating, whence distribution follows effortlessly. Our experiments on a selection of complex multi-threaded programs show significant analysis speedups and scalability, and greater performance gains than with general-purpose parallel solvers.
Multi-core and many-core were already major trends for the past six years, and are expected to continue for the next decades. Withthese trends of parallel computing, it becomes increasingly difficult to decide on whi...
详细信息
ISBN:
(纸本)9781450311601
Multi-core and many-core were already major trends for the past six years, and are expected to continue for the next decades. Withthese trends of parallel computing, it becomes increasingly difficult to decide on which architecture to run a given application. In this work, we use an algorithm classification to predict performance prior to algorithm implementation. For this purpose, we modify the roofline model to include class information. In this way, we enable architectural choice through performance prediction prior to the development of architecture specific code. the new model, the boat hull model, is demonstrated using a GPU as a target architecture. We show for 6 example algorithms that performance is predicted accurately without requiring code to be available.
this poster proposes an efficient runtime scheduler that provides provable performance guarantees to parallel programs that use data structures through the use of implicit batching.
ISBN:
(纸本)9781450326568
this poster proposes an efficient runtime scheduler that provides provable performance guarantees to parallel programs that use data structures through the use of implicit batching.
Boosted transactions offer an attractive method that enables programmers to create larger transactions that scale well and offer deadlock-free guarantees. However, as boosted transactions get larger, they become more ...
详细信息
ISBN:
(纸本)9781605583976
Boosted transactions offer an attractive method that enables programmers to create larger transactions that scale well and offer deadlock-free guarantees. However, as boosted transactions get larger, they become more susceptible to conflicts and aborts. We describe a linear-time algorithm to detect transactions that cannot make progress, which transactions need to be aborted, and when. the algorithm guarantees zero false positives with minimal aborts. Our proposals, as implemented in DSTM2, increase the transactional throughput of the system, often by more than 30%.
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallelprogramming on may-core architectures. We show that the NB-FEB primitive is universal, s...
详细信息
ISBN:
(纸本)9781605583976
We introduce a non-blocking full/empty bit primitive, or NB-FEB for short, as a promising synchronization primitive for parallelprogramming on may-core architectures. We show that the NB-FEB primitive is universal, scalable and feasible. NB-FEB, together with registers, can solve the consensus problem for an arbitrary number of processes (universality). NB-FEB is combinable, namely its memory requests to the same memory location can be combined into only one memory request, which consequently mitigates performance degradation due to synchronization "hot spots" (scalability). Since NB-FEB is a variant of the original full/empty bit that always returns a value instead of waiting for a conditional flag, it is as feasible as the original full/empty bit, which has been implemented in many computer systems (feasibility).
Loop tiling and communication optimizations, such as message pipelining and aggregation, can achieve optimized and robust memory performance by proactively managing storage and data movement. In this paper, we general...
详细信息
Loop tiling and communication optimizations, such as message pipelining and aggregation, can achieve optimized and robust memory performance by proactively managing storage and data movement. In this paper, we generalize these techniques to pointer-based data structures (PBDSs). Our approach, dynamic pointer alignment (DPA), has two components. the compiler decomposes a program into non-blocking threads that operate on specific pointers and labels thread creation sites withtheir corresponding pointers. At runtime, an explicit mapping from pointers to dependent threads is updated at thread creation and is used to dynamically schedule boththreads and communication, such that threads using the same objects execute together, communication overlaps with local work, and messages are aggregated. We have implemented DPA to optimize remote reads to global PBDSs on parallel machines. Our empirical results on the force computation phases of two applications that use sophisticated PBDSs, Barnes-Hut and FMM, show that DPA achieves good absolute performance and speedups by enabling tiling and communication optimizations on the CRAY T3D.
Live sequence charts (LSCs) have been proposed as an inter-object scenario-based specification and visual programming language for reactive systems. In this paper, we introduce a logic-based framework to check the con...
详细信息
ISBN:
(纸本)9781605585680
Live sequence charts (LSCs) have been proposed as an inter-object scenario-based specification and visual programming language for reactive systems. In this paper, we introduce a logic-based framework to check the consistency of an LSC specification. An LSC simulator has been implemented in logic programming, utilizing a memoized depth-first search strategy, to show how a reactive system in LSCs would response to a set of external event sequences. A formal notation is defined to specify external event sequences, extending the regular expression with a parallel operator and a testing control. the parallel operator allows interleaved parallel external events to be tested in LSCs simultaneously;while the testing control provides users to a new approach to specify and test certain temporal properties (e.g., CTL formula) in a form of LSC. Our framework further provides either a state transition graph or a failure trace to justify the consistency checking results.
暂无评论