Networks of workstations (NOWs), which are generally composed of autonomous compute elements networked together, axe an attractive parallel computing platform since they offer high performance at low cost. The autonom...
详细信息
ISBN:
(纸本)9781581133462
Networks of workstations (NOWs), which are generally composed of autonomous compute elements networked together, axe an attractive parallel computing platform since they offer high performance at low cost. The autonomous nature of the environment, however, often results in inefficient utilization due to load imbalances caused by three primary factors: 1) unequal load (compute or communication) assignment to equally-powerful compute nodes, 2) unequal resources at compute nodes, and 3) multiprogramming. These load imbalances result in idle waiting time on cooperating processes that need to synchronize or communicate data. Additional waiting time may result due to local scheduling decisions in a multiprogrammed environment. In this paper, we present a combined approach of compile-time analysis, run-time load distribution, and operating system scheduler cooperation for improved utilization of available resources in an autonomous NOW. The techniques we propose allow efficient resource utilization by taking into consideration all three causes of load imbalance in addition to locality of access in the process of load distribution. The resulting adaptive load distribution and cooperative scheduling system allows applications to take advantage of parallel resources when available by providing better performance than when the loaded resources axe not used at all.
In shared memory programs contention often occurs at the transition between a sequential and a parallel section of the code. As all threads start executing the parallel section, they often access data just modified by...
详细信息
ISBN:
(纸本)9781581133462
In shared memory programs contention often occurs at the transition between a sequential and a parallel section of the code. As all threads start executing the parallel section, they often access data just modified by the thread that executed the sequential section, causing a flurry of data requests to converge on that processor. We address this problem in a software distributed shared memory system by replicating the execution of the sequential sections on all processors. Communication during this replicated sequential execution is reduced by using multicast. We have implemented replicated sequential execution with multicast support in OpenMP/NOW, a version of of OpenMP that runs on networks of workstations. We do not rely on compile-time data analysis, and therefore we can handle irregular and pointer-based applications. We show significant improvement for two pointer-based applications that suffer from severe contention without replicated sequential execution.
Distributing data is one of the key problems in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in programs where data redistribution between computational phases is con...
详细信息
ISBN:
(纸本)9781581133462
Distributing data is one of the key problems in implementing efficient distributed-memory parallel programs. The problem becomes more difficult in programs where data redistribution between computational phases is considered. The global data distribution problem is to find the optimal distribution in multi-phase parallel programs. Solving this problem requires accurate knowledge of data redistribution cost. We are investigating this problem in the context of a software distributed shared memory (SDSM) system, in which obtaining accurate redistribution cost estimates is difficult. This is because SDSM communication is implicit: It depends on access patterns, page locations, and the SDSM consistency protocol. We have developed integrated compile- and run-time analysis for SDSM systems to determine accurate redistribution cost estimates with low overhead. Our resulting system, SUIF-Adapt, can efficiently and accurately estimate execution time, including redistribution, to within 5% of the actual time in all of our test cases and is often much closer. These precise costs enable SUIF-Adapt to find efficient global data distributions in multiple-phase programs.
Recent proposals for multithreaded architectures allow threads with unknown dependences to execute speculatively in parallel. These architectures use hardware speculative storage to buffer uncertain data, track data d...
详细信息
Recent proposals for multithreaded architectures allow threads with unknown dependences to execute speculatively in parallel. These architectures use hardware speculative storage to buffer uncertain data, track data dependences and roll back incorrect executions. Because all memory references access the speculative storage, current proposals implement this storage using small memory structures for fast access. The limited capacity of the speculative storage causes considerable performance loss due to speculative storage overflow whenever a thread's speculative state exceeds the storage capacity. Larger threads exacerbate the over-flow problem but are preferable to smaller threads, as larger threads uncover more parallelism. In this paper, we discover a new program property called memory reference idempotency. Idempotent references need not be tracked in the speculative storage, and instead can directly access non-speculative storage (i.e., the conventional memory hierarchy). Thus, we reduce the demand fo r speculative storage space. We define a formal framework for reference idempotency and present a novel compiler-assisted speculative execution model. We prove the necessary and sufficient conditions for reference idempotency using our model. We present a compiler algorithm to label idempotent memory references for the hardware. Experimental results show that for our benchmarks, over 60% of the references in non-parallelizable program sections are idempotent.
The efficiency of HPF with respect to irregular applications is still largely unproven. While recent work has shown that a highly irregular hierarchical n-body force calculation method can be implemented in HPF, we ha...
详细信息
The efficiency of HPF with respect to irregular applications is still largely unproven. While recent work has shown that a highly irregular hierarchical n-body force calculation method can be implemented in HPF, we have found that the implementation contains inefficiencies which cause it to run up to a factor of three times slower than our hand-coded, explicitly parallel implementation. Our work examines these inefficiencies, determines that most of the extra overhead is due to a single aspect of the communication strategy, and demonstrates that fixing the communication strategy can bring the overheads of the HPF application to within 25% of those of the hand-coded version.
The design of non-trace based performance instrumentation techniques for threaded programs is investigated to provide detailed performance data while maintaining control of instrumentation costs. The design is based o...
详细信息
The design of non-trace based performance instrumentation techniques for threaded programs is investigated to provide detailed performance data while maintaining control of instrumentation costs. The design is based on low contention data structures. The Paradyn's dynamic instrumentation is extended to handle threaded programs. To associate data with individual threads, all threads must share the same instrumentation code and assign each thread with its own private copy of performance counters or timers. The asynchrony in a threaded program poses a major challenge to dynamic instrumentation.
In comparison to automatic parallelization, which is thoroughly studied in the literature, classical analyses and optimizations of explicitly parallel programs were more or less neglected. This may be due to the fact ...
详细信息
In comparison to automatic parallelization, which is thoroughly studied in the literature, classical analyses and optimizations of explicitly parallel programs were more or less neglected. This may be due to the fact that naive adaptations of the sequential techniques fail, and their straightforward correct ones have unacceptable costs caused by the interleavings, which manifest the possible executions of a parallel program. Recently, however, we showed that unidirectional bitvector analyses can be performed for parallel programs as easily and as efficiently as for sequential ones, a necessary condition for the successful transfer of the classical optimizations to the parallel setting. In this article we focus on possible subsequent code motion transformations, which turn out to require much more care than originally conjectured. Essentially, this is due to the fact that interleaving semantics, although being adequate for correctness considerations, fails when it comes to reasoning about efficiency of parallel programs. This deficiency, however, can be overcome by strengthening the specific treatment of synchronization points.
Accurate simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete event simulation. This paper describes the use of COMPASS, a direct execution-driven, parall...
详细信息
Accurate simulation of large parallel applications can be facilitated with the use of direct execution and parallel discrete event simulation. This paper describes the use of COMPASS, a direct execution-driven, parallel simulator for performance prediction of programs that include both communication and I/O intensive applications. The simulator has been used to predict the performance of such applications on both distributed memory machines like the IBM SP and shared-memory machines like the SGI Origin 2000. The paper illustrates the usefulness of COMPASS as a versatile performance prediction tool. We use both real-world applications and synthetic benchmarks to study application scalability, sensitivity to communication latency, and the interplay between factors like communication pattern and parallel file system caching on application performance. We also show that the simulator is accurate in its predictions and that it is also efficient in its ability to use parallel simulation to reduce its own execution time which, in some cases, has yielded a near-linear speedup.
Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. This is because traditional co...
详细信息
Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. This is because traditional compilers for sequential programs do not account for the updates to a shared variable by different threads. We present a concurrent static single assignment (CSSA) form for parallel programs containing cobegin/coend and parallel do constructs and post/wait synchronization primitives. Based on the CSSA form, we present copy propagation and dead code elimination techniques. Also, a global value numbering technique that detects equivalent variables in parallel programs is presented. By using global value numbering and the CSSA form, we extend classical common subexpression elimination, redundant load/store elimination, and loop invariant detection to parallel programs without violating sequential consistency. These optimization techniques are the most commonly used techniques for sequential programs. By extending these techniques to parallel programs, we can guarantee the correctness of the optimized program and maintain single processor performance in a multiprocessor environment.
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses i...
详细信息
The SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiting attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. Third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. The system guides the programmers in the parallelization process using a set of sophisticated visualization technique. This paper demonstrates the effectiveness of the SUIF Explorer with three case studies. The programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables.
暂无评论