Recent proposals for multithreaded architectures allow threads with unknown dependences to execute speculatively in parallel. these architectures use hardware speculative storage to buffer uncertain data, track data d...
详细信息
Recent proposals for multithreaded architectures allow threads with unknown dependences to execute speculatively in parallel. these architectures use hardware speculative storage to buffer uncertain data, track data dependences and roll back incorrect executions. Because all memory references access the speculative storage, current proposals implement this storage using small memory structures for fast access. the limited capacity of the speculative storage causes considerable performance loss due to speculative storage overflow whenever a thread's speculative state exceeds the storage capacity. Larger threads exacerbate the over-flow problem but are preferable to smaller threads, as larger threads uncover more parallelism. In this paper, we discover a new program property called memory reference idempotency. Idempotent references need not be tracked in the speculative storage, and instead can directly access non-speculative storage (i.e., the conventional memory hierarchy). thus, we reduce the demand fo r speculative storage space. We define a formal framework for reference idempotency and present a novel compiler-assisted speculative execution model. We prove the necessary and sufficient conditions for reference idempotency using our model. We present a compiler algorithm to label idempotent memory references for the hardware. Experimental results show that for our benchmarks, over 60% of the references in non-parallelizable program sections are idempotent.
the design of non-trace based performance instrumentation techniques for threaded programs is investigated to provide detailed performance data while maintaining control of instrumentation costs. the design is based o...
详细信息
the design of non-trace based performance instrumentation techniques for threaded programs is investigated to provide detailed performance data while maintaining control of instrumentation costs. the design is based on low contention data structures. the Paradyn's dynamic instrumentation is extended to handle threaded programs. To associate data with individual threads, all threads must share the same instrumentation code and assign each thread with its own private copy of performance counters or timers. the asynchrony in a threaded program poses a major challenge to dynamic instrumentation.
the efficiency of HPF with respect to irregular applications is still largely unproven. While recent work has shown that a highly irregular hierarchical n-body force calculation method can be implemented in HPF, we ha...
详细信息
the efficiency of HPF with respect to irregular applications is still largely unproven. While recent work has shown that a highly irregular hierarchical n-body force calculation method can be implemented in HPF, we have found that the implementation contains inefficiencies which cause it to run up to a factor of three times slower than our hand-coded, explicitly parallel implementation. Our work examines these inefficiencies, determines that most of the extra overhead is due to a single aspect of the communication strategy, and demonstrates that fixing the communication strategy can bring the overheads of the HPF application to within 25% of those of the hand-coded version.
In comparison to automatic parallelization, which is thoroughly studied in the literature, classical analyses and optimizations of explicitly parallel programs were more or less neglected. this may be due to the fact ...
详细信息
In comparison to automatic parallelization, which is thoroughly studied in the literature, classical analyses and optimizations of explicitly parallel programs were more or less neglected. this may be due to the fact that naive adaptations of the sequential techniques fail, and their straightforward correct ones have unacceptable costs caused by the interleavings, which manifest the possible executions of a parallel program. Recently, however, we showed that unidirectional bitvector analyses can be performed for parallel programs as easily and as efficiently as for sequential ones, a necessary condition for the successful transfer of the classical optimizations to the parallel setting. In this article we focus on possible subsequent code motion transformations, which turn out to require much more care than originally conjectured. Essentially, this is due to the fact that interleaving semantics, although being adequate for correctness considerations, fails when it comes to reasoning about efficiency of parallel programs. this deficiency, however, can be overcome by strengthening the specific treatment of synchronization points.
Accurate simulation of large parallel applications can be facilitated withthe use of direct execution and parallel discrete event simulation. this paper describes the use of COMPASS, a direct execution-driven, parall...
详细信息
Accurate simulation of large parallel applications can be facilitated withthe use of direct execution and parallel discrete event simulation. this paper describes the use of COMPASS, a direct execution-driven, parallel simulator for performance prediction of programs that include both communication and I/O intensive applications. the simulator has been used to predict the performance of such applications on both distributed memory machines like the IBM SP and shared-memory machines like the SGI Origin 2000. the paper illustrates the usefulness of COMPASS as a versatile performance prediction tool. We use both real-world applications and synthetic benchmarks to study application scalability, sensitivity to communication latency, and the interplay between factors like communication pattern and parallel file system caching on application performance. We also show that the simulator is accurate in its predictions and that it is also efficient in its ability to use parallel simulation to reduce its own execution time which, in some cases, has yielded a near-linear speedup.
Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. this is because traditional co...
详细信息
Traditional compiler techniques developed for sequential programs do not guarantee the correctness (sequential consistency) of compiler transformations when applied to parallel programs. this is because traditional compilers for sequential programs do not account for the updates to a shared variable by different threads. We present a concurrent static single assignment (CSSA) form for parallel programs containing cobegin/coend and parallel do constructs and post/wait synchronization primitives. Based on the CSSA form, we present copy propagation and dead code elimination techniques. Also, a global value numbering technique that detects equivalent variables in parallel programs is presented. By using global value numbering and the CSSA form, we extend classical common subexpression elimination, redundant load/store elimination, and loop invariant detection to parallel programs without violating sequential consistency. these optimization techniques are the most commonly used techniques for sequential programs. By extending these techniques to parallel programs, we can guarantee the correctness of the optimized program and maintain single processor performance in a multiprocessor environment.
the SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses i...
详细信息
the SUIF Explorer is an interactive parallelization tool that is more effective than previous systems in minimizing the number of lines of code that require programmer assistance. First, the interprocedural analyses in the SUIF system is successful in parallelizing many coarse-grain loops, thus minimizing the number of spurious dependences requiting attention. Second, the system uses dynamic execution analyzers to identify those important loops that are likely to be parallelizable. third, the SUIF Explorer is the first to apply program slicing to aid programmers in interactive parallelization. the system guides the programmers in the parallelization process using a set of sophisticated visualization technique. this paper demonstrates the effectiveness of the SUIF Explorer withthree case studies. the programmer was able to speed up all three programs by examining only a small fraction of the program and privatizing a few variables.
An implementation scheme of fine-grain multithreading that needs no changes to current calling standards for sequential languages and modest extensions to sequential compilers is described. Like previous similar syste...
详细信息
An implementation scheme of fine-grain multithreading that needs no changes to current calling standards for sequential languages and modest extensions to sequential compilers is described. Like previous similar systems, it performs an asynchronous call as if it were an ordinary procedure call, and detaches the callee from the caller when the callee suspends or either of them migrates to another processor. Unlike previous similar systems, it detaches and connects arbitrary frames generated by off-the-shelf sequential compilers obeying calling standards. As a consequence, it requires neither a frontend preprocessor nor a native code generator that has a builtin notion of parallelism. the system practically works with unmodified GNU C compiler (GCC). Desirable extensions to sequential compilers for guaranteeing portability and correctness of the scheme are clarified and claimed modest. Experiments indicate that sequential performance is not sacrificed for practical applications and both sequential and parallel performance are comparable to Cilk, whose current implementation requires a fairly sophisticated preprocessor to C. these results show that efficient asynchronous calls (a.k.a. future calls) can be integrated into current calling standard with a very small impact both on sequential performance and compiler engineering.
A central problem in executing performance critical parallel and distributed applications on shared networks is the selection of computation nodes and communication paths for execution. Automatic selection of nodes is...
详细信息
A central problem in executing performance critical parallel and distributed applications on shared networks is the selection of computation nodes and communication paths for execution. Automatic selection of nodes is complex as the best choice depends on the application structure as well as the expected availability of computation and communication resources. this paper presents a solution to this problem for realistic application and network scenarios. A new algorithm to jointly analyze computation and communication resources for different application demands is introduced and a framework for automatic node selection is developed on top of Remos, which is a query interface to network information. the paper reports results from a set of applications, including Airshed pollution modeling and magnetic resonance imaging, executing on a high speed network testbed. the results demonstrate that node selection is effective in enhancing application performance in the presence of computation load as well as network traffic. Under the network conditions used for experiments, the increase in execution time due to compute loads and network congestion was reduced by half with node selection. the node selection algorithms developed in this research are also applicable to dynamic migration of long running jobs.
this paper develops a highly accurate LogGP model of a complex wavefront application that uses MPI communication on the IBM SP/2. Key features of the model include: (1) elucidation of the principal wavefront synchroni...
详细信息
this paper develops a highly accurate LogGP model of a complex wavefront application that uses MPI communication on the IBM SP/2. Key features of the model include: (1) elucidation of the principal wavefront synchronization structure, and (2) explicit high-fidelity models of the MPI-send and MPI-receive primitives. the MPI-send/receive models are used to derive L, o, and G from simple two-node micro-benchmarks. Other model parameters are obtained by measuring small application problem sizes on four SP nodes. Results show that the LogGP model predicts, in seconds and with a high degree of accuracy, measured application execution time for large problems running on 128 nodes. Detailed performance projections are provided for very large future processor configurations that are expected to be available to the application developers. these results indicate that scaling beyond one or two thousand nodes yields greatly diminished in execution time, and that synchronization delays are a principal factor limiting the scalability of the application.
暂无评论