The cost of communication in message-passing systems can only be computed based on a large number of low-level details. Consequently, the only architectural measure they naturally suggest is a frrst-order one, latency...
详细信息
The cost of communication in message-passing systems can only be computed based on a large number of low-level details. Consequently, the only architectural measure they naturally suggest is a frrst-order one, latency. We show that a second-order property, the standard deviation of the delivery times is also of interest. Most importantly, the average performance of a large communication system depends not only on the average performance of its components, but also on the standard deviation of these performances. In other words, building a high-performance system requires components that are themselves performing high-performance, but their performance must also have small variance. We illustrate this effect using distributions of the BSP g parameter. Lower bounds in the time per unit transfer of communication in large systems can be derived from data measured over single links. (C) 1999 Elsevier Science B.V. All rights reserved.
A hybrid method for performance modeling of parallel programs is considered where the runtime of large sequential segments is estimated statically and the parallel program structure is evaluated by simulation. The pre...
详细信息
A hybrid method for performance modeling of parallel programs is considered where the runtime of large sequential segments is estimated statically and the parallel program structure is evaluated by simulation. The present paper describes a way to generate a model of a given program automatically from the source code where the user has to provide only values for a small number of variables, This model contains the control structure of the original program and timing information for generalized basic blocks. We consider Fortran programs which are parallelized using the message passing paradigm. A prototype of a tool for automatic model generation has been developed which is able to treat examples of moderate size. (C) 1999 Elsevier Science B.V. All rights reserved.
Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify p...
详细信息
Fine tuning the performance of large parallel programs is a very difficult task. A profiling tool can provide detailed insight into the utilization and communication of the different processors, which helps identify performance bottlenecks, In this paper we present two profiling techniques for the fine-grained parallel programming language Split-C, which provides a simple global address space memory model. One profiler provides a detailed analysis of a program's execution. The other profiler collects cumulative information. As our experience shows, it is quite challenging to profile programs that make use of efficient, low-overhead communication. We incorporated techniques which minimize profiling effects on the running program, and quantified the profiling overhead. We present several Split-C applications showing that the profiler is useful in determining performance bottlenecks. Copyright (C) 1999 John Whey & Sons, Ltd.
High Performance Fortran (HPF) is a data-parallel language that provides a high-level interface for programming scientific applications, while delegating to the compiler the task of generating explicitly parallel mess...
详细信息
High Performance Fortran (HPF) is a data-parallel language that provides a high-level interface for programming scientific applications, while delegating to the compiler the task of generating explicitly parallel message-passing programs. This paper provides an overview of HPF compilation and runtime technology for distributed-memory architectures, and deals with a number of topics in some detail. In particular, we discuss distribution and alignment processing, the basic compilation scheme and methods for the optimization of regular computations. A separate section is devoted to the transformation and optimization of independent loops with irregular data accesses. The paper concludes with a discussion of research issues and outlines potential future development paths of the language. (C) 1999 Elsevier Science B.V. All rights reserved.
Simulated annealing is an effective method for solving large combinatorial optimisation problems. Because of its iterative nature the annealing process requires a substantial amount of computation time. A new parallel...
详细信息
Simulated annealing is an effective method for solving large combinatorial optimisation problems. Because of its iterative nature the annealing process requires a substantial amount of computation time. A new parallel implementation based on the concurrency control theory of database systems is presented;the parallelised annealing process is serialisable. Concurrent updates to the base solution are allowed provided that they do not have data conflict. Using the travelling salesman problem as the example application, the parallel simulated annealing algorithm is implemented on a Motorola Delta 3000 shared-memory multiprocessor system with eight processors. With a moderate problem size of 400 cities, a speedup efficiency of over 90% is achieved at high annealing temperature and close to 100% at a low annealing temperature.
CAP, a computer-aided parallelization tool, generates highly pipelined applications that run communication and I/O operations in parallel with processing operations. One of CAP's successes is the Visible Human Sli...
详细信息
CAP, a computer-aided parallelization tool, generates highly pipelined applications that run communication and I/O operations in parallel with processing operations. One of CAP's successes is the Visible Human Slice Server (http://visible ***), a 3D tomographic image server that allows clients to choose and view any cross section of the human body.
The shared-memory concept makes it easier to write parallel programs, but tuning the application to reduce the impact of frequent long-latency memory accesses still requires substantial programmer effort. Researchers ...
详细信息
The shared-memory concept makes it easier to write parallel programs, but tuning the application to reduce the impact of frequent long-latency memory accesses still requires substantial programmer effort. Researchers have proposed using compilers, operating systems, or architectures to improve performance by allocating data close to the processors that use it, The Cache-Only Memory Architecture (CO,MA) increases the chances of data being available locally because the hardware transparently replicates the data and migrates it to the memory module of the node that is currently accessing it. Each memory module acts as a huge cache memory in which each block has a tag with the address and the state. The authors explain the functionality, architecture, performance, and complexity of COMA systems. They also outline different COMA designs, compare COMA to traditional nonuniform memory access (NUMA) systems, and describe proposed improvements in NUMA systems that target the same performance obstacles as COMA.
We model a deterministic parallel program by a directed acyclic graph of tasks, where a task can execute as soon as all tasks preceding it have been executed. Each task can allocate or release an arbitrary amount of m...
详细信息
We model a deterministic parallel program by a directed acyclic graph of tasks, where a task can execute as soon as all tasks preceding it have been executed. Each task can allocate or release an arbitrary amount of memory (i.e., heap memory allocation can be modeled). We call a parallel schedule "space efficient" if the amount of memory required is at mast equal to the number of processors times the amount of memory required for some depth-first execution of the program by a single processor. We will describe a simple, locally depth-first, scheduling algorithm and shaw that it is always space efficient. Since the scheduling algorithm is greedy, it will be within a factor of two of being optimal with respect to time. For the special case of a program having a series-parallel structure, we show how to efficiently compute the worst case memory requirements over all possible depth-first executions of a program. Finally, we show how scheduling can be decentralized, making the approach scalable to a large number of processors when there is sufficient parallelism.
Graphical visualisation plays an important role in parallel program development. Researchers have proposed and developed many visualisation tools that assist the development of parallel programs. A number of graph for...
详细信息
Graphical visualisation plays an important role in parallel program development. Researchers have proposed and developed many visualisation tools that assist the development of parallel programs. A number of graph formalisms or notations have been used to visualise various aspects of parallel programs and their executions. This paper attempts to classify and compare these graph formalisms and notations which provide different information at different stages of parallel program development. (C) 1999 Academic Press.
暂无评论