The distributed data service library allows developers to control the distribution of data in a parallel program simply by setting attributes on a distributed data object. This interface provides the power of a data p...
详细信息
The asynchronous nature of many MPI/PVM programs does not fit the BSP model. The barrier synchronization imposed by the model restricts the range of available algorithms and their performance;Through the suppression o...
详细信息
ISBN:
(纸本)0818685913
The asynchronous nature of many MPI/PVM programs does not fit the BSP model. The barrier synchronization imposed by the model restricts the range of available algorithms and their performance;Through the suppression of barriers and the generalization of the concept of superstep we propose two Mew models, the BSP-like and the BSP Without Barriers (BSPWB) models. While the BSP-like extends the BSP* model to programs written using collective operations, the more general BSPWB model admits the MPI/PVM parallel asynchronous programming: style. As LogP, the model encourages locality bat it is simpler to use. The parameters of the models and their quality are evaluated on a distributed-shared memory machine, the Origin 2000 and on a distributed memory machine, the CRAY T3E. The dependence of the time spent in an h-relation is stronger in the communication pattern than in the number of processors. The total variation of the h-relation time in both the patterns and processor numbers is smaller than sixty nanoseconds. To illustrate the proposed models, two different applications are considered: a parallel Sort using Regular Sampling (PSRS) and a parallel Dynamic programming Algorithm solving the Single Resource Allocation Problem (SRAP). The PSRS is a synchronous algorithm with a rich set of collective communication patterns and coarse grain communications. On the opposite extreme, the SRAP is a fine grain communication algorithm using permutation patterns The computational results prove the accuracy of the models. The prediction of the communication times is robust even for the SRAP, where communication is dominated by small messages.
The Image Processing applications require both computing and communication power. The object of the GFLOPS project was to study all aspects concerning the design of such computers. The project's aim was to develop...
详细信息
The Image Processing applications require both computing and communication power. The object of the GFLOPS project was to study all aspects concerning the design of such computers. The project's aim was to develop a parallel architecture as well as its software environment to implement these applications efficiently. A development environment, especially a C data-parallel language, has been built for this purpose. The C// parallel language presented here, simplifies the use of such architectures by providing the programmer with a global name space and a control mechanism to exploit fine and medium grain parallelism of its applications. The main advantage of our paradigm is that it allows a unique framework to express both data and control parallelism. We have implemented this programming environment on the GFLOPS machine which supports up to 512 processor nodes, which are PC mother boards, connected over a scaleable and cost-effective network, via the PCI-bus, at a constant cost per node. The aim is to obtain at low cost a scaleable virtually shared memory machine. In this paper we discuss the design of the GFLOPS machine and its C// parallel language, and evaluate the effectiveness of the mechanisms incorporated. The analysis of the architecture's behaviour was conducted with microbenchmarks and image processing algorithms, written in C.
We present a unified approach for building high-performance numerical linear algebra routines for large classes of dense and sparse matrices. As with the Standard Template Library [1], we separate algo- rithms from da...
详细信息
There has been relatively little analytical work on processor optimizations for multimedia applications. With the introduction of MMX by Intel, it is clear that this is an area of increasing importance. Building on pr...
详细信息
ISBN:
(纸本)0818684038
There has been relatively little analytical work on processor optimizations for multimedia applications. With the introduction of MMX by Intel, it is clear that this is an area of increasing importance. Building on previous work [4, 5, 6, 7, 13, 14], we propose optimizations for multimedia architectures that support independent parallel execution of instructions within dynamically assembled traces, resulting in dramatic performance improvements. Specifically, we propose simplified instruction scheduling and register renaming algorithms due to constraints on trace formation. In addition, we suggest specific instruction pool and trace cache parameters. We constructed a simulator in order to measure the benefits of these processor optimizations for multimedia applications. The simulated machine, which could fetch/decode 2 instructions per cycle, performed better than a superscalar machine that could fetch/decode 8 instructions per cycle. Execution rates as high as 7.3 instructions per cycle were achieved for the benchmarks simulated, assuming 16 instructions per trace.
This paper discusses the main achievements of the EPIC project, whose aim was to design a high level programming environment with an associated implementation for portable parallel image processing. The project was fu...
详细信息
ISBN:
(纸本)3540649522
This paper discusses the main achievements of the EPIC project, whose aim was to design a high level programming environment with an associated implementation for portable parallel image processing. The project was funded as part of the EPSRC Portable Software Tools for parallelarchitectures (PSTPA) programme. The paper summarises new portable programming abstractions for image processing, and outlines the automatically optimising implementation which achieves portability of application code and efficiency of implementation on a closely coupled distributed memory parallel system. The paper includes timings for optimised and unoptimised versions of typical image processing algorithms;it draws the main conclusion that it is possible to achieve portability with efficiency, for a specific application, by adopting a high level algebraic programming model, together with a transformation-based optimiser which reclaims the loss of efficiency which an algebraic approach traditionally entails.
An approach for designing a hybrid parallel system that can perform different levels of parallelism adaptively is presented. An adaptive parallel computer vision system (APVIS) is proposed to attain this goal. The APV...
详细信息
An approach for designing a hybrid parallel system that can perform different levels of parallelism adaptively is presented. An adaptive parallel computer vision system (APVIS) is proposed to attain this goal. The APVIS is constructed by integrating two different types of parallelarchitectures, i.e, a multiprocessor based system (MBS) and a memory based processor array (MPA);tightly into a single machine. One important feature in the APVIS is that the programming interface to execute data parallel code onto the MPA is the same as the usual subroutine calling mechanism. Thus the existence of the MPA is transparent to the programmers. This research is to design an underlying base architecture that can be optimally executed for a broad range of vision tasks. A performance model is provided to show the effectiveness of the APVTS. It turns out that the proposed APVIS can provide significant performance improvement and cost effectiveness for highly parallel applications having a mixed set of parallelisms. Also an example application composed of a series of vision algorithms, from low-level and medium-level processing steps, is mapped onto the MPA. Consequently, the APVIS with a few or tens of MPA modules can perform the chosen example application in real time when multiple images are incoming successively with a few seconds inter-arrival time.
There is a constant tension between the designers of algorithms and architectures. Each designs for the others previous generation. Several factors are making this an increasingly inadequate approach. On the hardware ...
详细信息
ISBN:
(纸本)0818684240
There is a constant tension between the designers of algorithms and architectures. Each designs for the others previous generation. Several factors are making this an increasingly inadequate approach. On the hardware side, the growing disparity between the performance of memory and CPU has pushed many architectures to hierarchical memories that reward significant data reuse (and punish algorithms that use data items a small number of times). On the algorithmic side, at least for a large class of algorithms in scientific computing, the emphasis on increasingly efficient algorithms, measured by the amount of work (floating point operations) per solution value, has led to algorithms that touch data only a few times. Further, algorithmic techniques that lead to adaptive methods (computing only with as much data as is required to accurately represent the solution) often lead to irregular or unpredictable accesses to memory. While these trends in architectures and algorithms are away from each other, there are other trends in algorithms that provide both a degree of greater memory locality without sacrificing algorithmic optimality. These are hierarchical methods, such as multigrid and general domain decomposition. These hierarchical methods place some requirements on architectures;primarily that there be no distinguished collection of processes/threads in a computation. Another opportunity in algorithms is the potential for exploiting split-phase or two-step operations;most algorithms are designed under the assumption that parallel operations, such as a scan or reduction, are single step (blocking in MPI terms), but this is often not required. We conclude by noting that both algorithmic and architecture research can benefit from a more continuous dialog.
We present a parallel implementation of Schönhage’s integer GCD algorithm on distributed memory architectures. Results are generalized for the extended GCD algorithm. Experiments on sequential architectures show...
详细信息
When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a 'data race' occurs, which is usually a bug. This paper describes th...
详细信息
ISBN:
(纸本)9780897919890
When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a 'data race' occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which checks for data races in programs coded in the Cilk multithreaded language. Like its predecessor, the Nondeterminator, which checks for simple 'determinacy' races, the Nondeterminator-2 is a debugging tool, not a verifier, since it checks for data races only in the computation generated by a serial execution of the program on a given input. We give an algorithm, ALL-SETS, that determines whether the computation generated by a serial execution of a Cilk program on a given input contains a race. For a program that runs serially in time T, accesses V shared memory locations, uses a total of n locks, and holds at most k n locks simultaneously, ALL-SETS runs in. O(nkT α(V, V)) time and O(nkV) space, where α is Tarjan's functional inverse of Ackermann's function. Since ALL-SETS may be too inefficient in the worst case, we propose a much more efficient algorithm which can be used to detect races in programs that obey the 'umbrella' locking discipline, a programming methodology that is more flexible than similar disciplines proposed in the literature. We present an algorithm, BRELLY, which detects violations of the umbrella discipline in O(kT α(V, V)) time using O(kV) space. We also prove that any 'abelian' Cilk program, one whose critical sections commute, produces a determinate final state if it is deadlock free and if it generates any computation which is data-race free. Thus, the Nondeterminator-2's two algorithms can verify the determinacy of a deadlock-free abelian program running on a given input.
暂无评论