The article presents TOPAS, a programming environment for visualization, animation and investigation of algorithms for mapping graphs of parallel programs into graphs of parallel computing systems implemented in Java ...
详细信息
The article presents TOPAS, a programming environment for visualization, animation and investigation of algorithms for mapping graphs of parallel programs into graphs of parallel computing systems implemented in Java and accessible on the WWW.
This paper presents the BaLinda model, based on last in/first out threads that interact via a shared tuplespace, and discusses the idea of using function based objects as the basic unit of parallel execution, and the ...
详细信息
This paper presents the BaLinda model, based on last in/first out threads that interact via a shared tuplespace, and discusses the idea of using function based objects as the basic unit of parallel execution, and the hierarchical structure to partition tuplespaces. It is argued that the two-level parallel execution, both within and between objects, are well suited to scalable parallel platforms with shared memory nodes connected by high speed networks.
Many problems of distributed object-oriented applications can be uniformly resolved in the frame of approach based on the concept of cover. The cover is defined as an environment that transparently controls all aspect...
详细信息
Many problems of distributed object-oriented applications can be uniformly resolved in the frame of approach based on the concept of cover. The cover is defined as an environment that transparently controls all aspects of object's community, life: creation, interaction etc. To enable transparency, an object-oriented application must obey a principle of late binding, a reference to server object being obtained by the client at run time from a system environment. To implement cover services, the technique of metaobject control is applied, which provides extensions of program's semantics without changing the program code, by means of attaching additional method calls to each application object invocation. A special language (TL) in which the user can incrementally define new metaservices is described and illustrated by numerous examples.
In order to efficiently compute Fast Fourier transform (FFT) various parallelalgorithms and their implementation to multiprocessors and multicomputers have been developed. In general, the local interconnection networ...
详细信息
In order to efficiently compute Fast Fourier transform (FFT) various parallelalgorithms and their implementation to multiprocessors and multicomputers have been developed. In general, the local interconnection network is more high speed than a global one, but its capability depends on network architecture. On the other hand, the global interconnection network is not so high speed, but it does not depends on network architecture. It provides a flexible communication interface to the programmer. In this paper, we discuss parallel radix R FFT algorithms on a multiprocessor or multicomputer system with a global interconnection network. We propose two algorithms a stage-by-stage method and a multi-stage method. We also estimate the communication time. Then we show that the communication time is very sensitive to and affected by data exchange strategy. Finally, we implement these algorithms on two commercial massively parallel computers (nCUBE/2 and CM5) and measure the communication time.
With increasing on-chip hardware, concurrency is a way to bridge the gap between the computational power demanded by the applications and that afforded by the computer platforms. Although parallel systems are increasi...
详细信息
With increasing on-chip hardware, concurrency is a way to bridge the gap between the computational power demanded by the applications and that afforded by the computer platforms. Although parallel systems are increasingly popular they remain very difficult to program. In fact, most compilers require the programmer to specify how to partition data or map program code to the system's processors. To ensure an effective program, cache locality is important because of the large speed gap between microprocessors and memory systems. It is also important to make use of local communication whenever possible, since it is cheaper faster and less power hungry than global communication. In order to exploit these locality properties, we present a systematic operation placement and scheduling scheme for fine-grain parallelarchitectures. The key advantages are twofolds: (1) This multiprojection method, which deals with multidimensional parallelism systematically, can alleviate the burden of the programmer in coding and data partitioning. (2) it addresses the memory/communication bandwidth bottleneck, and can lend to faster program execution. On a special design example of the motion estimation block-matching algorithm, which requires the most intensive computation and memory accesses in video coding, our method lends to a reduction of external memory accesses by two to three orders of magnitude.
We propose an "Asymmetric Distributed Shared Memory: ADSM", which provides users with an efficient shared memory model. The ADSM is a hybrid system that needs not only operating system support, but also comp...
详细信息
We propose an "Asymmetric Distributed Shared Memory: ADSM", which provides users with an efficient shared memory model. The ADSM is a hybrid system that needs not only operating system support, but also compiler support. The ADSM executes a load instruction as the shared read with the assistance of virtual memory mechanism. As for the shared write, the ADSM executes a sequence of instructions for consistency management after the corresponding store instruction. We describe the algorithm to reduce overheads for consistency management. The algorithm coalesces a sequence of instructions for consistency management using the information of affine memory accesses. The coalescing algorithm is evaluated using the SPLASH-2 benchmark. The performance evaluation shows that the coalescing algorithm achieves an execution time improvement compared to the non optimized result, ranging from 76% to 85%.
Data parallel language was suggested to solve programming problems of distributed memory machines in terms of programming language. Among data parallel languages, HPF is a standard data parallel language across a vari...
详细信息
ISBN:
(纸本)0818678704
Data parallel language was suggested to solve programming problems of distributed memory machines in terms of programming language. Among data parallel languages, HPF is a standard data parallel language across a variety of high-performance architectures. Most HPF compilers are source-to-source translators because they can be easily implemented. However, these source-to-source compilers produce significant amount of ineffective codes. In particular, the FORALL construct is converted into several DO loops, so its loop overhead is increased. Therefore, we propose some techniques for converting FORALL construct to optimized DO loop. For this, we define and use relation distance vector which can represent both data dependence information and flow information. Then we evaluate and analyze execution time for the codes converted by our method and by PARADIGM method.
We propose an adaptive processor allocation strategy based on shape manipulations of required submesh for large mesh-connected systems. When an incoming job requests a rectangular submesh, our strategy first tries to ...
详细信息
ISBN:
(纸本)0818682596
We propose an adaptive processor allocation strategy based on shape manipulations of required submesh for large mesh-connected systems. When an incoming job requests a rectangular submesh, our strategy first tries to allocate the conventional rectangular submeshes including 90-degree rotation and folding techniques. If it fails, our strategy further tries to allocate more flexible and robust L-shaped submeshes instead of signaling the allocation failure. Thus, our strategy accommodates incoming job earlier than conventional strategies. Simulation results indicate that our strategy performs more efficiently than other strategies in terms of the external fragmentation, the job response time, and the system utilization. Our strategy is transparent to application programmers and does not require additional hardware supports. Moreover, with our L-shaped submesh allocation strategy, application programmers using the mesh-connected system may no longer limit their request to rectangular submeshes. They can request the L-shaped submesh with the number of processors much closer to the exactly needed value to execute their job.
In this paper, we propose a new family of interconnection networks, called cyclic networks (CNs), in which an intercluster connection is defined on a set of nodes whose addresses are cyclic shifts of one another. The ...
详细信息
In this paper, we propose a new family of interconnection networks, called cyclic networks (CNs), in which an intercluster connection is defined on a set of nodes whose addresses are cyclic shifts of one another. The node degrees of basic CNs are independent of system size, but can vary from a small constant (e.g., 3) to as large as required, thus providing flexibility and effective tradeoff between cost and performance. The diameters of suitably constructed CNs can be asymptotically optimal within their lower bounds, given the degrees. We show that packet routing and ascend/descend algorithms can be performed in /spl Theta/(log/sub d/ N) communication steps on some CNs with N nodes of degree /spl Theta/(d). Moreover CNs can also efficiently emulate homogeneous product networks (e.g., hypercubes and high dimensional meshes). As a consequence, we obtain a variety of efficient algorithms on such networks, thus proving the versatility of CNs.
The rise of explicit parallelprogramming involves new problems: lack of structure for parallelalgorithms and the ad hoc development of parallelalgorithms. We use skeletons to characterize and design parallel algori...
详细信息
ISBN:
(纸本)0818675829
The rise of explicit parallelprogramming involves new problems: lack of structure for parallelalgorithms and the ad hoc development of parallelalgorithms. We use skeletons to characterize and design parallelalgorithms and define a process to refine the designs step-by-step into programs. This paper introduces a high-level library on top of MPI which is derived from the skeleton concept to achieve better programmability and obtain portability. We conclude with a CFD application to demonstrate our idea.
暂无评论