While data and workload distribution can be tailored to fit a particular problem to a particular distributed-memory architecture, it is often difficult to do so for various practical issues. This paper presents our st...
详细信息
While data and workload distribution can be tailored to fit a particular problem to a particular distributed-memory architecture, it is often difficult to do so for various practical issues. This paper presents our study on multithreading for distributed-memory multiprocessors. Specifically, we investigate the effects of multithreading on data distribution and workload distribution with variable thread granularity. Various types of workload distribution strategies are defined along thread granularity. Three types of data distribution strategies are investigated: row-wise cyclic, k-way partial-row cyclic and blocked distribution. We have implemented all of these on the 80-processor EM-4 distributed-memory multiprocessor using highly-sequential Gaussian elimination with partial pivoting and highly-parallel matrix multiplication. Experimental results indicated that multithreading can offset the loss that is due to the mismatch of data distribution to workload distribution for even sequential and irregular problems while giving high absolute performance.
This paper discusses the present state of the art of components, systems, and application technology related to parallel optical data links (ODL) as demonstrated by the OptoElectronic technology Consortium (OETC). Par...
详细信息
This paper discusses the present state of the art of components, systems, and application technology related to parallel optical data links (ODL) as demonstrated by the OptoElectronic technology Consortium (OETC). Parallel ODL technology is poised for large volume commercialization despite some uncertainties in industrial standards and system applications. This is fueled by the demand for high-bandwidth to support the upcoming information age. To meet the need for low-cost, broadband digital multimedia services, parallel ODL technology faces the challenge of providing reasonable cost/performance ratios when compared with other established technologies. Responding to this challenge has required the integration of a number of state-of-the-art component technologies (e.g. VCSEL, monolithic integrated photoreceiver, MCM, GaAs IC, optical array connector and cable) with system designs and applications.
Harder, new requirements are appearing in the area of database systems. The popularity reached by parallel database systems during the past decade, due to their high performance and scalability characteristics, should...
详细信息
Harder, new requirements are appearing in the area of database systems. The popularity reached by parallel database systems during the past decade, due to their high performance and scalability
Harder, new requirements are appearing in the area of database systems. The popularity reached by parallel database systems during the past decade, due to their high performance and scalability
Proper distribution of operations among parallel processors in a large scientific computation executed on a distributed-memory machine can significantly reduce the total computation time. In this paper, we propose an ...
详细信息
Proper distribution of operations among parallel processors in a large scientific computation executed on a distributed-memory machine can significantly reduce the total computation time. In this paper, we propose an operation called simultaneous parallel reduction(SPR), that is amenable to such optimization. SPR performs reduction operations in parallel, each operation reducing a one-dimensional consecutive section of a distributed array. Each element of the distributed array is used as an operand to many reductions executed concurrently over the overlapping array's sections. SPR is distinct from a more commonly considered parallel reduction which concurrently evaluates a single reduction. In this paper we consider SPR on Single Instruction Multiple Data (SIMD) machines with different interconnection networks. We focus on SPR over sections whose size is not a power of 2 with the result shifted relative to the arguments. Several algorithms achieving some of the lower bounds on SPR complexity are presented under various assumptions about the properties of the binary operator of the reduction and of the communication cost of the target architectures.
High level data parallel languages such as Vienna Fortran and High Performance Fortran (HPF) have been introduced to allow the programming of massively parallel distributed memory machines at a relatively high level o...
详细信息
High level data parallel languages such as Vienna Fortran and High Performance Fortran (HPF) have been introduced to allow the programming of massively parallel distributed memory machines at a relatively high level of abstraction, based on the single program multiple data (SPMD) paradigm. Their main features include mechanisms for expressing the distribution of data across the processors of a machine. The paper introduces additional language functionality to allow the efficient processing of sparse matrix codes. It introduces methods for the representation and distribution of sparse matrices, which forms a powerful mechanism for storing and manipulating sparse matrices able to be efficiently implemented on massively parallel machines.< >
Transport-triggered architectures are a new class of architectures that provide more scheduling freedom and have unique compiler optimizations. This paper reports experiments that quantify the advantages of transport-...
详细信息
ALBA, A parallel Language Based on Actors is a new programming language designed to take advantage of highly parallel architectures that use message passing as the fundamental low-level interaction primitive. In this ...
详细信息
暂无评论