the proceedings contain 61 papers. the topics discussed include: new number representation and conversion techniques on reconfigurable mesh;precise control of instruction caches;more on arbitrary boundary packed arith...
ISBN:
(纸本)0818691948
the proceedings contain 61 papers. the topics discussed include: new number representation and conversion techniques on reconfigurable mesh;precise control of instruction caches;more on arbitrary boundary packed arithmetic;more on arbitrary boundary packed arithmetic;PERL - a registerless architecture;design alternatives for shared memory multiprocessors;a simple optimal list ranking algorithm;a parallel skeletonization algorithm and its VLSI architecture;improving error bounds for multipole-based treecodes;computation of penetration measures for convex polygons and polyhedra for graphics applications;extrapolation in distributed adaptive integration;and java data parallel extensions with runtime system support.
A main consideration when implementing a parallel simulation application is the choice of the parallel simulation protocol (conservative vs. optimistic). Given a particular protocol, the application programmer then ha...
详细信息
MPEG-4 is a new standard for multimedia applications. Due to the flexible and extensible features of MPEG-4, the software-based implementation seems to be a natural and viable option. While such approaches usually req...
详细信息
the communication cost plays a key role in the performance of many parallel algorithms. In the particular case of the one-sided Jacobi method for symmetric eigenvalue and eigenvector computation the communication cost...
详细信息
the communication cost plays a key role in the performance of many parallel algorithms. In the particular case of the one-sided Jacobi method for symmetric eigenvalue and eigenvector computation the communication cost of previously proposed algorithms is mainly determined by the particular ordering being used. We propose two novel Jacobi orderings: the permuted-BR ordering and the degree-4 ordering, aimed at efficiently exploiting the multi-port capability of a hypercube. It is shown that the former is nearly optimal for some scenarios and the latter outperforms previously known orderings by a factor of two.
Sofware distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of over...
详细信息
Sofware distributed-shared-memory (DSM) systems provide a desirable target for parallelizing compilers due to their flexibility. However, studies show synchronization and load imbalance are significant sources of overhead. the authors investigate the impact of compilation techniques for eliminating synchronization overhead in software DSMs, developing new algorithms to handle situations found in practice. they evaluate the contributions of synchronization elimination algorithms based on 1) dependence analysis, 2) communication analysis, 3) exploiting coherence protocols in software DSMs, and 4) aggressive expansion of parallel SPMD regions. they also found suppressing expensive parallelism to be useful for one application. Experiments indicate these techniques eliminate almost all parallel task invocations, and reduce the number of barriers executed by 66% on average. On a 16 processor IBM SP-2, speedups are improved on average by 35%, and are tripled for some applications.
In this paper we present an adaptive version of our previously proposed quality equalizing (QE) load balancing strategy that attempts to maximize the performance of parallel branch-and-bound (B&B) by adapting to a...
详细信息
In this paper we present an adaptive version of our previously proposed quality equalizing (QE) load balancing strategy that attempts to maximize the performance of parallel branch-and-bound (B&B) by adapting to application and target computing system characteristics. Adaptive QE (AQE) incorporates the following salient adaptive features: (1) Anticipatory quantitative and qualitative load balancing mechanisms. (2) Regulation of load information exchange overhead. (3) Deterministic load balancing in extended neighborhoods instead of just immediate neighborhoods as in non-adaptive QE. (4) Randomized global load balancing to fetch work from outside the extended neighborhood. AQE fields speedup improvements of up to 80%, and 15% on the average, compared to that provided by QE for several real-world mixed-integer programming (MIP) problems, and near-ideal speedups for two of the largest problems in the MIPLIB benchmark suite on an IBM SP2 system.
MPEG-4 is currently being developed by MPEG to specify the technologies for supporting current and emerging multimedia applications. Because of its object-based features and flexible toolbox approach, it is much more ...
详细信息
MPEG-4 is currently being developed by MPEG to specify the technologies for supporting current and emerging multimedia applications. Because of its object-based features and flexible toolbox approach, it is much more complex than previous video coding standards. We believe that software-based implementation on parallel and distributed computing systems is a natural and viable option. In this paper, we describe such an approach on the MPEG-4 video encoder using a cluster of workstations. We propose to use hierarchical Petri nets as a modeling tool to describe the temporal relations and time constraints among various video objects at different levels. this would allow us to perform scheduling with a guarantee of synchronization among multiple objects. A dynamic shape-adaptive data parallel approach is used in the spatial domain for further speed-up gain. Our preliminary results indicate that real-time MPEG-4 encoding using distributed and parallel computing is achievable.
In a hypercube multiprocessor withdistributed memory, each data element has a street address and an apartment number (i.e. a hypercube node address and a local memory address). We describe an optimal algorithm for pe...
详细信息
In a hypercube multiprocessor withdistributed memory, each data element has a street address and an apartment number (i.e. a hypercube node address and a local memory address). We describe an optimal algorithm for performing the all-to-some personalized communication (ASPC) on Boolean n-cubes, defined as (i|j)/spl rarr/(i/spl plusmn/2/sup j/|j), i/spl isin/[0,2/sup n/-1], j/spl isin/[0,n-1], where (i|j) denote the data element on node i and location j. the algorithm also gives an optimal schedule for emulating PM2I networks on hypercubes under the binary-reflected Gray code encoding. We also study an important class of parallel algorithms, called /spl plusmn/2/sup b/-descend, which perform log M iterations on an M-element input a[O:M-1]. For b=log M-1,...,0, iteration b computes new values of each a[i] as a function of a[i], a[i+2/sup b/], a[i-2/sup b/]. For large applications, the problem size M is typically much larger than the number of nodes N. We show that on hypercubes, the optimal ASPC algorithm devised in this paper can be used in combination with pipelining communication and computation in /spl plusmn/2/sup b/-descend computations to reduce the communication steps from *** N.M/N to 4(log M+M/N-1). At one communication step, a hypercube node can send n elements along its n links, one per link.
MPEG-4 is a new standard for multimedia applications. Due to the flexible and extensible features of MPEG-4, the software-based implementation seems to be a natural and viable option. While such approaches usually req...
详细信息
MPEG-4 is a new standard for multimedia applications. Due to the flexible and extensible features of MPEG-4, the software-based implementation seems to be a natural and viable option. While such approaches usually require huge computing power, we can overcome such problem by using parallel and distributedprocessing. Because the behaviour of MPEG-4 objects may vary with time and such variation can not be predicted in advance, the issues of data partition and load balancing of the multiprocessor systems need to be addressed carefully in order to achieve real-time operation performance. In this paper, we propose a shape-adaptive data partitioning method to guarantee the load balancing among the multiprocessor systems. the effectiveness of this method has been demonstrated by the experimental results.
暂无评论