The problem of sorting on a one-dimensional sub-bus array of processors is addressed. A one-dimensional sub-bus array has a bus connecting the processors which can be segmented into sub-buses on which an active proces...
详细信息
ISBN:
(纸本)0898713498
The problem of sorting on a one-dimensional sub-bus array of processors is addressed. A one-dimensional sub-bus array has a bus connecting the processors which can be segmented into sub-buses on which an active processor can broadcast. The sub-bus broadcast capability is implemented on the MasPar MP-1 and MP-2 parallel computers. The sub-bus broadcast operation makes possible a new class of parallel sorting algorithms which can be characterized by parallel insertion steps. We restrict our attention to the class of sorting methods where parallel insertion steps are in the same direction, say left. For this class of one-way sorting strategies we prove a lower bound and give two strategies, the left greedy sort and the left adaptive insertion sort, both of which achieve the lower bound. Because our parallel insertion model is quite general, it is not necessarily the case that a sorting strategy in the model can be efficiently implemented on a real sub-bus array of processors. The left greedy sort cannot be efficiently implemented, but the left adaptive insertion sort can be efficiently implemented. The two sorting strategies have different properties and each is interesting in its own right.
The proceedings contain 70 papers. The topics discussed include: selecting the median;algorithms for graphic polymatroids and parametric s-sets;design of practical and provably good random number generators;a combinat...
ISBN:
(纸本)0898713498
The proceedings contain 70 papers. The topics discussed include: selecting the median;algorithms for graphic polymatroids and parametric s-sets;design of practical and provably good random number generators;a combinatorial algorithm for minimizing symmetric submodular functions;on the entropy of DNA: algorithms and measurements based on memory and rapid convergence;improved algorithms for protein motif recognition;from valid inequalities to heuristics: a unified view of primal-dual approximation algorithms in covering problems;doing two-level logic minimization 100 times faster;on the statistical dependencies of coalesced hashing and their implications for both full and limited independence;efficient parallel computations for singular band matrices;chaining multiple-alignment fragments in sub-quadratic time;finding optimal edge-rankings of trees;dihedral bounds for mesh generation in high dimensions;approximating discrete collections via local improvements;randomized rounding without solving the linear program;finding subsets maximizing minimum structures;average-case analysis of off-line and on-line knapsack problems;Voronoi diagrams of lines in 3-space under polyhedral convex distance functions;and path optimization and near-greedy analysis for graph partitioning: an empirical study.
We provide optimal parallel solutions to several shortest path and visibility problems set in triangulated simple polygons. Let P be a triangulated simple polygon with n vertices, preprocessed to support shortest path...
详细信息
We provide optimal parallel solutions to several shortest path and visibility problems set in triangulated simple polygons. Let P be a triangulated simple polygon with n vertices, preprocessed to support shortest path queries. We can find the shortest path tree from any point inside P in O(log n) time using O(n/log n) processors. In the same bounds, we can preprocess P for shooting queries (a query can be answered in O(log n) time by a uniprocessor). Given a set S of m points inside P, we can find an implicit representation of the relative convex hull of S in O(log(nm)) time with O(m) processors. If the relative convex hull has k edges, we can explicitly produce these edges in O(log(nm)) time with O(k/log(nm)) processors. All of these algorithms are deterministic and use the CREW PRAM model.
We present algorithms for sorting and routing on two-dimensional mesh-connected parallelarchitectures that are optimal on average. If one processor has many packets then we asymptotically halve the up to now best run...
详细信息
This paper presents a new parallel volume rendering algorithm that can render 2563 voxel medical data sets at over 10 Hz and 1283 voxel data sets at over 30 Hz on a 16-processor Silicon Graphics Challenge. The algorit...
详细信息
ISBN:
(纸本)9780897917742
This paper presents a new parallel volume rendering algorithm that can render 2563 voxel medical data sets at over 10 Hz and 1283 voxel data sets at over 30 Hz on a 16-processor Silicon Graphics Challenge. The algorithm achieves these results by minimizing each of the three components of execution time: computation time, synchronization time, and data communication time. Computation time is low because the parallel algorithm is based on the recently-reported shear-warp serial volume rendering algorithm which is over five times faster than previous serial algorithms. Synchronization time is minimized by using dynamic load balancing and a task partition that minimizes synchronization events. Data communication costs are low because the algorithm is implemented for shared-memory multiprocessors, a class of machines with hardware support for low-latency fine-grain communication and hardware caching to hide latency. We draw two conclusions from our implementation. First, we find that on shared-memory architectures data redistribution and communication costs do not dominate rendering time. Second, we find that cache locality requirements impose a limit on parallelism in volume rendering algorithms. Specifically, our results indicate that shared-memory machines with hundreds of processors would be useful only for rendering very large data sets.
One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the des...
详细信息
ISBN:
(纸本)0780330005
One can effectively utilize predicated execution to improve branch handling in instruction-level parallel processors. Although the potential benefits of predicated execution are high, the tradeoffs involved in the design of an instruction set to support predicated execution can be difficult. On one end of the design spectrum, architectural support for full predicated execution requires increasing the number of source operands for all instructions. Full predicate support provides for the most flexibility and the largest potential performance improvements. On the other end, partial predicated execution support, such as conditional moves, requires very little change to existing architectures. This paper presents a preliminary study to qualitatively and quantitatively address the benefit of full and partial predicated execution support. With our current compiler technology, we show that the compiler can use both partial and full predication to achieve speedup in large control-intensive programs. Some details of the code generation techniques are shown to provide insight into the benefit of going from partial to full predication. Preliminary experimental results are very encouraging: partial predication provides an average of 33% performance improvement for an 8-issue processor with no predicate support while full predication provides an additional 33% improvement.
暂无评论