Most high-level parallel programming languages allow for fine-grained parallelism. Programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to ...
详细信息
ISBN:
(纸本)9780897917179
Most high-level parallel programming languages allow for fine-grained parallelism. Programs written in such languages can express the full parallelism in the program without specifying the mapping of program tasks to processors. When executing such programs, the major concern is to dynamically schedule tasks to processors in order to minimize execution time and the amount of memory needed. In this paper, a class of parallel schedules that are provably efficient in both time and space, even for programs whose task structure is revealed only during execution are identified. Following this, an efficient dynamic scheduling algorithm that generates schedules in this class, for languages with nested fine-grained parallelism is described.
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new ana...
详细信息
ISBN:
(纸本)9780897918909
It is well known that simple randomized load balancing schemes can balance load effectively while incurring only a small overhead, making such schemes appealing for practical systems. In this paper, we provide new analyses for several such dynamic randomized load balancing schemes. Unlike previous analyses, we do not assume that in equilibrium each server is stochastically independent from other servers. Our work extends a previous analysis of the supermarket model, a model that abstracts a simple, efficient load balancing scheme in the setting where jobs arrive at a large system of parallel processors. In this model, customers arrive at a system of n servers as a Poisson stream of rate λn, λ
Cache-oblivious algorithms have the advantage of achieving good sequential cache complexity across all levels of a multi-level cache hierarchy, regardless of the specifies (cache size and cache line size) of each leve...
详细信息
ISBN:
(纸本)9781605586069
Cache-oblivious algorithms have the advantage of achieving good sequential cache complexity across all levels of a multi-level cache hierarchy, regardless of the specifies (cache size and cache line size) of each level. In this paper, we describe cache-oblivious sorting algorithms with optimal work, optimal cache complexity and polylogarithmic depth. Using known mappings, these lead to low cache complexities on shared-memory multiprocessors with a single level of private caches or a single shared cache. Moreover, the low cache complexities extend to shared-memory multiprocessors with common configurations of multi-level caches. The key factor in the low cache complexity on multiprocessors is the low depth of the algorithms we propose.
Mining association rules from large databases is an important problem in data mining. There is a need to develop parallel algorithm for this problem because it is a very costly computation process. However, all propos...
详细信息
ISBN:
(纸本)9780897919890
Mining association rules from large databases is an important problem in data mining. There is a need to develop parallel algorithm for this problem because it is a very costly computation process. However, all proposed parallelalgorithms for mining association rules follow the conventional level-wise approach. On a shared-memory multi-processors, they will impose a synchronization in every iteration which degrades greatly their performance. The deficiency comes from the contention on the shared I/O channel when all processors are accessing their database partitions in the shared storage synchronously. An asynchronous algorithm APM has been proposed for mining association rules on shared-memory multiprocessors. All participating processors in APM generate candidates and count their supports independently without synchronization. Furthermore, it can finish the computation with less I/O than required in the level-wise approach. The algorithm has been implemented on a Sun Enterprise 4000 multi-processors with 12 nodes. The experiments show that APM has super performance than other proposed synchronous algorithms.
NEighborhood MOdeling (NEMO) provides transparent parallelism to develop parallel spatial data processing. It addresses the following parallel issues: architecture and machine independence;communication bottlenecks;da...
详细信息
NEighborhood MOdeling (NEMO) provides transparent parallelism to develop parallel spatial data processing. It addresses the following parallel issues: architecture and machine independence;communication bottlenecks;data visualization;casualty errors;load balancing;and data coherence. NEMO is capable of processing three types of time consuming raster neighborhood models: cellular automata;propagation;and neighborhood analysis. NEMO achieves this flexibility by including five components to its design: the three application drivers such as the cellular automata driver, propagation driver, and neighborhood analysis automata driver;and the display manager and raster database manager.
We present a randomized parallel algorithm for constructing the 3D convex hull on a generic p-processor coarse grained multicomputer with arbitrary interconnection network and n/p local memory per processor, where n/p...
详细信息
We present a randomized parallel algorithm for constructing the 3D convex hull on a generic p-processor coarse grained multicomputer with arbitrary interconnection network and n/p local memory per processor, where n/p ≥ p2+Ε (for some arbitrarily small Ε > 0). For any given set of n points in 3-space, the algorithm computes the 3D convex hull, with high probability, in O(n log n÷p) local computation time and O(1) communication phases with at most O(n÷p) data sent/received by each processor. That is, with high probability, the algorithm computes the 3D convex hull of an arbitrary point set in time O(n log n÷p + Γn,p), where Γn,p denotes the time complexity of one communication phase. In the terminology of the BSP model, our algorithm requires, with high probability, O(1) supersteps and a synchronization period Θ(n log n÷p). In the LogP model, the execution time of our algorithm is asymptotically optimal for several architectures.
The original LogP model for parallel computation use four parameters;the communication latency (L), overhead (n), bandwidth (g), and the number of processors (P). This basic model is extended with a linear model for l...
详细信息
The original LogP model for parallel computation use four parameters;the communication latency (L), overhead (n), bandwidth (g), and the number of processors (P). This basic model is extended with a linear model for long messages. The resulting combination which is called the LogGP model for parallel computation has one additional parameter, G, which captures the bandwidth obtained for long messages. In this paper, the algorithm design and analysis under the new model are discussed, and the all-to-all remap, FFT, and radix sort are examined. In addition, the single node scatter problem is, examined. Finally, solutions are derived for the problem, and the their optimality under the LogGP model is proven.
We study the parallel query complexity of reconstructing binary trees from simple queries involving their nodes. We show that a querier can efficiently reconstruct a binary tree with a logarithmic number of rounds and...
详细信息
Padded sorting requires n input keys to be output in sorted order in an array with slightly more than n locations, unused locations being filled with a special null value. We show that a deterministic CRCW PRAM with k...
详细信息
As parallel machines scale to one million nodes and beyond, it becomes increasingly difficult to build a reliable network that is able to guarantee packet delivery. Eventually large systems will need to employ fault-t...
详细信息
ISBN:
(纸本)9781581135299
As parallel machines scale to one million nodes and beyond, it becomes increasingly difficult to build a reliable network that is able to guarantee packet delivery. Eventually large systems will need to employ fault-tolerant messaging protocols that afford correct execution in the presence of a lossy network. In this paper we present a lightweight protocol that preserves message idempotence and is easy to implement in hardware. We identify the requirements for a correct implementation of the protocol. Experiments are performed in simulation to determine implementation parameters that optimize performance. We find that an aggressive implementation on a fat tree network results in a slowdown of less than 2x compared to buffered wormhole routing on a fault-free network.
暂无评论