We study the relatively old problem of asymptotically reducing the runtime of serial computations with polynomial size Boolean circuits. To the best of our knowledge, no progress on this problem has been formally repo...
详细信息
ISBN:
(纸本)9781581139860
We study the relatively old problem of asymptotically reducing the runtime of serial computations with polynomial size Boolean circuits. To the best of our knowledge, no progress on this problem has been formally reported in the literature for general computational models, although we observe that early work of Chandra, Stockmeyer, and Vishkin implies the existence of non-uniform unbounded fan-in circuits of tO(1) size and O(t/log log n) depth, for time t Turing machines. We give an algorithmic size-depth tradeoff for parallelizing time t random access Turing machines, a model at least as powerful as logarithmic cost RAMs. Our parallel simulation yields logspace-uniform tO(1) size, O(t/log t) depth Boolean circuits having semi-unbounded fan-in gates. In fact, for appropriate d, uniform tO(1)2 O(t/d) size circuits of depth O(d) can simulate time t. One corollary is that any log-cost time t RAM can be simulated by a log-cost CRCW PRAM using tO(1) processors and O(t/ log t) time. This is a major improvement over previous parallel speedups, which could only guarantee an Ω(log t) speedup with an exponential number of processors. Copyright 2005 acm.
A quantitative comparison of the BSP and LogP models of parallel computation is developed. We concentrate on a variant of LogP that disallows the so-called stalling behavior, although issues surrounding the stalling p...
详细信息
A quantitative comparison of the BSP and LogP models of parallel computation is developed. We concentrate on a variant of LogP that disallows the so-called stalling behavior, although issues surrounding the stalling phenomenon are also explored. Very efficient cross simulations between the two models are derived, showing their substantial equivalence for algorithmic design guided by asymptotic analysis. it is also shown that the two models can be implemented with similar performance on most point-to-point networks. In conclusion, within the limits of our analysis that is mainly of an asymptotic nature. BSP and (stall-free) LogP can be viewed as closely related variants within the bandwidth-latency framework for modeling parallel computation. BSP seems somewhat preferable due to its greater simplicity and portability, and slightly greater power. LogP lends itself more naturally to multiuser mode.
We present a number of efficient parallelalgorithms for constructing 2- and 3-dimensional convex hulls on a randomized CRCW PRAM. Specifically, we show how to build the convex hull of n pre-sorted points in the plane...
详细信息
An O(log 2 n) time, n2/logn processor as well as an O(log n) time, n3/log n processor CREW deterministic parallelalgorithms are presented for constructing Huffman codes from a given list of frequences. The time can b...
详细信息
A n × m (0,1)-matrix is said to satisfy the consecutive-ones property if there is a permutation of the rows of the matrix such that in each column all non-zero entries are adjacent. The problem of determining suc...
详细信息
ISBN:
(纸本)9780897917179
A n × m (0,1)-matrix is said to satisfy the consecutive-ones property if there is a permutation of the rows of the matrix such that in each column all non-zero entries are adjacent. The problem of determining such a permutation, if one exists, is the consecutive-ones property problem. Previously, Klein and Reif [13] gave a parallel solution for the consecutive-ones property problem with an algorithm based on complicated parallel PQ-tree manipulations. The work complexity of this algorithm was improved in [14] to run in time O(log2 n) with a linear number of CRCW processors. We present a new algorithm for this problem, based on a less sophisticated data structure, that improves upon the processor bounds of the previous algorithms by a factor of log n/log log n is general, and by a factor of log n for sufficiently dense problem instances. Our algorithm uses a novel divide-and-conquer approach, and uses for a fundamental data structure the decomposition of graphs into tri-connected components. Solutions to the consecutive-ones problem have important applications to a variety of problems in computational molecular biology, databases, distributed computing, VLSI placement and routing, and graph and network theory.
A malleable parallel task is one whose execution time is a function of the number of (identical) processors alloted to it. We study the problem of scheduling a set of n independent malleable tasks on a fixed number of...
详细信息
ISBN:
(纸本)0898714346
A malleable parallel task is one whose execution time is a function of the number of (identical) processors alloted to it. We study the problem of scheduling a set of n independent malleable tasks on a fixed number of parallel processors, and propose an approximation scheme that for any fixed epsilon > 0 computes in O(n) time a non-preemptive schedule of length at most (1 + epsilon) times the optimum.
A metric tree embedding of expected stretch α maps a weighted n-node graph G = (V, E, ω) to a weighted tree T = (VT, ET, ωT) with V C VT, and dist(v, w, G) ≤ dist(v, w, T) and E[dist(v, w, T)] ≤ αdist(v, w, G) f...
详细信息
ISBN:
(纸本)9781450342100
A metric tree embedding of expected stretch α maps a weighted n-node graph G = (V, E, ω) to a weighted tree T = (VT, ET, ωT) with V C VT, and dist(v, w, G) ≤ dist(v, w, T) and E[dist(v, w, T)] ≤ αdist(v, w, G) for all v, w ∈ V. Such embeddings are highly useful for designing fast approximation algorithms, as many hard problems are easy to solve on tree instances. However, to date the best parallel polylogn depth algorithm that achieves an asymptotically optimal expected stretch of α ∈ O(logn) uses Ω(n2) work and requires a metric as input. In this paper, we show how to achieve the same guarantees using Õ(m1/ϵ) work, where m is the number of edges of G and ϵ > 0 is an arbitrarily small constant. Moreover, one may reduce the work further to Õ(m + n1-ϵ), at the expense of increasing the expected stretch α to O(ϵ-1 log n) using the spanner construction of Baswana and Sen as preprocessing step. Our main tool in deriving these parallelalgorithms is an algebraic characterization of a generalization of the classic Moore-Bellman-Ford algorithm. We consider this framework, which subsumes a large variety of previous "Moore-Bellman-Ford-flavored" algorithms, to be of independent interest.
We present the first parallel architecture for a dynamic approximate membership data-structure (i.e., a filter) that supports insertions, deletions, and approximate membership queries. Our architecture borrows techniq...
详细信息
ISBN:
(纸本)9781450395458
We present the first parallel architecture for a dynamic approximate membership data-structure (i.e., a filter) that supports insertions, deletions, and approximate membership queries. Our architecture borrows techniques from PRAM emulation to obtain a parallel filter based on two levels of fingerprint-dictionaries. A key component in the architecture is a special-purpose wide-word processor we designed to support operations over small dictionaries. We implemented this architecture on an FPGA running at 100MHz. The implementation stores up to 1.44 million keys, has a false-positive rate less than 0.3%, receives batches 16 of operations per cycle, preserves sequential order, and runs with a stable throughput of over a billion operations per second with respect to several benchmarks.
We present several algorithms for sorting efficiently with parallel two-level and multilevel memories. Our main result is an elegant, easy-to-implement, optimal, deterministic algorithm for external sorting with P dis...
详细信息
Work stealing is a well-established technique in multi-core systems that aims to improve load balancing and task scheduling efficiency. Each processing unit maintains its own task queue, and when idle, it steals tasks...
详细信息
ISBN:
(纸本)9798400704161
Work stealing is a well-established technique in multi-core systems that aims to improve load balancing and task scheduling efficiency. Each processing unit maintains its own task queue, and when idle, it steals tasks from other units. Traditional work-stealing approaches face performance bottlenecks due to costly synchronization primitives and contention arising from concurrent access by both the queue owner and thieves. The state-of-the-art solution addresses these issues through coarse-grained synchronization;however, it restricts stealing in specific scenarios, thereby limiting parallelism. We introduce PadWS, a partial and asynchronous delegated work-stealing algorithm. PadWS employs a block-based design in which, under common cases, the queue owner and thieves work on separate blocks, reducing metadata contention. Delegation is partially enabled for the block in which the owner is located, allowing thieves to steal from it-an approach that deviates from the current block-based approach. Additionally, our delegation strategy is asynchronous, which removes the need for thieves to spin-wait after sending a request.
暂无评论