External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Block-wise access to data is a central the...
详细信息
ISBN:
(纸本)9780897919890
External memory (EM) algorithms are designed for computational problems in which the size of the internal memory of the computer is only a small fraction of the problem size. Block-wise access to data is a central theme in the design of efficient EM algorithms. A similar requirement arises in the transmission of data between processors in certain parallel computation models such as BSP* and CGM, for which block-wise communication is a crucial issue. We consider multisearch problems, where a large number of queries are to be simultaneously processed and satisfied by navigating through large data structures on parallel computers. The examples used originate as BSP* algorithms and we adapt them to the EM situation where the queries and data structure are considered to be much larger than the size of the available internal memory. This paper presents techniques to achieve blocking for I/O as well as for communication in multisearch on the BSP* and EM-BSP* models. In the area of EM algorithms new algorithms for multisearch in balanced trees are described. For search trees up to size O(n log n) where n is the number of queries, we obtain work-optimal, parallel, EM multisearch algorithms whose I/O and communication time are the same, asymptotically, as the computation time. These algorithms are obtained via the simulation technique of [15]. For larger trees we describe a parallel, EM algorithm which is simultaneously c-optimal and I/O-optimal. We give a lower bound to the number of I/O operations required for filtering n queries through a binary or multiway search tree of size m when m ≥ n2+Ε, constant Ε > 0. In the domain of parallel, non-EM algorithms we describe a new 1-optimal algorithm for BSP* multisearch in optimal trees which communicates in a block-wise fashion.
We give an alphabet-independent deterministic parallel algorithm for finding all occurrences of a pattern array of size m(h) x m(w) in a text array of size n(h) x n(w) in the concurrent-read-concurrent-write-parallel-...
详细信息
We give an alphabet-independent deterministic parallel algorithm for finding all occurrences of a pattern array of size m(h) x m(w) in a text array of size n(h) x n(w) in the concurrent-read-concurrent-write-parallel-random-access-machine (CRCW-PRAM) model. Our algorithm runs in O(1) time performing optimal, that is, O(n(h) x n(w)) work, following preprocessing of the pattern. This improves the previous best bound of O(loglogm) time with optimal work [A. Amir, G. Benson, and M. Farach, proceedings 5th annualacmsymposium on parallelalgorithms and architectures, acm, New York, 1993, pp. 79-85], following preprocessing of the pattern, where m = max {m(h),m(w)}. The preprocessing required by our algorithm (and that due to Amir, Benson, and Farach) can be accomplished in O(loglogm) time and O(m(h) x m(w)) work [M. Crochemore et al., manuscript, 1993], [R. Cole et al., manuscript, 1993].
We consider the following scheduling problem: Our goal is to execute a given amount of arbitrarily decomposable work on a distributed machine as quickly as possible. The work is maintained by a central scheduler that ...
详细信息
ISBN:
(纸本)9780897919890
We consider the following scheduling problem: Our goal is to execute a given amount of arbitrarily decomposable work on a distributed machine as quickly as possible. The work is maintained by a central scheduler that can assign chunks of work of an arbitrary size to idle processors. The difficulty is that the processing time required for a chunk is not exactly predictable - usually the less, the larger the chunk - and that processors suffer a delay for each assignment. Our objective is to minimize the total wasted time of the schedule, that is, the sum of all delays plus the idle times of processors waiting for the last processor to finish. We introduce a new deterministic model for this setting, based on estimated ranges [α(w), β(w)] for processing times of chunks of size w. Depending on α, β, and a measure for the overall deviation from these estimates, we can prove matching upper and lower bounds on the wasted time, the former being achieved by our new balancing strategy. This is in sharp contrast with previous work that, even under the strong assumption of independent, approximately normally distributed chunk processing times, proposed only heuristic scheduling schemes supported merely by empirical evidence. Our model naturally subsumes this stochastic setting, and our generic analysis is valid for most of the existing schemes too, proving them to be non-optimal.
When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a 'data race' occurs, which is usually a bug. This paper describes th...
详细信息
ISBN:
(纸本)9780897919890
When two parallel threads holding no locks in common access the same memory location and at least one of the threads modifies the location, a 'data race' occurs, which is usually a bug. This paper describes the algorithms and strategies used by a debugging tool, called the Nondeterminator-2, which checks for data races in programs coded in the Cilk multithreaded language. Like its predecessor, the Nondeterminator, which checks for simple 'determinacy' races, the Nondeterminator-2 is a debugging tool, not a verifier, since it checks for data races only in the computation generated by a serial execution of the program on a given input. We give an algorithm, ALL-SETS, that determines whether the computation generated by a serial execution of a Cilk program on a given input contains a race. For a program that runs serially in time T, accesses V shared memory locations, uses a total of n locks, and holds at most k n locks simultaneously, ALL-SETS runs in. O(nkT α(V, V)) time and O(nkV) space, where α is Tarjan's functional inverse of Ackermann's function. Since ALL-SETS may be too inefficient in the worst case, we propose a much more efficient algorithm which can be used to detect races in programs that obey the 'umbrella' locking discipline, a programming methodology that is more flexible than similar disciplines proposed in the literature. We present an algorithm, BRELLY, which detects violations of the umbrella discipline in O(kT α(V, V)) time using O(kV) space. We also prove that any 'abelian' Cilk program, one whose critical sections commute, produces a determinate final state if it is deadlock free and if it generates any computation which is data-race free. Thus, the Nondeterminator-2's two algorithms can verify the determinacy of a deadlock-free abelian program running on a given input.
In this paper we consider the problem of deadlock-free routing in arbitrary parallel and distributed computers. We focus on asynchronous routing algorithms which continuously receive new packets to route and which do ...
详细信息
ISBN:
(纸本)9780897919890
In this paper we consider the problem of deadlock-free routing in arbitrary parallel and distributed computers. We focus on asynchronous routing algorithms which continuously receive new packets to route and which do not discard packets that encounter congestion. Specifically, we examine what we call the deadlock-free routing (DFR) problem. The input to the DFR problem consists of an arbitrary network and an arbitrary set of paths in the network. The output consists of a routing algorithm, which is a list of the buffers used along each of the paths. The routing algorithm is required to be free from deadlock and the goal is to minimize the number of buffers required in any one node. We study the DFR problem by converting it into an equivalent problem which we call the flattest common supersequence (FCS) problem. The input to the FCS problem consists of a set of sequences and the output consists of a single sequence that contains all of the input sequences as (possibly non-contiguous) subsequences. The goal of the FCS problem is to minimize the maximum frequency of any symbol in the output sequence. We present three main results. First, we prove that the decision version of the FCS problem is NP-complete, and has no polynomial-time approximation scheme unless P = NP. Next, we propose and experimentally evaluate a range of heuristics for FCS. Our experimental results show that one of these heuristics performs very well over a wide range of inputs. Lastly, we prove that this heuristic is in fact optimal for certain restricted classes of inputs.
A fundamental issue affecting the performance of a parallel program is the assignment of tasks to processors in order to get the minimum completion rime. In this paper, we present a compilation-time two-stage mapping ...
详细信息
ISBN:
(纸本)0818684038
A fundamental issue affecting the performance of a parallel program is the assignment of tasks to processors in order to get the minimum completion rime. In this paper, we present a compilation-time two-stage mapping strategy (denoted as CREMA) used for mapping arbitrary programs (modeled as TIG graphs) onto message-passing parallel systems with any architecture. In contrast to most of the other approaches found in the literature, CREMA is not tied to any particular architecture or any specific algorithm. The first stage is based on task clustering and task reassignment algorithms that contract the original task graph. The second stage takes the contracted graph and tries to successfully match the physical properties of the target system. It has been evaluated for a wide range of bath regular and irregular graphs that correspond to some well-known real applications. The results show that CREMA provides a good trade-off between mapping quality and computational complexity.
We consider the following general problem modeling load balancing in a variety of distributed settings. Given an arbitrary undirected connected graph G = (V, E) and a weight distribution w(0) on the nodes, determine a...
详细信息
We consider the following general problem modeling load balancing in a variety of distributed settings. Given an arbitrary undirected connected graph G = (V, E) and a weight distribution w(0) on the nodes, determine a schedule to move weights across edges in each step so as to (approximately) balance the weights on the nodes. We focus on diffusive schedules for this problem. All previously studied diffusive schedules can be modeled as w(t+l) = Mw(t) where w(t) is the weight distribution after t steps and M is a doubly stochastic matrix. We call these the first-order schedules. First-order schedules, although widely used in practice, are often slow. In this paper we introduce a new direction in diffusive schedules by considering schedules that are modeled as: w(1) = Mw(0);w(t+1) = beta Mw(t) + (1 - beta)w(t-1) for some appropriate beta;we call these the second-order schedules. In the idealized setting of weights being real numbers, we adopt known results to show that beta can be chosen so that the second-order schedule involves significantly fewer steps than the first-order method for approximate load balancing. In the realistic setting when the weights are positive integers, we simulate the idealized schedules by maintaining I Owe You units on the edges. Extensive experiments with simulated data and real-life data from JOSTLE, a mesh-partitioning software, show that the resultant realistic schedule is close to the idealized schedule, and it again involves fewer steps than the first-order schedules for approximate load balancing. Our main result is therefore a fast algorithm for coarse load balancing that can be used in a variety of applications.
暂无评论