Exascale machines have a small mean time between failures, necessitating fault tolerance. Out-of-the-box fault-tolerant solutions, such as checkpoint-restart and replication, apply to any algorithm but incur significa...
详细信息
ISBN:
(纸本)9798400704161
Exascale machines have a small mean time between failures, necessitating fault tolerance. Out-of-the-box fault-tolerant solutions, such as checkpoint-restart and replication, apply to any algorithm but incur significant overhead costs. Long integer multiplication is a fundamental kernel in numerical linear algebra and cryptography. The naive, schoolbook multiplication algorithm runs in Theta(n(2)) while Toom-Cook algorithms runs in Theta(n(logk (2k -1)) for 2 <= k. We obtain the first efficient fault-tolerant parallel Toom-Cook algorithm. While asymptotically faster FFT-based algorithms exist, Toom-Cook algorithms are often favored in practice on small scale and on supercomputers. Our algorithm enables fault tolerance with negligible overhead costs. Compared to existing, general-purpose, fault-tolerant solutions, our algorithm reduces the arithmetic and communication (bandwidth) overhead costs by a factor of Theta(P/(2k -1)) (where P is the number of processors). To this end, we adapt the fault-tolerant BFS-DFS method of Birnbaum et al. (2020) for fast matrix multiplication and combine it with a coding strategy tailored for Toom-Cook. This eliminates the need for recomputations, resulting in a much faster algorithm.
In this paper, we address the question how efficiently a single constant-degree processor network can simulate the computation of any constant-degree processor network. We show the following lower bound trade-off: If ...
详细信息
In this paper, we address the question how efficiently a single constant-degree processor network can simulate the computation of any constant-degree processor network. We show the following lower bound trade-off: If M is an arbitrary constant-degree processor network of size m that can simulate all constant-degree processor networks of size n with slowdown s, then m·s = Ω(n log m). Our trade-off holds for a very general model of simulations. It covers all previously considered models and all known techniques for simulations among networks. For m ≥ n, this improves a previous lower bound by a factor of log log n, proved for a weaker simulation model. For m < n, this is the first non-trivial lower bound for this problem. In this case, this lower bound is asymptotically tight.
We show that the several problems, whose complexity with respect P and NC was open, are equivalent under NC-reductions. These include: (1) Finding the optimal solution of a two-variable integer program;(2) Determining...
详细信息
Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell cannot be computed until all th...
详细信息
ISBN:
(纸本)9781581134094
Discrete ordinates methods are commonly used to simulate radiation transport for fire or weapons modeling. The computation proceeds by sweeping the flux across a grid. A particular cell cannot be computed until all the cells immediately upwind of it are finished. If the directed dependence graph for the grid cells contains a cycle then sweeping methods will deadlock. This can happen in unstructured grids and time stepped problems where the grid is allowed to deform. In this paper we present a parallel algorithm to detect cycles in the dependence graphs present in these grids as well as an implementation and experimental results on shared and distributed memory machines.
We present parallelalgorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps [26]). For two sets of size n and m (m ≤ n) the algorithms run in expected O(m lg(n/m))...
详细信息
ISBN:
(纸本)9780897919890
We present parallelalgorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps [26]). For two sets of size n and m (m ≤ n) the algorithms run in expected O(m lg(n/m)) work and O(lg n) depth (parallel time) on an EREW PRAM with scan operations (implying O(lg2 n) depth on a plain EREW PRAM). As with the sequential algorithms on treaps for insertion and deletion the main advantage of our algorithms are their simplicity. In fact our algorithms for set operations seem simpler than previous sequential algorithms with the same work bounds, and might therefore also be useful in a sequential context. To analyze the effectiveness of the algorithms we implemented both sequential and parallel versions of the algorithms and ran several experiments on them. Our parallel implementation uses the Cilk [5] shared memory runtime system on a 16 processor SGI Power Challenge and a 6 processor Sun Ultra Enterprise 3000. It shows reasonable speedup: 6.3 to 6.8 speedup on 8 processors of the SGI, and 4.1 to 4.4 speedup on 5 processors of the Sun.
The time required to compute any function of a collection of circular doubly linked lists on a CROW PRAM is shown to be at most a constant factor more than on a CREW PRAM, but this is not true for singly linked lists....
详细信息
ISBN:
(纸本)0897913701
The time required to compute any function of a collection of circular doubly linked lists on a CROW PRAM is shown to be at most a constant factor more than on a CREW PRAM, but this is not true for singly linked lists. A tight lower bound of Ω(log log n) for colouring an n node doubly linked list on a CROW PRAM using a constant number of colours is also obtained.
Switching cells in parallel is a common approach to build switches with very high external line rate and a large number of ports. A prime example is the parallel packet switch (in short, PPS) in which a demultiplexing...
详细信息
ISBN:
(纸本)9781581139860
Switching cells in parallel is a common approach to build switches with very high external line rate and a large number of ports. A prime example is the parallel packet switch (in short, PPS) in which a demultiplexing algorithm sends cells, arriving at rate R on N input-ports, through one of K intermediate slower switches, operating at rate τ < R. This paper presents lower bounds on the average queuing delay introduced by the PPS relative to an optimal workconserving FCFS switch, for demultiplexing algorithms that does not have full and immediate information about the switch status. The bounds hold even if the algorithm is randomized. These lower bounds are shown to be asymptotically optimal through a new methodology for analyzing the maximal relative queuing delay;this clearly upper bounds their average relative queuing delay. The methodology is used to devise a new algorithm that relies on slightly out-dated global information on the switch status. It is also used to provide, for the first time, a complete proof of the maximum relative queuing delay provided by the fractional traffic dispatch [19, 22] algorithm. Copyright 2005 acm.
Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. We present a novel method to dynamically balance the processor workloads ...
详细信息
ISBN:
(纸本)9780897918909
Mesh adaption is a powerful tool for efficient unstructured-grid computations but causes load imbalance among processors on a parallel machine. We present a novel method to dynamically balance the processor workloads with a global view. This paper presents, for the first time, the implementation and integration of all major components within our dynamic load balancing strategy for adaptive grid calculations. Mesh adaption, repartitioning, processor assignment, and remapping are critical components of the framework that must be accomplished rapidly and efficiently so as not to cause a significant overhead to the numerical simulation. Previous results indicated that mesh repartitioning and data remapping are potential bottlenecks for performing large-scale scientific calculations. We resolve these issues and demonstrate that our framework remains viable on a large number of processors.
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache ...
详细信息
ISBN:
(纸本)9781450312134
We present the parallel buffer tree, a parallel external memory (PEM) data structure for batched search problems. This data structure is a non-trivial extension of Arge's sequential buffer tree to a private-cache multiprocessor environment and reduces the number of I/O operations by the number of available processor cores compared to its sequential counterpart, thereby taking full advantage of multicore parallelism. The parallel buffer tree is a search tree data structure that supports the batched parallel processing of a sequence of N insertions, deletions, membership queries, and range queries in the optimal O(sortP (N) + K/PB) parallel I/O complexity, where K is the size of the output reported in the process and sortP (N) is the parallel I/O complexity of sorting N elements using P processors. Copyright 2012 acm.
A basic problem in hypergraphs is that of finding a large independent set-one of guaranteed size-in a given hypergraph. Understanding the parallel complexity of this and related independent set problems on hypergraphs...
详细信息
ISBN:
(纸本)9781581134094
A basic problem in hypergraphs is that of finding a large independent set-one of guaranteed size-in a given hypergraph. Understanding the parallel complexity of this and related independent set problems on hypergraphs is a fundamental open issue in parallel computation. Caro and Tuza (J. Graph Theory, Vol. 15, pp. 99-107, 1991) have shown a certain lower bound αk(H) on the size of a maximum independent set in a given k-uniform hypergraph H, and have also presented an efficient sequential algorithm to find an independent set of size αk(H). They also show that αk(H) is the size of the maximum independent set for various hypergraph families. Here, we develop the first RNC algorithm to find an independent set of size αk (H), and also derandomize it for various special cases. We also present lower bounds on independent set size and corresponding RNC algorithms for non-uniform hypergraphs.
暂无评论