Currently, the best known tradeoff between approximation ratio and complexity for the Sparsest Cut problem is achieved by the algorithm in [Sherman, FOCS 2009]: it computes O(root(log n)/epsilon)-approximation using O...
详细信息
ISBN:
(纸本)9798400704161
Currently, the best known tradeoff between approximation ratio and complexity for the Sparsest Cut problem is achieved by the algorithm in [Sherman, FOCS 2009]: it computes O(root(log n)/epsilon)-approximation using O(n(epsilon) log(O(1))n) maxflows for any epsilon is an element of[Theta(1/log n), Theta(1)]. It works by solving the SDP relaxation of [Arora-Rao-Vazirani, STOC 2004] using the Multiplicative Weights Update algorithm (MW) of [Arora-Kale, Jacm 2016]. To implement one MW step, Sherman approximately solves a multicommodity flow problem using another application of MW. Nested MW steps are solved via a certain "chaining" algorithm that combines results of multiple calls to the maxflow algorithm. We present an alternative approach that avoids solving the multicommodity flow problem and instead computes "violating paths". This simplifies Sherman's algorithm by removing a need for a nested application of MW, and also allows parallelization: we show how to compute O(root(log n)/epsilon)-approximation via O(log(O(1)) n) maxflows using O(n(epsilon)) processors. We also revisit Sherman's chaining algorithm, and present a simpler version together with a new analysis.
Transactional memory (TM) is a promising approach for designing concurrent data structures, and it is essential to develop better understanding of the formal properties that can be achieved by TM implementations. Two ...
详细信息
ISBN:
(纸本)9781605586069
Transactional memory (TM) is a promising approach for designing concurrent data structures, and it is essential to develop better understanding of the formal properties that can be achieved by TM implementations. Two fundamental properties of TM implementations are disjoint-access parallelism, which is critical for their scalability, and the invisibility of read operations, which reduces memory contention. This paper proves an inherent tradeoff for implementations of transactional memories: they cannot be both disjoint-access parallel and have read-only transactions that are invisible and always terminate successfully. In fact, a lower bound of Omega(t) is proved on the number of writes needed in order to implement a read-only transaction of t items, which successfully terminates in a disjoint-access parallel TM implementation. The results assume strict serializability and thus hold under the assumption of opacity. It is shown how to extend the results to hold also for weaker consistency conditions, serializability and snapshot isolation.
In this paper we present the Segment router, a novel router design for the interconnection networks of massively parallel computers. The design decisions of the Segment router are motivated by the need to improve the ...
详细信息
ISBN:
(纸本)9780897916714
In this paper we present the Segment router, a novel router design for the interconnection networks of massively parallel computers. The design decisions of the Segment router are motivated by the need to improve the network performance when the traffic consists of messages with widely different *** key novelty of the Segment router is that it provides different queueing policies for the two classes of messages it services. Short messages are stored in a centralized, dynamically allocated queue that guarantees storage availability for the whole packet of the short message. Long messages on the other hand are stored in small FIFO buffers associated with the input channels of the router. Thus when a long message becomes blocked, it is stored in segments in the FIFO buffers of a number of routing elements in the network. Furthermore, the physical channels of the Segment router are fairly multiplexed between the two supported classes of messages without the potential for *** simulations we compare the performance of our technique to the performance of multipacket messages and networks implementing lanes. The results clearly demonstrate the performance advantages of our technique.
We present a mechanism for partially aborting transactions through the use of data structure checkpoints and control-flow continuations. In particular, we show that boosted transactions [9] already have built-in resto...
详细信息
ISBN:
(纸本)9781595939739
We present a mechanism for partially aborting transactions through the use of data structure checkpoints and control-flow continuations. In particular, we show that boosted transactions [9] already have built-in restoration points and afford a simple, efficient implementation. Our mechanism is far simpler than previous work, which relied on complex nesting schemes to establish checkpoints. We demonstrate syntactic advantages and we quantify the overhead of checkpoints and explore several examples, illustrating the utility of partially aborting transactions. We additionally present a novel queue-based spin lock which allows threads to timeout and differ in priority. Unlike the known lock due to Craig [5], our lock is more efficient for priority schemes of few levels.
Exascale machines have a small mean time between failures, necessitating fault tolerance. Out-of-the-box fault-tolerant solutions, such as checkpoint-restart and replication, apply to any algorithm but incur significa...
详细信息
ISBN:
(纸本)9798400704161
Exascale machines have a small mean time between failures, necessitating fault tolerance. Out-of-the-box fault-tolerant solutions, such as checkpoint-restart and replication, apply to any algorithm but incur significant overhead costs. Long integer multiplication is a fundamental kernel in numerical linear algebra and cryptography. The naive, schoolbook multiplication algorithm runs in Theta(n(2)) while Toom-Cook algorithms runs in Theta(n(logk (2k -1)) for 2 <= k. We obtain the first efficient fault-tolerant parallel Toom-Cook algorithm. While asymptotically faster FFT-based algorithms exist, Toom-Cook algorithms are often favored in practice on small scale and on supercomputers. Our algorithm enables fault tolerance with negligible overhead costs. Compared to existing, general-purpose, fault-tolerant solutions, our algorithm reduces the arithmetic and communication (bandwidth) overhead costs by a factor of Theta(P/(2k -1)) (where P is the number of processors). To this end, we adapt the fault-tolerant BFS-DFS method of Birnbaum et al. (2020) for fast matrix multiplication and combine it with a coding strategy tailored for Toom-Cook. This eliminates the need for recomputations, resulting in a much faster algorithm.
Chips (or chip sets) which include one or more CPUs, some local memory, and rudimentary communications and routing hardware are becoming common (eg. transputers, SRCs HNet, the nodes of most MIMD machines). These chip...
详细信息
We present a parallel randomized algorithm for finding if two planar graphs are isomorphic. Assuming that we have a tree of separators for each planar graph, our algorithm takes O(log(n)) time with P = (n1.5&middo...
详细信息
ISBN:
(纸本)0897913701
We present a parallel randomized algorithm for finding if two planar graphs are isomorphic. Assuming that we have a tree of separators for each planar graph, our algorithm takes O(log(n)) time with P = (n1.5·√log(n)) processors with probability to fail of 1/n or less, where n is the number of vertices. The algorithms needs 2·log(m)·log(n) + O(log(n)) random bits. The number of random bits can be decreased to O(log(n)) by increasing the processors number to n3/2+Ε. This algorithm significantly improves the previous results of n4 processors.
The primary purpose of Cray Research computer systems is the timely solution of complex problems in science and engineering. A few examples illustrate that the CRAY C90 is currently the world's most powerful tool ...
详细信息
We consider the problem of sorting a file of N records on the D-disk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B con...
详细信息
ISBN:
(纸本)9780897918091
We consider the problem of sorting a file of N records on the D-disk model of parallel I/O [VS94] in which there are two sources of parallelism. Records are transferred to and from disk concurrently in blocks of B contiguous records. In each I/O operation, up to one block can be transferred to or from each of the D disks in parallel. We propose a simple, efficient, randomized mergesort algorithm called SRM that uses a forecast-and-flush approach to overcome the inherent difficulties of simple merging on parallel disks. SRM exhibits a limited use of randomization and also has a useful deterministic version. Generalizing the forecasting technique of [Knu73], our algorithm is able to read in, at any time, the `right' block from any disk, and using the technique of flushing, our algorithm evicts, without any I/O overhead, just the `right' blocks from memory to make space for new ones to be read in. The disk layout of SRM is such that it enjoys perfect write parallelism, avoiding fundamental inefficiencies of previous mergesort algorithms. Our analysis technique involves a novel reduction to various maximum occupancy problems. We prove that the expected I/O performance of SRM is efficient under varying sizes of memory and that it compares favorably in practice to disk-striped mergesort (DSM). Our studies indicate that SRM outperforms DSM even when the number D of parallel disks is fairly small.
The proceedings contain 39 papers. The topics discussed include: Fast greedy algorithms in MapReduce and streaming;reduced hardware transactions: a new approach to hybrid transactional memory;recursive design of hardw...
The proceedings contain 39 papers. The topics discussed include: Fast greedy algorithms in MapReduce and streaming;reduced hardware transactions: a new approach to hybrid transactional memory;recursive design of hardware priority queues;drop the anchor: lightweight memory management for non-blocking data structures;scalable statistics counters;storage and search in dynamic peer-to-peer networks;expected sum and maximum of displacement of random sensors for coverage of a domain;on dynamics in selfish network creation;brief announcement: truly parallel burrows-wheeler compression and decompression;brief announcement: locality in wireless scheduling;brief announcement: universally truthful secondary spectrum auctions;and brief announcement: online batch scheduling for flow objectives.
暂无评论