Tree search algorithms play an important role in many applications in the field of artificial intelligence. When playing board games like chess etc., computers use game tree search algorithms to evaluate a position. I...
详细信息
ISBN:
(纸本)9781581134094
Tree search algorithms play an important role in many applications in the field of artificial intelligence. When playing board games like chess etc., computers use game tree search algorithms to evaluate a position. In this paper, we present a procedure that we call parallel Controlled Conspiracy Number Search (parallel CCNS). Shortly, we describe the principles of the sequential CCNS algorithm, which bases its approximation-results on irregular subtrees of the entire game tree. We have parallelized CCNS and implemented it in our chess program ***, which now is the first in the world, that could win a high-ranked Grandmaster chess-tournament. We add experiments that show a speedup of about 50 on 159 processors running on an SCI-workstation-cluster.
We study the problem of storing an ordered E;et On an asynchronous shared memory parallel computer. We examine the case where we want to perform successor (least upper bound) queries efficiently on the set members tha...
详细信息
We study the problem of storing an ordered E;et On an asynchronous shared memory parallel computer. We examine the case where we want to perform successor (least upper bound) queries efficiently on the set members that are stored. We also examine the case where processors insert and delete members of the set. Due to asynchrony, we require processors to perform queries and to maintain the structure independently. Although several such structures have been proposed, the analysis of these structures has been very limited. We here ut;e the recently proposed QRQW PRAM model to provide upper and lower bounds on the performance of such data structures. In the asynchronous QRQW PRAM, the problem of processors concurrently and independently searching a shared data structure is very similar to the problem of routing packets through a network. Using this as a guide, we introduce the Search-Butterfly, a search structure that combines the efficient packet routing properties of the butterfly graph withthe efficient search structure properties of the B-Tree. We analyze the behavior of the Search-Butterfly when the following operations are performed: arbitrary searches, random searches,and random searches, insertions, and deletions. We also provide lower bounds that show that the results are within a factor of O (log n) of optimal where n is the number of keys;in the structure. When the searches are random, the results are within a constant factor of optimal. Many of the proofs are derived from closely related results for packet routing. Others are of independent interest, most notably a method of adding queues to any network belonging to a large class of queuing networks with non-Markovian routing in a manner that allows us to bound the delay experienced by packets in the augmented network.
We describe an efficient parallel implementation of Goldberg's maximum flow algorithm for a shared-memory multiprocessor. Our main technical innovation is a method that allows a 'global relabeling' heurist...
详细信息
ISBN:
(纸本)089791483X
We describe an efficient parallel implementation of Goldberg's maximum flow algorithm for a shared-memory multiprocessor. Our main technical innovation is a method that allows a 'global relabeling' heuristic to be executed concurrently withthe main algorithm. this heuristic is essential for good performance in practice. We present performance results from a Sequent Symmetry for a variety of input distributions. We achieve speed-ups of up to 8.8 with 16 processors, relative to the parallel program with 1 processor (5.8 when compared to our best sequential program). We consider these speed-ups very good and we provide evidence that hardware effects and insufficient parallelism in certain inputs are the main obstacles to achieving better performance.
We show an improved parallel algorithm for decomposing an undirected unweighted graph into small diameter pieces with a small fraction of the edges in between. these decompositions form critical subroutines in a numbe...
详细信息
ISBN:
(纸本)9781450315722
We show an improved parallel algorithm for decomposing an undirected unweighted graph into small diameter pieces with a small fraction of the edges in between. these decompositions form critical subroutines in a number of graph algorithms. Our algorithm builds upon the shifted shortest path approach introduced in [Blelloch, Gupta, Koutis, Miller, Peng, Tangwongsan. SPAA 2011]. By combining various stages of the previous algorithm, we obtain a significantly simpler algorithm withthe same asymptotic guarantees as the best sequential algorithm.
We present lower bounds for time needed to solve basic problems on three general-purpose models of parallel computation: the shared-memory models QSM and s-QSW, and the distributed-memory model, the BSP. For each of t...
详细信息
ISBN:
(纸本)9780897919890
We present lower bounds for time needed to solve basic problems on three general-purpose models of parallel computation: the shared-memory models QSM and s-QSW, and the distributed-memory model, the BSP. For each of these models, we also obtain lower bounds for the number of rounds needed to solve these problems using a randomized algorithm on a p-processor machine. Our results on 'rounds' is of special interest in the context of designing work-efficient algorithms on a machine where latency and synchronization costs are high. Many of our lower bound results are complemented by upper bounds that match the lower bound or are close to it.
We present two asymptotically optimal Θ(n) time algorithms for labeling the connected components of a gray-scale image on a mesh-connected computer. We assume that the input is an n × n gray-scale image mapped o...
详细信息
ISBN:
(纸本)089791483X
We present two asymptotically optimal Θ(n) time algorithms for labeling the connected components of a gray-scale image on a mesh-connected computer. We assume that the input is an n × n gray-scale image mapped one pixel per processor onto an n × n mesh-connected computer. Our algorithms label the components so that every component is connected, the maximum difference in the gray-scale values of the pixels within any component does not exceed a given value, and no component can be merged with a neighboring component. the first algorithm is based on a divide-and-conquer approach. Although it is simple, this algorithm has the potential drawback of possibly assigning two adjacent pixels withthe same gray-scale value to different components. the second algorithm avoids this potential drawback, and exploits the ability of a mesh-connected computer to efficiently determine a maximal independent set of a planar graph.
this paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-threading (X...
详细信息
ISBN:
(纸本)9780897919890
this paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithmsthrough architecture to implementation is introduced;new elements are added where needed.
this paper presents a work-optimal CGM algorithm that solves the Longest Increasing Subsequence Problem. It can be implemented in the CGM with P processors in O(N2/P) time and O(P) communication steps. It is the first...
详细信息
this paper presents a work-optimal CGM algorithm that solves the Longest Increasing Subsequence Problem. It can be implemented in the CGM with P processors in O(N2/P) time and O(P) communication steps. It is the first CGM algorithm for this problem and it is work-optimal since the sequential algorithm has a complexity of O(N2).
the amount of Task Level parallelism (TLP) in runtime workload is useful information to determine the efficient us age of multiprocessors. this paper presents mechanisms to dynamically estimate the amount of TLP in ru...
详细信息
ISBN:
(纸本)0769525091
the amount of Task Level parallelism (TLP) in runtime workload is useful information to determine the efficient us age of multiprocessors. this paper presents mechanisms to dynamically estimate the amount of TLP in runtime work loads. Modifications are added to the operating system (OS) to collect information about processor utilization, task activities, from which TLP can be calculated. By effectively utilizing the Time Stamp Counter (TSC) hardware, the task activities can be monitored at fine time resolution, result ing in capability of estimation of TLP at fine granularity. We implement the mechanisms on a recent version of Linux OS. Evaluation results indicate that the mechanisms can estimate TLP accurately for various kinds of workloads with small overheads.
A global barrier synchronizes all processors in a parallel system. this paper investigates algorithmsthat allow disjoint subsets of processors to synchronize independently and in parallel. the user model of a subset ...
详细信息
ISBN:
(纸本)089791483X
A global barrier synchronizes all processors in a parallel system. this paper investigates algorithmsthat allow disjoint subsets of processors to synchronize independently and in parallel. the user model of a subset barrier is straight forward;a processor that participates in a subset barrier needs to know only the name of the barrier and the number of participating processors. this paper identifies two general communication models for private-memory parallel systems: the bounded buffer broadcast model and the anonymous destination message passing model and presents algorithms for barrier synchronization in the terms of these models. the models are detailed enough to allow meaningful cost estimates for their primitives, yet independent of a specific architecture and can be supported efficiently by a modern private memory parallel system. the anonymous destination message passing model is the most attractive. the time complexity to synchronize over a uni-directional ring of N processors is O(log N) for common cases, and O(√N) in the worst case. the algorithms have been implemented on iWarp, a private-memory parallel system and are now in daily use. the paper concludes with timing measurements obtained on a 64-node system.
暂无评论