This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-Threading (X...
详细信息
ISBN:
(纸本)9780897919890
This paper envisions an extension to a standard instruction set which efficiently implements PRAM-style algorithms using explicit multi-threaded instruction-level parallelism (ILP);that is, Explicit Multi-Threading (XMT), a fine-grained computational paradigm covering the spectrum from algorithms through architecture to implementation is introduced;new elements are added where needed.
We present lower bounds for time needed to solve basic problems on three general-purpose models of parallel computation: the shared-memory models QSM and s-QSW, and the distributed-memory model, the BSP. For each of t...
详细信息
ISBN:
(纸本)9780897919890
We present lower bounds for time needed to solve basic problems on three general-purpose models of parallel computation: the shared-memory models QSM and s-QSW, and the distributed-memory model, the BSP. For each of these models, we also obtain lower bounds for the number of rounds needed to solve these problems using a randomized algorithm on a p-processor machine. Our results on 'rounds' is of special interest in the context of designing work-efficient algorithms on a machine where latency and synchronization costs are high. Many of our lower bound results are complemented by upper bounds that match the lower bound or are close to it.
This paper analyzes job scheduling for parallel computers by using theoretical and experimental means. Based on existing architectures we first present a machine and a job model. Then, we propose a simple on-line algo...
详细信息
This paper analyzes job scheduling for parallel computers by using theoretical and experimental means. Based on existing architectures we first present a machine and a job model. Then, we propose a simple on-line algorithm employing job preemption without migration and derive theoretical bounds for the performance of the algorithm. The algorithm is experimentally evaluated with trace data from a large computing facility. These experiments show that the algorithm is highly sensitive on parameter selection and that substantial performance improvements over existing (non-preemptive) scheduling methods are possible.
Recently, the subject of allocating tasks to servers has attracted much attention. There are several ways of distinguishing load balancing problems. There are sequential and parallel strategies, that is, placing the t...
详细信息
Recently, the subject of allocating tasks to servers has attracted much attention. There are several ways of distinguishing load balancing problems. There are sequential and parallel strategies, that is, placing the tasks one after the other or all of them in parallel. Another approach divides load balancing problems into continuous and static ones. In the continuous case new tasks are generated and consumed as time proceeds, in the second case the number of tasks is fixed. We present and analyze a parallel randomized continuous load balancing algorithm in a scenario where n processors continuously generate and consume tasks according to some given probability distribution. Each processor initiates load balancing actions only if its load exceeds a certain threshold t, and then tries to find a balancing partner for moving some of its tasks to this partner. We consider several randomized load generation and consumption schemes which yield an expected load of O(n) in the whole system. We show that the maximum load of any processor is upper bounded by (log log n)2, with high probability. The expected number of load requests required for finding a balancing partner is constant.
As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for i...
详细信息
ISBN:
(纸本)9780897919890
As databases increasingly integrate non-textual information it is becoming necessary to support efficient similarity searching in addition to range searching. Recently, declustering techniques have been proposed for improving the performance of similarity searches through parallel I/O. In this paper, we propose a new scheme which provides good declustering for similarity searching. In particular, it does global declustering as opposed to local declustering, exploits the availability of extra disks and does not limit the partitioning of the data space. Our technique is based upon the Cyclic declustering schemes which were developed for range and partial match queries. We establish, in general, that Cyclic declustering techniques outperform previously proposed techniques.
Mining association rules from large databases is an important problem in data mining. There is a need to develop parallel algorithm for this problem because it is a very costly computation process. However, all propos...
详细信息
ISBN:
(纸本)9780897919890
Mining association rules from large databases is an important problem in data mining. There is a need to develop parallel algorithm for this problem because it is a very costly computation process. However, all proposed parallelalgorithms for mining association rules follow the conventional level-wise approach. On a shared-memory multi-processors, they will impose a synchronization in every iteration which degrades greatly their performance. The deficiency comes from the contention on the shared I/O channel when all processors are accessing their database partitions in the shared storage synchronously. An asynchronous algorithm APM has been proposed for mining association rules on shared-memory multiprocessors. All participating processors in APM generate candidates and count their supports independently without synchronization. Furthermore, it can finish the computation with less I/O than required in the level-wise approach. The algorithm has been implemented on a Sun Enterprise 4000 multi-processors with 12 nodes. The experiments show that APM has super performance than other proposed synchronous algorithms.
We present parallelalgorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps [26]). For two sets of size n and m (m ≤ n) the algorithms run in expected O(m lg(n/m))...
详细信息
ISBN:
(纸本)9780897919890
We present parallelalgorithms for union, intersection and difference on ordered sets using random balanced binary trees (treaps [26]). For two sets of size n and m (m ≤ n) the algorithms run in expected O(m lg(n/m)) work and O(lg n) depth (parallel time) on an EREW PRAM with scan operations (implying O(lg2 n) depth on a plain EREW PRAM). As with the sequential algorithms on treaps for insertion and deletion the main advantage of our algorithms are their simplicity. In fact our algorithms for set operations seem simpler than previous sequential algorithms with the same work bounds, and might therefore also be useful in a sequential context. To analyze the effectiveness of the algorithms we implemented both sequential and parallel versions of the algorithms and ran several experiments on them. Our parallel implementation uses the Cilk [5] shared memory runtime system on a 16 processor SGI Power Challenge and a 6 processor Sun Ultra Enterprise 3000. It shows reasonable speedup: 6.3 to 6.8 speedup on 8 processors of the SGI, and 4.1 to 4.4 speedup on 5 processors of the Sun.
Lower and upper bounds for finding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the first non-trivial lower bounds on the communication volume required to sol...
详细信息
Lower and upper bounds for finding a minimum spanning tree (MST) in a weighted undirected graph on the BSP model are presented. We provide the first non-trivial lower bounds on the communication volume required to solve the MST problem. Let p denote the number of processors, n the number of nodes of the input graph, and m the number of edges of the input graph. We show that in the worst case a total of Ω(k·min(m, pn)) bits need to be transmitted in order to solve the MST problem, where k is the number of bits required to represent a single edge weight. This implies that if each message contains k bits, any BSP algorithm for finding an MST requires communication time Ω(g · min(m/p, n)), where g is the gap parameter of the BSP model. In addition, we present two algorithms whose running times match the lower bounds in different situations. Both algorithms perform linear work and use a number of super-steps independent of the input size. The first algorithm is simple but can employ at most m/n processors efficiently. Hence, it should be applied in situations where the input graph is relatively dense. The second algorithm is a randomized algorithm that performs linear work with high probability, provided that m ≥ n · log p. This is the first linear work BSP algorithm for finding an MST in sparse graphs.
parallel I/O systems typically consist of individual processors, communication networks, and a large number of disks. Managing and utilizing these resources to meet performance, portability and usability goals of appl...
详细信息
ISBN:
(纸本)9780897919890
parallel I/O systems typically consist of individual processors, communication networks, and a large number of disks. Managing and utilizing these resources to meet performance, portability and usability goals of applications has become a significant challenge. We believe that a parallel I/O system that automatically selects efficient I/O plans for user applications is a solution to this problem. In this paper, we present such an automatic performance optimization approach for scientific applications performing collective I/O requests on multidimensional arrays. Under our approach, an optimization engine in a parallel I/O system selects optimal I/O plans automatically without human intervention based on a description of the application I/O requests and the system configuration. To validate our hypothesis, we have built an optimizer that uses a rule-based and randomized search-based algorithms to select optimal parameter settings in Panda, a parallel I/O library for multidimensional arrays. Our performance results obtained from two IBM SPs with significantly different configurations show that the Panda optimizer is able to select high-quality I/O plans and deliver high performance under a variety of system configurations.
Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed...
详细信息
ISBN:
(纸本)9780897919890
Sparse LU factorization with partial pivoting is important for many scientific applications and delivering high performance for this problem is difficult on distributed memory machines. Our previous work has developed an approach called S* that incorporates static symbolic factorization, supernode partitioning and graph scheduling. This paper studies the properties of elimination forests and uses them to guide supernode partitioning/amalgamation and execution scheduling. The new design with 2D mapping effectively identifies dense structures without introducing too many zeros in the BLAS computation and exploits asynchronous parallelism with low buffer space cost. The implementation of this code, called S+, uses supernodal matrix multiplication which retains the BLAS-3 level efficiency and avoids unnecessary arithmetic operations. The experiments show that S+ improves our previous code substantially and can achieve up to 11.04GFLOPS on 128 Cray T3E 450 MHz nodes, which is the highest performance reported in the literature.
暂无评论