Matching is an important pari of a model-based object recognition system. Matching is a difficult task, for a number of reasons. First, in a number of recognition systems matching is formulated as a combinatorial prob...
详细信息
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n =...
详细信息
ISBN:
(纸本)9798400704161
Computing a Single-Linkage Dendrogram (SLD) is a key step in the classic single-linkage hierarchical clustering algorithm. Given an input edge-weighted tree), the SLD of) is a binary dendrogram that summarizes the n = 1 clusterings obtained by contracting the edges of T in order of weight. Existing algorithms for computing the SLD all require Omega(n log n) work where n = vertical bar T vertical bar. Furthermore, to the best of our knowledge no prior work provides a parallel algorithm obtaining non-trivial speedup for this problem. In this paper, we design faster parallel algorithms for computing SLDs both in theory and in practice based on new structural results about SLDs. In particular, we obtain a deterministic output-sensitive parallel algorithm based on parallel tree contraction that requires O(n log h) work and O(log(2) n log(2) h) depth, where h is the height of the output SLD. We also give a deterministic bottom-up algorithm for the problem inspired by the nearest-neighbor chain algorithm for hierarchical agglomerative clustering, and show that it achieves O(n log h) work and O(h log n) depth. Our results are based on a novel divide-and-conquer framework for building SLDs, inspired by divide-and-conquer algorithms for Cartesian trees. Our new algorithms can quickly compute the SLD on billion-scale trees, and obtain up to 150x speedup over the highly-efficient Union-Find algorithm typically used to compute SLDs in practice.
The paper presents performance results of parallel matrix LU decomposition algorithms with row-wise cyclic striping and partial pivoting implemented within several message passing distributed environments. Two environ...
详细信息
ISBN:
(纸本)1892512459
The paper presents performance results of parallel matrix LU decomposition algorithms with row-wise cyclic striping and partial pivoting implemented within several message passing distributed environments. Two environments have been chosen: message passing interface (MPI with point-to-point communication routines;environment portable and not interoperable), parallel virtual machine (PVM- a dynamic collection of potentially heterogeneous computing resources;environment portable and interoperable). The following performance measures are studied and results are presented standard speedup, scaled speedup, and efficiency. The conclusions are supported by experimental results conducted for matrix sizes up to 2000 x 2000 with 2, 4, 6, and 8 processors. From experiments it is apparent that if application is going to be developed on a massively parallel processor or on a homogeneous network, then MPI may be a choice because of its good communication performance. If an application is going to be developed on a heterogeneous network, then PVM would appear to be the preferred choice. Interoperability of PVM, comparing with MPI, comes with a price. PVM parallel programs level off with sequential programs for matrix sizes 600 x 600 and higher, comparing with MPI parallel programs which achieve the same performance result for matrices of 400 x 400.
Memory consistency model is crucial to the performance of shared-memory multiprocessors, and in current architectures several different models are adopted. In this paper, using graph algorithms for illustrative purpos...
详细信息
ISBN:
(纸本)9780889866386
Memory consistency model is crucial to the performance of shared-memory multiprocessors, and in current architectures several different models are adopted. In this paper, using graph algorithms for illustrative purposes, we consider the impact of memory model on the implementation and performance of parallel algorithms on shared-memory multiprocessors. We show that the implementation of PRAM algorithm's is largely "oblivious" of the underlying memory model, and has good performance on relaxed models. More importantly, we show that different memory models can favor drastically different algorithm designs.
Expressed sequence tags, abbreviated ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and understanding important genetic variations...
详细信息
ISBN:
(纸本)0769516777
Expressed sequence tags, abbreviated ESTs, are DNA molecules experimentally derived from expressed portions of genes. Clustering of ESTs is essential for gene recognition and understanding important genetic variations such as those resulting in diseases. In this paper, we present the design and development of a parallel software system for EST clustering. To our knowledge, this is the first such effort to address the problem of EST clustering in parallel. The novel features of our approach include 1) design of space efficient algorithms to keep the space requirement linear in the size of the input data set, 2) a combination of algorithmic techniques to reduce the total work without sacrificing the quality of EST clustering, and 3) use of parallel processing to reduce the run-time and facilitate the clustering of large data sets. Using a combination of these techniques, we report the clustering of 81,414 Arabidopsis ESTs in under 2.5 minutes on a 64-processor IBM SP, a problem that is estimated to take 9 hours of run-time with a state-of-the-art software, provided the memory required to run the software can be made available.
The boundary and the initial boundary value problems form the basis of numerous mathematical models. Ultimately, the discrete (linearized) boundary and the initial boundary value problems are reduced to the systems of...
详细信息
ISBN:
(纸本)9783319670355;9783319670348
The boundary and the initial boundary value problems form the basis of numerous mathematical models. Ultimately, the discrete (linearized) boundary and the initial boundary value problems are reduced to the systems of linear algebraic equations with sparse and ill-conditioned coefficient matrix. In modern applications (such as computational fluid dynamics) the number of equations in the system can reach about 10(12) and higher. Just the numerical solution of such systems requires significant computational effort, so an actual problem of modern computational mathematics is working-out, theoretical analysis and testing of high-performance parallel algorithms. The article discusses algebraic, geometric and combined ways to formation of the parallel algorithms. In this work we presented advantages and disadvantages of each ways, the estimate of parallelism's acceleration and efficiency, the comparison of volume of computational work compared with the optimal sequential algorithm, and the results of computational experiments. The peculiarities of parallel algorithms' implementation by using of software and hardware structures for parallel programming were discussed in this work.
The logical structure of a forest of octrees can be used to create scalable algorithms for parallel adaptive mesh refinement (AMR), which has recently been demonstrated for several petascale applications. Among variou...
详细信息
ISBN:
(纸本)9780769546759
The logical structure of a forest of octrees can be used to create scalable algorithms for parallel adaptive mesh refinement (AMR), which has recently been demonstrated for several petascale applications. Among various frequently used octree-based mesh operations, including refinement, coarsening, partitioning, and enumerating nodes, ensuring a 2:1 size balance between neighboring elements has historically been the most expensive in terms of CPU time and communication volume. The 2:1 balance operation is thus a primary target to optimize. One important component of a parallel balance algorithm is the ability to determine whether any two given octants have a consistent distance/size relation. Based on new logical concepts we propose fast algorithms for making this decision for all types of 2: 1 balance conditions in 2D and 3D. Since we are able to achieve this without constructing any parent nodes in the tree that would otherwise need to be sorted and communicated, we can significantly reduce the required memory and communication volume. In addition, we propose a lightweight collective algorithm for reversing the asymmetric communication pattern induced by non-local octant interactions. We have implemented our improvements as part of the open-source "p4est" software. Benchmarking this code with both synthetic and simulation-driven adapted meshes we are able to demonstrate much reduced runtime and excellent weak and strong scalability. On our largest benchmark problem with 5.13 x 10(11) octants the new 2:1 balance algorithm executes in less than 8 seconds on 112,128 CPU cores of the Jaguar Cray XT5 supercomputer.
Recently, there has been substantial interest in the study of various random networks as mathematical models of complex systems. As these complex systems grow larger, the ability to generate progressively large random...
详细信息
ISBN:
(纸本)9781450323789
Recently, there has been substantial interest in the study of various random networks as mathematical models of complex systems. As these complex systems grow larger, the ability to generate progressively large random networks becomes all the more important. This motivates the need for efficient parallel algorithms for generating such networks. Naive parallelization of the sequential algorithms for generating random networks may not work due to the dependencies among the edges and the possibility of creating duplicate (parallel) edges. In this paper, we present MPI-based distributed memory parallel algorithms for generating random scale-free networks using the preferential-attachment model. Our algorithms scale very well to a large number of processors and provide almost linear speedups. The algorithms can generate scale-free networks with 50 billion edges in 123 seconds using 768 processors.
Many randomized algorithms can be derandomized efficiently using either the method of conditional expectations or probability spaces with low (almost-) independence. A series of papers, beginning with work by Luby (19...
详细信息
ISBN:
(纸本)9781611974782
Many randomized algorithms can be derandomized efficiently using either the method of conditional expectations or probability spaces with low (almost-) independence. A series of papers, beginning with work by Luby (1988) and continuing with Berger & Rompel (1991) and Chari et al. (1994), showed that these techniques can be combined to give deterministic parallel algorithms for combinatorial optimization problems involving sums of w-juntas. We improve these algorithms through derandomized variable partitioning. This reduces the processor complexity to essentially independent of w while the running time is reduced from exponential in w to linear in w. For example, we improve the time complexity of an algorithm of Berger & Rompel (1991) for rainbow hypergraph coloring by a factor of approximately log(2) n and the processor complexity by a factor of approximately m(ln2). As a major application of this, we give an NC algorithm for the Lovasz Local Lemma Previous NC algorithms, including the seminal algorithm of Moser & Tardos (2010) and the work of Chandrasekaran et. al (2013), required that (essentially) the bad-events could span only O(log n) variables;we relax this to allowing polylog(n) variables. As two applications of our new algorithm, we give algorithms for defective vertex coloring and domatic graph partition. One main sub-problem encountered in these algorithms is to generate a probability space which can "fool" a given list of GF(2) Fourier characters. Schulman (1992) gave an NC algorithm for this;we dramatically improve its efficiency to near-optimal time and processor complexity and code dimension. This leads to a new algorithm to solve the heavy-codeword problem, introduced by Naor & Naor (1993), with a near-linear processor complexity (mn)(l+o(1)).
beta-skeletons, prominent members of the neighborhood graph family, have interesting geometric properties and various applications ranging from geographic networks to archeology. This paper focuses on computing the be...
详细信息
ISBN:
(纸本)9783642401633
beta-skeletons, prominent members of the neighborhood graph family, have interesting geometric properties and various applications ranging from geographic networks to archeology. This paper focuses on computing the beta-spectrum, a labeling of the edges of the Delaunay triangulation, DT(V), which makes it possible to quickly find the lune-ased beta-skeleton of V for any query value beta is an element of [1,2]. We consider planar n-point sets V with L-p metric, 1 < p < infinity. We present an O (n log(2) n) time sequential, and an O (log(4) n) time parallel, beta-spectrum labeling. We also show a parallel algorithm, which for a given beta is an element of [1,2] finds the lune-based beta-skeleton in O (log(2) n) time. The parallel algorithms use O(n) processors in the CREW-PRAM model. (C) 2015 Elsevier B.V. All rights reserved.
暂无评论