this paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks. It is targeted to situations where the size of message data is very small, like i...
详细信息
this paper presents an efficient routing and flow control mechanism to implement multidestination message passing in wormhole networks. It is targeted to situations where the size of message data is very small, like in invalidation and update messages in distributed shared-memory multiprocessors (DSMs) with hardware cache coherence. the mechanism is a variation of tree-based multicast with pruning to avoid deadlocks. the new scheme does not require that the destination addresses in a given multicast message be ordered, thereby avoiding any ordering overhead. It allows messages to use any deadlock-free routing function and only requires one startup for each multicast message. the new scheme has been evaluated on several k-ary n-cube networks under synthetic loads. the results show that the proposed scheme is faster than other multicast mechanisms when the multicast traffic is composed of short messages.
the paper describes a randomized distributed enumeration algorithm which (in contrast to deterministic solutions) works for all network topologies and with fully asynchronous communication. the algorithm correctness a...
详细信息
Next Generation Sequencing (NGS) is gaining interests due to the increased requirements and the decreased sequencing cost. the important and prerequisite step of most NGS applications is the mapping of short sequences...
详细信息
ISBN:
(纸本)9780769546766
Next Generation Sequencing (NGS) is gaining interests due to the increased requirements and the decreased sequencing cost. the important and prerequisite step of most NGS applications is the mapping of short sequences, called reads, to the template reference sequences. Boththe explosion of NGS data with over billions of reads generated each day and the data-intensive computations pose great challenges to the capability of existing computing systems. In this paper, we take a hash-index based algorithm (PerM) as an example to investigate the optimization approaches for accelerating NGS reads mapping on multi-core architectures. First, we propose a new parallel algorithm that reorders bucket access in hash index among multiple threads so that data locality in shared cache is improved. Second, in order to reduce the number of empty hash bucket, we propose a serialized hash index compression algorithm, which coincides withthe sequential access nature of our new parallel algorithm. With reduced hash index size, it also becomes possible for us to use longer hash keys, which alleviates the hash conflicts and improves the query performance. Our experiment on an 8-socket 8-cores Intel Xeon X7550 SMP with 128 GB memory shows that the new parallel algorithm reduces LLC miss ratio to be 8% similar to 15% of the original algorithm and the overall performance is improved by 4 similar to 11 times (6 times avg.).
Given a graph, finding the distance-2 maximal independent set (MIS-2) of the vertices is a problem that is useful in several contexts such as algebraic multigrid coarsening or multilevel graph partitioning. Such multi...
详细信息
ISBN:
(纸本)9781665481069
Given a graph, finding the distance-2 maximal independent set (MIS-2) of the vertices is a problem that is useful in several contexts such as algebraic multigrid coarsening or multilevel graph partitioning. Such multilevel methods rely on finding the independent vertices so they can be used as seeds for aggregation in a multilevel scheme. We present a parallel MIS-2 algorithm to improve performance on modern accelerator hardware. this algorithm is implemented using the Kokkos programming model to enable performance portability. We demonstrate the portability of the algorithm and the performance on a variety of architectures (x86/ARM CPUs and NVIDIA/AMD GPUs). the resulting algorithm is also deterministic, producing an identical result for a given input across all of these platforms. the new MIS-2 implementation outperforms implementations in state of the art libraries like CUSP and ViennaCL by 3-8x while producing similar quality results. We further demonstrate the benefits of this approach by developing parallel graph coarsening scheme for two different use cases. First, we develop an algebraic multigrid (AMG) aggregation scheme using parallel MIS-2 and demonstrate the benefits as opposed to previous approaches used in the MueLu multigrid package in Trilinos. We also describe an approach for implementing a parallel multicolor "cluster" Gauss-Seidel preconditioner using this MIS-2 coarsening, and demonstrate better performance with an efficient, parallel, multicolor Gauss-Seidel algorithm.
Petaflop/s computing platforms will combine deep memory hierarchies in both latency and bandwidth with a need for many-thousand-fold parallelism. Initial users of these systems will most likely be asked to meet these ...
详细信息
Petaflop/s computing platforms will combine deep memory hierarchies in both latency and bandwidth with a need for many-thousand-fold parallelism. Initial users of these systems will most likely be asked to meet these challenges to efficient parallel program design armed only with minimal system software. A candidate for a portable petaflop/s programming model that can enable these important early application programs to be developed while at the same time permitting these same applications to run efficiently on the most capable computing systems now available, is presented.
Artificial immune systems (AIS) are algorithms that are based on the structure and mechanisms of the vertebrate immune system. Clonal selection is a process that allows lymphocytes to launch a quick response to known ...
详细信息
ISBN:
(纸本)9781424416936
Artificial immune systems (AIS) are algorithms that are based on the structure and mechanisms of the vertebrate immune system. Clonal selection is a process that allows lymphocytes to launch a quick response to known pathogens and to adapt to new, previously unencountered ones. this paper presents a parallel island model algorithm based on the clonal selection principles for solving the Graph Coloring Problem. the performance of the algorithm over a set of well-established benchmark graphs and random graphs is compared with a parallel Tabu Search algorithm.
Available copy protocols provide the highest data availability and data reliability of all replication protocols that do not regenerate failed replicas. Unfortunately, all existing implementations of available copy pr...
详细信息
Available copy protocols provide the highest data availability and data reliability of all replication protocols that do not regenerate failed replicas. Unfortunately, all existing implementations of available copy protocols either rely on complex procedures for ascertaining which replicas are up to date after a total failure or have to wait for the recovery of all failed sites. We present a simple technique for efficiently implementing the available copy protocol. Our protocol does not require version numbers and maintains only n+log(n) bits of state per replica. We also show under standard Markovian assumptions that our new protocol provides the same data availability as the best feasible implementations of the available copy protocol.
Recently, the study of I/O-efficient algorithms has moved beyond fundamental problems of sorting and permuting and into wider areas such as computational geometry and graph algorithms. Withthis expansion has come a n...
详细信息
Recently, the study of I/O-efficient algorithms has moved beyond fundamental problems of sorting and permuting and into wider areas such as computational geometry and graph algorithms. Withthis expansion has come a need for new algorithmic techniques and data structures. In this paper, we present I/O-efficient analogues of well-known data structures that we show to be useful for obtaining simpler and improved algorithms for several graph problems. Our results include improved algorithms for minimum spanning trees, breadth-first search, and single-source shortest paths. the descriptions of these algorithms are greatly simplified by their use of well-defined I/O-efficient data structures with good amortized performance bounds. We expect that I/O-efficient data structures such as these will be a useful tool for the design of I/O-efficient algorithms.
this paper presents a scalable design and implementation of the molecular docking application DOCK for a large-scale high performance computing system, the Sunway TaihuLight supercomputer, which provisions a heterogen...
详细信息
ISBN:
(纸本)9781538637906
this paper presents a scalable design and implementation of the molecular docking application DOCK for a large-scale high performance computing system, the Sunway TaihuLight supercomputer, which provisions a heterogeneous, manycore processor architecture that consists of management processing elements (MPEs) and clusters of computing processing elements (CPEs). the key innovation is a novel refactoring of DOCK on the CPEs. Optimization techniques for data redundancy minimization to fit data in cache, software-controlled prefetching into scratchpads, memory access coalescing, software caches, vectorization and loop unrolling are employed to improve the exploitation of the computational resources. For a single docking process, the refactored version using boththe MPE and CPE cluster achieved 260x to 402x speedup compared against the original ported version using MPE only. To scale the DOCK to the full Sunway Taihulight system with 10,649,600 cores (including all MPE and CPE cores), we present an MPI communication domain partition scheme as well. For docking 9 million small compounds to a Zika virus target protein, we manage to scale to 131,072 MPEs, and 8,388,608 CPEs, with a total of 8,519,680 cores.
In this paper, we mainly study the parallelization aspects of the accelerated waveform relaxation algorithms for the transient simulation of semiconductor devices on paralleldistributed memory computers since these m...
详细信息
In this paper, we mainly study the parallelization aspects of the accelerated waveform relaxation algorithms for the transient simulation of semiconductor devices on paralleldistributed memory computers since these method are competitive with standard pointwise methods on serial machines, but are significantly faster on parallel computers. Here we propose an efficient variant of GMRES method, for solving the resulting sequence of time-varying sparse linear differential-algebraic initial-value problems (IVP) arising at each linearization step with waveform Newton. We will describe a more efficient alternative which avoids the global communication of inner products and only requires local communications on for massively distributed memory computers since the number of inner products represents the bottleneck of parallel performance. Experimental results carried out on Parsytec GC/PowerPlus with regards to the comparison with other accelerated approaches such as convolution SOR and waveform GMRES techniques on waveform relaxation algorithm and pointwise methods are described.
暂无评论