A key component of most large-scale rendering systems is a parallel image compositing algorithm, and the most commonly used compositing algorithms are binary swap and its variants. Although shown to be very efficient,...
详细信息
ISBN:
(纸本)9781538668733
A key component of most large-scale rendering systems is a parallel image compositing algorithm, and the most commonly used compositing algorithms are binary swap and its variants. Although shown to be very efficient, one of the classic limitations of binary swap is that it only works on a number of processes that is a perfect power of 2. Multiple variations of binary swap have been independently introduced to overcome this limitation and handle process counts that have factors that are not 2. To date, few of these approaches have been directly compared against each other, making it unclear which approach is best. This paper presents a fresh implementation of each of these methods using a common software framework to make them directly comparable. These methods to run binary swap with odd factors are directly compared. The results show that some simple compositing approaches work as well or better than more complex algorithms that are more difficult to implement.
Network (Graph) is a powerful abstraction for representing structures in large complex socio-technological systems. Community detection reveals important patterns and structural organizations in a network. This has nu...
详细信息
ISBN:
(数字)9781728108582
ISBN:
(纸本)9781728108599
Network (Graph) is a powerful abstraction for representing structures in large complex socio-technological systems. Community detection reveals important patterns and structural organizations in a network. This has numerous applications in the fields of social networking, computer security, bioinformatics, business, marketing, etc. An information-theoretic approach for community detection, named as Infomap method, is a sequential algorithm capable of providing high-quality solutions. However, the emergence of massive networks, often with millions of edges and beyond, makes the problem of discovering communities technically challenging. The sequential algorithm has very high execution time to process such networks. Therefore, a scalable parallel execution model of Infomap algorithm is needed. In this paper, we design a distributed-memory parallel algorithm for community detection based on Infomap method. We balance the workload among the processing units carefully and keep communication among processes minimal. We empirically show that our distributed solution produces excellent quality of communities similar to that of sequential Infomap. In terms of minimum description length (MDL), our distributed results are over 99.5% in alignment with that of the sequential version. Our algorithm also demonstrates good scalability with hundreds of processors while processing large-scale social and information networks.
Consider a deterministic algorithm that tries to find a string in an unknown set S subset of {0, 1}(n), under the promise that S has large density. The only information that the algorithm can obtain about S is estimat...
详细信息
ISBN:
(纸本)9783959770620
Consider a deterministic algorithm that tries to find a string in an unknown set S subset of {0, 1}(n), under the promise that S has large density. The only information that the algorithm can obtain about S is estimates of the density of S in adaptively chosen subsets of {0, 1}(n), up to an additive error of mu > 0. This problem is appealing as a derandomization problem, when S is the set of satisfying inputs for a circuit C: {0, 1}(n) -> {0, 1} that accepts many inputs: In this context, an algorithm as above constitutes a deterministic black-box reduction of the problem of hitting C (i.e., finding a satisfying input for C) to the problem of approximately counting the number of satisfying inputs for C on subsets of {0, 1}(n). We prove tight lower bounds for this problem, demonstrating that naive approaches to solve the problem cannot be improved upon, in general. First, we show a tight trade-off between the estimation error mu and the required number of queries to solve the problem: When mu = O(log(n)/n) a polynomial number of queries suffices, and when mu >= 4. (log(n)/n) the required number of queries is 2(circle minus(mu.n)). Secondly, we show that the problem "resists" parallelization: Any algorithm that works in iterations, and can obtain p = p(n) density estimates "in parallel" in each iteration, still requires Omega (n/log(p)+log(1/mu)) iterations to solve the problem. This work extends the well-known work of Karp, Upfal, and Wigderson (1988), who studied the setting in which S is only guaranteed to be non-empty (rather than dense), and the algorithm can only probe subsets for the existence of a solution in them. In addition, our lower bound on parallel algorithms affirms a weak version of a conjecture of Motwani, Naor, and Naor (1994);we also make progress on a stronger version of their conjecture.
Using MD simulation the energy surface of pit-patterned Si substrate is obtained. The physical model is based on empirical Tersoff potential, which governs the dynamics of an atomistic crystal system including Si and ...
详细信息
Using MD simulation the energy surface of pit-patterned Si substrate is obtained. The physical model is based on empirical Tersoff potential, which governs the dynamics of an atomistic crystal system including Si and Ge atoms. The pits at the surface of Si substrate assume the shape of inverted truncated pyramid. The analysis of the energy surface mapped by MD calculations shows that the dense space-arranged array of nanoislands with up to 4 nanoislands per pit may be grown by the deposition of Ge on Si pit-patterned substrate. The calculations were carried using algorithm of parallel programming. The scheme of the parallel algorithm for Verlet neighbor list formation is suggested. The results of calculations using parallel algorithm are presented in form of the speedup depending on the cores number.
Interactive information retrieval services, such as enterprise search and document search, must provide relevant results with consistent, low response times in the face of rapidly growing data sets and query loads. Th...
详细信息
ISBN:
(纸本)9781450349826
Interactive information retrieval services, such as enterprise search and document search, must provide relevant results with consistent, low response times in the face of rapidly growing data sets and query loads. These growing demands have led researchers to consider a wide range of optimizations to reduce response latency, including query processing parallelization and acceleration with co-processors such as GPUs. However, previous work runs queries either on GPU or CPU, ignoring the fact that the best processor for a given query depends on the query's characteristics, which may change as the processing proceeds. We present Griffin, an IR systems that dynamically combines GPU- and CPU-based algorithms to process individual queries according to their characteristics. Griffin uses state-of-the-art CPU-based query processing techniques and incorporates a novel approach to GPU-based query evaluation. Our GPU-based approach, as far as we know, achieves the best available GPU search performance by leveraging a new compression scheme and exploiting an advanced merge-based intersection algorithm. We evaluate Griffin with real world queries and datasets, and show that it improves query performance by 10x compared to a highly optimized CPU-only implementation, and 1.5x compared to our GPU-approach running alone. We also find that Griffin helps reduce the 95th-, 99th-, and 99.9th-percentile query response time by 10.4x, 16.1x, and 26.8x, respectively.
This paper presents an accurate density computation approach for large dark matter simulations, based on a recently introduced phase-space tessellation technique and designed for massively parallel, heterogeneous clus...
详细信息
This paper presents an accurate density computation approach for large dark matter simulations, based on a recently introduced phase-space tessellation technique and designed for massively parallel, heterogeneous cluster architectures. We discuss a memory efficient construction of an oct-tree structure to sample the mass densities with locally adaptive resolution, according to the features of the underlying tetrahedral tessellation. We propose an efficient GPU implementation for the computationally intensive operation of intersecting the tetrahedra with the cubical cells of the deposit grid, that achieves a speedup of almost an order of magnitude compared to an optimized CPU version. We discuss two dynamic load balancing schemes the first exchanges particle data between cluster nodes and deposits the tetrahedra for each block of the grid structure on single nodes, whereas the second approach uses global reduction operations to obtain the total masses. We demonstrate the scalability of our algorithms with up to 256 GPUs and TB-sized simulation snapshots, resulting in tessellations with more than 400 billion tetrahedra. Published by Elsevier B.V.
K-mer indices and de Bruijn graphs are important data structures in bioinformatics with multiple applications ranging from foundational tasks such as error correction, alignment, and genome assembly, to knowledge disc...
详细信息
K-mer indices and de Bruijn graphs are important data structures in bioinformatics with multiple applications ranging from foundational tasks such as error correction, alignment, and genome assembly, to knowledge discovery tasks including repeat detection and SNP identification. While advances in next generation sequencing technologies have dramatically reduced the cost and improved latency and throughput, few bioinformatics tools can efficiently process the data sets at the current generation rate of 1.8 terabases every 3 days. The volume and velocity with which sequencing data is generated necessitate efficient algorithms and implementation of k-mer indices and de Bruijn graphs, two central components in bioinformatic applications. Existing applications that utilize k-mer counting and de Bruijn graphs, however, tend to provide embedded, specialized implementations. The research presented here represents efforts toward the creation of the first reusable, flexible, and extensible distributed memory parallel libraries for k-mer indexing and de Bruijn graphs. These libraries are intended for simplifying the development of bioinformatics applications for distributed memory environments. For each library, our goals are to create a set of API that are simple to use, and provide optimized implementations based on efficient parallel algorithms. We designed algorithms that minimize communication volume and latency, and developed implementations with better cache utilization and SIMD vectorization. We developed Kmerind, a k-mer counting and indexing library based on distributed memory hash table and distributed sorted arrays, that provide efficient insert, find, count, and erase operations. For de Bruijn graphs, we developed Bruno by leveraging Kmerind functionalities to support parallel de Bruijn graph construction, chain compaction, error removal, and graph traversal and element query. Our performance evaluations showed that Kmerind is scalable and high performance. Kmerin
This paper focuses on parallel hash functions based on tree modes of operation for an inner Variable-Input-Length function. This inner function can be either a single-block-length (SBL) and prefix-free MD hash functio...
详细信息
This paper focuses on parallel hash functions based on tree modes of operation for an inner Variable-Input-Length function. This inner function can be either a single-block-length (SBL) and prefix-free MD hash function, or a sponge-based hash function. We discuss the various forms of optimality that can be obtained when designing parallel hash functions based on trees where all leaves have the same depth. The first result is a scheme which optimizes the tree topology in order to decrease the running time. Then, without affecting the optimal running time we show that we can slightly change the corresponding tree topology so as to minimize the number of required processors as well. Consequently, the resulting scheme decreases in the first place the running time and in the second place the number of required processors.
A long-standing conjecture mentions that a k-connected graph G admits k independent spanning trees (ISTs for short) rooted at an arbitrary node of G. An n-dimensional twisted cube, denoted by TO, is a variation of hyp...
详细信息
A long-standing conjecture mentions that a k-connected graph G admits k independent spanning trees (ISTs for short) rooted at an arbitrary node of G. An n-dimensional twisted cube, denoted by TO, is a variation of hypercube with connectivity n and has many features superior to those of hypercube. Yang (2010) first proposed an algorithm to construct n edge-disjoint spanning trees in TQ(n) for any odd integer n >= 3 and showed that half of them are ISTs. At a later stage, Wang et al. (2012) inferred that the above conjecture in affirmative for TQ(n) by providing an O (N log N) rime algorithm to construct n ISTs, where N = 2(n) is the number of nodes in TQ(n). However, this algorithm is executed in a recursive fashion and thus is hard to be parallelized. In this paper, we revisit the problem of constructing ISTs in twisted cubes and present a non-recursive algorithm. Our approach can be fully parallelized to make the use of all nodes of TQ(n) as processors for computation in such a way that each node can determine its parent in all spanning trees directly by referring its address and tree indices in O (log N) time. (C) 2016 Elsevier B.V. All rights reserved.
As the wireless network has limited bandwidth and insecure shared media, the data compression and encryption are very useful for the broadcasting transportation of big data in IoT (Internet of Things). However, the tr...
详细信息
As the wireless network has limited bandwidth and insecure shared media, the data compression and encryption are very useful for the broadcasting transportation of big data in IoT (Internet of Things). However, the traditional techniques of compression and encryption are neither competent nor efficient. In order to solve this problem, this paper presents a combined parallel algorithm named "CZ algorithm" which can compress and encrypt the big data efficiently. CZ algorithm uses a parallel pipeline, mixes the coding of compression and encryption, and supports the data window up to 1 TB (or larger). Moreover, CZ algorithm can encrypt the big data as a chaotic cryptosystem which will not decrease the compression speed. Meanwhile, a shareware named "ComZip" is developed based on CZ algorithm. The experiment results show that ComZip in 64 b system can get better compression ratio than WinRAR and 7-zip, and it can be faster than 7-zip in the big data compression. In addition, ComZip encrypts the big data without extra consumption of computing resources.
暂无评论