Various classic reasoning problems with natural hypergraph representations are known to be tractable when a hypertree decomposition (HD) of low width exists. The resulting algorithms are attractive for practical use i...
详细信息
ISBN:
(纸本)9781450392600
Various classic reasoning problems with natural hypergraph representations are known to be tractable when a hypertree decomposition (HD) of low width exists. The resulting algorithms are attractive for practical use in fields like databases and constraint satisfaction. However, algorithmic use of HDs relies on the difficult task of first computing a decomposition of the hypergraph underlying a given problem instance, which is then used to guide the algorithm for this particular instance. The performance of purely sequential methods for computing HDs is inherently limited, yet the problem is, theoretically, amenable to parallelisation. In this paper we propose the first algorithm for computing hypertree decompositions that is well-suited for parallelisation. The newly proposed algorithm log-k-decomp requires only a logarithmic number of recursion levels and additionally allows for highly parallelised pruning of the search space by restriction to so-called balanced separators. We provide a detailed experimental evaluation over the HyperBench benchmark and demonstrate that log-k-decomp outperforms the current state-of-the-art significantly.
Finite-state transducers (FSTs) are frequently used in speech recognition. Transducer composition is an essential operation for combining different sources of information at different granularities. However, compositi...
详细信息
ISBN:
(纸本)9781665405409
Finite-state transducers (FSTs) are frequently used in speech recognition. Transducer composition is an essential operation for combining different sources of information at different granularities. However, composition is also one of the more computationally expensive operations. Due to the heterogeneous structure of FSTs, parallel algorithms for composition are suboptimal in efficiency, generality, or both. We propose an algorithm for parallel composition and implement it on graphics processing units. We benchmark our parallel algorithm on the composition of random graphs and the composition of graphs commonly used in speech recognition. The parallel composition scales better with the size of the input graphs and for large graphs can be as much as 10 to 30 times faster than a sequential CPU algorithm.
The paper considers the problem of exact network reliability calculation. We assume that a network has unreliable communication links and perfectly reliable nodes. The reliability for such network is defined as a prob...
详细信息
The paper considers the problem of exact network reliability calculation. We assume that a network has unreliable communication links and perfectly reliable nodes. The reliability for such network is defined as a probability that every pair of nodes of network is connected by an operational path. The problem of computing this characteristic is known to be NP-hard. For supercomputers with distributed memory, we study the ways of parallelization of the well-know recursive factoring method. The best parallel algorithm among approaches considered is the algorithm based on a Master-Slave scheme using a threshold for the minimal size of a graph for sending it to a new process without recursion backtracking. This algorithm has a linear or even superlinear speedup up to 768 cores. The numerical results show that the scalability depends on the chosen threshold for the minimal size of a graph for sending to a new process, which, in turn, depends on the density of the graph.
Querying the existence of an edge in a given graph or hypergraph is a building block in several algorithms. Hashing-based methods can be used for this purpose, where the given edges are stored in a hash table in a pre...
详细信息
ISBN:
(纸本)9781665497473
Querying the existence of an edge in a given graph or hypergraph is a building block in several algorithms. Hashing-based methods can be used for this purpose, where the given edges are stored in a hash table in a preprocessing step, and then the queries are answered using the lookup operations. While the general hashing methods have fast lookup times in the average case, the worst case run time is much higher. Perfect hashing methods take advantage of the fact that the items to be stored are all available and construct a collision free hash function for the given input, resulting in an optimal lookup time even in the worst case. We investigate an efficient shared-memory parallel implementation of a recently proposed perfect hashing method for hypergraphs. We experimentally compare the resulting parallel algorithms with the state-of-the-art and demonstrate better run time and scalability on a set of hypergraphs corresponding to real-life sparse tensors.
Partitioning a graph into blocks of roughly equal weight while cutting only few edges is a fundamental problem in computer science with numerous practical applications. While shared-memory parallel partitioners have r...
详细信息
ISBN:
(纸本)9798400704161
Partitioning a graph into blocks of roughly equal weight while cutting only few edges is a fundamental problem in computer science with numerous practical applications. While shared-memory parallel partitioners have recently matured to achieve the same quality as widely used sequential partitioners, there is still a pronounced quality gap between distributed partitioners and their sequential counterparts. In this work, we shrink this gap considerably by describing the engineering of an unconstrained local search algorithm suitable for distributed partitioners. We integrate the proposed algorithm in a distributed multilevel partitioner. Our extensive experiments show that the resulting algorithm scales to thousands of PEs while computing cuts that are, on average, only 3.5% larger than those of a state-of-the-art high-quality shared-memory partitioner. Compared to previous distributed partitioners, we obtain on average 6.8% smaller cuts than the best-performing competitor while being more than 9 times faster.
parallel algorithm design is generally hard. parallel program verification is even harder. We take an ex-ample from the maximum subarray problem and and show those two problems of design and verification. The best kno...
详细信息
Edge Computing (EC) has emerged as a solution to reduce energy demand and greenhouse gas emissions from digital technologies. EC supports low latency, mobility, and location awareness for delay-sensitive applications ...
详细信息
ISBN:
(纸本)9798400705977
Edge Computing (EC) has emerged as a solution to reduce energy demand and greenhouse gas emissions from digital technologies. EC supports low latency, mobility, and location awareness for delay-sensitive applications by bridging the gap between cloud computing services and end-users. Machine learning (ML) methods have been applied in EC for data classification and information processing. Ensemble learners have often proven to yield high predictive performance on data stream classification problems. Mini-batching is a technique proposed for improving cache reuse in multi-core architectures of bagging ensembles for the classification of online data streams, which benefits application speedup and reduces energy consumption. However, the original mini-batching presents limited benefits in terms of cache reuse and it hinders the accuracy of the ensembles (i.e., their capacity to detect behavior changes in data streams). In this paper, we improve mini-batching by fusing continuous training and test loops for the classification of data streams. We evaluated the new strategy by comparing its performance and energy efficiency with the original mini-batching for data stream classification using six ensemble algorithms and four benchmark datasets. We also compare mini-batching strategies with two hardware-based strategies supported by commodity multi-core processors commonly used in EC. Results show that mini-batching strategies can significantly reduce energy consumption in 95% of the experiments. Mini-batching improved energy efficiency by 96% on average and 169% in the best case. Likewise, our new mini-batching strategy improved energy efficiency by 136% on average and 456% in the best case. These strategies also support better control of the balance between performance, energy efficiency, and accuracy.
Probe machine(PM) is a recently reported mathematic model with massive parallelism. Herein,we presented searching the maximum clique of an undirected graph with six vertices. We constructed data library containing n s...
详细信息
Probe machine(PM) is a recently reported mathematic model with massive parallelism. Herein,we presented searching the maximum clique of an undirected graph with six vertices. We constructed data library containing n sublibraries, each sublibrary corresponded to a vertex in the given graph. Then, probe library according to the induced subgraph was designed in order to search and generate all maximal cliques. Subsequently,we performed probe operation, and all maximal cliques were generated in parallel. The advantages of the proposed model lie in two aspects. On one hand, solution to NP-complete problem is generated in just one step of probe operation rather than found in vast solution *** the other hand, the proposed model is highly *** work demonstrates that PM is superior to TM in terms of searching capacity when tackling NP-complete problem.
Hilbert curve describes a one-to-one mapping between multidimensional space and 1 D *** traditional 3D Hilbert encoding and decoding algorithms work on order-wise manner and are not aware of the difference between dif...
详细信息
Hilbert curve describes a one-to-one mapping between multidimensional space and 1 D *** traditional 3D Hilbert encoding and decoding algorithms work on order-wise manner and are not aware of the difference between different input data and spend equivalent computing costs on them, thus resulting in a low efficiency. To solve this problem, in this paper we design efficient 3D state views for fast encoding and decoding. Based on the state views designed, a new encoding algorithm(JFK-3HE) and a new decoding algorithm(JFK-3HD) are proposed. JFK-3HE and JFK-3HD can avoid executing iteratively encoding or decoding each order by skipping the first 0 s in input data, thus decreasing the complexity and improving the efficiency. Experimental results show that JFK-3HE and JFK-3HD outperform the state-of-the-arts algorithms for both uniform and skew-distributed data.
Multiplication is a fundamental step in many algorithms. If the multiplication of two integers of n words has a complexity of M(n), divisions and squares can be computed in O(M(n)) as well and the greatest common divi...
详细信息
ISBN:
(纸本)9781665469586
Multiplication is a fundamental step in many algorithms. If the multiplication of two integers of n words has a complexity of M(n), divisions and squares can be computed in O(M(n)) as well and the greatest common divisor can be computed in O(M(n) log n). Thus being able to have a small value for M(n) is extremely important. To this day, the best known algorithm for reachable values is the Schonhage-Strassen algorithm which is implemented by a few arithmetic libraries. Asymptotically faster algorithms exist, however no computer is able to hold numbers big enough for those algorithms to outrun Schonhage-Strasser. The GNU Multiple Precision (GMP) library has a sequential-only implementation of Schonhage-Strassen. However some algorithms contains a step which is a single big multiplication. Thus when trying to parallelize such an algorithm, one requires a parallel algorithm for multiplication. An example of such an algorithm is the batch factorization for Number Field Sieve. Thus people trying to implement a parallel version of such algorithms need to find an arithmetic library that implements a parallel integer multiplication. An example of such a library is the Flint (Fast Library for Number Theory) library that contains a parallel implementation of Schonhage-Strassen. In this article we present an implementation of Schonhage-Strassen, that reaches a speedup of 20 for the multiplication of two integers of 10(7) words of 64 bits using a Xeon Gold with 32 cores.
暂无评论