Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computin...
详细信息
Computing distance fields is fundamental to many scientific and engineering applications. Distance fields can be used to direct analysis and reduce data. In this paper, we present a highly scalable method for computing 3D distance fields on massively parallel distributed-memory machines. A new distributed spatial data structure, named parallel distance tree, is introduced to manage the level sets of data and facilitate surface tracking over time, resulting in significantly reduced computation and communication costs for calculating the distance to the surface of interest from any spatial locations. Our method supports several data types and distance metrics from real-world applications. We demonstrate its efficiency and scalability on state-of-the-art supercomputers using both large-scale volume datasets and surface models. We also demonstrate in-situ distance field computation on dynamic turbulent flame surfaces for a petascale combustion simulation. Our work greatly extends the usability of distance fields for demanding applications.
For a given algorithm, the energy consumed in executing the algorithm has a nonlinear relationship with performance. In case of parallel algorithms, energy use and performance are functions of the structure of the alg...
详细信息
For a given algorithm, the energy consumed in executing the algorithm has a nonlinear relationship with performance. In case of parallel algorithms, energy use and performance are functions of the structure of the algorithm. We define the asymptotic energy complexity of algorithms which models the minimum energy required to execute a parallel algorithm for a given execution time as a function of input size. Our methodology provides us with a way of comparing the orders of (minimal) energy required for different algorithms and can be used to define energy complexity classes of parallel algorithms.
We live in an era of big data and the analysis of these data is becoming a bottleneck in many domains including biology and the internet. To make these analyses feasible in practice, we need efficient data reduction a...
详细信息
ISBN:
(纸本)9781509042982
We live in an era of big data and the analysis of these data is becoming a bottleneck in many domains including biology and the internet. To make these analyses feasible in practice, we need efficient data reduction algorithms. The Singular Value Decomposition (SVD) is a data reduction technique that has been used in many different applications. For example, SVDs have been extensively used in text analysis. The best known sequential algorithms for the computation of SVDs take cubic time which may not be acceptable in practice. As a result, many parallel algorithms have been proposed in the literature. There are two kinds of algorithms for SVD, namely, QR decomposition and Jacobi iterations. Researchers have found out that even though QR is sequentially faster than Jacobi iterations, QR is difficult to parallelize. As a result, most of the parallel algorithms in the literature are based on Jacobi iterations. For example, the Jacobi Relaxation Scheme (JRS) of the classical Jacobi algorithm has been shown to be very effective in parallel. In this paper we propose a novel variant of the classical Jacobi algorithm that is more efficient than the JRS algorithm. Our experimental results confirm this assertion. The key idea behind our algorithm is to select the pivot elements for each sweep appropriately. We also show how to efficiently implement our algorithm on such parallel models as the PRAM and the mesh.
We discuss the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in ge...
详细信息
We discuss the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in general have low concurrency, poor data locality, and high ratio of data access to computation costs, making it challenging to achieve scalability on massively parallel machines. We overcome this challenge by employing a variety of techniques, including speculation and iteration, optimized communication, and randomization. We present preliminary results on weak and strong scalability studies conducted on an IBM Blue Gene/P machine employing up to tens of thousands of processors. The results show that the algorithms hold strong potential for computing at petascale.
The article presents an algorithmic model of sound propagation in rooms to run on parallel and distributed computer systems. This algorithm is used by the authors in an implementation of an adaptable high-performance ...
详细信息
The article presents an algorithmic model of sound propagation in rooms to run on parallel and distributed computer systems. This algorithm is used by the authors in an implementation of an adaptable high-performance computer system simulating various fields and providing scalability on an arbitrary number of parallel central and graphical processors as well as distributed computer clusters. Many general-purpose computer simulation systems have limited usability when it comes to highprecision simulation associated with large numbers of elementary computations due to their lack of scalability on various parallel and distributed platforms. The more the required adequacy of the model is, the higher the numbers of steps of the simulation algorithms are. Scalability permits a use hybrid parallel computer systems and improves efficiency o f t he s imulation w ith respect to adequacy, time consumptions, and total costs of simulation *** report covers such an algorithm which is based on an approximate superposition of acoustical fields and provides adequate results, as long as the used equations of acoustics are linear. The algorithm represents reflecting surfaces as sets of vibrating pistons and uses the Rayleigh integral to calculate their scattering properties. The article also provides a parallel form of the algorithm and analysis of its properties in parallel and sequential forms.
Community detection has become an important operation in numerous graph based applications. It is used to reveal groups that exist within real world networks without imposing prior size or cardinality constraints on t...
详细信息
Community detection has become an important operation in numerous graph based applications. It is used to reveal groups that exist within real world networks without imposing prior size or cardinality constraints on the set of communities. Despite its potential, the support for parallel computers is rather limited. This is largely because the algorithm is irregular and the underlying heuristics imply a sequential nature. In this paper I present parallelization heuristics for fast community detection using the Louvain method as it is applied on GPUs. The Louvain method is a multi-phase, iterative heuristic for modularity optimization. It was originally developed by Blondel et al. (2008), the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. The parallel heuristics used, were first introduced by Hao Lu et al. (2015). As the Louvain method is inherently sequential, it limits the possibility of scalable usage. Thanks to the proposed parallel heuristics, I observe how this method can behave on GPUs. For evaluation I implemented the heuristics using CUDA on a GeForce GTX 980M GPU and for testing I used organization landscapes from the CERN developed Collaboration Spotting project that involves patents and publications to visualize the connections in technologies among its collaborators. Compared to the parallel Louvain implementation running on 8 threads on the same machine that has the used GPU, the CUDA implementation is able to produce community outputs comparable to the CPU generated results, while providing absolute speedups of up to 12 using the GeForce GTX 980M mobile GPU.
One of the most important constraints of today's architectures for data-intensive applications is the limited bandwidth due to the memory-processor communication bottleneck. This significantly impacts performance ...
详细信息
One of the most important constraints of today's architectures for data-intensive applications is the limited bandwidth due to the memory-processor communication bottleneck. This significantly impacts performance and energy. For instance, the energy consumption share of communication and memory access may exceed 80%. Recently, the concept of Computation-in-Memory (CIM) was proposed, which is based on the integration of storage and computation in the same physical location using a crossbar topology and non-volatile resistive-switching memristor technology. To illustrate the tremendous potential of CIM architecture in exploiting massively parallel computation while reducing the communication overhead, we present a communication-efficient mapping of a large-scale matrix multiplication algorithm on the CIM architecture. The experimental results show that, depending on the matrix size, CIM architecture exhibits several orders of magnitude higher performance in total execution time and two orders of magnitude better in total energy consumption than the multicore-based on the shared memory architecture.
Hashing algorithms are used widely in information security area. Having studied the characteristics of traditional cryptographic hashing function and considered the features of multi-core cryptographic processor, this...
详细信息
ISBN:
(纸本)9781509028153
Hashing algorithms are used widely in information security area. Having studied the characteristics of traditional cryptographic hashing function and considered the features of multi-core cryptographic processor, this paper proposes a parallel algorithm for hash computation well-suited to multicore cryptographic processor. The algorithm breaks the chain dependencies of the standard hash function by implementing recursive hash to get faster hash implementation. We discuss the theoretical foundation for our mapping framework including security measure and performance measure. The experiments are performed on a PC with a PCIE card including multi-core cryptographic processor as the cipher processing engine. The results show a performance gain by an approximate factor of 7.8 when running on the 8-core cryptographic processor.
Computing problems that handle large amounts of data necessitate the use of lossless data compression for efficient storage and transmission. We present a novel lossless universal data compression algorithm that uses ...
详细信息
Computing problems that handle large amounts of data necessitate the use of lossless data compression for efficient storage and transmission. We present a novel lossless universal data compression algorithm that uses parallel computational units to increase the throughput. The length-N input sequence is partitioned into B blocks. Processing each block independently of the other blocks can accelerate the computation by a factor of B but degrades the compression quality. Instead, our approach is to first estimate the minimum description length (MDL) context tree source underlying the entire input, and then encode each of the B blocks in parallel based on the MDL source. With this two-pass approach, the compression loss incurred by using more parallel units is insignificant. Our algorithm is work-efficient, i. e., its computational complexity is O(N/B) Its redundancy is approximately B log (N/B) bits above Rissanen's lower bound on universal compression performance, with respect to any context tree source whose maximal depth is at most log (N/B). We improve the compression by using different quantizers for states of the context tree based on the number of symbols corresponding to those states. Numerical results from a prototype implementation suggest that our algorithm offers a better trade-off between compression and throughput than competing universal data compression algorithms.
Many high dimensional data mining applications involve the nearest neighbor search (NNS) on a KD-tree. Randomized KD-tree forest enables fast medium and large scale NNS among high dimensional data points. In this pape...
详细信息
ISBN:
(纸本)9781479953424
Many high dimensional data mining applications involve the nearest neighbor search (NNS) on a KD-tree. Randomized KD-tree forest enables fast medium and large scale NNS among high dimensional data points. In this paper, we present massively parallel algorithms for the construction of KD-tree forest, and NNS on a cluster equipped with massively parallel architecture (MPA) devices of graphical processing unit (GPU). This design can accelerate the KD-tree forest construction and NNS significantly for the signature of histograms of orientations (SHOT) 3D local descriptors by factors of up to 5.27 and 20.44, respectively. Our implementations will potentially benefit realtime high dimensional descriptors matching.
暂无评论