ChainMail algorithm is a physically based deformation algorithm that has been successfully used in virtual surgery simulators, where time is a critical factor. In this paper, we present a parallel algorithm, based on ...
详细信息
ChainMail algorithm is a physically based deformation algorithm that has been successfully used in virtual surgery simulators, where time is a critical factor. In this paper, we present a parallel algorithm, based on ChainMail, and its efficient implementation that reduces the time required to compute deformations over large medical 3D datasets by means of modern GPU capabilities. We also present a 3D blocking scheme that reduces the amount of unnecessary processing threads. For this purpose, this paper describes a new parallel boolean reduction scheme, used to efficiently decide which blocks are computed. Finally, through an extensive analysis, we show the performance improvement achieved by our implementation of the proposed algorithm and the use of the proposed blocking scheme, due to the high spatial and temporal locality of our approach.
Next Generation Sequence (NGS) assemblers are challenged with the problem of handling massive number of reads. Bi-directed de Bruijn graph is the most fundamental data structure on which numerous NGS assemblers have b...
详细信息
Next Generation Sequence (NGS) assemblers are challenged with the problem of handling massive number of reads. Bi-directed de Bruijn graph is the most fundamental data structure on which numerous NGS assemblers have been built (e.g. Velvet, ABySS). Most of these assemblers only differ in the heuristics which they employ to operate on this de Bruijn graph. These heuristics are composed of several fundamental operations such as construction, compaction and pruning of the underlying bi-directed de Bruijn graph. Unfortunately the current algorithms to accomplish these fundamental operations on the de Bruijn graph are computationally inefficient and have become a bottleneck to scale the NGS assemblers. In this talk, some of the recent results which provide computationally efficient algorithms to these fundamental bi-directed de Bruijn graph operations are discussed. The algorithms are based on sorting and efficient in sequential, out of-core, and parallel settings.
We discuss the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in ge...
详细信息
We discuss the design and implementation of new highly-scalable distributed-memory parallel algorithms for two prototypical graph problems, edge-weighted matching and distance-1 vertex coloring. Graph algorithms in general have low concurrency, poor data locality, and high ratio of data access to computation costs, making it challenging to achieve scalability on massively parallel machines. We overcome this challenge by employing a variety of techniques, including speculation and iteration, optimized communication, and randomization. We present preliminary results on weak and strong scalability studies conducted on an IBM Blue Gene/P machine employing up to tens of thousands of processors. The results show that the algorithms hold strong potential for computing at petascale.
The article presents an algorithmic model of sound propagation in rooms to run on parallel and distributed computer systems. This algorithm is used by the authors in an implementation of an adaptable high-performance ...
详细信息
The article presents an algorithmic model of sound propagation in rooms to run on parallel and distributed computer systems. This algorithm is used by the authors in an implementation of an adaptable high-performance computer system simulating various fields and providing scalability on an arbitrary number of parallel central and graphical processors as well as distributed computer clusters. Many general-purpose computer simulation systems have limited usability when it comes to highprecision simulation associated with large numbers of elementary computations due to their lack of scalability on various parallel and distributed platforms. The more the required adequacy of the model is, the higher the numbers of steps of the simulation algorithms are. Scalability permits a use hybrid parallel computer systems and improves efficiency o f t he s imulation w ith respect to adequacy, time consumptions, and total costs of simulation *** report covers such an algorithm which is based on an approximate superposition of acoustical fields and provides adequate results, as long as the used equations of acoustics are linear. The algorithm represents reflecting surfaces as sets of vibrating pistons and uses the Rayleigh integral to calculate their scattering properties. The article also provides a parallel form of the algorithm and analysis of its properties in parallel and sequential forms.
Hashing algorithms are used widely in information security area. Having studied the characteristics of traditional cryptographic hashing function and considered the features of multi-core cryptographic processor, this...
详细信息
ISBN:
(纸本)9781509028153
Hashing algorithms are used widely in information security area. Having studied the characteristics of traditional cryptographic hashing function and considered the features of multi-core cryptographic processor, this paper proposes a parallel algorithm for hash computation well-suited to multicore cryptographic processor. The algorithm breaks the chain dependencies of the standard hash function by implementing recursive hash to get faster hash implementation. We discuss the theoretical foundation for our mapping framework including security measure and performance measure. The experiments are performed on a PC with a PCIE card including multi-core cryptographic processor as the cipher processing engine. The results show a performance gain by an approximate factor of 7.8 when running on the 8-core cryptographic processor.
We present a parallel time-domain simulator to solve the acoustic wave equation for large acoustic spaces on a distributed memory architecture. Our formulation is based on the adaptive rectangular decomposition (ARD) ...
详细信息
We present a parallel time-domain simulator to solve the acoustic wave equation for large acoustic spaces on a distributed memory architecture. Our formulation is based on the adaptive rectangular decomposition (ARD) algorithm, which performs acoustic wave propagation in three dimensions for homogeneous media. We propose an efficient parallelization of the different stages of the ARD pipeline;using a novel load balancing scheme and overlapping communication with computation, we achieve scalable performance on distributed memory architectures. Our solver can handle the full frequency range of human hearing (20 Hz-20 kHz) and scenes with volumes of thousands of cubic meters. We highlight the performance of our parallel simulator on a CPU cluster with up to a thousand cores and terabytes of memory. To the best of our knowledge, this is the fastest time-domain simulator for acoustic wave propagation in large, complex 3D scenes such as outdoor or architectural environments. (C) 2015 Published by Elsevier Ltd.
In 2011 we published a practical algorithm for short division (division of a multiple precision dividend by a single precision divisor) on a parallel processor (HiPC 2011) with a run time of O(n/p+log p). Our algorith...
详细信息
ISBN:
(纸本)9781509021413
In 2011 we published a practical algorithm for short division (division of a multiple precision dividend by a single precision divisor) on a parallel processor (HiPC 2011) with a run time of O(n/p+log p). Our algorithm, based on parallel computation of remainder sequences, is an improvement of Takahashi's earlier work (LSSC 2007) which has a run time of O((n/p) log p). Here we prove that Omega(n/p+log p) is a tight lower bound for short division (using a conventional fixed radix number system) on EREW and CREW PRAMs when the divisor d is not simply a power of two. The proof is based on an application of Cook, Dwork, and Reischuk's work on Boolean function complexity. The result itself is especially significant because it establishes a novel tight lower bound for two fundamental arithmetic operations, short division and division by a fixed constant, on an important class of parallel machines.
Computing problems that handle large amounts of data necessitate the use of lossless data compression for efficient storage and transmission. We present a novel lossless universal data compression algorithm that uses ...
详细信息
Computing problems that handle large amounts of data necessitate the use of lossless data compression for efficient storage and transmission. We present a novel lossless universal data compression algorithm that uses parallel computational units to increase the throughput. The length-N input sequence is partitioned into B blocks. Processing each block independently of the other blocks can accelerate the computation by a factor of B but degrades the compression quality. Instead, our approach is to first estimate the minimum description length (MDL) context tree source underlying the entire input, and then encode each of the B blocks in parallel based on the MDL source. With this two-pass approach, the compression loss incurred by using more parallel units is insignificant. Our algorithm is work-efficient, i. e., its computational complexity is O(N/B) Its redundancy is approximately B log (N/B) bits above Rissanen's lower bound on universal compression performance, with respect to any context tree source whose maximal depth is at most log (N/B). We improve the compression by using different quantizers for states of the context tree based on the number of symbols corresponding to those states. Numerical results from a prototype implementation suggest that our algorithm offers a better trade-off between compression and throughput than competing universal data compression algorithms.
We present the development of a scalable parallel algorithm and solver for computational electromagnetics based on a double higher order method of moments in the surface integral equation formulation in conjunction wi...
详细信息
We present the development of a scalable parallel algorithm and solver for computational electromagnetics based on a double higher order method of moments in the surface integral equation formulation in conjunction with a direct hierarchically semiseparable structures solver. Multiscale modeling using the new method, for electrically very large structures that also include electrically very small details, is discussed, with several advancement strategies.
With the emergence of GPU computing, deep neural networks have become a widely used technique for advancing research in the field of image and speech processing. In the context of object and event detection, sliding-w...
详细信息
ISBN:
(纸本)9781479999897
With the emergence of GPU computing, deep neural networks have become a widely used technique for advancing research in the field of image and speech processing. In the context of object and event detection, sliding-window classifiers require to choose the best among all positively discriminated candidate windows. In this paper, we introduce the first GPU-based non-maximum suppression (NMS) algorithm for embedded GPU architectures. The obtained results show that the proposed parallel algorithm reduces the NMS latency by a wide margin when compared to CPUs, even clocking the GPU at 50% of its maximum frequency on an NVIDIA Tegra Kl. In this paper, we show results for object detection in images. The proposed technique is directly applicable to speech segmentation tasks such as speaker diarization.
暂无评论