This paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible ...
详细信息
This paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages, and therefore, this algorithm is perfectly suited for distributedmemory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multicore machines, show that both algorithms provide a significant speedup.
High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed l...
详细信息
ISBN:
(纸本)9781538683842
High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributedmemory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for k-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06x and 3.7x, respectively, over the previous state-of-the-art distributedmemory k-mer counter.
We introduce Aquila-LCS, GPU and CPU optimized object-oriented, in-house codes for volumetric particle advection and 3D Finite-Time Lyapunov Exponent (FTLE) and Finite-Size Lyapunov Exponent (FSLE) computations. The p...
详细信息
We introduce Aquila-LCS, GPU and CPU optimized object-oriented, in-house codes for volumetric particle advection and 3D Finite-Time Lyapunov Exponent (FTLE) and Finite-Size Lyapunov Exponent (FSLE) computations. The purpose is to analyze 3D Lagrangian Coherent Structures (LCS) in large Direct Numerical Simulation (DNS) data. Our technique uses advanced search strategies for quick cell identification and efficient storage techniques. This solver scales effectively on both GPUs (up to 62 NVIDIA V100 GPUs) and multi-core CPUs (up to 32,768 CPU-cores), tracking up to 8-billion particles. We apply our approach to turbulent boundary layers at different flow regimes and Reynolds numbers.
We consider the problem of how to design and implement communication-efficient versions of parallel kernel support vector machines, a widely used classifier in statistical machine learning, for distributedmemory clus...
详细信息
We consider the problem of how to design and implement communication-efficient versions of parallel kernel support vector machines, a widely used classifier in statistical machine learning, for distributedmemory clusters and supercomputers. The main computational bottleneck is the training phase, in which a statistical model is built from an input data set. Prior to our study, the parallel isoefficiency of a state-of-the-art implementation scaled as W = Omega(P-3), where W is the problem size and P the number of processors;this scaling is worse than even a one-dimensional block row dense matrix vector multiplication, which has W = Omega(P-2). This study considers a series of algorithmic refinements, leading ultimately to a Communication-Avoiding SVM method that improves the isoefficiency to nearly W = Omega(P). We evaluate these methods on 96 to 1,536 processors, and show average speedups of 3 - 16 x ( 7 x on average) over Dis-SMO, and a 95 percent weak-scaling efficiency on six real-world datasets, with only modest losses in overall classification accuracy. The source code can be downloaded at [1].
We consider the direct solution of general sparse linear systems baseds on a multifrontal method. The approach combines partial static scheduling of the task dependency graph during the symbolic factorization and dist...
详细信息
We consider the direct solution of general sparse linear systems baseds on a multifrontal method. The approach combines partial static scheduling of the task dependency graph during the symbolic factorization and distributed dynamic scheduling during the numerical factorization to balance the work among the processes of a distributedmemory computer. We show that to address clusters of Symmetric Multi-Processor (SMP) architectures, and more generally non-uniform memory access multiprocessors, our algorithms for both the static and the dynamic scheduling need to be revisited to take account of the non-uniform cost of communication. The performance analysis on an IBM SP3 with 16 processors per SMP node and up to 128 processors shows that we can significantly reduce both the amount of inter-node communication and the solution time. (C) 2003 Elsevier Ltd. All rights reserved.
We consider the problem of how to design and implement communication-efficient versions of parallel support vector machines, a widely used classifier in statistical machine learning, for distributedmemory clusters an...
详细信息
ISBN:
(纸本)9781479986484
We consider the problem of how to design and implement communication-efficient versions of parallel support vector machines, a widely used classifier in statistical machine learning, for distributedmemory clusters and supercomputers. The main computational bottleneck is the training phase, in which a statistical model is built from an input data set. Prior to our study, the parallel isoefficiency of a state-of-the-art implementation scaled as W = Omega(P-3), where W is the problem size and P the number of processors;this scaling is worse than even an one-dimensional block row dense matrix vector multiplication, which has W = Omega(P-2). This study considers a series of algorithmic refinements, leading ultimately to a Communication-Avoiding SVM (CA-SVM) method that improves the isoefficiency to nearly W = Omega(P). We evaluate these methods on 96 to 1536 processors, and show average speedups of 3 - 16x (7x on average) over Dis-SMO, and a 95% weak-scaling efficiency on six real-world datasets, with only modest losses in overall classification accuracy. The source code can be downloaded at [1].
A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random pro...
详细信息
ISBN:
(纸本)9798350337662
A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random projection trees for constructing a k-nearest neighbor graph (KNNG) from highdimensional data. We develop a distributed-memory Random Projection Tree (DRPT) algorithm for constructing sparse random projection trees and then running a query on the forest to create the KNN graph. DRPT uses sparse matrix operations and a communication reduction scheme to scale KNN graph constructions to thousands of processes on a supercomputer. The accuracy of DRPT is comparable to state-of-the-art methods for approximate nearest neighbor search, while it runs two orders of magnitude faster than its peers. DRPT is available at https://***/HipGraph/DRPT.
Matrix factorization is an efficient technique used for disclosing latent features of real-world data. It finds its application in areas such as text mining, image analysis, social network and more recently and popula...
详细信息
Matrix factorization is an efficient technique used for disclosing latent features of real-world data. It finds its application in areas such as text mining, image analysis, social network and more recently and popularly in recommendation systems. Alternating Least Squares (ALS), Stochastic Gradient Descent (SGD) and Coordinate Descent (CD) are among the methods used commonly while factorizing large matrices. SGD-based factorization has proven to be the most successful among these methods after Netflix and KDDCup competitions where the winners' algorithms relied on methods based on SGD. Parallelization of SGD then became a hot topic and studied extensively in the literature in recent years. We focus on parallel SGD algorithms developed for shared memory and distributedmemory systems. Shared memory parallelizations include works such as HogWild, FPSGD and MLGF-MF, and distributedmemory parallelizations include works such as DSGD, GASGD and NOMAD. We design a survey that contains exhaustive analysis of these studies, and then particularly focus on DSGD by implementing it through message-passing paradigm and testing its performance in terms of convergence and speedup. In contrast to the existing works, many real-wold datasets are used in the experiments that we produce using published raw data. We show that DSGD is a robust algorithm for large-scale datasets and achieves near-linear speedup with fast convergence rates.
In this work, we introduce a scalable and efficient GPU-accelerated methodology for volumetric particle advection and finite-time Lyapunov exponent (FTLE) calculation, focusing on the analysis of Lagrangian coherent s...
详细信息
In this work, we introduce a scalable and efficient GPU-accelerated methodology for volumetric particle advection and finite-time Lyapunov exponent (FTLE) calculation, focusing on the analysis of Lagrangian coherent structures (LCS) in large-scale direct numerical simulation (DNS) datasets across incompressible, supersonic, and hypersonic flow regimes. LCS play a significant role in turbulent boundary layer analysis, and our proposed methodology offers valuable insights into their behavior in various flow conditions. Our novel owning-cell locator method enables efficient constant-time cell search, and the algorithm draws inspiration from classical search algorithms and modern multi-level approaches in numerical linear algebra. The proposed method is implemented for both multi-core CPUs and Nvidia GPUs, demonstrating strong scaling up to 32,768 CPU cores and up to 62 Nvidia V100 GPUs. By decoupling particle advection from other problems, we achieve modularity and extensibility, resulting in consistent parallel efficiency across different architectures. Our methodology was applied to calculate and visualize the FTLE on four turbulent boundary layers at different Reynolds and Mach numbers, revealing that coherent structures grow more isotropic proportional to the Mach number, and their inclination angle varies along the streamwise direction. We also observed increased anisotropy and FTLE organization at lower Reynolds numbers, with structures retaining coherency along both spanwise and streamwise directions. Additionally, we demonstrated the impact of lower temporal frequency sampling by upscaling with an efficient linear upsampler, preserving general trends with only 10% of the required storage. In summary, we present a particle search scheme for particle advection workloads in the context of visualizing LCS via FTLE that exhibits strong scaling performance and efficiency at scale. Our proposed algorithm is applicable across various domains, requiring efficient search
暂无评论