检索结果-内蒙古大学图书馆

Vascular System Modeling in Parallel Environment - distributed and Shared memory Approaches

IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE 2011年第4期15卷 668-672页

作者： Jurczuk, Krzysztof Kretowski, Marek Bezy-Wendling, Johanne Bialystok Tech Univ Fac Comp Sci PL-15351 Bialystok Poland INSERM U642 F-35000 Rennes France Univ Rennes 1 LTSI F-35000 Rennes France

This paper presents two approaches in parallel modeling of vascular system development in internal organs. In the first approach, new parts of tissue are distributed among processors and each processor is responsible for perfusing its assigned parts of tissue to all vascular trees. Communication between processors is accomplished by passing messages, and therefore, this algorithm is perfectly suited for distributed memory architectures. The second approach is designed for shared memory machines. It parallelizes the perfusion process during which individual processing units perform calculations concerning different vascular trees. The experimental results, performed on a computing cluster and multicore machines, show that both algorithms provide a significant speedup.

关键词： Computational modeling distributed memory algorithms parallel computing shared memory algorithms vascular system

来源：评论

学校读者我要写书评

暂无评论

Optimizing High Performance distributed memory Parallel Hash Tables for DNA k-mer Counting

Optimizing High Performance Distributed Memory Parallel Hash...

引用

International Conference on High Performance Computing, Networking, Storage, and Analysis (SC)

作者： Pan, Tony C. Misra, Sanchit Aluru, Srinivas Georgia Inst Technol Sch Computat Sci & Engn Atlanta GA 30332 USA Intel Corp Parallel Comp Lab Bangalore Karnataka India

ISBN: (纸本)9781538683842

High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for k-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06x and 3.7x, respectively, over the previous state-of-the-art distributed memory k-mer counter.

关键词： Hash tables k-mer counting vectorization cache-aware optimizations distributed memory algorithms

来源：评论

学校读者我要写书评

暂无评论

Aquila-LCS: GPU/CPU-accelerated particle advection schemes for large-scale simulations

引用

SOFTWAREX 2024年 27卷

作者： Lagares, Christian Araya, Guillermo Univ Puerto Rico Dept Mech Eng POB 9000 Mayaguez PR 00681 USA Univ Texas San Antonio Dept Mech Eng Computat Turbulence & Visualizat Lab San Antonio TX 78249 USA

We introduce Aquila-LCS, GPU and CPU optimized object-oriented, in-house codes for volumetric particle advection and 3D Finite-Time Lyapunov Exponent (FTLE) and Finite-Size Lyapunov Exponent (FSLE) computations. The purpose is to analyze 3D Lagrangian Coherent Structures (LCS) in large Direct Numerical Simulation (DNS) data. Our technique uses advanced search strategies for quick cell identification and efficient storage techniques. This solver scales effectively on both GPUs (up to 62 NVIDIA V100 GPUs) and multi-core CPUs (up to 32,768 CPU-cores), tracking up to 8-billion particles. We apply our approach to turbulent boundary layers at different flow regimes and Reynolds numbers.

关键词： LCS GPU-accelerated DNS distributed memory algorithms FTLE FSLE

来源：评论

学校读者我要写书评

暂无评论

Design and Implementation of a Communication-Optimal Classifier for distributed Kernel Support Vector Machines

引用

IEEE TRANSACTIONS ON PARALLEL AND distributed SYSTEMS 2017年第4期28卷 974-988页

作者： You, Yang Demmel, James Czechowski, Kent Song, Le Vuduc, Rich Univ Calif Berkeley Div Comp Sci Berkeley CA 94720 USA Georgia Tech Coll Comp Atlanta GA 30332 USA

We consider the problem of how to design and implement communication-efficient versions of parallel kernel support vector machines, a widely used classifier in statistical machine learning, for distributed memory clusters and supercomputers. The main computational bottleneck is the training phase, in which a statistical model is built from an input data set. Prior to our study, the parallel isoefficiency of a state-of-the-art implementation scaled as W = Omega(P-3), where W is the problem size and P the number of processors;this scaling is worse than even a one-dimensional block row dense matrix vector multiplication, which has W = Omega(P-2). This study considers a series of algorithmic refinements, leading ultimately to a Communication-Avoiding SVM method that improves the isoefficiency to nearly W = Omega(P). We evaluate these methods on 96 to 1,536 processors, and show average speedups of 3 - 16 x ( 7 x on average) over Dis-SMO, and a 95 percent weak-scaling efficiency on six real-world datasets, with only modest losses in overall classification accuracy. The source code can be downloaded at [1].

关键词： distributed memory algorithms communication-avoidance statistical machine learning

来源：评论

学校读者我要写书评

暂无评论

Adapting a parallel sparse direct solver to architectures with clusters of SMPs

引用

PARALLEL COMPUTING 2003年第11-12期29卷 1645-1668页

作者： Amestoy, PR Duff, IS Pralet, S Vömel, C CERFACS F-31057 Toulouse 01 France CERFACS Toulouse & Atlas Ctr RAL Didcot OX11 0QX Oxon England ENSEEIHT F-31071 Toulouse 7 France

We consider the direct solution of general sparse linear systems baseds on a multifrontal method. The approach combines partial static scheduling of the task dependency graph during the symbolic factorization and distributed dynamic scheduling during the numerical factorization to balance the work among the processes of a distributed memory computer. We show that to address clusters of Symmetric Multi-Processor (SMP) architectures, and more generally non-uniform memory access multiprocessors, our algorithms for both the static and the dynamic scheduling need to be revisited to take account of the non-uniform cost of communication. The performance analysis on an IBM SP3 with 16 processors per SMP node and up to 128 processors shows that we can significantly reduce both the amount of inter-node communication and the solution time. (C) 2003 Elsevier Ltd. All rights reserved.

关键词： sparse linear systems MUMPS distributed memory algorithms task scheduling dynamic scheduling

来源：评论

学校读者我要写书评

暂无评论

CA-SVM: Communication-Avoiding Support Vector Machines on distributed Systems 29

CA-SVM: Communication-Avoiding Support Vector Machines on Di...

引用

29th IEEE International Parallel and distributed Processing Symposium (IPDPS)

作者： You, Yang Demmel, James Czechowski, Kenneth Song, Le Vuduc, Richard Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China Univ Calif Berkeley Div Comp Sci Berkeley CA 94720 USA Georgia Inst Technol Coll Comp Atlanta GA 30332 USA

ISBN: (纸本)9781479986484

We consider the problem of how to design and implement communication-efficient versions of parallel support vector machines, a widely used classifier in statistical machine learning, for distributed memory clusters and supercomputers. The main computational bottleneck is the training phase, in which a statistical model is built from an input data set. Prior to our study, the parallel isoefficiency of a state-of-the-art implementation scaled as W = Omega(P-3), where W is the problem size and P the number of processors;this scaling is worse than even an one-dimensional block row dense matrix vector multiplication, which has W = Omega(P-2). This study considers a series of algorithmic refinements, leading ultimately to a Communication-Avoiding SVM (CA-SVM) method that improves the isoefficiency to nearly W = Omega(P). We evaluate these methods on 96 to 1536 processors, and show average speedups of 3 - 16x (7x on average) over Dis-SMO, and a 95% weak-scaling efficiency on six real-world datasets, with only modest losses in overall classification accuracy. The source code can be downloaded at [1].

关键词： distributed memory algorithms communication-avoidance statistical machine learning

来源：评论

学校读者我要写书评

暂无评论

distributed Sparse Random Projection Trees for Constructing K-Nearest Neighbor Graphs 37

Distributed Sparse Random Projection Trees for Constructing ...

引用

37th IEEE International Parallel and distributed Processing Symposium (IPDPS)

作者： Ranawaka, Isuru Rahmant, Md Khaledur Azad, Ariful Indiana Univ Bloomington IN 47405 USA Meta Inc Menlo Pk CA USA

ISBN: (纸本)9798350337662

A random projection tree that partitions data points by projecting them onto random vectors is widely used for approximate nearest neighbor search in high-dimensional space. We consider a particular case of random projection trees for constructing a k-nearest neighbor graph (KNNG) from highdimensional data. We develop a distributed-memory Random Projection Tree (DRPT) algorithm for constructing sparse random projection trees and then running a query on the forest to create the KNN graph. DRPT uses sparse matrix operations and a communication reduction scheme to scale KNN graph constructions to thousands of processes on a supercomputer. The accuracy of DRPT is comparable to state-of-the-art methods for approximate nearest neighbor search, while it runs two orders of magnitude faster than its peers. DRPT is available at https://***/HipGraph/DRPT.

关键词： distributed memory algorithms k nearest neighbor graph parallel algorithm

来源：评论

学校读者我要写书评

暂无评论

MATRIX FACTORIZATION WITH STOCHASTIC GRADIENT DESCENT FOR RECOMMENDER SYSTEMS

MATRIX FACTORIZATION WITH STOCHASTIC GRADIENT DESCENT FOR RE...

引用

作者： Ömer Faruk Aktulum Bilkent University

学位级别：硕士

Matrix factorization is an efficient technique used for disclosing latent features of real-world data. It finds its application in areas such as text mining, image analysis, social network and more recently and popularly in recommendation systems. Alternating Least Squares (ALS), Stochastic Gradient Descent (SGD) and Coordinate Descent (CD) are among the methods used commonly while factorizing large matrices. SGD-based factorization has proven to be the most successful among these methods after Netflix and KDDCup competitions where the winners' algorithms relied on methods based on SGD. Parallelization of SGD then became a hot topic and studied extensively in the literature in recent years. We focus on parallel SGD algorithms developed for shared memory and distributed memory systems. Shared memory parallelizations include works such as HogWild, FPSGD and MLGF-MF, and distributed memory parallelizations include works such as DSGD, GASGD and NOMAD. We design a survey that contains exhaustive analysis of these studies, and then particularly focus on DSGD by implementing it through message-passing paradigm and testing its performance in terms of convergence and speedup. In contrast to the existing works, many real-wold datasets are used in the experiments that we produce using published raw data. We show that DSGD is a robust algorithm for large-scale datasets and achieves near-linear speedup with fast convergence rates.

关键词： Recommender system Matrix Factorization Stochastic Gradient Descent Parallel Computing Shared memory algorithms distributed memory algorithms

来源：评论

学校读者我要写书评

暂无评论

A GPU-Accelerated Particle Advection Methodology for 3D Lagrangian Coherent Structures in High-Speed Turbulent Boundary Layers

引用

ENERGIES 2023年第12期16卷 4800页

作者： Lagares, Christian Araya, Guillermo Univ Puerto Rico Mayaguez Dept Mech Engn HPC & Visualizat Lab Mayaguez PR 00682 USA Univ Texas San Antonio Klesse Coll Engn & Integrated Design Computat Turbulence & Visualizat Lab San Antonio TX 78249 USA

In this work, we introduce a scalable and efficient GPU-accelerated methodology for volumetric particle advection and finite-time Lyapunov exponent (FTLE) calculation, focusing on the analysis of Lagrangian coherent structures (LCS) in large-scale direct numerical simulation (DNS) datasets across incompressible, supersonic, and hypersonic flow regimes. LCS play a significant role in turbulent boundary layer analysis, and our proposed methodology offers valuable insights into their behavior in various flow conditions. Our novel owning-cell locator method enables efficient constant-time cell search, and the algorithm draws inspiration from classical search algorithms and modern multi-level approaches in numerical linear algebra. The proposed method is implemented for both multi-core CPUs and Nvidia GPUs, demonstrating strong scaling up to 32,768 CPU cores and up to 62 Nvidia V100 GPUs. By decoupling particle advection from other problems, we achieve modularity and extensibility, resulting in consistent parallel efficiency across different architectures. Our methodology was applied to calculate and visualize the FTLE on four turbulent boundary layers at different Reynolds and Mach numbers, revealing that coherent structures grow more isotropic proportional to the Mach number, and their inclination angle varies along the streamwise direction. We also observed increased anisotropy and FTLE organization at lower Reynolds numbers, with structures retaining coherency along both spanwise and streamwise directions. Additionally, we demonstrated the impact of lower temporal frequency sampling by upscaling with an efficient linear upsampler, preserving general trends with only 10% of the required storage. In summary, we present a particle search scheme for particle advection workloads in the context of visualizing LCS via FTLE that exhibits strong scaling performance and efficiency at scale. Our proposed algorithm is applicable across various domains, requiring efficient search

关键词： LCS GPU-accelerated particle advection distributed memory algorithms high-speed turbulent boundary layers DNS

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：