检索结果-内蒙古大学图书馆

IEEE High Performance Extreme Computing Virtual Conference (HPEC)

作者： Brigada, David J. Merfeld, Maximilian Warner, Kara MIT Lincoln Lab 244 Wood St Lexington MA 02421 USA

ISBN: (数字)9781665497862

ISBN: (纸本)9781665497862

Radar signal processing is a computationally intensive task, especially for high-bandwidth systems. Traditionally, such systems have relied on the interleaving of processing on multiple nodes of large compute clusters to achieve the necessary throughput. Development in general-purpose GPU computing has led to a massive increase in the computational power available to highly parallel tasks. Most parts of the radar signal processing pipeline are well suited for such a task. This paper describes an algorithm for centroiding, a key part of the search radar pipeline that has not yet been demonstrated on a GPU. With this centroiding algorithm, the entire high-data-rate portion of the processing pipeline can be run on the GPU, yielding a speedup factor of approximately 40. The primary benefit of this approach is a massive reduction in data copying from the GPU to the CPU-a factor of over 1200 in this case-alleviating the main barrier to GPU-based radar processing systems.

关键词： Blob detection data structures parallel algorithms radar

来源：评论

学校读者我要写书评

暂无评论

HTS: A Threaded Multilevel Sparse Hybrid Solver 36

HTS: A Threaded Multilevel Sparse Hybrid Solver

引用

36th IEEE International parallel and Distributed Processing Symposium (IEEE IPDPS)

作者： Booth, Joshua Dennis Univ Alabama Huntsville AL 35899 USA

ISBN: (纸本)9781665481069

Large shared-memory many-core nodes have become the norm in scientific computing, and therefore the sparse linear solver stack must adapt to the multilevel structure that exists on these nodes. One adaption is the development of hybrid-solvers at the node level. We present HTS as a hybrid threaded solver that aims to provide a finer-grain algorithm to keep an increased number of threads actively working on these larger shared-memory environments without the overheads of message passing implementations. Additionally, HTS aims at utilizing the additional shared memory that may be available to improve performance, i.e., reducing iteration counts when used as a preconditioner and speeding up calculations. HTS is built around the Schur complement framework that many other hybrid solver packages already use. However, HTS uses a multilevel structure in dealing with the Schur complement and allows for fill-in in certain off-diagonal submatrices to allow for a faster and more accurate solve phase. These modifications allow for a tasking thread library, namely Cilk, to be used to speed up performance while still reducing peak memory by more than 20% on average compared to an optimized direct factorization method. We show that HTS can outperform the MPI-based hybrid solver ShyLU on a suite of sparse matrices by as much as 2x, and show that HTS can scale well on three-dimensional finite difference problems.

关键词： Linear algebra parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Communication optimization strategies for distributed deep neural network training: A survey

引用

JOURNAL OF parallel AND DISTRIBUTED COMPUTING 2021年 149卷 52-65页

作者： Ouyang, Shuo Dong, Dezun Xu, Yemao Xiao, Liquan Natl Univ Def Technol Coll Comp Changsha 410073 Hunan Peoples R China

Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slow the overall training speeds, which causes bottlenecks in distributed training, particularly in clusters with limited network bandwidths. To mitigate the drawbacks of distributed communications, researchers have proposed various optimization strategies. In this paper, we provide a comprehensive survey of communication strategies from both an algorithm viewpoint and a computer network perspective. Algorithm optimizations focus on reducing the communication volumes used in distributed training, while network optimizations focus on accelerating the communications between distributed devices. At the algorithm level, we describe how to reduce the number of communication rounds and transmitted bits per round. In addition, we elucidate how to overlap computation and communication. At the network level, we discuss the effects caused by network infrastructures, including logical communication schemes and network protocols. Finally, we extrapolate the potential future challenges and new research directions to accelerate communications for distributed deep neural network training. (C) 2020 Elsevier Inc. All rights reserved.

关键词： Distributed deep learning Communication optimization parallel algorithms Network infrastructure

来源：评论

学校读者我要写书评

暂无评论

A Scalable parallel Algorithm for 3-D Magnetotelluric Finite Element Modeling in Anisotropic Media

引用

IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING 2022年 60卷 1页

作者： Zhu, Xiaoxiong Liu, Jie Cui, Yian Gong, Chunye Natl Univ Def Technol Sci & Technol Parallel & Distributed Proc Lab Lab Software Engn Complex Syst Changsha 410073 Peoples R China Cent South Univ Sch Geosci & Info Phys Changsha 410083 Peoples R China

3-D magnetotelluric (MT) forward modeling has always been faced with the problems of high memory requirements and long computing time. In this article, we design a scalable parallel algorithm for 3-D MT finite element modeling in anisotropic media. The parallel algorithm is based on the distributed mesh storage, including multiple parallel granularities, and is implemented through multiple tools. Message-passing interface (MPI) is used to exploit process parallelisms for subdomains, frequencies, and solving equations. Thread parallelisms for merge sorting, element analysis, matrix assembly, and imposing Dirichlet boundary conditions are developed by Open Multi-Processing (OpenMP). We validate the algorithm through several model simulations and study the effects of topography and conductivity anisotropy on apparent resistivities and phase responses. Scalability tests are performed on the Tianhe-2 supercomputer to analyze the parallel performance of different parallel granularities. Three parallel direct solvers Supernodal LU (SUPERLU), MUltifrontal Massively parallel sparse direct Solver (MUMPS), and parallel Sparse matriX package (PASTIX) are compared in solving sparse systems of equations. As a result, reasonable parallel parameters are suggested for practical applications. The developed parallel algorithm is proven to be efficient and scalable.

关键词： Finite element analysis Mathematical model parallel algorithms Memory management Sparse matrices Conductivity Computational modeling Conductivity anisotropy finite element method (FEM) magnetotelluric (MT) parallel algorithm

来源：评论

学校读者我要写书评

暂无评论

Scalable FFT-Krylov Subspace Method For Landmine Imaging Problem 23

Scalable FFT-Krylov Subspace Method For Landmine Imaging Pro...

引用

Conference on Chemical, Biological, Radiological, Nuclear, and Explosives (CBRNE) Sensing XXIII

作者： Lee, Yun Teck Gryazin, Yury Idaho State Univ Pocatello ID 83209 USA

ISBN: (纸本)9781510651098;9781510651081

The objective of this paper is to present an efficient parallel implementation of the iterative compact high-order approximation numerical solver for 3D Helmholtz equation on multicore computers. The high-order parallel iterative algorithm is built upon a combination of a Krylov subspace-type method with a direct parallel Fast Fourier transform (FFT) type preconditioner from the authors' previous work, as shown in Ref. 7. In this paper, we will be presenting the result of our algorithm by computationally simulating data with realistic ranges of parameters in soil and mine-like targets. Our algorithm will also be incorporating second, fourth, and sixth-order compact finite difference schemes. The accuracy and result of the fourth and sixth-order compact approximation will be shown alongside the scalability of our implementation in the parallel programming environment.

关键词： Helmholtz equation subsurface imaging landmines compact finite difference schemes GMRES FFT preconditioners parallel algorithms OpenMP MPI

来源：评论

学校读者我要写书评

暂无评论

A unified consensus-based parallel ADMM algorithm for high-dimensional regression with combined regularizations

arXiv

引用

arXiv 2023年

作者： Wu, Xiaofei Zhang, Zhimin Cui, Zhenyu College of Mathematics and Statistics Chongqing University Chongqing401331 China School of Business Stevens Institute of Technology HobokenNJ07030 United States

The parallel alternating direction method of multipliers (ADMM) algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solving statistical learning models. However, there is currently limited research on parallel algorithms specifically designed for high-dimensional regression with combined (composite) regularization terms. These terms, such as elastic-net, sparse group lasso, sparse fused lasso, and their nonconvex variants, have gained significant attention in various fields due to their ability to incorporate prior information and promote sparsity within specific groups or fused variables. The scarcity of parallel algorithms for combined regularizations can be attributed to the inherent nonsmoothness and complexity of these terms, as well as the absence of closed-form solutions for certain proximal operators associated with them. In this paper, we propose a unified constrained optimization formulation based on the consensus problem for these types of convex and nonconvex regression problems and derive the corresponding parallel ADMM algorithms. Furthermore, we prove that the proposed algorithm not only has global convergence but also exhibits linear convergence rate. Extensive simulation experiments, along with a financial example, serve to demonstrate the reliability, stability, and scalability of our algorithm. The R package for implementing the proposed algorithms can be obtained at https://***/xfwu1016/CPADMM. Copyright © 2023, The Authors. All rights reserved.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Data-centric workloads with MPI_Sort

引用

JOURNAL OF parallel AND DISTRIBUTED COMPUTING 2024年 187卷

作者： Zulian, P. Ben Bader, S. Fourestey, G. Krause, R. Rossinelli, D. Univ Svizzera Italiana Euler Inst Fac informat Lugano Switzerland Swiss Fed Inst Technol Lausanne Switzerland UniDistance Brig Switzerland Stanford Univ Inst Computat Math Engn Stanford CA 94305 USA

Sorting is a fundamental task in computing and plays a central role in information technology. The advent of rack-scale and warehouse-size data processing shaped the architecture of data analysis platforms towards supercomputing. In turn, established techniques on supercomputers have become relevant to a wider range of application domains. This work is concerned with multi-way mergesort with exact splitting on distributed memory architectures. At its core, our approach leverages a novel and parallel algorithm for multi-way selection problems. Remarkably concise, the algorithm relies on MPI_Allgather and MPI_ReduceScatter_block, two collective communication schemes that find hardware support in most high-end networks. A software implementation of our approach is used to process the Terabyte-size Data Challenge 2 signal, released by the SKA radio telescopes organization. On the supercomputer considered herein, our approach outperforms the state of the art by up to 2.6X using 9,216 cores. Our implementation is released as a compact open source library compliant to the MPI programming model. By supporting the most popular elementary key types, and arbitrary fixed-size value types, the library can be straightforwardly integrated into third-party MPI-based software

关键词： Distributed sorting parallel algorithms Supercomputers

来源：评论

学校读者我要写书评

暂无评论

Partition-Insensitive parallel ADMM Algorithm for High-dimensional Linear Models

arXiv

引用

arXiv 2023年

作者： Wu, Xiaofei Jiang, Jiancheng Zhang, Zhimin College of Mathematics and Statistics Chongqing University China Department of Mathematics and Statistics University of North Carolina at Charlotte United States

The parallel alternating direction method of multipliers (ADMM) algorithms have gained popularity in statistics and machine learning due to their efficient handling of large sample data problems. However, the parallel structure of these algorithms, based on the consensus problem, can lead to an excessive number of auxiliary variables when applied to high-dimensional data, resulting in large computational burden. In this paper, we propose a partition-insensitive parallel framework based on the linearized ADMM (LADMM) algorithm and apply it to solve nonconvex penalized high-dimensional regression problems. Compared to existing parallel ADMM algorithms, our algorithm does not rely on the consensus problem, resulting in a significant reduction in the number of variables that need to be updated at each iteration. It is worth noting that the solution of our algorithm remains largely unchanged regardless of how the total sample is divided, which is known as partition-insensitivity. Furthermore, under some mild assumptions, we prove the convergence of the iterative sequence generated by our parallel algorithm. Numerical experiments on synthetic and real datasets demonstrate the feasibility and validity of the proposed algorithm. We provide a publicly available R software package to facilitate the implementation of the proposed algorithm. Copyright © 2023, The Authors. All rights reserved.

关键词： parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Efficient Modification of the Upper Triangular Square Root Matrix on Variable Reordering

引用

IEEE ROBOTICS AND AUTOMATION LETTERS 2021年第2期6卷 675-682页

作者： Elimelech, Khen Indelman, Vadim Technion Israel Inst Technol Robot & Autonomous Syst Program IL-32000 Haifa Israel Technion Israel Inst Technol Dept Aerosp Engn IL-32000 Haifa Israel

In probabilistic state inference, we seek to estimate the state of an (autonomous) agent from noisy observations. It can be shown that, under certain assumptions, finding the estimate is equivalent to solving a linear least squares problem. Solving such a problem is done by calculating the upper triangularmatrixRfrom the coefficient matrix A, using the QR or Cholesky factorizations;this matrix is commonly referred to as the "square root matrix". In sequential estimation problems, we are often interested in periodic optimization of the state variable order, e.g., to reduce fill-in, or to apply a predictive variable ordering tactic;however, changing the variable order implies expensive re-factorization of the system. Thus, we address the problem of modifying an existing square root matrix R, to convey reordering of the variables. To this end, we identify several conclusions regarding the effect of column permutation on the factorization, to allow efficient modification of R, without accessing A at all, or with minimal re-factorization. The proposed parallelizable algorithm achieves a significant improvement in performance over the state-of-the-art incremental Smoothing AndMapping (iSAM2) algorithm, which utilizes incremental factorization to update R.

关键词： Incremental least squares parallel algorithms probabilistic inference SLAM sparse systems

来源：评论

学校读者我要写书评

暂无评论

PIANO: A fast parallel iterative algorithm for multinomial and sparse multinomial logistic regression

引用

SIGNAL PROCESSING 2022年 194卷 108459-108459页

作者： Jyothi, R. Babu, P. Indian Inst Technol Ctr Appl Res Elect Delhi India

Multinomial Logistic Regression is a well-studied tool for classification and has been widely used in fields like image processing, computer vision and, bioinformatics, to name a few. Under a supervised classification scenario, a Multinomial Logistic Regression model learns a weight vector to differentiate between any two classes by optimizing over the likelihood objective. With the advent of big data, the inundation of data has resulted in large dimensional weight vector and has also given rise to a huge number of classes, which makes the classical methods applicable for model estimation not computationally viable. To handle this issue, we here propose a parallel iterative algorithm: parallel Iterative Algorithm for MultiNomial LOgistic Regression ( PIANO ) which is based on the Majorization Minimization procedure, and can parallely update each element of the weight vectors. Further, we also show that PIANO can be easily extended to solve the Sparse Multinomial Logistic Regression problem -an extensively studied problem because of its attractive feature selection property. In particular, we work out the extension of PIANO to solve the Sparse Multinomial Logistic Regression problem with epsilon(1) and t 0 regularizations. We also prove that PIANO converges to a stationary point of the Multinomial and the Sparse Multinomial Logistic Regression problems. Simulations were conducted to compare PIANO with the existing methods, and it was found that the proposed algorithm performs better than the existing methods in terms of speed of convergence.(C) 2022 Elsevier B.V. All rights reserved.

关键词： Multinomial logistic regression Majorization minimization Sparse Parameter estimation Regularization parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：