Radar signal processing is a computationally intensive task, especially for high-bandwidth systems. Traditionally, such systems have relied on the interleaving of processing on multiple nodes of large compute clusters...
详细信息
ISBN:
(数字)9781665497862
ISBN:
(纸本)9781665497862
Radar signal processing is a computationally intensive task, especially for high-bandwidth systems. Traditionally, such systems have relied on the interleaving of processing on multiple nodes of large compute clusters to achieve the necessary throughput. Development in general-purpose GPU computing has led to a massive increase in the computational power available to highly parallel tasks. Most parts of the radar signal processing pipeline are well suited for such a task. This paper describes an algorithm for centroiding, a key part of the search radar pipeline that has not yet been demonstrated on a GPU. With this centroiding algorithm, the entire high-data-rate portion of the processing pipeline can be run on the GPU, yielding a speedup factor of approximately 40. The primary benefit of this approach is a massive reduction in data copying from the GPU to the CPU-a factor of over 1200 in this case-alleviating the main barrier to GPU-based radar processing systems.
Large shared-memory many-core nodes have become the norm in scientific computing, and therefore the sparse linear solver stack must adapt to the multilevel structure that exists on these nodes. One adaption is the dev...
详细信息
ISBN:
(纸本)9781665481069
Large shared-memory many-core nodes have become the norm in scientific computing, and therefore the sparse linear solver stack must adapt to the multilevel structure that exists on these nodes. One adaption is the development of hybrid-solvers at the node level. We present HTS as a hybrid threaded solver that aims to provide a finer-grain algorithm to keep an increased number of threads actively working on these larger shared-memory environments without the overheads of message passing implementations. Additionally, HTS aims at utilizing the additional shared memory that may be available to improve performance, i.e., reducing iteration counts when used as a preconditioner and speeding up calculations. HTS is built around the Schur complement framework that many other hybrid solver packages already use. However, HTS uses a multilevel structure in dealing with the Schur complement and allows for fill-in in certain off-diagonal submatrices to allow for a faster and more accurate solve phase. These modifications allow for a tasking thread library, namely Cilk, to be used to speed up performance while still reducing peak memory by more than 20% on average compared to an optimized direct factorization method. We show that HTS can outperform the MPI-based hybrid solver ShyLU on a suite of sparse matrices by as much as 2x, and show that HTS can scale well on three-dimensional finite difference problems.
Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation no...
详细信息
Recent trends in high-performance computing and deep learning have led to the proliferation of studies on large-scale deep neural network training. However, the frequent communication requirements among computation nodes drastically slow the overall training speeds, which causes bottlenecks in distributed training, particularly in clusters with limited network bandwidths. To mitigate the drawbacks of distributed communications, researchers have proposed various optimization strategies. In this paper, we provide a comprehensive survey of communication strategies from both an algorithm viewpoint and a computer network perspective. Algorithm optimizations focus on reducing the communication volumes used in distributed training, while network optimizations focus on accelerating the communications between distributed devices. At the algorithm level, we describe how to reduce the number of communication rounds and transmitted bits per round. In addition, we elucidate how to overlap computation and communication. At the network level, we discuss the effects caused by network infrastructures, including logical communication schemes and network protocols. Finally, we extrapolate the potential future challenges and new research directions to accelerate communications for distributed deep neural network training. (C) 2020 Elsevier Inc. All rights reserved.
3-D magnetotelluric (MT) forward modeling has always been faced with the problems of high memory requirements and long computing time. In this article, we design a scalable parallel algorithm for 3-D MT finite element...
详细信息
3-D magnetotelluric (MT) forward modeling has always been faced with the problems of high memory requirements and long computing time. In this article, we design a scalable parallel algorithm for 3-D MT finite element modeling in anisotropic media. The parallel algorithm is based on the distributed mesh storage, including multiple parallel granularities, and is implemented through multiple tools. Message-passing interface (MPI) is used to exploit process parallelisms for subdomains, frequencies, and solving equations. Thread parallelisms for merge sorting, element analysis, matrix assembly, and imposing Dirichlet boundary conditions are developed by Open Multi-Processing (OpenMP). We validate the algorithm through several model simulations and study the effects of topography and conductivity anisotropy on apparent resistivities and phase responses. Scalability tests are performed on the Tianhe-2 supercomputer to analyze the parallel performance of different parallel granularities. Three parallel direct solvers Supernodal LU (SUPERLU), MUltifrontal Massively parallel sparse direct Solver (MUMPS), and parallel Sparse matriX package (PASTIX) are compared in solving sparse systems of equations. As a result, reasonable parallel parameters are suggested for practical applications. The developed parallel algorithm is proven to be efficient and scalable.
The objective of this paper is to present an efficient parallel implementation of the iterative compact high-order approximation numerical solver for 3D Helmholtz equation on multicore computers. The high-order parall...
详细信息
ISBN:
(纸本)9781510651098;9781510651081
The objective of this paper is to present an efficient parallel implementation of the iterative compact high-order approximation numerical solver for 3D Helmholtz equation on multicore computers. The high-order parallel iterative algorithm is built upon a combination of a Krylov subspace-type method with a direct parallel Fast Fourier transform (FFT) type preconditioner from the authors' previous work, as shown in Ref. 7. In this paper, we will be presenting the result of our algorithm by computationally simulating data with realistic ranges of parameters in soil and mine-like targets. Our algorithm will also be incorporating second, fourth, and sixth-order compact finite difference schemes. The accuracy and result of the fourth and sixth-order compact approximation will be shown alongside the scalability of our implementation in the parallel programming environment.
The parallel alternating direction method of multipliers (ADMM) algorithm is widely recognized for its effectiveness in handling large-scale datasets stored in a distributed manner, making it a popular choice for solv...
详细信息
Sorting is a fundamental task in computing and plays a central role in information technology. The advent of rack-scale and warehouse-size data processing shaped the architecture of data analysis platforms towards sup...
详细信息
Sorting is a fundamental task in computing and plays a central role in information technology. The advent of rack-scale and warehouse-size data processing shaped the architecture of data analysis platforms towards supercomputing. In turn, established techniques on supercomputers have become relevant to a wider range of application domains. This work is concerned with multi-way mergesort with exact splitting on distributed memory architectures. At its core, our approach leverages a novel and parallel algorithm for multi-way selection problems. Remarkably concise, the algorithm relies on MPI_Allgather and MPI_ReduceScatter_block, two collective communication schemes that find hardware support in most high-end networks. A software implementation of our approach is used to process the Terabyte-size Data Challenge 2 signal, released by the SKA radio telescopes organization. On the supercomputer considered herein, our approach outperforms the state of the art by up to 2.6X using 9,216 cores. Our implementation is released as a compact open source library compliant to the MPI programming model. By supporting the most popular elementary key types, and arbitrary fixed-size value types, the library can be straightforwardly integrated into third-party MPI-based software
The parallel alternating direction method of multipliers (ADMM) algorithms have gained popularity in statistics and machine learning due to their efficient handling of large sample data problems. However, the parallel...
详细信息
In probabilistic state inference, we seek to estimate the state of an (autonomous) agent from noisy observations. It can be shown that, under certain assumptions, finding the estimate is equivalent to solving a linear...
详细信息
In probabilistic state inference, we seek to estimate the state of an (autonomous) agent from noisy observations. It can be shown that, under certain assumptions, finding the estimate is equivalent to solving a linear least squares problem. Solving such a problem is done by calculating the upper triangularmatrixRfrom the coefficient matrix A, using the QR or Cholesky factorizations;this matrix is commonly referred to as the "square root matrix". In sequential estimation problems, we are often interested in periodic optimization of the state variable order, e.g., to reduce fill-in, or to apply a predictive variable ordering tactic;however, changing the variable order implies expensive re-factorization of the system. Thus, we address the problem of modifying an existing square root matrix R, to convey reordering of the variables. To this end, we identify several conclusions regarding the effect of column permutation on the factorization, to allow efficient modification of R, without accessing A at all, or with minimal re-factorization. The proposed parallelizable algorithm achieves a significant improvement in performance over the state-of-the-art incremental Smoothing AndMapping (iSAM2) algorithm, which utilizes incremental factorization to update R.
Multinomial Logistic Regression is a well-studied tool for classification and has been widely used in fields like image processing, computer vision and, bioinformatics, to name a few. Under a supervised classification...
详细信息
Multinomial Logistic Regression is a well-studied tool for classification and has been widely used in fields like image processing, computer vision and, bioinformatics, to name a few. Under a supervised classification scenario, a Multinomial Logistic Regression model learns a weight vector to differentiate between any two classes by optimizing over the likelihood objective. With the advent of big data, the inundation of data has resulted in large dimensional weight vector and has also given rise to a huge number of classes, which makes the classical methods applicable for model estimation not computationally viable. To handle this issue, we here propose a parallel iterative algorithm: parallel Iterative Algorithm for MultiNomial LOgistic Regression ( PIANO ) which is based on the Majorization Minimization procedure, and can parallely update each element of the weight vectors. Further, we also show that PIANO can be easily extended to solve the Sparse Multinomial Logistic Regression problem -an extensively studied problem because of its attractive feature selection property. In particular, we work out the extension of PIANO to solve the Sparse Multinomial Logistic Regression problem with epsilon(1) and t 0 regularizations. We also prove that PIANO converges to a stationary point of the Multinomial and the Sparse Multinomial Logistic Regression problems. Simulations were conducted to compare PIANO with the existing methods, and it was found that the proposed algorithm performs better than the existing methods in terms of speed of convergence.(C) 2022 Elsevier B.V. All rights reserved.
暂无评论