We consider the problem of sampling n numbers from the range {1,..., N} without replacement on modern architectures. The main result is a simple divide-and-conquer scheme that makes sequential algorithms more cache ef...
详细信息
We consider the problem of sampling n numbers from the range {1,..., N} without replacement on modern architectures. The main result is a simple divide-and-conquer scheme that makes sequential algorithms more cache efficient and leads to a parallel algorithm running in expected time O(n/p + log p) on p processors, i.e., scales to massively parallel machines even for moderate values of n. The amount of communication between the processors is very small (at most O(log p)) and independent of the sample size. We also discuss modifications needed for load balancing, online sampling, sampling with replacement, Bernoulli sampling, and vectorization on SIMD units or GPUs.
Many complex physical processes are modeled by coupled systems of partial differential equations (PDEs). Often, the numerical approximation of these PDEs requires the solution of large sparse nonsymmetric systems of e...
详细信息
Many complex physical processes are modeled by coupled systems of partial differential equations (PDEs). Often, the numerical approximation of these PDEs requires the solution of large sparse nonsymmetric systems of equations. In this paper the authors compare the parallel performance of a number of preconditioned Krylov subspace methods on a large-scale multiple instruction multiple data (MIMD) machine. These methods are among the most robust and efficient iterative algorithms tor the solution of large sparse linear systems. In this comparison, the focus is on parallel issues associated with preconditioners within the generalized minimum residual (GMRES). conjugate gradient squared (CGS), biconjugate gradient stabilized (Bi-CGSTAB), and quasi-minimal residual CGS (QMRCGS) methods. Conclusions are drawn on the effectiveness of the different schemes based on results obtained from a 1024 processor nCUBE 2 hypercube.
This paper illustrates the concept of multiphase parallel structuring of algorithms on reconfigurable computers. Reconfigurable network architectured computers are described and a paradigm for programming them is defi...
详细信息
This paper illustrates the concept of multiphase parallel structuring of algorithms on reconfigurable computers. Reconfigurable network architectured computers are described and a paradigm for programming them is defined. The execution behavior of two linear system solving techniques is determined and compared. This paper does not attempt a traditional analysis of linear system solvers: instead it presents a study of the scheduling and data flow requirements of a selected pair of algorithms.
The time-symmetric block time-step (TSBTS) algorithm is a newly developed efficient scheme for N-body integrations. It is constructed on an era-based iteration. In this work, we re-designed the TSBTS integration schem...
详细信息
The time-symmetric block time-step (TSBTS) algorithm is a newly developed efficient scheme for N-body integrations. It is constructed on an era-based iteration. In this work, we re-designed the TSBTS integration scheme with a dynamically changing era size. A number of numerical tests were performed to show the importance of choosing the size of the era, especially for long-time integrations. Our second aim was to show that the TSBTS scheme is as suitable as previously known schemes for developing parallel N-body codes. In this work, we relied on a parallel scheme using the copy algorithm for the time-symmetric scheme. We implemented a hybrid of data and task parallelization for force calculation to handle load balancing problems that can appear in practice. Using the Plummer model initial conditions for different numbers of particles, we obtained the expected efficiency and speedup for a small number of particles. Although parallelization of the direct N-body codes is negatively affected by the communication/calculation ratios, we obtained good load-balanced results. Moreover, we were able to conserve the advantages of the algorithm (e.g., energy conservation for long-term simulations).
Multinomial Logistic Regression is a well-studied tool for classification and has been widely used in fields like image processing, computer vision and, bioinformatics, to name a few. Under a supervised classification...
详细信息
Multinomial Logistic Regression is a well-studied tool for classification and has been widely used in fields like image processing, computer vision and, bioinformatics, to name a few. Under a supervised classification scenario, a Multinomial Logistic Regression model learns a weight vector to differentiate between any two classes by optimizing over the likelihood objective. With the advent of big data, the inundation of data has resulted in large dimensional weight vector and has also given rise to a huge number of classes, which makes the classical methods applicable for model estimation not computationally viable. To handle this issue, we here propose a parallel iterative algorithm: parallel Iterative Algorithm for MultiNomial LOgistic Regression ( PIANO ) which is based on the Majorization Minimization procedure, and can parallely update each element of the weight vectors. Further, we also show that PIANO can be easily extended to solve the Sparse Multinomial Logistic Regression problem -an extensively studied problem because of its attractive feature selection property. In particular, we work out the extension of PIANO to solve the Sparse Multinomial Logistic Regression problem with epsilon(1) and t 0 regularizations. We also prove that PIANO converges to a stationary point of the Multinomial and the Sparse Multinomial Logistic Regression problems. Simulations were conducted to compare PIANO with the existing methods, and it was found that the proposed algorithm performs better than the existing methods in terms of speed of convergence.(C) 2022 Elsevier B.V. All rights reserved.
We present a fast parallel algorithm for finding the blocks or biconnected components of an undirected graphG = (V,E) havingnvertices and m edges. Our techniques arc based on partitioning the vertex setVinto...
详细信息
We present a fast parallel algorithm for finding the blocks or biconnected components of an undirected graphG = (V,E) havingnvertices and m edges. Our techniques arc based on partitioning the vertex setVinto adjacency-level sets using information contained in the distance matrixDof the graph. LettDandpDbe the time and number of processors, respectively, for the computation of the distance matrix of a graphGon a CRCW-PRAM computational model. We show that the location of all cut vertices and bridges of a graph can be done in timeO(logδ +tD) by usingO(n m/td) processors, where δ is the maximum degree of a vertex inG. Based on these results, we define a digraphGdand we prove certain properties on its distance matrix leading to a parallel block-finding algorithm running in timeO(logδ +tD) withO(n m/tD) processors on the same computational model. We also show that other connectivity-related problems can be efficiently solved using distance matrices.
The development of smart grid and the increasing scale of power system bring much pressure to the electromagnetic transient simulation of a power system. The graphic processing unit (GPU), which features the massive c...
详细信息
The development of smart grid and the increasing scale of power system bring much pressure to the electromagnetic transient simulation of a power system. The graphic processing unit (GPU), which features the massive concurrent threads and excellent floating point performance, brings a new chance to the area of power system simulation. This study introduces a parallel lower triangular and upper triangular decomposition algorithm and calculation strategy of electromagnetic transient simulation based on GPU. In this scheme, the GPU is mainly used to do the computationally intensive part of the simulation in parallel on its built-in multiple processing cores, and the CPU is assigned for updating history terms and flow control of the simulation. By comparing with the results simulating by the CPU-only implementations, the validity and efficiency of the proposed method are verified.
Presented is a parallel algorithm based on the fast multipole method (FMM) for the Helmholtz equation. This variant of the FMM is useful for computing radar cross sections and antenna radiation patterns. The FMM decom...
详细信息
Presented is a parallel algorithm based on the fast multipole method (FMM) for the Helmholtz equation. This variant of the FMM is useful for computing radar cross sections and antenna radiation patterns. The FMM decomposes the impedance matrix into sparse components, reducing the operation count of the matrix-vector multiplication in iterative solvers to O(N 3/2 ) (where N is the number of unknowns). The parallel algorithm divides the problem into groups and assigns the computation involved with each group to a processor node. Careful consideration is given to the communications costs. A time complexity analysis of the algorithm is presented and compared with empirical results from a Paragon XP/S running the lightweight Sandia/University of New Mexico operating system (SUNMOS). For a 90,000 unknown problem running on 60 nodes, the sparse representation fits in memory and the algorithm computes the matrix-vector product in 1.26 seconds. It sustains an aggregate rate of 1.4 Gflop/s. The corresponding dense matrix would occupy over 100 Gbytes and, assuming that I/O is free, would require on the order of 50 seconds to form the matrix-vector product.
This paper presents the results of a study conducted to evaluate the inherent memory reference behavior of several engineering/scientific applications, executing on shared memory, MIN-based, parallel systems. In this ...
详细信息
This paper presents the results of a study conducted to evaluate the inherent memory reference behavior of several engineering/scientific applications, executing on shared memory, MIN-based, parallel systems. In this study, system sizes of two to 64 processors were evaluated. A trace-driven simulation model was used to obtain dynamic reference characteristics of the code. Included in this code were explicit declarations of shared variables. Our results indicate that a significant amount of explicitly declared shared data is accessed as either readonly by several processors, or read-write by a single processor. Furthermore, lines containing synchronization variables tend to see small ownership times at a processor and are accessed by several processors in the system. We also note that, as expected, relatively more references are to data with smaller ownership times, as the number of processors increase. Finally, the application data set size can have an impact on ownership time, as the number of processors increase.
Computationally efficient serial and parallel algorithms for estimating the general linear model are proposed. The sequential block-recursive algorithm is an adaptation of a known Givens strategy that has Lis a main c...
详细信息
Computationally efficient serial and parallel algorithms for estimating the general linear model are proposed. The sequential block-recursive algorithm is an adaptation of a known Givens strategy that has Lis a main component the Generalized QR decomposition. The proposed algorithm is based on orthogonal transformations and exploits the triangular structure of the Cholesky QRD factor of the variance-covariance matrix. Specifically, it computes the estimator of the general linear model by solving recursively a series of smaller and smaller generalized linear least squares problems. The new algorithm is found to Outperform significantly the corresponding LAPACK routine. A parallel version of the new sequential algorithm which utilizes an efficient distribution of the matrices over the processors and has low inter-processor communication is developed. The theoretical computational complexity of the parallel algorithms is derived and analyzed. Experimental results are presented which confirm the theoretical analysis. The parallel strategy is found to be scalable and highly efficient for estimating large-scale general linear estimation problems. (c) 2005 Elsevier B.V. All rights reserved.
暂无评论