We construct memory-optimized and time-efficient parallel algorithms (and the corresponding programs) taking advantage of regularized modified alpha-processes, namely the modified steepest descent method and the modif...
详细信息
ISBN:
(纸本)9783030281632;9783030281625
We construct memory-optimized and time-efficient parallel algorithms (and the corresponding programs) taking advantage of regularized modified alpha-processes, namely the modified steepest descent method and the modified minimal residual method, for solving the nonlinear equation of the structural inverse gravimetry problem. Memory optimization relies on the block-Toeplitz structure of the Jacobian matrix. The algorithms are implemented on multicore CPUs and GPUs through the use of, respectively, OpenMP and NVIDIA CUDA technologies. We analyze the efficiency and speedup of the algorithms. In addition, we solve a model problem of gravimetry and conduct a comparative study regarding the number of iterations and computation time against algorithms based on conjugate gradient-type methods and the componentwise gradient method. The comparison demonstrates that the algorithms based on alpha-processes perform better, reducing the number of iterations and the computation time by as much as 50%.
We present a parallel large neighborhood search framework for finding high quality primal solutions for general mixed-integer programs (MIPs). The approach simultaneously solves a large number of sub-MIPs with the dua...
详细信息
We present a parallel large neighborhood search framework for finding high quality primal solutions for general mixed-integer programs (MIPs). The approach simultaneously solves a large number of sub-MIPs with the dual objective of reducing infeasibility and optimizing with respect to the original objective. Both goals are achieved by solving restricted versions of two auxiliary MIPs, where subsets of the variables are fixed. In contrast to prior approaches, ours does not require a feasible starting solution. We leverage parallelism to perform multiple searches simultaneously, with the objective of increasing the effectiveness of our heuristic. We computationally compare the proposed framework with a state-of-the-art MIP solver in terms of solution quality, scalability, reproducibility, and parallel efficiency. Results show the efficacy of our approach in finding high quality solutions quickly both as a standalone primal heuristic and when used in conjunction with an exact algorithm.
Ionized field calculations for high-voltage direct current (HVDC) transmission line is a computationally demanding problem, which can benefit from the application of massively parallel high-performance compute archite...
详细信息
Ionized field calculations for high-voltage direct current (HVDC) transmission line is a computationally demanding problem, which can benefit from the application of massively parallel high-performance compute architectures. The finite element method (FEM) commonly employed to solve this problem is both memory and execution time intensive. In this paper, a finite-difference relaxation (FDR) method is proposed to solve a unipolar and a bipolar ionized field problem in an HVDC line. The novel FDR method has several advantages over FEM. First, the scheme is suitable for massively parallel computation and runs much faster: Compared with the commercial FEM software Comsol Multiphysics, the speed-up is more than 14 times in CPU parallelization and 35 times in graphics processor parallel implementation, while providing high accuracy. Moreover, the set of equations in FDR need not be assembled;instead, it is solved by a relaxation scheme and requires much less memory than FEM. Additionally, differentiated grid size with interpolation techniques is proposed to improve the flexibility of FDR for problem domain containing irregular geometries or disproportional sizes.
We consider the problem of nonnegative tensor factorization. Our aim is to derive an efficient algorithm that is also suitable for parallel implementation. We adopt the alternating optimization framework and solve eac...
详细信息
We consider the problem of nonnegative tensor factorization. Our aim is to derive an efficient algorithm that is also suitable for parallel implementation. We adopt the alternating optimization framework and solve each matrix nonnegative least-squares problem via a Nesterov-type algorithm for strongly convex problems. We describe a parallel implementation of the algorithm and measure the attained speedup in a multicore computing environment. It turns out that the derived algorithm is a competitive candidate for the solution of very large-scale dense nonnegative tensor factorization problems.
Suffix arrays and trees are important and fundamental string data structures which lie at the foundation of many string algorithms, with important applications in computational biology, text processing, and informatio...
详细信息
ISBN:
(数字)9781450362290
ISBN:
(纸本)9781450362290
Suffix arrays and trees are important and fundamental string data structures which lie at the foundation of many string algorithms, with important applications in computational biology, text processing, and information retrieval. Recent work enables the efficient parallel construction of suffix arrays and trees requiring at most O(n/p) memory per process in distributed memory. However, querying these indexes in distributed memory has not been studied extensively. Querying common string indexes such as suffix arrays, enhanced suffix arrays, and FM-Index, all require random accesses into O(n) memory - which in distributed memory settings becomes prohibitively expensive. In this paper, we introduce a novel distributed string index, the Distributed Enhanced Suffix Array (DESA). We present efficient algorithms for the construction and querying of this distributed data structure, all while requiring only O(n/p) memory per process. We further provide a scalable parallel implementation and demonstrate its performance and scalability.
As data sets become larger and more complicated, an extreme learning machine (ELM) that runs in a traditional serial environment cannot realize its ability to be fast and effective. Although a parallel ELM (PELM) base...
详细信息
As data sets become larger and more complicated, an extreme learning machine (ELM) that runs in a traditional serial environment cannot realize its ability to be fast and effective. Although a parallel ELM (PELM) based on MapReduce to process large-scale data shows more efficient learning speed than identical ELM algorithms in a serial environment, some operations, such as intermediate results stored on disks and multiple copies for each task, are indispensable, and these operations create a large amount of extra overhead and degrade the learning speed and efficiency of the PELMs. In this paper, an efficient ELM based on the Spark framework (SELM), which includes three parallel subalgorithms, is proposed for big data classification. By partitioning the corresponding data sets reasonably, the hidden layer output matrix calculation algorithm, matrix U decomposition algorithm, and matrix V decomposition algorithm perform most of the computations locally. At the same time, they retain the intermediate results in distributed memory and cache the diagonal matrix as broadcast variables instead of several copies for each task to reduce a large amount of the costs, and these actions strengthen the learning ability of the SELM. Finally, we implement our SELM algorithm to classify large data sets. Extensive experiments have been conducted to validate the effectiveness of the proposed algorithms. As shown, our SELMachieves an 8.71 x speedup on a cluster with ten nodes, and reaches a 13.79 x speedup with 15 nodes, an 18.74 x speedup with 20 nodes, a 23.79 x speedup with 25 nodes, a 28.89 x speedup with 30 nodes, and a 33.81 x speedup with 35 nodes.
Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit desig...
详细信息
Deep Neural Networks (DNNs) are becoming an important tool in modern computing applications. Accelerating their training is a major challenge and techniques range from distributed algorithms to low-level circuit design. In this survey, we describe the problem from a theoretical perspective, followed by approaches for its parallelization. We present trends in DNN architectures and the resulting implications on parallelization strategies. We then review and model the different types of concurrency in DNNs: from the single operator, through parallelism in network inference and training, to distributed deep learning. We discuss asynchronous stochastic optimization, distributed system architectures, communication schemes, and neural architecture search. Based on those approaches, we extrapolate potential directions for parallelism in deep learning.
Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQ...
详细信息
ISBN:
(纸本)9781728112466
Scalable QR factorization algorithms for solving least squares and eigenvalue problems are critical given the increasing parallelism within modern machines. We introduce a more general parallelization of the CholeskyQR2 algorithm and show its effectiveness for a wide range of matrix sizes. Our algorithm executes over a 3D processor grid, the dimensions of which can be tuned to trade-off costs in synchronization, interprocessor communication, computational work, and memory footprint. We implement this algorithm, yielding a code that can achieve a factor of Theta(P-1/6) less interprocessor communication on P processors than any previous parallel QR implementation. Our performance study on Intel Knights-Landing and Cray XE supercomputers demonstrates the effectiveness of this CholeskyQR2 parallelization on a large number of nodes. Specifically, relative to ScaLAPACK's QR, on 1024 nodes of Stampede2, our CholeskyQR2 implementation is faster by 2.6x-3.3x in strong scaling tests and by 1.1x-1.9x in weak scaling tests.
Asymmetric data patterns and workloads pose a challenge to massively parallel algorithm design, in particular for modern wide-SIMD architectures exhibiting several levels of parallelism. We propose a simple-to use pri...
详细信息
ISBN:
(纸本)9781450366038
Asymmetric data patterns and workloads pose a challenge to massively parallel algorithm design, in particular for modern wide-SIMD architectures exhibiting several levels of parallelism. We propose a simple-to use primitive that enables programmers to design algorithms with arbitrary data expansion or compaction while hiding the architecture details. We evaluate and characterize the performance of the primitive for a range of workloads, both synthetic and real-world. The results demonstrate that the primitive can be an effective tool in the toolbox of designers of parallel algorithms.
In induction machine simulation, it usually takes quite a long time to reach the steady state due to the large time constant. In this paper, two methods are proposed to speed up the transient process to reach the stea...
详细信息
ISBN:
(纸本)9781728155920
In induction machine simulation, it usually takes quite a long time to reach the steady state due to the large time constant. In this paper, two methods are proposed to speed up the transient process to reach the steady state. In both methods, the initial condition of the simulation is estimated from the solution of FEA model with locked rotor and equivalent conductivity/resistance. The effectiveness of the methods is validated by two examples with the comparison of the performance between two methods.
暂无评论