center dot Currently, domain propagation in state-of-the-art MIP solvers is single thread only. center dot The paper presents a novel, efficient GPU algorithm to perform domain propagation. center dot Challenges are d...
详细信息
center dot Currently, domain propagation in state-of-the-art MIP solvers is single thread only. center dot The paper presents a novel, efficient GPU algorithm to perform domain propagation. center dot Challenges are dynamic algorithmic behavior, dependency structures, sparsity patterns. center dot The algorithm is capable of running entirely on the GPU with no CPU involvement. center dot We achieve speed-ups of around 10x to 20x, up to 180x on favorably-large instances.
In this paper, we present a new parallel accurate algorithm called PAccSumK for computing summation of floating-point numbers. It is based on AccSumK algorithm. In the experiment, for the summation problems with large...
详细信息
In this paper, we present a new parallel accurate algorithm called PAccSumK for computing summation of floating-point numbers. It is based on AccSumK algorithm. In the experiment, for the summation problems with large condition numbers, our algorithm outperforms the PSumK algorithm in terms of accuracy and computing time. The reason is that our algorithm is based on a more accurate algorithm called AccSumK algorithm compared to the SumL algorithm used in PSumK. The proposed parallel algorithm in this paper is designed to compute a result as if computed internally in K-fold the working precision. Numerical results are presented showing the performance and the accuracy of our new parallel algorithm for calculating summation. (c) 2021 Elsevier B.V. All rights reserved.
Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have receive...
详细信息
ISBN:
(纸本)9781665440660
Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel 50D for problems not conforming to these strong assumptions, in particular for deep learning (DL). There is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We contribute with answering questions in this gap by studying a spectrum of parallel algorithmic implementations of AsyncSGD, aiming to understand how shared-data synchronization influences the convergence properties in fundamental DL applications. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyperparameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. Leashed-SGD features a natural contention-regulating mechanism, as well as dynamic memory management, allocating space only when needed. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for two prominent deep learning (DL) applications: multilayer perceptrons (MLP) and convolutional neural networks (CNN). We o
The emerging memristive Memory Processing Unit (mMPU) overcomes the memory wall through memristive devices that unite storage and logic for real processing-in-memory (PIM) systems. At the core of the mMPU is stateful ...
详细信息
ISBN:
(纸本)9781665484855
The emerging memristive Memory Processing Unit (mMPU) overcomes the memory wall through memristive devices that unite storage and logic for real processing-in-memory (PIM) systems. At the core of the mMPU is stateful logic, which is accelerated with memristive partitions to enable logic with massive inherent parallelism within crossbar arrays. This paper vastly accelerates the fundamental operations of matrix-vector multiplication and convolution in the mMPU, with either fullprecision or binary elements. These proposed algorithms establish an efficient foundation for large-scale mMPU applications such as neural-networks, image processing, and numerical methods. We overcome the inherent asymmetry limitation in the previous in-memory full-precision matrix-vector multiplication solutions by utilizing techniques from block matrix multiplication and reduction. We present the first fast in-memory binary matrixvector multiplication algorithm by utilizing memristive partitions with a tree-based popcount reduction ( 39x faster than previous work). For convolution, we present a novel in-memory inputparallel concept which we utilize for a full-precision algorithm that overcomes the asymmetry limitation in convolution, while also improving latency (2x faster than previous work), and the first fast binary algorithm (12x faster than previous work).
Polygon overlay operations are used for various purposes such as GIS, VLSI, and geometric operations. Recent articles present algorithms using the GPU to perform the polygon overlay operation. We present two algorithm...
详细信息
ISBN:
(纸本)9781450395298
Polygon overlay operations are used for various purposes such as GIS, VLSI, and geometric operations. Recent articles present algorithms using the GPU to perform the polygon overlay operation. We present two algorithms implemented on the GPU that focus on the active list of the traditional serial plane sweep algorithm. The presented results show improvement in executions time with respect to recent algorithms.
An emerging datacenter network (DCN) with high scalability called HSDC is a server-centric DCN that can help cloud computing in supporting many inherent cloud services. For example, a server-centric DCN can initiate r...
详细信息
An emerging datacenter network (DCN) with high scalability called HSDC is a server-centric DCN that can help cloud computing in supporting many inherent cloud services. For example, a server-centric DCN can initiate routing for data transmission. This paper investigates the construction of independent spanning trees (ISTs for short), a set of the rooted spanning trees associated with the disjoint-path property, in HSDC. Regarding multiple spanning trees as routing protocol, ISTs have applications in data transmission, e.g., fault-tolerant broadcasting and secure message distribution. We first establish the vertex-symmetry of HSDC. Then, by the structure that n-dimensional HSDC is a compound graph of an n-dimensional hypercube Q(n) and n-clique K-n, we amend the algorithm constructing ISTs for Q(n) to obtain the algorithm required by HSDC. Unlike most algorithms of recursively constructing tree structures, our algorithm can find every node's parent in each spanning tree directly via an easy computation relied upon only the node address and tree index. Consequently, we can implement the algorithm for constructing n ISTs in O(nN) time, where N = n2(n) is the number of vertices of n-dimensional HSDC;or parallelize the algorithm in O(n) time using Nprocessors. Remarkably, the diameter of the constructed ISTs is about twice the diameter of Q(n). (C) 2021 Elsevier Inc. All rights reserved.
As data scales continue to increase, studying the porting and implementation of shared memory parallel algorithms for distributed memory architectures becomes increasingly important. We consider the problem of biconne...
详细信息
ISBN:
(数字)9781665497862
ISBN:
(纸本)9781665497862
As data scales continue to increase, studying the porting and implementation of shared memory parallel algorithms for distributed memory architectures becomes increasingly important. We consider the problem of biconnectivity for this current study, which identifies cut vertices and cut edges in a graph. As part of our study, we implemented and optimized a shared memory biconnectivity algorithm based on color propagation within a distributed memory context. This algorithm is neither work nor time efficient. However, when we compare to distributed implementations of theoretically efficient algorithms, we find that simple non-optimal algorithms can greatly outperform time-efficient algorithms in practice when implemented for real distributed-memory environments and real data. Overall, our distributed implementation for computing graph biconnectivity demonstrates an average strong scaling speedup of 15 x across 64 MPI ranks on a suite of irregular real-world inputs. We also note an average of 11 x and 7.3 x speedup relative to the optimal serial algorithm and fastest shared-memory implementation for the biconnectivity problem, respectively.
In addition to its theoretical interest, computing with circuits has found applications in many emerging domains. Yet, the exact circuit complexity of query evaluation had remained an unexplored topic. In this paper, ...
详细信息
ISBN:
(纸本)9781450392600
In addition to its theoretical interest, computing with circuits has found applications in many emerging domains. Yet, the exact circuit complexity of query evaluation had remained an unexplored topic. In this paper, we present circuit constructions for conjunctive queries under degree constraints, with polylogarithmic depth and size matching the polymatroid bound up to polylogarithmic factors. We also propose a definition of output-sensitive circuit families and obtain such circuits with sizes matching their RAM counterparts.
We consider the problem of nonnegative tensor completion. We adopt the alternating optimization framework and solve each nonnegative matrix least-squares problem via an accelerated variation of the stochastic gradient...
详细信息
ISBN:
(纸本)9789082797091
We consider the problem of nonnegative tensor completion. We adopt the alternating optimization framework and solve each nonnegative matrix least-squares problem via an accelerated variation of the stochastic gradient descent. The step-sizes used by the algorithm determine, to a high extent, its behavior. We propose two new strategies for the computation of step-sizes and we experimentally test their effectiveness using both synthetic and real-world data.
Large-scale simulations of wave-type equations have many industrial applications, such as in oil and gas exploration. Realistic simulations, which involve a vast amount of data, are often performed on multiple nodes o...
详细信息
ISBN:
(纸本)9781450392815
Large-scale simulations of wave-type equations have many industrial applications, such as in oil and gas exploration. Realistic simulations, which involve a vast amount of data, are often performed on multiple nodes of an HPC cluster. Using GPUs for these simulations is attractive due to considerable parallelizability of the algorithms. Many industry-relevant simulations have characteristics in their physics or geometry that can be exploited to improve computational efficiency. Furthermore, the choice of simulation algorithm impacts computational efficiency significantly. In this work, we exploit these features to significantly improve performance for a class of problems. Specifically, we use the discontinuous Galerkin (DG) finite element method, along with the Gauss-Lobatto-Legendre (GLL) integration scheme on hexahedral elements with straight faces, which then greatly reduces the number of BLAS operations, and simplify the computations to Level-1 BLAS operations, reducing the turn around time for wave simulation. However, attaining peak performance of GPUs is often not possible in these codes that exacerbate bottlenecks caused by data movement, even when modern GPUs enjoying the latest high-bandwidth memory are being used. We have developed GAPS, an efficient and scalable, GPU-accelerated PDE solver for Wave Simulation, by using hardwareand data-movement-aware algorithms. While significant speed-up over CPUs can be achieved, data movement still limits GPU performance. We present several optimization strategies, including kernel fusion, Look-Up-Table-based neighbor search, improved shared memory utilization, and SM-occupancy-aware register allocation. They improve performance up to 84.15x over CPU implementations and 1.84x over base GPU implementations on average. We then extend GAPS to support multi-GPUs on multi-node HPC clusters for large-scale wave simulations, and perform additional optimizations to reduce communication overhead. We also investigate the perfor
暂无评论