检索结果-内蒙古大学图书馆

Accelerating domain propagation: An efficient GPU-parallel algorithm over sparse matrices

parallel COMPUTING 2022年 109卷

作者： Sofranac, Boro Gleixner, Ambros Pokutta, Sebastian Berlin Insitute Technol Str 17 Juni 135 D-10623 Berlin Germany Zuse Inst Berlin Takustr 7 D-14195 Berlin Germany HTW Berlin Treskowallee 8 D-10318 Berlin Germany

center dot Currently, domain propagation in state-of-the-art MIP solvers is single thread only. center dot The paper presents a novel, efficient GPU algorithm to perform domain propagation. center dot Challenges are dynamic algorithmic behavior, dependency structures, sparsity patterns. center dot The algorithm is capable of running entirely on the GPU with no CPU involvement. center dot We achieve speed-ups of around 10x to 20x, up to 180x on favorably-large instances.

关键词： Mixed integer linear programming MIP GPU Domain propagation Bound tightening parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

A fast parallel high-precision summation algorithm based on AccSumK

引用

JOURNAL OF COMPUTATIONAL AND APPLIED MATHEMATICS 2022年 406卷

作者： Lei, Xiaojun Gu, Tongxiang Graillat, Stef Jiang, Hao Qi, Jin China Acad Engn Phys Grad Sch Beijing 100088 Peoples R China Inst Appl Phys & Computat Math Beijing 100094 Peoples R China Sorbonne Univ LIP6 CNRS F-75005 Paris France Natl Univ Def Technol Changsha 410073 Peoples R China

In this paper, we present a new parallel accurate algorithm called PAccSumK for computing summation of floating-point numbers. It is based on AccSumK algorithm. In the experiment, for the summation problems with large condition numbers, our algorithm outperforms the PSumK algorithm in terms of accuracy and computing time. The reason is that our algorithm is based on a more accurate algorithm called AccSumK algorithm compared to the SumL algorithm used in PSumK. The proposed parallel algorithm in this paper is designed to compute a result as if computed internally in K-fold the working precision. Numerical results are presented showing the performance and the accuracy of our new parallel algorithm for calculating summation. (c) 2021 Elsevier B.V. All rights reserved.

关键词： parallel algorithms Accurate summation Higher precision Floating-point arithmetic

来源：评论

学校读者我要写书评

暂无评论

Consistent Lock-free parallel Stochastic Gradient Descent for Fast and Stable Convergence 35

Consistent Lock-free Parallel Stochastic Gradient Descent fo...

引用

35th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Backstrom, Karl Walulya, Ivan Papatriantafilou, Marina Tsigas, Philippas Chalmers Univ Technol Dept Comp Sci & Engn Gothenburg Sweden

ISBN: (纸本)9781665440660

Stochastic Gradient Descent (SGD) is an essential element in Machine Learning (ML) algorithms. Asynchronous shared-memory parallel SGD (AsyncSGD), including synchronization-free algorithms, e.g. HOGWILD!, have received interest in certain contexts, due to reduced overhead compared to synchronous parallelization. Despite that they induce staleness and inconsistency, they have shown speedup for problems satisfying smooth, strongly convex targets, and gradient sparsity. Recent works take important steps towards understanding the potential of parallel 50D for problems not conforming to these strong assumptions, in particular for deep learning (DL). There is however a gap in current literature in understanding when AsyncSGD algorithms are useful in practice, and in particular how mechanisms for synchronization and consistency play a role. We contribute with answering questions in this gap by studying a spectrum of parallel algorithmic implementations of AsyncSGD, aiming to understand how shared-data synchronization influences the convergence properties in fundamental DL applications. We focus on the impact of consistency-preserving non-blocking synchronization in SGD convergence, and in sensitivity to hyperparameter tuning. We propose Leashed-SGD, an extensible algorithmic framework of consistency-preserving implementations of AsyncSGD, employing lock-free synchronization, effectively balancing throughput and latency. Leashed-SGD features a natural contention-regulating mechanism, as well as dynamic memory management, allocating space only when needed. We argue analytically about the dynamics of the algorithms, memory consumption, the threads' progress over time, and the expected contention. We provide a comprehensive empirical evaluation, validating the analytical claims, benchmarking the proposed Leashed-SGD framework, and comparing to baselines for two prominent deep learning (DL) applications: multilayer perceptrons (MLP) and convolutional neural networks (CNN). We o

关键词： artificial neural networks parallel algorithms lock-free synchronization stochastic gradient descent

来源：评论

学校读者我要写书评

暂无评论

MatPIM: Accelerating Matrix Operations with Memristive Stateful Logic

MatPIM: Accelerating Matrix Operations with Memristive State...

引用

IEEE International Symposium on Circuits and Systems (ISCAS)

作者： Leitersdorf, Orian Ronen, Ronny Kvatinsky, Shahar Technion Israel Inst Technol Viterbi Fac Elect & Comp Engn Haifa Israel

ISBN: (纸本)9781665484855

The emerging memristive Memory Processing Unit (mMPU) overcomes the memory wall through memristive devices that unite storage and logic for real processing-in-memory (PIM) systems. At the core of the mMPU is stateful logic, which is accelerated with memristive partitions to enable logic with massive inherent parallelism within crossbar arrays. This paper vastly accelerates the fundamental operations of matrix-vector multiplication and convolution in the mMPU, with either fullprecision or binary elements. These proposed algorithms establish an efficient foundation for large-scale mMPU applications such as neural-networks, image processing, and numerical methods. We overcome the inherent asymmetry limitation in the previous in-memory full-precision matrix-vector multiplication solutions by utilizing techniques from block matrix multiplication and reduction. We present the first fast in-memory binary matrixvector multiplication algorithm by utilizing memristive partitions with a tree-based popcount reduction ( 39x faster than previous work). For convolution, we present a novel in-memory inputparallel concept which we utilize for a full-precision algorithm that overcomes the asymmetry limitation in convolution, while also improving latency (2x faster than previous work), and the first fast binary algorithm (12x faster than previous work).

关键词： Memristor processing-in-memory parallel algorithms matrix multiplication convolution

来源：评论

学校读者我要写书评

暂无评论

Per Segment Plane Sweep Line Segment Intersection on the GPU 22

Per Segment Plane Sweep Line Segment Intersection on the GPU

引用

30th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM SIGSPATIAL GIS)

作者： Frye, Roger McKenney, Mark Southern Illinois Univ Edwardsville IL 62901 USA

ISBN: (纸本)9781450395298

Polygon overlay operations are used for various purposes such as GIS, VLSI, and geometric operations. Recent articles present algorithms using the GPU to perform the polygon overlay operation. We present two algorithms implemented on the GPU that focus on the active list of the traditional serial plane sweep algorithm. The presented results show improvement in executions time with respect to recent algorithms.

关键词： computational geometry GPU processing parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

parallel construction of multiple independent spanning trees on highly scalable datacenter networks

引用

APPLIED MATHEMATICS AND COMPUTATION 2022年 413卷

作者： Yang, Jinn-Shyong Li, Xiao-Yan Peng, Sheng-Lung Chang, Jou-Ming Natl Taipei Univ Business Dept Informat Management Taipei 10051 Taiwan Fuzhou Univ Coll Math & Comp Sci Fuzhou 350108 Peoples R China Natl Taipei Univ Business Dept Prod Innovat & Entrepreneurship Taoyuan 32462 Taiwan Natl Taipei Univ Business Inst Informat & Decis Sci Taipei 10051 Taiwan

An emerging datacenter network (DCN) with high scalability called HSDC is a server-centric DCN that can help cloud computing in supporting many inherent cloud services. For example, a server-centric DCN can initiate routing for data transmission. This paper investigates the construction of independent spanning trees (ISTs for short), a set of the rooted spanning trees associated with the disjoint-path property, in HSDC. Regarding multiple spanning trees as routing protocol, ISTs have applications in data transmission, e.g., fault-tolerant broadcasting and secure message distribution. We first establish the vertex-symmetry of HSDC. Then, by the structure that n-dimensional HSDC is a compound graph of an n-dimensional hypercube Q(n) and n-clique K-n, we amend the algorithm constructing ISTs for Q(n) to obtain the algorithm required by HSDC. Unlike most algorithms of recursively constructing tree structures, our algorithm can find every node's parent in each spanning tree directly via an easy computation relied upon only the node address and tree index. Consequently, we can implement the algorithm for constructing n ISTs in O(nN) time, where N = n2(n) is the number of vertices of n-dimensional HSDC;or parallelize the algorithm in O(n) time using Nprocessors. Remarkably, the diameter of the constructed ISTs is about twice the diameter of Q(n). (C) 2021 Elsevier Inc. All rights reserved.

关键词： Datacenter networks Independent spanning trees parallel algorithms Fault-tolerant broadcasting Secure message distribution

来源：评论

学校读者我要写书评

暂无评论

Achieving Speedups for Distributed Graph Biconnectivity

Achieving Speedups for Distributed Graph Biconnectivity

引用

IEEE High Performance Extreme Computing Virtual Conference (HPEC)

作者： Bogle, Ian Slota, George M. Rensselaer Polytech Inst Dept Comp Sci Troy NY 12181 USA

ISBN: (数字)9781665497862

ISBN: (纸本)9781665497862

As data scales continue to increase, studying the porting and implementation of shared memory parallel algorithms for distributed memory architectures becomes increasingly important. We consider the problem of biconnectivity for this current study, which identifies cut vertices and cut edges in a graph. As part of our study, we implemented and optimized a shared memory biconnectivity algorithm based on color propagation within a distributed memory context. This algorithm is neither work nor time efficient. However, when we compare to distributed implementations of theoretically efficient algorithms, we find that simple non-optimal algorithms can greatly outperform time-efficient algorithms in practice when implemented for real distributed-memory environments and real data. Overall, our distributed implementation for computing graph biconnectivity demonstrates an average strong scaling speedup of 15 x across 64 MPI ranks on a suite of irregular real-world inputs. We also note an average of 11 x and 7.3 x speedup relative to the optimal serial algorithm and fastest shared-memory implementation for the biconnectivity problem, respectively.

关键词： parallel algorithms graph algorithms biconnectivity

来源：评论

学校读者我要写书评

暂无评论

Query Evaluation by Circuits 22

Query Evaluation by Circuits

引用

41st ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems (PODS)

作者： Wang, Yilei Yi, Ke Hong Kong Univ Sci & Technol Hong Kong Peoples R China

ISBN: (纸本)9781450392600

In addition to its theoretical interest, computing with circuits has found applications in many emerging domains. Yet, the exact circuit complexity of query evaluation had remained an unexplored topic. In this paper, we present circuit constructions for conjunctive queries under degree constraints, with polylogarithmic depth and size matching the polymatroid bound up to polylogarithmic factors. We also propose a definition of output-sensitive circuit families and obtain such circuits with sizes matching their RAM counterparts.

关键词： Conjunctive queries circuit complexity parallel algorithms

来源：评论

学校读者我要写书评

暂无评论

Nonnegative Tensor Completion: step-sizes for an accelerated variation of the stochastic gradient descent 30

Nonnegative Tensor Completion: step-sizes for an accelerated...

引用

30th European Signal Processing Conference (EUSIPCO)

作者： Liavas, Athanasios P. Papagiannakos, Ioannis Marios Kolomvakis, Christos Tech Univ Crete Sch Elect & Comp Engn Khania Greece Univ Mons Fac Engn Dept Math & Operat Res Mons Belgium

ISBN: (纸本)9789082797091

We consider the problem of nonnegative tensor completion. We adopt the alternating optimization framework and solve each nonnegative matrix least-squares problem via an accelerated variation of the stochastic gradient descent. The step-sizes used by the algorithm determine, to a high extent, its behavior. We propose two new strategies for the computation of step-sizes and we experimentally test their effectiveness using both synthetic and real-world data.

关键词： tensors nonnegative tensor completion stochastic gradient descent accelerated gradient step-size selection Armijo line-search parallel algorithms OpenMP

来源：评论

学校读者我要写书评

暂无评论

GAPS: GPU-Acceleration of PDE Solvers for Wave Simulation 22

GAPS: GPU-Acceleration of PDE Solvers for Wave Simulation

引用

36th ACM International Conference on Supercomputing (ICS)

作者： Hanindhito, Bagus Gourounas, Dimitrios Fathi, Arash Trenev, Dimitar Gerstlauer, Andreas John, Lizy K. Univ Texas Austin Austin TX 78712 USA ExxonMobil Technol & Engn Annandale NJ USA

ISBN: (纸本)9781450392815

Large-scale simulations of wave-type equations have many industrial applications, such as in oil and gas exploration. Realistic simulations, which involve a vast amount of data, are often performed on multiple nodes of an HPC cluster. Using GPUs for these simulations is attractive due to considerable parallelizability of the algorithms. Many industry-relevant simulations have characteristics in their physics or geometry that can be exploited to improve computational efficiency. Furthermore, the choice of simulation algorithm impacts computational efficiency significantly. In this work, we exploit these features to significantly improve performance for a class of problems. Specifically, we use the discontinuous Galerkin (DG) finite element method, along with the Gauss-Lobatto-Legendre (GLL) integration scheme on hexahedral elements with straight faces, which then greatly reduces the number of BLAS operations, and simplify the computations to Level-1 BLAS operations, reducing the turn around time for wave simulation. However, attaining peak performance of GPUs is often not possible in these codes that exacerbate bottlenecks caused by data movement, even when modern GPUs enjoying the latest high-bandwidth memory are being used. We have developed GAPS, an efficient and scalable, GPU-accelerated PDE solver for Wave Simulation, by using hardwareand data-movement-aware algorithms. While significant speed-up over CPUs can be achieved, data movement still limits GPU performance. We present several optimization strategies, including kernel fusion, Look-Up-Table-based neighbor search, improved shared memory utilization, and SM-occupancy-aware register allocation. They improve performance up to 84.15x over CPU implementations and 1.84x over base GPU implementations on average. We then extend GAPS to support multi-GPUs on multi-node HPC clusters for large-scale wave simulations, and perform additional optimizations to reduce communication overhead. We also investigate the perfor

关键词： GPU acceleration HPC wave simulation discontinuous Galerkin parallel algorithms optimization strategies

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：