The design of memory consistency models for both hardware and soft-ware is a difficult task. It is particularly difficult for a programming language because the target audience is much wider than the target audience f...
详细信息
ISBN:
(纸本)0769515797
The design of memory consistency models for both hardware and soft-ware is a difficult task. It is particularly difficult for a programming language because the target audience is much wider than the target audience for a machine language, making usability a more important criteria. Adding to this problem is the fact that the programming language community has little experience designing programming language consistency models, and therefore each new attempt is very much a voyage into uncharted territory. A concrete example of the difficulties of the task is the current Java Memory Model. Although designed to be easy to use by Java programmers, it is poorly understood and at least one common idiom (the "double check idiom") to exploit the model is unsafe. In this paper we describe the design of an optimizing Java compiler that will accept, as either input or as an interface implementation, a consistency model for the code to be compiled The compiler will use escape analysis, Shasha and Snir's delay set analysis, and our CSSA program representation to normalize the effects of different consistency models on optimizations and analysis. When completed, the compiler will serve as a testbed to prototype new memory models, and to measure the differences of different memory models on program performance.
Given a graph, finding the distance-2 maximal independent set (MIS-2) of the vertices is a problem that is useful in several contexts such as algebraic multigrid coarsening or multilevel graph partitioning. Such multi...
详细信息
ISBN:
(纸本)9781665481069
Given a graph, finding the distance-2 maximal independent set (MIS-2) of the vertices is a problem that is useful in several contexts such as algebraic multigrid coarsening or multilevel graph partitioning. Such multilevel methods rely on finding the independent vertices so they can be used as seeds for aggregation in a multilevel scheme. We present a parallel MIS-2 algorithm to improve performance on modern accelerator hardware. This algorithm is implemented using the Kokkos programming model to enable performance portability. We demonstrate the portability of the algorithm and the performance on a variety of architectures (x86/ARM CPUs and NVIDIA/AMD GPUs). The resulting algorithm is also deterministic, producing an identical result for a given input across all of these platforms. The new MIS-2 implementation outperforms implementations in state of the art libraries like CUSP and ViennaCL by 3-8x while producing similar quality results. We further demonstrate the benefits of this approach by developing parallel graph coarsening scheme for two different use cases. First, we develop an algebraic multigrid (AMG) aggregation scheme using parallel MIS-2 and demonstrate the benefits as opposed to previous approaches used in the MueLu multigrid package in Trilinos. We also describe an approach for implementing a parallel multicolor "cluster" Gauss-Seidel preconditioner using this MIS-2 coarsening, and demonstrate better performance with an efficient, parallel, multicolor Gauss-Seidel algorithm.
We present cache-efficient chip multiprocessor (CMP) algorithms with good speed-up for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: D-CMP with a private cache f...
详细信息
ISBN:
(纸本)9781595939739
We present cache-efficient chip multiprocessor (CMP) algorithms with good speed-up for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: D-CMP with a private cache for each core, S-CMP with a single cache shared by all cores, and Multicore, which has private L, caches and a shared L-2 cache. We derive results for three classes of problems: local dependency dynamic programming (LDDP), Gaussian Elimination Paradigm (GEP), and parenthesis problem. For each class of problems, we develop a generic CMP algorithm with an associated tiling sequence. We then tailor this tiling sequence to each caching model and provide a parallel schedule that results in a cache-efficient parallel execution up to the critical path length of the underlying dynamic programming algorithm. We present experimental results on an 8-core Opteron for two sequence alignment problems that are important examples of LDDP. Our experimental results show good speed-ups for simple versions of our algorithms.
Array processors are gaining importance as fabrication technology scales down. Thus a simple and systematic method to implement algorithms in array architectures is desirable. In this paper a general Integer Linear Pr...
详细信息
ISBN:
(纸本)9780780393899
Array processors are gaining importance as fabrication technology scales down. Thus a simple and systematic method to implement algorithms in array architectures is desirable. In this paper a general Integer Linear Program (ILP) is formulated to minimize computation time and interconnection length. The ILP is solved with various constraints on the parallel computation time and interconnect space. Different cost functions as desired by the designer can be added to the optimization process. Viterbi decoder is considered as a case study. Optimum array architecture is obtained using the proposed ILP.
The OpenMP Application programming Interface is an emerging standard for parallelprogramming on shared-memory multiprocessors. For on-chip multiprocessor architectures, OpenMP offers an easy-to-use parallel programmi...
详细信息
ISBN:
(纸本)1581135769
The OpenMP Application programming Interface is an emerging standard for parallelprogramming on shared-memory multiprocessors. For on-chip multiprocessor architectures, OpenMP offers an easy-to-use parallelprogramming environment to develop multi-threaded applications on on-chip multiprocessors. Currently, a portable OpenMP compiler, called Omni OpenMP compiler system is being developed.
Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, u...
详细信息
ISBN:
(纸本)9781450301787
Active messages have proven to be an effective approach for certain communication problems in high performance computing. Many MPI implementations, as well as runtimes for Partitioned Global Address Space languages, use active messages in their low-level transport layers. However, most active message frameworks have low-level programming interfaces that require significant programming effort to use directly in applications and that also prevent optimization opportunities. In this paper we present AM++, a new user-level library for active messages based on generic programming techniques. Our library allows message handlers to be run in an explicit loop that can be optimized and vectorized by the compiler and that can also be executed in parallel on multicore architectures. Runtime optimizations, such as message combining and filtering, are also provided by the library, removing the need to implement that functionality at the application level. Evaluation of AM++ with distributed-memory graph algorithms shows the usability benefits provided by these library features, as well as their performance advantages.
We propose a MIP programming model for the bird's nest detection on the railway catenary, which performs coarse-to-fine strategy based on a cascaded YOLO network, and calculates the coarse-level and fine-level det...
详细信息
We describe a very simple deterministic parallel algorithm for linear programming in fixed dimension d that takes poly(log log n) time in the common CRCW PRAM model and does optimal O(n) work. Our algorithm is based o...
详细信息
ISBN:
(纸本)9780897918091
We describe a very simple deterministic parallel algorithm for linear programming in fixed dimension d that takes poly(log log n) time in the common CRCW PRAM model and does optimal O(n) work. Our algorithm is based on multidimensional search and an effective use of approximation algorithms to speed-up the basic search in the CRCW model. Our method also yields very fast poly(log log n) algorithm for smallest enclosing sphere and approximate ham-sandwich cuts as well as an O(log n) time work-optimal algorithm for exact ham-sandwich cuts of separable point sets. For all these problems, particularly for the fixed-dimensional linear programming, o(log n) time efficient deterministic PRAM algorithms were not known until very recently.
Issues related to exploitation of parallelism in perceptual grouping process, a generally accepted paradigm for image segmentation, are explored. The segmentation process is analyzed at three levels: tasks which are p...
详细信息
This book constitutes the refereed proceedings of the 10th internationalsymposium on parallelarchitectures, algorithms and programming, PAAP 2019, held in Guangzhou, China, in December 2019.
ISBN:
(数字)9789811527678
ISBN:
(纸本)9789811527661
This book constitutes the refereed proceedings of the 10th internationalsymposium on parallelarchitectures, algorithms and programming, PAAP 2019, held in Guangzhou, China, in December 2019.
暂无评论