In this paper, we describe a parallel Branch-and-Bound (B&B) algorithm with a history-based domination technique, and we apply it to the Sequential Ordering Problem (SOP). To the best of our knowledge, the propose...
详细信息
ISBN:
(纸本)9781450392044
In this paper, we describe a parallel Branch-and-Bound (B&B) algorithm with a history-based domination technique, and we apply it to the Sequential Ordering Problem (SOP). To the best of our knowledge, the proposed algorithm is the first parallel B&B algorithm that includes a history-based domination technique and is the first parallel B&B algorithm for solving the SOP using a pure B&B approach. the proposed algorithm takes a pool-based approach and employs a collection of novel techniques that we have developed to achieve effective parallel exploration of the solution space, including parallel history domination, history table memory management, and a thread restart technique. the proposed algorithm was experimentally evaluated using the SOPLIB and TSPLIB benchmarks. the results show that using ten threads with a time limit of one hour on the medium-difficulty instances, the proposed algorithm gives a geometric-mean speedup of 19.9 on SOPLIB and 10.23 on TSPLIB, with super-linear speedups up to 65x seen on 17 instances.
the increasing demand in HPC to utilize accelerators has motivated the development of pragma-based directives to target these devices. OmpSs-2 and OpenACC are both directive-based solutions that allow application prog...
详细信息
ISBN:
(纸本)9781450392044
the increasing demand in HPC to utilize accelerators has motivated the development of pragma-based directives to target these devices. OmpSs-2 and OpenACC are both directive-based solutions that allow application programmers to utilize accelerators. the two leverage distinct types of parallelism: task parallelism and data parallelism, respectively. Non-trivial scientific applications can benefit from both types of available parallelism. However, the combination of pragma-based models is difficult to coordinate, as both assume full control and are unaware of each other at runtime. We propose an interoperation mechanism to enable novel composability across pragma-based programming models. We study and propose a clear separation of duties and implement our approach by augmenting the OmpSs-2 programming model, compiler and runtime to support OmpSs-2 + OpenACC programming.
We introduce a new approach to automatically extract an idealized logical structure from a parallel execution trace. We use this structure to define intuitive metrics such as the lateness of a process involved in a pa...
详细信息
ISBN:
(纸本)9781450326568
We introduce a new approach to automatically extract an idealized logical structure from a parallel execution trace. We use this structure to define intuitive metrics such as the lateness of a process involved in a parallel execution. By analyzing and illustrating traces in terms of logical steps, we leverage a developer's understanding of the happened-before relations in a parallel program. this technique can uncover dependency chains, elucidate communication patterns, and highlight sources and propagation of delays, all of which may be obscured in a traditional trace visualization.
Tensor computations present significant performance challenges that impact a wide spectrum of applications. Efforts on improving the performance of tensor computations include exploring data layout, execution scheduli...
详细信息
ISBN:
(纸本)9781450368186
Tensor computations present significant performance challenges that impact a wide spectrum of applications. Efforts on improving the performance of tensor computations include exploring data layout, execution scheduling, and parallelism in common tensor kernels. this work presents a benchmark suite for arbitrary-order sparse tensor kernels using state-of-the-art tensor formats: coordinate (COO) and hierarchical coordinate (HiCOO). It demonstrates a set of reference tensor kernel implementations and some observations on Intel CPUs and NVIDIA GPUs. the full paper can be referred to at http://***/abs/2001.00660.
the proceedings contain 44 papers. the topics discussed include: predicate RCU: an RCU for scalable concurrent updates;automatic scalable atomicity via semantic locking;a framework for practical parallel fast matrix m...
ISBN:
(纸本)9781450332057
the proceedings contain 44 papers. the topics discussed include: predicate RCU: an RCU for scalable concurrent updates;automatic scalable atomicity via semantic locking;a framework for practical parallel fast matrix multiplication;PLUTO+: near-complete modeling of affine transformations for parallelism and locality;distributed memory code generation for mixed irregular/regular computations;performance implications of dynamic memory allocators on transactional memory systems;low-overhead software transactional memory with progress guarantees and strong semantics∗;barrier elision for production parallel programs;scalable and efficient implementation of 3D unstructured meshes computation: a case study on matrix assembly;and diagnosing the causes and severity of one-sided message contention.
Chase and Lev9;s concurrent deque is a key data structure in shared-memory parallelprogramming and plays an essential role in work-stealing schedulers. We provide the first correctness proof of an optimized implem...
详细信息
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored per...
详细信息
ISBN:
(纸本)9781450368186
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and ineicient utilization of network and/or GPU resources. In this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact withthe network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.
the Toolkit for Accurate Scientific Software (TASS) is a suite of tools for the formal verification of MPI-based parallel programs used in computational science. TASS can verify various safety properties as well as co...
详细信息
ISBN:
(纸本)9781450301190
the Toolkit for Accurate Scientific Software (TASS) is a suite of tools for the formal verification of MPI-based parallel programs used in computational science. TASS can verify various safety properties as well as compare two programs for functional equivalence. the TASS front end takes an integer n >= 1 and a C/MPI program, and constructs an abstract model of the program with n processes. Procedures, structs, (multi-dimensional) arrays, heap-allocated data, pointers, and pointer arithmetic are all representable in a TASS model. the model is then explored using symbolic execution and explicit state space enumeration. A number of techniques are used to reduce the time and memory consumed. A variety of realistic MPI programs have been verified with TASS, including Jacobi iteration and manager-worker type programs, and some subtle defects have been discovered. TASS is written in Java and is available from http://***/tass under the Gnu Public License.
It is widely acknowledged in high-performance computing circles that parallel input/output needs substantial improvement in order to make scalable computers truly usable. We present a data storage model that allows pr...
详细信息
It is widely acknowledged in high-performance computing circles that parallel input/output needs substantial improvement in order to make scalable computers truly usable. We present a data storage model that allows processors independent access to their own data and a corresponding compilation strategy that integrates data-parallel computation with data distribution for out-of-core problems. Our results compare several communication methods and I/O optimizations using two out-of-core problems, Jacobi iteration and LU factorization.
Traditional concurrent programming involves manipulating shared mutable state. Alternatives to this programming style are communicating sequential processes (CSP) [1] and actor [2] models, which share data via explici...
详细信息
暂无评论