Programming models multiply seemingly without bound. They emerge from university and corporate research labs at a rate that outstrips anyone's ability to cope. For all this prodigious effort, however, only a remar...
详细信息
ISBN:
(纸本)9781509036837
Programming models multiply seemingly without bound. They emerge from university and corporate research labs at a rate that outstrips anyone's ability to cope. For all this prodigious effort, however, only a remarkably tiny number of these models are actually used to any significant degree. In this talk, we will explore the emergence of new programming models, the sociology connected to their origins, and the factors that allow certain ones to succeed. We will then consider changes that we see just over the horizon in hardware and ask the question; “are we entering a period where new parallel programming models might actually succeed”? We will then discuss our work to understand the commonly found species of programing models with ExaScale ambitions. In particular, we expose these programming models to our suite of tests (https://***/ParRes/Kernels) to explore the survival of the fittest programming model; one that will hopefully carry us into the era of ExaScale computers.
Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel ma...
详细信息
ISBN:
(纸本)9781509021413
Multiplication of a sparse matrix with a dense matrix is a building block of an increasing number of applications in many areas such as machine learning and graph algorithms. However, most previous work on parallel matrix multiplication considered only both dense or both sparse matrix operands. This paper analyzes the communication lower bounds and compares the communication costs of various classic parallel algorithms in the context of sparse-dense matrix-matrix multiplication. We also present new communication-avoiding algorithms based on a 1D decomposition, called 1.5D, which - while suboptimal in dense-dense and sparse-sparse cases - outperform the 2D and 3D variants both theoretically and in practice for sparse-dense multiplication. Our analysis separates one-time costs from per iteration costs in an iterative machine learning context. Experiments demonstrate speedups up to 100x over a baseline 3D SUMMA implementation and show parallel scaling over 10 thousand cores.
This work presents a hierarchical, parallel, dynamic dependence analysis for inferring run-time dependencies between recursively parallel tasks in the OmpSs programming model. To evaluate the dependence analysis we im...
详细信息
ISBN:
(纸本)9781509021413
This work presents a hierarchical, parallel, dynamic dependence analysis for inferring run-time dependencies between recursively parallel tasks in the OmpSs programming model. To evaluate the dependence analysis we implement PARTEE, a scalable runtime system that supports implicit synchronization between nested parallel tasks. We evaluate the performance of the resulting runtime system and compare it to Nanos++, the state of the art OmpSs implementation, and Cilk, a high performance task-parallel runtime system without implicit task synchronization. We find that i) PARTEE is able to handle more fine grained tasks than Nanos++, ii) PARTEE's performance is comparable to that of Cilk, iii) in cases where task dependencies are irregular, PARTEE outperforms Cilk by up to 103%.
Ever since Clayton Christensen coined the terms "disruptive technologies" and "disruptive innovations" in 1990s, researchers and entrepreneurs love the word "disruptive" because disruptin...
详细信息
ISBN:
(纸本)9781509021413
Ever since Clayton Christensen coined the terms "disruptive technologies" and "disruptive innovations" in 1990s, researchers and entrepreneurs love the word "disruptive" because disrupting current knowledge or products help us accelerate knowledge discoveries and moving the society into a new era. What is disruptive research? What is disruptive innovation? How do they happen? To answer such questions, in this talk, I will share my experience from co-leading the ImageNet project which built a knowledge base for computer vision and machine learning community, and from co-founding Data Domain, Inc. Which built deduplication storage ecosystems to replace tape library infrastructure in data centers.
The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this a...
详细信息
ISBN:
(纸本)9781509036837
The clustering coefficient and the transitivity ratio are concepts often used in network analysis, which creates a need for fast practical algorithms for counting triangles in large graphs. Previous research in this area focused on sequential algorithms, MapReduce parallelization, and fast approximations. In this paper we propose a parallel triangle counting algorithm for CUDA GPU. We describe the implementation details necessary to achieve high performance and present the experimental evaluation of our approach. The algorithm achieves 15 to 35 times speedup over our CPU implementation, and is capable of finding 8.8 billion triangles in a 180 million edges graph in 12 seconds on the Nvidia GeForce GTX 980 GPU.
The field of Intelligent Systems and applications has expanded enormously during the last two decades. Theoretical and practical results in this area are growing rapidly due to many successful applications and new the...
ISBN:
(纸本)9783662523315
The field of Intelligent Systems and applications has expanded enormously during the last two decades. Theoretical and practical results in this area are growing rapidly due to many successful applications and new theories derived from many diverse problems. This book is dedicated to the Intelligent Systems and applications in many different aspects. In particular, this book is to provide highlights of the current research in Intelligent Systems and applications. It consists of research papers in the following specific topics: l Authentication, Identification, and Signaturel Intrusion Detectionl Steganography, Data Hiding, and Watermarkingl Database, System, and Communication Securityl Computer Vision, Object Tracking, and Pattern Recognitionl Image processing, Medical Image processing, and Video Codingl Digital Content, Digital Life, and Human Computer Interactionl parallel, Peer-to-peer, distributed, and Cloud Computingl Software Engineering and Programming Language This book provides a reference to theoretical problems as well as practical solutions and applications for the state-of-the-art results in Intelligent Systems and applications on the aforementioned topics. In particular, both the academic community (graduate students, post-doctors and faculties) in Electrical Engineering, Computer Science, and Applied Mathematics; and the industrial community (engineers, engineering managers, programmers, research lab staffs and managers, security managers) will find this book interesting."
Graphics processing Units (GPUs) have evolved to become high performance processors for general purpose data-parallelapplications. Most GPU execution exploits a Single Instruction Multiple Data (SIMD) model. Typicall...
详细信息
ISBN:
(纸本)9781509021413
Graphics processing Units (GPUs) have evolved to become high performance processors for general purpose data-parallelapplications. Most GPU execution exploits a Single Instruction Multiple Data (SIMD) model. Typically, little attention is paid to whether the input data to the SIMD lanes are the same or different. We have observed that a significant number of SIMD instructions demonstrate scalar characteristics, i.e., they operate on the same data across their active lanes. Treating them as normal SIMD instructions results in redundant and inefficient GPU execution. To better serve both scalar and vector operations, we propose a novel scalar-vector GPU architecture. Our specialized scalar pipeline handles scalar instructions efficiently with only a single copy of the data, freeing the SIMD pipeline for normal vector execution. We propose a novel synchronization scheme to resolve data dependencies between scalar and vector instructions. With our optimized warp scheduling and instruction dispatching schemes, the scalar-vector GPU architecture achieves performance improvements of 19% on average in the Parboil and Rodinia benchmarks suites. We also examine the effects of varying warp sizes on scalar-vector execution and explore subwarp execution for power efficiency. Our results show that, on average, power is reduced by 18%.
Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide impor...
详细信息
ISBN:
(纸本)9781467389471
Modern discrete GPUs have been the processors of choice for accelerating compute-intensive applications, but using them in large-scale data processing is extremely challenging. Unfortunately, they do not provide important I/O abstractions long established in the CPU context, such as memory mapped files, which shield programmers from the complexity of buffer and I/O device management. However, implementing these abstractions on GPUs poses a problem: the limited GPU virtual memory system provides no address space management and page fault handling mechanisms to GPU developers, and does not allow modifications to memory mappings for running GPU programs. We implement ActivePointers, a software address translation layer and paging system that introduces native support for page faults and virtual address space management to GPU programs, and enables the implementation of fully functional memory mapped files on commodity GPUs. Files mapped into GPU memory are accessed using active pointers, which behave like regular pointers but access the GPU page cache under the hood, and trigger page faults which are handled on the GPU. We design and evaluate a number of novel mechanisms, including a translation cache in hardware registers and translation aggregation for deadlock-free page fault handling of threads in a single warp. We extensively evaluate ActivePointers on commodity NVIDIA GPUs using microbenchmarks, and also implement a complex image processing application that constructs a photo collage from a subset of 10 million images stored in a 40GB file. The GPU implementation maps the entire file into GPU memory and accesses it via active pointers. The use of active pointers adds only up to 1% to the application's runtime, while enabling speedups of up to 3.9x over a combined CPU+GPU implementation and 2.6x over a 12-core CPU-only implementation which uses AVX vector instructions.
This paper describes how we solved 12 previously unsolved mixed-integer programming (MIP) instances from the MIPLIB benchmark sets. To achieve these results we used an enhanced version of ParaSCIP, setting a new recor...
详细信息
ISBN:
(纸本)9781509021413
This paper describes how we solved 12 previously unsolved mixed-integer programming (MIP) instances from the MIPLIB benchmark sets. To achieve these results we used an enhanced version of ParaSCIP, setting a new record for the largest scale MIP computation: up to 80,000 cores in parallel on the Titan supercomputer. In this paper we describe the basic parallelization mechanism of ParaSCIP, improvements of the dynamic load balancing and novel techniques to exploit the power of parallelization for MIP solving. We give a detailed overview of computing times and statistics for solving open MIPLIB instances.
Achieving reproducibility of scientific results in parallel computing is both a challenge and a source of active research. A significant contribution to non-reproducibility is rounding error introduced into calculatio...
详细信息
ISBN:
(纸本)9781509021413
Achieving reproducibility of scientific results in parallel computing is both a challenge and a source of active research. A significant contribution to non-reproducibility is rounding error introduced into calculations by the non-associativity of floating point addition. Scientific applications that rely on accumulation of many small values, such as climate and N-body simulations, are susceptible to this type of error. This paper proposes a variant of an existing fixed-point method for real number summation that yields sums with perfect precision, and which are invariant to summation order and system architecture. The new method improves upon the existing technique by exhibiting improved performance for large numbers of summands, introducing tunable fractional precision to place precision where it is needed, and eliminating the aliasing problem of the original method. The proposed technique is described and its performance is demonstrated in the OpenMP, MPI, CUDA, and Xeon Phi parallel programming environments. In particular, the proposed method outperforms the previous state-of-the-art for larger problems involving over one million summands at high precision. With the anticipated convergence of exascale high-performance computing and big data analytics on hybrid architectures, computational reproducibility will become an even more difficult problem than it is today. Use of numerical techniques such as the method proposed here can help to mitigate the impact of error and variation within simulations at these large scales.
暂无评论