We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates...
详细信息
ISBN:
(纸本)9798400700156
We introduce Stream-K, a work-centric parallelization of matrix multiplication (GEMM) and related computations in dense linear algebra. Whereas contemporary decompositions are primarily tile-based, our method operates by partitioning an even share of the aggregate inner loop iterations among physical processing elements. this provides a near-perfect utilization of computing resources, regardless of how efficiently the output tiling for any given problem quantizes across the underlying processing elements.
Bayesian networks (BNs) are attractive, because they are graphical and interpretable machine learning models. However, exact inference on BNs is time-consuming, especially for complex problems. To improve the efficien...
详细信息
ISBN:
(纸本)9798400700156
Bayesian networks (BNs) are attractive, because they are graphical and interpretable machine learning models. However, exact inference on BNs is time-consuming, especially for complex problems. To improve the efficiency, we propose a fast BN exact inference solution named Fast-BNI on multi-core CPUs. Fast-BNI enhances the efficiency of exact inference through hybrid parallelism that tightly integrates coarse- and fine-grained parallelism. We also propose techniques to further simplify the bottleneck operations of BN exact inference. Fast-BNI source code is freely available at https://***/jjiantong/FastBN.
Today, almost all computer architectures are parallel and heterogeneous; a combination of multiple CPUs, GPUs and specialized processors. this creates a challenging problem for application developers who want to devel...
详细信息
ISBN:
(纸本)9781450326568
Today, almost all computer architectures are parallel and heterogeneous; a combination of multiple CPUs, GPUs and specialized processors. this creates a challenging problem for application developers who want to develop high performance programs without the effort required to use low-level, architecture specific parallelprogramming models (e.g. OpenMP for CMPs, CUDA for GPUs, MPI for clusters). Domain-specific languages (DSLs) are a promising solution to this problem because they can provide an avenue for high-level application-specific abstractions with implicit parallelism to be mapped directly to low level architecture-specific programming models; providing both high programmer productivity and high execution *** this talk I will describe an approach to building high performance DSLs, which is based on DSL embedding in a general purpose programming language, metaprogramming and a DSL infrastructure called Delite. I will describe how we transform DSL programs into efficient first-order low-level code using domain specific optimization, parallelism and locality optimization withparallel patterns, and architecture-specific code generation. All optimizations and transformations are implemented in Delite: an extensible DSL compiler infrastucture that significantly reduces the effort required to develop new DSLs. Delite DSLs for machine learning, data querying, graph analysis, and scientific computing all achieve performance competitive with manually parallelized C++ code.
CUDA, as one of the most popular choices for GPU programming, can be executed only on NVIDIA GPUs. To execute CUDA on non-NVIDIA devices, researchers have proposed to translate CUDA to other programming languages. How...
详细信息
ISBN:
(纸本)9798400700156
CUDA, as one of the most popular choices for GPU programming, can be executed only on NVIDIA GPUs. To execute CUDA on non-NVIDIA devices, researchers have proposed to translate CUDA to other programming languages. However, this approach cannot achieve high coverage due to the challenges in source-to-source *** propose a framework, CuPBoP, that executes CUDA programs on non-NVIDIA devices without relying on other programming languages. CuPBoP consists of two parts. the compilation part applies transformations on CUDA host/kernel IRs. the runtime part consists of the runtime libraries for CUDA built-in functions. For the CPU backends, compared withthe existing frameworks, CuPBoP achieves the highest coverage on all CPUs that we evaluate (x86, aarch64, RISC-V).We make CuPBoP publicly available to inspire more works in this area 1.
Withthe rapid change of computing architectures, and variety of programming models; the ability to develop performance portable applications has become of great importance. this is particularly true in large producti...
详细信息
ISBN:
(纸本)9781450362252
Withthe rapid change of computing architectures, and variety of programming models; the ability to develop performance portable applications has become of great importance. this is particularly true in large production codes where developing and maintaining hardware specific versions is *** simplify the development of performance portable code, we introduce RAJA, our C++ library that allows developers to write single-source applications that can target multiple hardware and programming model back-ends. We provide a thorough introduction to all of RAJA features, and walk through some hands-on examples that will allow attendees to understand how RAJA might benefit their own applications. Attendees should bring a laptop computer to participate in the hands-on *** tutorial will introduce attendees to RAJA, a C++ library for developing performance portable applications. Attendees will learn how to write performance portable code that can execute on a range of programming models (OpenMP, CUDA, Intel TBB, and HCC) and hardware (CPU, GPU, Xeon Phi).Specifically, attendees will learn how to convert existing C++ applications to use RAJA, and how to use RAJA's programming abstractions to expose existing parallelism in their applications without complex algorithm rewrites. We will also cover specific guidelines for using RAJA in a large application, including some common "gotchas" and how to handle memory management. Finally, attendees will learn how to categorize loops to allow for simple and systematic performance tuning on any architecture.
the polyhedral model is a well developed formalism and has been extensively used in a variety of contexts viz. the automatic parallelization of loop programs, program verification, locality, hardware generationand mor...
详细信息
ISBN:
(纸本)9781595936028
the polyhedral model is a well developed formalism and has been extensively used in a variety of contexts viz. the automatic parallelization of loop programs, program verification, locality, hardware generationand more recently, in the automatic reduction of asymptotic program complexity. Such analyses and transformations rely on certain closure properties. However, the model is limited in expressivity and the need for a more general class of programs is widely *** provide the extension to ⁰-polyhedra which are the intersection of polyhedra and lattices. We prove the required closure properties using a novel representation and interpretation of ⁰-polyhedra. In addition, we also prove closure in the ⁰-polyhedral model under images by dependence functions---thereby proving that unions of LBLs, widely assumedto be a richer class of sets, is equal to unions of ⁰-polyhedra. Another corollary of this result is the equivalence of the unions of ⁰-polyhedraand Presburger sets. Our representation and closure properties constitute the foundations of the ⁰-polyhedral model. As an example, we presentthe transformation for automatic reduction of complexity in the ⁰-polyhedral model.
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored per...
详细信息
ISBN:
(纸本)9781450368186
Current state-of-the-art in GPU networking utilizes a host-centric, kernel-boundary communication model that reduces performance and increases code complexity. To address these concerns, recent works have explored performing network operations from within a GPU kernel itself. However, these approaches typically involve the CPU in the critical path, which leads to high latency and inefficient utilization of network and/or GPU *** this work, we introduce GPU Initiated OpenSHMEM (GIO), a new intra-kernel PGAS programming model and runtime that enables GPUs to communicate directly with a NIC without the intervention of the CPU. We accomplish this by exploring the GPU's coarse-grained memory model and correcting semantic mismatches when GPUs wish to directly interact withthe network. GIO also reduces latency by relying on a novel template-based design to minimize the overhead of initiating a network operation. We illustrate that for structured applications like a Jacobi 2D stencil, GIO can improve application performance by up to 40% compared to traditional kernel-boundary networking. Furthermore, we demonstrate that on irregular applications like Sparse Triangular Solve (SpTS), GIO provides up to 44% improvement compared to existing intra-kernel networking schemes.
this talk has two parts. the first part will discuss possible directions for computer architecture research, including architecture as infrastructure, energy first, impact of new technologies, and cross-layer opportun...
详细信息
ISBN:
(纸本)9781450326568
this talk has two parts. the first part will discuss possible directions for computer architecture research, including architecture as infrastructure, energy first, impact of new technologies, and cross-layer opportunities. this part is based on a 2012 Computing Community Consortium (CCC) whitepaper effort led by Hill, as well as other recent National Academy and ISAT studies. See: http://***/ccc/docs/init/***. the second part of the talk will discuss one or more exam-ples of cross-layer research advocated in the first part. For example, our analysis shows that many "big-memory" server workloads, such as databases, in-memory caches, and graph analytics, pay a high cost for page-based virtual memory: up to 50% of execution time wasted. Via small changes to the operating system (Linux) and hardware (x86-64 MMU), this work reduces execution time these workloads waste to less than 0.5%. the key idea is to map part of a process's linear virtual address space with a new incarnation of segmentation, while providing compatibility by mapping the rest of the virtual address space with pag-ing.
Assuming that the multicore revolution plays out the way the microprocessor industry expects, it seems that within a decade most programming will involve parallelism at some level. One needs to ask how this affects th...
详细信息
ISBN:
(纸本)9781605583976
Assuming that the multicore revolution plays out the way the microprocessor industry expects, it seems that within a decade most programming will involve parallelism at some level. One needs to ask how this affects the the way we teach computer science, or even how we have people think about computation. With regards to teaching there seem to be three basic choices: (1) we only train a small number of experts in parallel computation who develop a collection of libraries, and everyone else just uses them; (2) we leave our core curriculum pretty much as is, but add some advanced courses on parallelism or perhaps tack on a few lectures at the end of existing courses; or (3) we start teaching parallelism from the start and embed it throughout the curriculum withthe idea of getting students to think about parallelism as the most natural form of computation and sequential computation as a special *** talk will examine some of the implications of the third option. It will argue that thinking about parallelism, when treated in an appropriate way, might be as easy or easier that thinking sequentially. A key prerequisite, however, is to identify what the core ideas in parallelism are and how they might be layered and integrated with existing concepts. Another more difficult issue is how to cleanly integrate these ideas among courses. After all much of the success of sequential computation follows from the concept of a random access machine and its ability to serve as a simple, albeit imperfect, interface between programming languages, algorithm analysis, and hardware design. the talk will go through an initial list of some core ideas in parallelism, and an approach to integrating these ideas between parallel algorithms, programming languages, and, to some extent, hardware. this requires, however, moving away from the concept of a machine model as a interface for thinking about computation.
暂无评论