To achieve high performance, modern HPC systems take advantage of heterogeneous GPU architectures. Often these GPUs are programmed using a vendor preferred parallel programming model. Unfortunately, this often results...
详细信息
This paper presents experience using a research-infused teaching approach towards an undergraduate parallel programming course. The research-teaching nexus is applied at various levels, first by using research-led tea...
详细信息
ISBN:
(纸本)9781479941155
This paper presents experience using a research-infused teaching approach towards an undergraduate parallel programming course. The research-teaching nexus is applied at various levels, first by using research-led teaching of core parallel programming concepts, as well as teaching the latest developments from the affiliated research group. The bulk of the course, however, focuses more on the student-driven research-based and research-tutored teaching approaches, where students actively participate in groups on research projects, students are fully immersed in the learning activity of their respective project, while at the same time participating in discussions of wider parallel programming topics across other groups. This intimate affiliation between the undergraduate course and the research group results in a wide range of benefits for all those involved.
Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in...
详细信息
Taskflow aims to streamline the building of parallel and heterogeneous applications using a lightweight task graph-based approach. Taskflow introduces an expressive task graph programming model to assist developers in the implementation of parallel and heterogeneous decomposition strategies on a heterogeneous computing platform. Our programming model distinguishes itself as a very general class of task graph parallelism with in-graph control flow to enable end-to-end parallel optimization. To support our model with high performance, we design an efficient system runtime that solves many of the new scheduling challenges arising out of our models and optimizes the performance across latency, energy efficiency, and throughput. We have demonstrated the promising performance of Taskflow in real-world applications. As an example, Taskflow solves a large-scale machine learning workload up to 29% faster, 1.5x less memory, and 1.9x higher throughput than the industrial system, oneTBB, on a machine of 40 CPUs and 4 GPUs. We have opened the source of Taskflow and deployed it to large numbers of users in the open-source community.
Experimental research in physics can be a costly and time-consuming venture, requiring simulation-based approaches to effectively narrow down the scope of experiments to only the most promising cases. Our multidiscipl...
详细信息
ISBN:
(纸本)9798350311990
Experimental research in physics can be a costly and time-consuming venture, requiring simulation-based approaches to effectively narrow down the scope of experiments to only the most promising cases. Our multidisciplinary research in this paper demonstrates how the simulation of light-driven nanoparticles can substantially benefit from GPU-based parallelism. We develop a novel ray-tracing strategy and we implement it in C++/CUDA and extend it with a parallel differential equation solver. Our implementation relies on a custom memory layout optimization to tackle the computational challenges in the field and provide accurate solutions in near real time. We evaluate our approach on a variety of popular GPU architectures, including advanced data-center GPUs like the Nvidia V100, as well as consumer-grade hardware like the Nvidia RTX 2080 Ti and Nvidia GTX 1080. Our GPU-based approach achieves a speedup of up to 20x compared to a parallel CPU-based prototype implementation.
Driven by the demands of deep learning, many hardware accelerators, including GPUs, have begun to include specialized tensor processing units to accelerate matrix operations. However, general-purpose GPU applications ...
详细信息
Driven by the demands of deep learning, many hardware accelerators, including GPUs, have begun to include specialized tensor processing units to accelerate matrix operations. However, general-purpose GPU applications that have little or no large dense matrix operations cannot benefit from these tensor units. This article proposes Tensorox, a framework that exploits the half-precision tensor cores available on recent GPUs for approximable, non deep learning applications. In essence, a shallow neural network is trained based on the input-output mapping of the function to be approximated. The key innovation in our implementation is the use of the small and dimension-restricted tensor operations in Nvidia GPUs to run multiple instances of the approximation neural network in parallel. With the proper scaling and training methods, our approximation yielded an overall accuracy that is higher than naively running the original programs with half-precision. Furthermore, Tensorox allows for the runtime adjustment of the degree of approximation. For the 10 benchmarks we tested, we achieved speedups from 2x to 112x compared to the original in single precision floating point, while maintaining the error caused by the approximation to below 10 percent in most applications.
Scientists, engineers, and researchers leverage high-performance computing (HPC) systems to perform complex computations and process large amounts of data. Designing, developing, and operating HPC systems have a steep...
详细信息
ISBN:
(纸本)9798400702532
Scientists, engineers, and researchers leverage high-performance computing (HPC) systems to perform complex computations and process large amounts of data. Designing, developing, and operating HPC systems have a steep learning curve, thus making it crucial to train a highly skilled and knowledgeable workforce in order to keep up with the rapidly evolving field, drive innovation, and meet the increasing demand for HPC across various sectors. Limited access to HPC educational resources is the main deterrent to training HPC talent. This paper addresses two primary culprits for the limited access: the high cost of production systems and the lack of realistic full-stack HPC training. Cutting-edge hardware is usually expensive and requires specialized facilities. Moreover, large HPC facilities typically discourage experimenting with the systems since they run production computation workloads and require minimal disturbance. Furthermore, HPC training often does not reflect the scale or complexity of production systems. This lack of realistic training support makes education in this area particularly difficult and ineffective. This paper proposes an educational framework for HPC that includes the development of a low-cost and flexible platform design for users in diverse fields. It allows study and experimentation with multiple realistic elements involved in a production HPC ecosystem. DEMAC, the Delaware Modular Assembly Cluster, is a set of 3D-printable frames designed to house embedded systems and auxiliary systems in a way that emulates HPC platforms. The teaching framework focuses on practical training as an education model in which learners reinforce theoretical knowledge with hands-on experience. If successful, this effort will contribute fundamentally to scientific research, technological advancements, HPC workforce development, and economic growth.
Sensor technology has impacted and transformed several commercial and military industries, from automotive to defense, to personal security applications. Adopting Computer Vision techniques such as Object Detection he...
详细信息
ISBN:
(纸本)9798350332865
Sensor technology has impacted and transformed several commercial and military industries, from automotive to defense, to personal security applications. Adopting Computer Vision techniques such as Object Detection helps boost the system's capabilities in tracking and classification. A previous study looked at optimizing a Radar System that relied on the processing power of the Raspberry Pi 4. Since the Raspberry Pi 4 is not an ideal device for a real-time environment, this experiment aims to improve on that aspect by utilizing the processing power of the Jetson TX2 to boost the system's speed and reduce latency between the Object Detection and Velocity calculation. The experiment will be repeated using the Jetson hardware to mimic the same scenarios in common tracking and surveillance systems.
Hardware Transactional Memory (HTM) is a high-performance instantiation of the powerful programming abstraction of transactional memory, which simplifies the daunting-yet critically important-task of parallel programm...
详细信息
ISBN:
(纸本)9781665476522
Hardware Transactional Memory (HTM) is a high-performance instantiation of the powerful programming abstraction of transactional memory, which simplifies the daunting-yet critically important-task of parallel programming. While many HTM implementations with variable complexity exist in the literature, commercially available HTMs impose rigid restrictions to transaction and system behavior, limiting their practical use. A key constraint is the limited size of supported transactions, implicitly capped by hardware buffering capacity. We identify the opportunity to expand the effective capacity of these limited hardware structures by being more selective in memory accesses that need to be tracked. We leverage compiler and virtual memory support to identify safe memory accesses, which can never cause a transaction abort, subsequently passed as safety hints to the underlying HTM. With minor extensions over a conventional HTM implementation, HinTM uses these hints to selectively allocate transactional state tracking resources to unsafe accesses only, thus expanding the HTM's effective capacity, and conversely reducing capacity aborts. We demonstrate that HinTM effectively augments the performance of a range of baseline HTM configurations. When coupled with a POWER8 HTM implementation, HinTM eliminates 64% of transactional capacity aborts, achieving 1.4x average speedup, and up to 8.7x.
As the supercomputing landscape diversifies, solutions such as Kokkos to write vendor agnostic applications and libraries have risen in popularity. Kokkos provides a programming model designed for performance portabil...
详细信息
ISBN:
(纸本)9783031407437;9783031407444
As the supercomputing landscape diversifies, solutions such as Kokkos to write vendor agnostic applications and libraries have risen in popularity. Kokkos provides a programming model designed for performance portability, which allows developers to write a single source implementation that can run efficiently on various architectures. At its heart, Kokkos maps parallel algorithms to architecture and vendor specific backends written in lower level programming models such as CUDA and HIP. Another approach to writing vendor agnostic parallel code is using OpenMP's directives based approach, which lets developers annotate code to express parallelism. It is implemented at the compiler level and is supported by all major high performance computing vendors, as well as the primary Open Source toolchains GNU and LLVM. Since its inception, Kokkos has used OpenMP to parallelize on CPU architectures. In this paper, we explore leveraging OpenMP for a GPU backend and discuss the challenges we encountered when mapping the Kokkos APIs and semantics to OpenMP target constructs. As an exemplar workload we chose a simple conjugate gradient solver for sparse matrices. We find that performance on NVIDIA and AMD GPUs varies widely based on details of the implementation strategy and the chosen compiler. Furthermore, the performance of the OpenMP implementations decreases with increasing complexity of the investigated algorithms.
We present a dependent type system for enforcing array-size consistency in an ML-style functional array language. Our goal is to enforce shape-consistency at compile time and allow nontrivial transformations on array ...
详细信息
ISBN:
(纸本)9798400702969
We present a dependent type system for enforcing array-size consistency in an ML-style functional array language. Our goal is to enforce shape-consistency at compile time and allow nontrivial transformations on array shapes, without the complexity such features tend to introduce in dependently typed languages. Sizes can be arbitrary expressions and size equality is purely syntactical, which fits naturally within a scheme that interprets size-polymorphic functions as having implicit arguments. When non-syntactical equalities are needed, we provide dynamic checking. In contrast to other dependently typed languages, we automate the book-keeping involved in tracking existential sizes, such as when filtering arrays. We formalise a large subset of the presented type system and prove it sound. We also discuss how to adapt the type system for a real implementation, including type inference, within the Futhark programming language.
暂无评论