This paper explores how design patterns could be revisited in the era of mainstream functional programming languages. I discuss the kinds of knowledge that ought to be represented as functional design patterns: archit...
详细信息
ISBN:
(纸本)9798400702976
This paper explores how design patterns could be revisited in the era of mainstream functional programming languages. I discuss the kinds of knowledge that ought to be represented as functional design patterns: architectural concepts that are relatively self-contained, but whose entirety cannot be represented as a language-level abstraction. I present four concrete examples embodying this idea: the Witness, the State Machine, the parallel Lists, and the Registry. Each pattern is implemented in Rust to demonstrate how careful use of a sophisticated type system can better model each domain construct and thereby catch user mistakes at compile-time.
Approximate computing is a well-known method [7] to achieve higher performance or lower energy consumption while accepting a loss of output accuracy. Many applications such as image processing and neural networks, are...
详细信息
ISBN:
(纸本)9798400707452
Approximate computing is a well-known method [7] to achieve higher performance or lower energy consumption while accepting a loss of output accuracy. Many applications such as image processing and neural networks, are tolerant of a certain amount of error, and have the potential for significant improvements in terms of execution time and energy consumption. The most advanced software approximation techniques are mixed precision, which uses a lower precision data representation for both integer and floating point variables [1, 4]; perforation, which skips instruction blocks in a program, iterations in a loop, or data in buffers assuming that nearby data have similar values [2, 5, 6, 8]; and relaxed synchronization which removes synchronization points that represent one of the major bottleneck in parallel applications [3, 9]. These approximate approaches differ in performance achieved and also in error produced. Usually, perforation and synchronization elision have higher performance compared with mixed precision but produce more errors. In particular, synchronization elision introduces non-deterministic errors that are complex to handle. Support for approximate computing is provided by the SYCL heterogeneous programming model often used for developing portable HPC applications. SYCL supports approximate computing by providing a set of built-in functions and data types that can be used to perform approximate operations, such as half-floating-point reductions and bit-level operations. In this technical talk, we present SYprox, a SYCL-based API supporting a broad set of approximation techniques in modern C++. SYprox introduces a set of semantics that extend SYCL’s buffers and accessors to provide a high-level easy-to-use programming API. It supports data perforation and elision patterns for efficient approximation, as well as signal reconstruction algorithms for error mitigation. Figure 1 (a) depicts the accurate execution of an application while Figure 1 (b) shows the
Coherence induced cache misses are an important aspect limiting the scalability of shared memory parallel programs. Many coherence misses are avoidable, namely misses due to false sharing when different threads write ...
详细信息
ISBN:
(纸本)9781450384414
Coherence induced cache misses are an important aspect limiting the scalability of shared memory parallel programs. Many coherence misses are avoidable, namely misses due to false sharing when different threads write to different memory addresses that are contained within the same cache block causing unnecessary invalidations. Past work has proposed numerous ways to mitigate false sharing from coherence protocols optimized for certain sharing patterns, to software tools for false-sharing detection and repair. Our work leverages approximate computing and store value similarity in error-tolerant multi-threaded applications. We introduce a novel cache coherence protocol which implements an approximate store instruction and coherence states to allow some limited incoherence within approximatable shared data to mitigate both coherence misses and coherence traffic for various sharing patterns. For applications from the Phoenix and AxBench suites, we see dynamic energy improvements within the NoC and memory hierarchy of up to 50.1% and speedup of up to 37.3% with low output error for approximate applications that exhibit false sharing.
Knowledge acquisition from graph structured data is an important task in machine learning and data mining. TTSP (Two-Terminal Series parallel) graphs are used as data models for electric networks and scheduling. We pr...
详细信息
parallelpatterns, views, and spaces are promising abstractions to capture the programmer's intent as well as the contextual information that can be used by an underlying runtime to efficiently map software to par...
详细信息
ISBN:
(纸本)9780738143057
parallelpatterns, views, and spaces are promising abstractions to capture the programmer's intent as well as the contextual information that can be used by an underlying runtime to efficiently map software to parallel hardware. These abstractions can be valuable in cases where an algorithm must accommodate requirements of code and performance portability across hardware architectures and vendor programming models. Kokkos is a parallelprogramming model for host- and accelerator architectures that relies on these abstractions and targets these requirements. It consists of a pure C++ interface, a specification, and a programming library. The programming library exposes patterns and types and maps them to an underlying abstract machine model. The abstract machine model offers a generic view of parallel hardware. While Kokkos is gaining popularity in large-scale HPC applications at some DOE laboratories, we believe that the implemented concepts are of interest to a broader audience including academia as they may contribute to a generic, vendor, and architecture-independent education of parallelprogramming. In this work, we give an insight into the design considerations of this programming model and list important abstractions. Further, we document best practices obtained from giving virtual classes on Kokkos and give pointers to resources that the reader may consider valuable for a lecture on generic parallelprogramming for students with preexisting knowledge on this matter.
The steady growth of data volume produced as continuous streams makes paramount the development of software capable of providing timely results to the users. The Actor Model (AM) offers a high-level of abstraction sui...
详细信息
ISBN:
(纸本)9781450381857
The steady growth of data volume produced as continuous streams makes paramount the development of software capable of providing timely results to the users. The Actor Model (AM) offers a high-level of abstraction suited for developing scalable message-passing applications. It allows the application developer to focus on the application logic moving the burden of implementing fast and reliable inter-Actors message-exchange to the implementation framework. In this paper, we focus on evaluating the model in high data rate streaming applications targeting scale-up servers. Our approach leverages parallel Pattern (PP) abstractions to model streaming computations and introduces optimizations that otherwise could be challenging to implement without violating the Actor Model's semantics. The experimental analysis demonstrates that the new implementation skeletons we propose for our PPs can bring significant performance boosts (more than 2x) in high data rate streaming applications implemented in CAF.
Performance analysis is critical for pinpointing bottlenecks in parallel applications. Several profilers exist to instrument parallel programs on HPC systems and gather performance data. Hatchet is an open-source Pyth...
详细信息
ISBN:
(纸本)9780738110707
Performance analysis is critical for pinpointing bottlenecks in parallel applications. Several profilers exist to instrument parallel programs on HPC systems and gather performance data. Hatchet is an open-source Python library that can read profiling output of several tools, and enables the user to perform a variety of programmatic analyses on hierarchical performance profiles. In this paper, we augment Hatchet to support new features: a query language for representing call path patterns that can be used to filter a calling context tree, visualization support for displaying and interacting with performance profiles, and new operations for performing analyses on multiple datasets. Additionally, we present performance optimizations in Hatchet's HPCToolkit reader and the unify operation to enable scalable analysis of large datasets.
A common simplification made when modeling the performance of a parallel program is the assumption that the performance behavior of all processes or threads is largely uniform. Empirical performance-modeling tools suc...
详细信息
ISBN:
(纸本)9780738110707
A common simplification made when modeling the performance of a parallel program is the assumption that the performance behavior of all processes or threads is largely uniform. Empirical performance-modeling tools such as Extra-P exploit this common pattern to make their modeling process more noise resilient, mitigating the effect of outliers by summarizing performance measurements of individual functions across all processes. While the underlying assumption does not equally hold for all applications, knowing the qualitative differences in how the performance of individual processes changes as execution parameters are varied can reveal important performance bottlenecks such as malicious patterns of load imbalance. A challenge for empirical modeling tools, however, arises from the fact that the behavioral class of a process may depend on the process configuration, letting process ranks migrate between classes as the number of processes grows. In this paper, we introduce a novel approach to the problem of modeling of spatially diverging performance based on a certain type of process clustering. We apply our technique to identify a previously unknown performance bottleneck in the BoSSS fluid-dynamics code. Removing it made the code regions in question running up to 20 times and the application as a whole run up to 4.5 times faster.
Fast domain propagation of linear constraints has become a crucial component of today's best algorithms and solvers for mixed integer programming and pseudo-boolean optimization to achieve peak solving performance...
详细信息
ISBN:
(纸本)9781665415576
Fast domain propagation of linear constraints has become a crucial component of today's best algorithms and solvers for mixed integer programming and pseudo-boolean optimization to achieve peak solving performance. Irregularities in the form of dynamic algorithmic behaviour, dependency structures, and sparsity patterns in the input data make efficient implementations of domain propagation on GPUs and, more generally, on parallel architectures challenging. This is one of the main reasons why domain propagation in state-of-the-art solvers is single thread only. In this paper, we present a new algorithm for domain propagation which (a) avoids these problems and allows for an efficient implementation on GPUs, and is (b) capable of running propagation rounds entirely on the GPU, without any need for synchronization or communication with the CPU. We present extensive computational results which demonstrate the effectiveness of our approach and show that ample speedups are possible on practically relevant problems: on state-of-theart GPUs, our geometric mean speed-up for reasonably-large instances is around 10x to 20x and can be as high as 195x on favorably-large instances.
FPGAs have found increasing adoption in data center applications since a new generation of high-level tools have become available which noticeably reduce development time for FPGA accelerators and still provide high-q...
详细信息
ISBN:
(纸本)9780738123547
FPGAs have found increasing adoption in data center applications since a new generation of high-level tools have become available which noticeably reduce development time for FPGA accelerators and still provide high-quality results. There is, however, no high-level benchmark suite available, which specifically enables a comparison of FPGA architectures, programming tools, and libraries for HPC applications. To fill this gap, we have developed an OpenCL-based open-source implementation of the HPCC benchmark suite for Xilinx and Intel FPGAs. This benchmark can serve to analyze the current capabilities of FPGA devices, cards, and development tool flows, track progress over time, and point out specific difficulties for FPGA acceleration in the HPC domain. Additionally, the benchmark documents proven performance optimization patterns. We will continue optimizing and porting the benchmark for new generations of FPGAs and design tools and encourage active participation to create a valuable tool for the community.
暂无评论