Reduced-precision floating-point (FP) arithmetic is being widely adopted to reduce memory footprint and execution time on battery-powered Internet of Things (IoT) end-nodes. However, reduced precision computations mus...
详细信息
ISBN:
(纸本)9798350396249
Reduced-precision floating-point (FP) arithmetic is being widely adopted to reduce memory footprint and execution time on battery-powered Internet of Things (IoT) end-nodes. However, reduced precision computations must meet end-do-end precision constraints to be acceptable at the application level. This work introduces TransLib(1), an open-source kernel library based on transprecision computing principles, which provides knobs to exploit different FP data types (i.e., float, float16, and bfloat16), also considering the trade-off between homogeneous and mixed-precision solutions. We demonstrate the capabilities of the proposed library on PULP, a 32-bit microcontroller (MCU) coupled with a parallel, programmable accelerator. On average, TransLib kernels achieve an IPC of 0.94 and a speed-up of 1.64x using 16-bit vectorization. The parallel variants achieve a speed-up of 1.97x, 3.91x, and 7.59x on 2, 4, and 8 cores, respectively. The memory footprint reduction is between 25% and 50%. Finally, we show that mixed-precision variants increase the accuracy by 30x at the cost of 2.09x execution time and 1.35x memory footprint compared to float16 vectorized.
Research on high-level parallel programming approaches systematically evaluate the performance of applications written using these approaches and informally argue that high-level parallel programming languages or libr...
详细信息
Research on high-level parallel programming approaches systematically evaluate the performance of applications written using these approaches and informally argue that high-level parallel programming languages or libraries increase the productivity of programmers. In this paper we present a methodology that allows to evaluate the trade-off between programming effort and performance of applications developed using different programming models. We apply this methodology on some implementations of a function solving the all nearest smaller values problem. The high-level implementation is based on a new version of the BSP homomorphism algorithmic skeleton.
This paper presents an overview of the "Applied parallel Computing" course taught to final year Software Engineering undergraduate students in Spring 2014 at NUST, Pakistan. The main objective of the course ...
详细信息
This paper presents an overview of the "Applied parallel Computing" course taught to final year Software Engineering undergraduate students in Spring 2014 at NUST, Pakistan. The main objective of the course was to introduce practical parallel programming tools and techniques for shared and distributed memory concurrent systems. A unique aspect of the course was that Java was used as the principle programming language. The course was divided into three sections. The first section covered parallel programming techniques for shared memory systems including multicore and Symmetric Multi-Processor (SMP) systems. In this section, Java threads API was taught as a viable programming model for such systems. The second section was dedicated to parallel programming tools meant for distributed memory systems including clusters and network of computers. We used MPJ Express -- a Java MPI library -- for conducting programming assignments and lab work for this section. The third and the final section introduced advanced topics including the MapReduce programming model using Hadoop and the General Purpose Computing on Graphics Processing Units (GPGPU).
Chapel is a programming language being developed for high-performance applications. It is well suited for teaching parallelism in a wide variety of undergrad courses. Chapel is easy to learn since it supports a low-ov...
详细信息
ISBN:
(纸本)9781450326056
Chapel is a programming language being developed for high-performance applications. It is well suited for teaching parallelism in a wide variety of undergrad courses. Chapel is easy to learn since it supports a low-overhead style like a scripting language as well as a full OO style. It is concise, needing a single keyword to launch an asynchronous task, run a parallel loop, or perform a reduction. This helps undergrads focus on the main point of examples and lets them quickly try different parallel algorithms. It is also versatile, usable on both multicore systems and clusters. In this workshop, attendees will learn basics of Chapel, complete hands-on exercises, and see possible uses in algorithms, programming languages, and parallel programming courses. Laptop with SSH client required.
The SystemC/TLM technologies are widely accepted in the industry for fast system-level simulation. An important limitation of SystemC regarding performance is that the reference implementation is sequential, and the o...
详细信息
ISBN:
(纸本)9783981537000
The SystemC/TLM technologies are widely accepted in the industry for fast system-level simulation. An important limitation of SystemC regarding performance is that the reference implementation is sequential, and the official semantics makes parallel executions difficult. As the number of cores in computers increase quickly, the ability to take advantage of the host parallelism during a simulation is becoming a major concern. Most existing work on parallelization of SystemC targets cycle-accurate simulation, and would be inefficient on loosely timed systems since they cannot run in parallel processes that do not execute simultaneously. We propose an approach that explicitly targets loosely timed systems, and offers the user a set of primitives to express tasks with duration, as opposed to the notion of time in SystemC which allows only instantaneous computations and time elapses without computation. Our tool exploits this notion of duration to run the simulation in parallel. It runs on top of any (unmodified) SystemC implementation, which lets legacy SystemC code continue running as-it-is. This allows the user to focus on the performance-critical parts of the program that need to be parallelized.
Utilizing parallel systems to their full potential can be challenging for general-purpose developers. A solution to this problem is to create high-level abstractions using Domain-Specific Languages (DSL). We create a ...
详细信息
NAS parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel...
详细信息
The design and development of the inter-process communication pattern called Pipeline is presented as a proposal of parallel Object Composition to solve simple way problems that can be solved with this same parallel c...
详细信息
ISBN:
(纸本)9788885741881
The design and development of the inter-process communication pattern called Pipeline is presented as a proposal of parallel Object Composition to solve simple way problems that can be solved with this same parallel control structure. A particular class library called JPMI (Java Passing Message Interface) is used for parallel programming with message passing and to implement an original and particular version of the well-known video game called SIMON with the objective, on the one hand, to show the usefulness of this design within Structured parallel programming and, on the other hand, that this proposal serves to guarantee good performance in the execution of real time applications. An example of this type of applications is precisely video games. The parallel algorithm implemented as a Composition of parallel Objects is based on the development and use of a methodology where the algorithmic design represents the parallel control structure common to a given algorithmic technique that can use the pipeline communication pattern, generating a generic and abstract parallel program from which programs that solve specific problems using the same communication pattern can be derived. The implementation of this proposal within structured parallel programming tries to facilitate to the novice programmer in parallelism the reusability, genericity, and uniformity of code abstract enough to be suitable for any problem that can be solved with a pipeline offered implemented with a parallel message passing structure. This particularized proposal for the implementation of the SIMON video game is compared with another using a thread library called boost and ZeroC Ice for remote invocation of distributed objects. The execution times and speedups of both proposals are compared to identify how similar or different they are in their respective performances with training tests using AI modules with sequences of 500000 colors in a cluster of 2 Intel Xeon CPUs of 8 cores each and 2 nodes,
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtim...
详细信息
The wide adoption of SYCL as an open-standard API for accelerating C++ software in domains such as HPC, automotive, artificial intelligence, machine learning, and other areas necessitates efficient compiler and runtime support for a growing number of different platforms. Existing SYCL implementations provide support for various devices like CPUs, GPUs, DSPs, FPGAs and so forth, typically via OpenCL or CUDA backends. While accelerators have increased the performance of user applications significantly, employing CPU devices for further performance improvement is beneficial due to the significant presence of CPUs in existing data-centers. SYCL applications on CPUs, currently go through an OpenCL backend. Though an OpenCL backend is valuable in supporting accelerators, it may introduce additional overhead for CPUs since the host and device are the same. Overheads like a run-time compilation of the kernel, transferring of input/output memory to/from the OpenCL device, invoking the OpenCL kernel and so forth, may not be necessary when running on the CPU. While some of these overheads (such as data transfer) can be avoided by modifying the application, it can introduce disparity in the SYCL application's ability to achieve performance portability on other devices. In this article, we propose an alternate approach to running SYCL applications on CPUs. We bypass OpenCL and use a CPU-directed compilation flow, along with the integration of whole function vectorization to generate optimized host and device code together in the same translation unit. We compare the performance of our approach—the CPU-directed compilation flow, with an OpenCL backend for existing SYCL-based applications, with no code modification for BabelStream benchmark, Matmul from the ComputeCpp SDK, N-body simulation benchmarks and SYCL-BLAS (Aliaga et al. Proceedings of the 5th International Workshop on OpenCL;2017.), on CPUs from different vendors and architectures. We report a performance improvement of
OpenMP is the predominant standard for shared memory systems in high-performance computing (HPC), offering a tasking paradigm for parallelism. However, existing OpenMP implementations, like GCC and LLVM, face computat...
详细信息
暂无评论