The Controller model is a heterogeneous parallel programming model implemented as a library. It transparently manages the coordination, communication and kernel launching details on different heterogeneous computing d...
详细信息
The Controller model is a heterogeneous parallel programming model implemented as a library. It transparently manages the coordination, communication and kernel launching details on different heterogeneous computing devices. It exploits native or vendor specific programming models and compilers, such as OpenMP, CUDA or OpenCL, thus enabling the potential performance obtained by using them. This work discusses the integration of FPGAs in the Controller model, using high-level synthesis tools and OpenCL. A new Controller backend for FPGAs is presented based on a previous OpenCL backend for GPUs. We discuss new configuration parameters for FPGA kernels and key ideas to adapt the original OpenCL backend while maintaining the portability of the original model. We present an experimental study to compare performance and development effort metrics obtained with the Controller model, Intel oneAPI and reference codes directly programmed with OpenCL. The results show that using the Controller library has advantages and drawbacks compared with Intel oneAPI, while compared with OpenCL it highly reduces the programming effort with negligible performance overhead.
With the advent of renewable energy, smart grids, and cutting-edge measurement technologies, modern power systems are becoming more complex. As a result, analyzing modern power systems requires more computational powe...
详细信息
ISBN:
(纸本)9781665462020
With the advent of renewable energy, smart grids, and cutting-edge measurement technologies, modern power systems are becoming more complex. As a result, analyzing modern power systems requires more computational power. High Performance Computing (HPC) is the most viable option for meeting this demand. In India's power sector, the use of HPC is minimal. Hence, we are introducing HPC-based power flow analysis, which is the highly used application to analysis the system. The paper demonstrates the importance of HPC for power flow analysis. This paper also discusses a modified Gaussian Elimination method to utilize the sparse nature of Jacobian matrix to speedup the computation. Open-Multi Processing (OpenMP) is used to implement parallel computing. parallel power flow analysis is simulated on the Central Processing Unit (CPU) and Graphics Processing Unit (GPU) nodes of the C-DAC's PARAM Utkarsh supercomputer for various power system networks. The speedup obtained with HPC for the Polish 9241 bus network is 216.14 times the sequential computation.
Visual impairments are a global health issue with profound socioeconomic ramifications in both the developing and the developed world. There exist ongoing research projects, that aim to investigate the influence of li...
详细信息
ISBN:
(纸本)9781665464956
Visual impairments are a global health issue with profound socioeconomic ramifications in both the developing and the developed world. There exist ongoing research projects, that aim to investigate the influence of light in the perception of low vision individuals. But as of today, there is neither clear knowledge nor extensive data regarding the influence of light in low vision situations. This research will address these issues by introducing a methodology and a system to simulate visual impairments. A pipeline based on eye anatomy coupled with real-time image processing algorithms allows to dynamically simulate low vision specific characteristics of selected impairments in mixed reality. An original new approach based on massively parallelized processing combined with an efficient modeling of eye refractive errors aims to improve the accuracy of the low vision simulation.
parallel programming models (e.g., OpenMP) are more and more used to improve the performance of real-time applications in modern processors. Nevertheless, these processors have complex architectures, being very diffic...
详细信息
ISBN:
(数字)9781728175683
ISBN:
(纸本)9781728175683
parallel programming models (e.g., OpenMP) are more and more used to improve the performance of real-time applications in modern processors. Nevertheless, these processors have complex architectures, being very difficult to understand their timing behavior. The main challenge with most of existing works is that they apply static timing analysis for simpler models or measurement-based analysis using traditional platforms (e.g., single core) or considering only sequential algorithms. How to provide an efficient configuration for the allocation of the parallel program in the computing units of the processor is still an open challenge. This paper studies the problem of performing timing analysis on complex multi-core platforms, pointing out a methodology to understand the applications' timing behavior, and guide the configuration of the platform. As an example, the paper uses an OpenMP-based program of the Heat benchmark on a NVIDIA Jetson AGX Xavier. The main objectives are to analyze the execution time of OpenMP tasks, specify the best configuration of OpenMP directives, identify critical tasks, and discuss the predictability of the system/application. A Linux perf based measurement tool, which has been extended by our team, is applied to measure each task across multiple executions in terms of total CPU cycles, the number of cache accesses, and the number of cache misses at different cache levels, including L1, L2 and L3. The evaluation process is performed using the measurement of the performance metrics by our tool to study the predictability of the system/application.
Communication is critical to the scalable and efficient performance of scientific simulations on extreme scale computing systems. Part of the promise of task-based programming models is that they can naturally overlap...
详细信息
ISBN:
(纸本)9783031061561;9783031061554
Communication is critical to the scalable and efficient performance of scientific simulations on extreme scale computing systems. Part of the promise of task-based programming models is that they can naturally overlap communication with computation and exploit locality between tasks. Copy-based semantics using eager communication protocols easily enable such asynchrony by alleviating the responsibility of buffer management from the user, both on the sender and the receiver. However, these semantics increase memory allocations and copies and in turn affect application memory footprint and performance, especially with large message buffers. In this work we describe how the so-called "zero copy" messaging semantics can be supported in Converse, the message-driven parallel programming framework that is used by Charm++, by implementing support for user-owned buffer transfers in its lower level runtime system, LRTS. These semantics work on user-provided buffers and do not semantically require copies by either the user or the runtime system. We motivate our work by reviewing the existing messaging model in Converse/Charm++, identify its semantic shortcomings, and define new LRTS and Converse APIs to support zero copy communication based on RDMA capabilities. We demonstrate the utility of our new communication interfaces with benchmarks written in Converse. The result is up to 91% of message latency improvement and improved memory usage. These advances will enable future work on user-facing APIs in Charm++.
programming languages using functions on collections of values, such as map, reduce, scan and filter, have been used for over fifty years. Such collections have proven to be particularly useful in the context of paral...
详细信息
ISBN:
(纸本)9781450392044
programming languages using functions on collections of values, such as map, reduce, scan and filter, have been used for over fifty years. Such collections have proven to be particularly useful in the context of parallelism because such functions are naturally parallel. However, if implemented naively they lead to the generation of temporary intermediate collections that can significantly increase memory usage and runtime. To avoid this pitfall, many approaches use "fusion" to combine operations and avoid temporary results. However, most of these approaches involve significant changes to a compiler and are limited to a small set of functions, such as maps and reduces. In this paper we present a library-based approach that fuses widely used operations such as scans, filters, and flattens. In conjunction with existing techniques, this covers most of the common operations on collections. Our approach is based on a novel technique which parallelizes over blocks, with streams within each block. We demonstrate the approach by implementing libraries targeting multicore parallelism in two languages: parallel ML and C++, which have very different semantics and compilers. To help users understand when to use the approach, we define a cost semantics that indicates when fusion occurs and how it reduces memory allocations. We present experimental results for a dozen benchmarks that demonstrate significant reductions in both time and space. In most cases the approach generates code that is near optimal for the machines it is running on.
parallel programming skills may require a long time to acquire. "Think in parallel" is a skill that requires time, effort, and experience. In this work, we propose to facilitate the students' learning pr...
详细信息
parallel programming skills may require a long time to acquire. "Think in parallel" is a skill that requires time, effort, and experience. In this work, we propose to facilitate the students' learning process in parallel programming by using instant messaging. Our aim was to find out whether students' interaction through instant messaging tools is beneficial for the learning process. In order to do so, we asked several students of an HPC course of the Master's degree in Computer Science of the University of Leon to develop a specific parallel application, each of them using a different application program interface: OpenMP, MPI, CUDA, or OpenCL. Even though the used APIs are different, there are common points in the design process. We encouraged students to interact with each other by using Gitter, an instant messaging tool for GitHub users. Our analysis of the communications and results demonstrate that the direct interaction of students through the Gitter tool has a positive impact on the learning process.
The multiple signal classification algorithm (MUSICAL) is a statistical super-resolution technique for wide-field fluorescence microscopy. Although MUSICAL has several advantages, such as its high resolution, its low ...
详细信息
The multiple signal classification algorithm (MUSICAL) is a statistical super-resolution technique for wide-field fluorescence microscopy. Although MUSICAL has several advantages, such as its high resolution, its low computational performance has limited its exploitation. This paper aims to analyze the performance and scalability of MUSICAL for improving its low computational performance. We first optimize MUSICAL for performance analysis by using the latest high-performance computing libraries and parallel programming techniques. Thereafter, we provide insights into MUSICAL's performance bottlenecks. Based on the insights, we develop a new parallel MUSICAL in C++ using Intel Threading Building Blocks and the Intel Math Kernel Library. Our experimental results show that our new parallel MUSICAL achieves a speed-up of up to 30.36x on a commodity machine with 32 cores with an efficiency of 94.88%. The experimental results also show that our new parallel MUSICAL outperforms the previous versions of MUSICAL in Matlab, Java, and Python by 30.43x, 2.63x, and 1.69x, respectively, on commodity machines.
Descriptive complexity provides intrinsic, i.e. machine-independent , characterizations of the main complexity classes. On the other hand, logic can be useful for designing programs in a natural declarative way. This ...
详细信息
Descriptive complexity provides intrinsic, i.e. machine-independent , characterizations of the main complexity classes. On the other hand, logic can be useful for designing programs in a natural declarative way. This is especially important for parallel computation models such as cellular automata, since designing parallel programs is considered a difficult task. This paper establishes three logical characterizations of the three classical complexity classes modeling minimal time, called real-time , of one-dimensional cellular automata according to their canonical variations: unidirectional or bidirectional communication, input word given in a parallel or sequential way. Our three logics are natural restrictions of existential second-order Horn logic with built-in successor and predecessor functions. These logics correspond exactly to the three ways of deciding a language on a square grid circuit of side ������ according to one of the three natural locations of an input word of length ������: along a side of the grid, on the diagonal that contains the output cell - placed on the vertex (n,n) of the square grid-, or on the diagonal opposite to the output cell. The key ingredient to our results is a normalization method that transforms a formula from one of our three logics into an equivalent normalized formula that closely mimics a grid circuit. Then, we extend our logics by allowing a limited use of negation on hypotheses like in Stratified Datalog. By revisiting in detail a number of representative classical problems -recognition of the set of primes by Fisher's algorithm, Dyck language recognition, Firing Squad Synchronization problem, etc. -we show that this extension makes easier programming and we prove that it does not change the real-time complexity of our logics. Finally, based on our experience in expressing these representative problems in logic, we argue that our logics are high-level programming languages: they make it possible to express in a natural, c
This paper describes a compiler extension to our prototype extensible C translator that adds new features for parallel execution of matrix operations and shows their application to problems in spatio-temporal data min...
详细信息
ISBN:
(纸本)9781479956180
This paper describes a compiler extension to our prototype extensible C translator that adds new features for parallel execution of matrix operations and shows their application to problems in spatio-temporal data mining. The extension provides new language features for constructing new matrices, mapping functions over elements of a matrix, and accumulating operations that, for example, can sum values in a matrix. It also provides the appropriate semantic analysis to check for errors before translating the constructs down to parallel C code. The extension also provides features that let the programmer indicate how the extension translates these matrix constructs down to C code. Programmers seeking higher levels of performance can specify how the underlying for-loops are structured so that code using, for example, loop-tiling techniques or vector processors, is generated. In general, compiler extensions supported by our approach allow new domain-specific syntax and semantic analyses to be easily added to the host language. Specifications of the host C language and the extensions are composed to create a custom translator that maps extended C programs down to plain (parallel) C code, checking for domain-specific errors and applying high-level domain-specific optimizations in the process.
暂无评论