This paper proposes OMP-WHIP, a profiler that measures inherent parallelism in the program for a given input and provides what-if analyses to estimate improvements in parallelism. We propose a novel OpenMP series-para...
详细信息
ISBN:
(纸本)9781538683842
This paper proposes OMP-WHIP, a profiler that measures inherent parallelism in the program for a given input and provides what-if analyses to estimate improvements in parallelism. We propose a novel OpenMP series-parallel graph representation (OSPG) that precisely captures series-parallel relations induced by various directives between different fragments of dynamic execution. OMP-WHIP constructs the OSPG and measures the computation performed by each dynamic fragment using hardware performance counters. This series-parallel representation along with measurement of computation is a performance model of the program for a given input, which enables computation of inherent parallelism. This novel performance model also enables what-if analyses where a programmer can estimate improvements in parallelism when bottlenecks are addressed. We have used OMP-WHIP to identify parallelism bottlenecks in more than forty applications and then designed strategies to improve the speedup in seven applications.
Automatic parallelization of sequential code has become increasingly relevant in multicore programming. In particular, loop parallelization continues to be a promising optimization technique for scienti.c applications...
详细信息
ISBN:
(纸本)9781450364447
Automatic parallelization of sequential code has become increasingly relevant in multicore programming. In particular, loop parallelization continues to be a promising optimization technique for scienti.c applications, and can provide considerable speedups for program execution. Furthermore, if we can verify that there are no true data dependencies between loop iterations, they can be easily parallelized. This paper describes Clava AutoPar, a library for the Clava weaver that performs automatic and symbolic parallelization of C code. The library is composed of two main parts, parallel loop detection and source-to-source code parallelization. The system is entirely automatic and attempts to statically detect parallel loops for a given input program, without any user intervention or profiling information. We obtained a geometric mean speedup of 1.5 for a set of programs from the C version of the NAS benchmark, and experimental results suggest that the performance obtained with Clava AutoPar is comparable or better than other similar research and commercial tools.
The sophisticated nature of parallel computing concepts makes parallel programming challenging. This has encouraged higher-level frameworks that conceal much of the complications behind abstraction layers. Paradigms i...
详细信息
ISBN:
(纸本)9781538643686
The sophisticated nature of parallel computing concepts makes parallel programming challenging. This has encouraged higher-level frameworks that conceal much of the complications behind abstraction layers. Paradigms in this category are mostly performance centric, and do not share the same sentiments for the robustness of asynchronous executions. This is while current applications demand consistency in addition to fast performance. Therefore, programming environments that offer high-level support for asynchronous exception handling will have higher chances for popularity. This paper discusses our latest enhancements to @PT, a parallel programming environment that is based on Java annotations. The proposed concept promotes the robustness of parallelized programs by adhering to the familiar exception handling standards of sequential code, and reducing the asynchronous execution concerns at the API level. This study suggests that the concept simplifies efficient management of asynchronous exceptions, which appears to be a challenge in parallel programming.
In this paper, we present a fully-dynamic graph data structure for the Graphics Processing Unit (GPU). It delivers high update rates while keeping a low memory footprint using autonomous memory management directly on ...
详细信息
ISBN:
(纸本)9781538683842
In this paper, we present a fully-dynamic graph data structure for the Graphics Processing Unit (GPU). It delivers high update rates while keeping a low memory footprint using autonomous memory management directly on the GPU. The data structure is fully-dynamic, allowing not only for edge but also vertex updates. Performing the memory management on the GPU allows for fast initialization times and efficient update procedures without additional intervention or reallocation procedures from the host. Our optimized approach performs initialization completely in parallel;up to 300x faster compared to previous work. It achieves up to 200 million edge updates per second for sorted and unsorted update batches;up to 30x faster than previous work. Furthermore, it can perform more than 300 million adjacency queries and millions of vertex updates per second. On account of efficient memory management techniques like a queuing approach, currently unused memory is reused later on by the framework, permitting the storage of tens of millions of vertices and hundreds of millions of edges in GPU memory. We evaluate algorithmic performance using a PageRank and a Static Triangle Counting (STC) implementation, demonstrating the suitability of the framework even for memory access intensive algorithms.
The importance of concurrent and distributed programming is increasing on Computer Science curricula. This exploratory research identifies additional notions required by the official topics of "parallel and Concu...
详细信息
ISBN:
(数字)9781728147871
ISBN:
(纸本)9781728147888
The importance of concurrent and distributed programming is increasing on Computer Science curricula. This exploratory research identifies additional notions required by the official topics of "parallel and Concurrent programming" course, taught at the University of Costa Rica. This paper characterizes previous knowledge that students had about these notions and the extracurricular effort that they made to overcome the lack of notions. Findings show that students were able to overcome the lack of notions at expense of more extracurricular effort. Exploratory evidence indicates that students' election of professors in previous courses influenced their performance and extracurricular effort in the parallel programming course.
Getting depth information by stereo matching is one of the key steps in 3D reconstruct. In many practical applications, there are high requirements for the speed of processing and the accuracy of the results. Many alg...
详细信息
ISBN:
(纸本)9783319975894;9783319975887
Getting depth information by stereo matching is one of the key steps in 3D reconstruct. In many practical applications, there are high requirements for the speed of processing and the accuracy of the results. Many algorithms have obtained good results in processing precision, like SGM. However, the processing speed often does not meet the real-time requirements. In this paper, we improve the traditional stereo matching method SGM so that it is able to meet the real-time requirements. We implement our improved algorithm in TX2 with parallel programming. And in the experiments, it shows that our algorithm obtains 21 fps for the video gained by ZED camera size of 640 * 360 pixels, 32 disparity levels and using 4 path directions for the traditional SGM method. To measure the accuracy of our method, we use Middlebury dataset as indoor scene and the video obtained by ZED camera as outdoor scene to exam our algorithm separately. The results show that we get great balance between the speed and the accuracy.
JavaScript is widely used for scripting on client side. *** is a JavaScript runtime environment, allowing Javascript to be used for building scalable network applications on server side. However, *** does not support ...
详细信息
ISBN:
(纸本)9783030050542;9783030050535
JavaScript is widely used for scripting on client side. *** is a JavaScript runtime environment, allowing Javascript to be used for building scalable network applications on server side. However, *** does not support parallel programming, making it difficult to enhance applications' performance. Meanwhile, persistent memory (PM) shows optimistic prospects of being used in server-side applications, while few researches do exist in allowing script languages to support PM-based parallel programming. In this paper, we introduce SPMP, a JavaScript support for shared persistent memory on ***. With SPMP, each process needs to hold PersistentArrayBuffer, an object that is responsible for allocating, managing, and accessing persistent memory. Multiple processes can then share persistent memory and communicate each other by their PersistentArrayBuffer objects. Furthermore, SPMP supports dynamic load-balancing strategies and ensures data coherency, and also supports data persistence in a secondary storage. We have evaluated SPMP against Extended Memory Semantics (EMS, a state-of-the-art model for parallel programming on ***) on two data-intensive tasks. The results show that SPMP is 100 similar to 300x faster than EMS on five basic operations, and 2x faster on complicated parallel computing tasks such as counting words, due to its particular way on memory allocation and mapping.
As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing a...
详细信息
ISBN:
(纸本)9783319920405;9783319920399
As chip multi-processors (CMPs) are becoming more and more complex, software solutions such as parallel programming models are attracting a lot of attention. Task-based parallel programming models offer an appealing approach to utilize complex CMPs. However, the increasing number of cores on modern CMPs is pushing research towards the use of fine grained parallelism. Task-based programming models need to be able to handle such workloads and offer performance and scalability. Using specialized hardware for boosting performance of task-based programming models is a common practice in the research community. Our paper makes the observation that task creation becomes a bottleneck when we execute fine grained parallel applications with many taskbased programming models. As the number of cores increases the time spent generating the tasks of the application is becoming more critical to the entire execution. To overcome this issue, we propose TaskGenX. TaskGenX offers a solution for minimizing task creation overheads and relies both on the runtime system and a dedicated hardware. On the runtime system side, TaskGenX decouples the task creation from the other runtime activities. It then transfers this part of the runtime to a specialized hardware. We draw the requirements for this hardware in order to boost execution of highly parallel applications. From our evaluation using 11 parallel workloads on both symmetric and asymmetric multicore systems, we obtain performance improvements up to 15x , averaging to 3.1x over the baseline.
Weather radar is a system that utilizes advanced radio wave engineering to detect precipitation in the atmosphere. One of the wave generation technique used in weather radar is frequency-modulated continuous wave (FMC...
详细信息
ISBN:
(纸本)9781538672594
Weather radar is a system that utilizes advanced radio wave engineering to detect precipitation in the atmosphere. One of the wave generation technique used in weather radar is frequency-modulated continuous wave (FMCW), with dual polarization for differentiating detected precipitation types by its shape and size. Weather radar signal processing is usually performed using digital signal processing and field programmable gate array (FPGA), that performs well but with difficulty in system development and deployment. Software implementation of weather radar signal processing enables easier and faster development and deployment with the cost of performance when done serially. parallel implementation using general purpose graphics processing units (GP-GPU) may provide best of both worlds with easier development and deployment compared to hardware-based solutions but with better performance than serial CPU implementations. In this paper, implementation of various optimization strategies weather signal radar processing in GP-GPU environment on the Nvidia CUDA platform is shown. Performance measurements show that among optimization strategies implemented, only the utilization of multiple CUDA streams give significant performance gain. This paper contributes in attempts to build full weather radar signal processing stack on GPU.
In fault tolerant systems, applications are replicated and executed to enable error detection and recovery. If one replica application fails, another is able to take its place and provide the correct results. This con...
详细信息
ISBN:
(纸本)9783800749577
In fault tolerant systems, applications are replicated and executed to enable error detection and recovery. If one replica application fails, another is able to take its place and provide the correct results. This concept can benefit from parallel execution on separate execution units. The rise of multicore platforms supports the development of parallel software, by providing the adequate hardware. However, this raises challenges regarding the synchronization of the redundant strings of execution. Replica determinism means that given the same input, identical programs provide the same output. To ensure replica determinism, requirements regarding the synchronization can be split in two domains: data and time. This paper examines the state of the art of synchronization techniques for parallel replicated execution in the context of fault tolerant systems. We analyze the requirements regarding synchronization within the time and data domain and compare different concepts of hardware (multicore, multiprocessor and multi-PCB) and software (processes, threads).
暂无评论