This paper presents an experimental description of how to use OpenMP for achieving high performance from the quicksort algorithm through the parallelization of some key sections of its code. When this work was in proc...
详细信息
ISBN:
(纸本)9781479985487
This paper presents an experimental description of how to use OpenMP for achieving high performance from the quicksort algorithm through the parallelization of some key sections of its code. When this work was in process, I was unsure if the time profit I would achieve would be evident, but I think I exceed my expectations in this matter. The only problem I faced was the unpredictability of execution time in some cases for the parallel version. Anyway, even in this case the time profit in comparison with the sequential version was clear and evident.
Pipe-while loops have been proposed as a language construct for expressing pipeline parallelism in task-parallel languages. However, this loop construct has only been prototyped in research systems that lack compiler ...
详细信息
ISBN:
(纸本)9781450335881
Pipe-while loops have been proposed as a language construct for expressing pipeline parallelism in task-parallel languages. However, this loop construct has only been prototyped in research systems that lack compiler support. We demonstrate how to extend Intel® Cilk™ Plus, a production-quality task-parallel language, to implement pipe-while loops. We propose an extension to the compiler-runtime application binary interface (ABI) of Cilk Plus to support pipe-while loops. This extension maintains compatibility with existing language constructs and existing Cilk Plus binaries. We validate this ABI by prototyping the required runtime modifications and simulating the required front-end compiler transformations using preprocessor macros and C++ lambda functions.
Big data has brought new challenges to Top-k in data partitioning and parallel programming model. In order to overcome these problems, a new Top-k query algorithm for big data based on MapReduce is proposed. Based on ...
详细信息
ISBN:
(纸本)9781479983544
Big data has brought new challenges to Top-k in data partitioning and parallel programming model. In order to overcome these problems, a new Top-k query algorithm for big data based on MapReduce is proposed. Based on the features of MapReduce, this paper presents an in-depth study of Top-k query on big data from the perspective of data partitioning, data reduce, etc. Theoretical and experimental results show the proposed Algorithm makes a sharp increase in efficiency.
Developing complex computational-intensive and data-intensive scientific applications requires effective utilization of the computational power of the available computing platforms including grids, clouds, clusters, m...
详细信息
Developing complex computational-intensive and data-intensive scientific applications requires effective utilization of the computational power of the available computing platforms including grids, clouds, clusters, multi-core and many-core processors, and graphical processing units (GPUs). However, scientists who need to leverage such platforms are usually not parallel or distributed programming experts. Thus, they face numerous challenges when implementing and porting their software-based experimental tools to such platforms. In this paper, we introduce a sequential-to-parallel engineering approach to help scientists in engineering their scientific applications. Our approach is based on capturing sequential program details, planned parallelization aspects, and program deployment details using a set of domain-specific visual languages (DSVLs). Then, using code generation, we generate the corresponding parallel program using necessary parallel and distributed programming models (MPI, Open CL, or Open MP). We summarize three case studies (matrix multiplication, N-Body simulation, and digital signal processing) to evaluate our approach.
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-ou...
详细信息
ISBN:
(纸本)9781479989386
Tasking is a prominent parallel programming model. In this paper we conduct a first study into the feasibility of task-parallel execution at the CUDA grid, rather than the stream/kernel level, for regular, fixed in-out dependency task graphs, similar to those found in wavefront computational patterns, making the findings broadly applicable. We propose and evaluate three CUDA task progression algorithms, where threadblocks cooperatively process the task graph, and argue about their performance in terms of tasking throughput, atomics and memory IO overheads. Our initial results demonstrate a throughput of 38 million tasks/second on a Kepler K20X architecture.
Traditional methods for processing large images are extremely time intensive. Also, conventional image processing methods do not take advantage of available computing resources such as multicore central processing uni...
详细信息
Traditional methods for processing large images are extremely time intensive. Also, conventional image processing methods do not take advantage of available computing resources such as multicore central processing unit (CPU) and manycore general purpose graphics processing unit (GP-GPU). Studies suggest that applying parallel programming techniques to various image filters should improve the overall performance without compromising the existing resources. Recent studies also suggest that parallel implementation of image processing on compute unified device architecture (CUDA)-accelerated CPU/GPU system has potential to process the image very fast. In this paper, we introduce a CUDA-accelerated image processing method suitable for multicore/manycore systems. Using a bitmap file, we implement image processing and filtering through traditional sequential C and newly introduced parallel CUDA/C programs. A key step of the proposed algorithm is to load the pixel's bytes in a one dimensional array with length equal to matrix width * matrix height * bytes per pixel. This is done to process the image concurrently in parallel. According to experimental results, the proposed CUDA-accelerated parallel image processing algorithm provides benefit with a speedup factor up to 365 for an image with 8,192×8,192 pixels.
One of the most relevant problems at large organizations is the choice of locations for establishing facilities, distribution centers or retail stores. This logistics issue involves a strategic decision which may caus...
详细信息
One of the most relevant problems at large organizations is the choice of locations for establishing facilities, distribution centers or retail stores. This logistics issue involves a strategic decision which may cause significant impact at the effective cost of the product. There are several papers tackling this issue, known as the Facility Location Problem. The objective of this paper is to analyze applicable heuristics previously developed by other authors and to define a mathematical formulation to the fuel distribution industry in Brazil. It started from the analysis of the upstream and downstream flow in practice in this segment and the respective transportation cost formation, including taxes. Thereby, we propose the use of parallel programming techniques using the Message Passing Interface (MPI) with the objective of reducing transportation costs in a reasonable execution time. Results show that this approach provides interesting performance gains, when compared to serial execution.
Summary form only given. The electromagnetic transient (EMT) simulation of a large-scale power system consumes so much computational power that parallel programming techniques are urgently needed in this area. For exa...
详细信息
Summary form only given. The electromagnetic transient (EMT) simulation of a large-scale power system consumes so much computational power that parallel programming techniques are urgently needed in this area. For example, realistic-sized power systems include thousands of buses, generators, and transmission lines. Massive-thread computing is one of the key developments that can increase the EMT computational capabilities substantially when the processing unit has enough hardware cores. Compared to the traditional CPU, the graphic-processing unit (GPU) has many more cores with distributed memory which can offer higher data throughput. This paper proposes a massive-thread EMT program (MT-EMTP) and develops massive-thread parallel modules for linear passive elements, the universal line model, and the universal machine model for offline EMT simulation. An efficient node-mapping structure is proposed to transform the original power system admittance matrix into a block-node diagonal sparse format to exploit the massive- thread parallel GPU architecture. The developed MT-EMTP program has been tested on large-scale power systems of up to 2458 three-phase buses with detailed component modeling. The simulation results and execution times are compared with mainstream commercial software, EMTP-RV, to show the improvement in performance with equivalent accuracy.
A technique for the enhancement of point targets in clutter is described. The local 3-D spectrum at each pixel is estimated recursively. An optical flow-field for the textured background is then generated using the 3-...
详细信息
ISBN:
(纸本)9781467369985
A technique for the enhancement of point targets in clutter is described. The local 3-D spectrum at each pixel is estimated recursively. An optical flow-field for the textured background is then generated using the 3-D autocorrelation function and the local velocity estimates are used to apply high-pass velocity-selective spatiotemporal filters, with finite impulse responses (FIRs), to subtract the background clutter signal, leaving the foreground target signal, plus noise. parallel software implementations using a multicore central processing unit (CPU) and a graphical processing unit (GPU) are investigated.
Communication patterns that appear in RDMA-based parallel programming are different from those in the standard MPI programming. In this paper, we evaluated the performance of typical communication patterns in RDMA-bas...
详细信息
ISBN:
(纸本)9781510817982
Communication patterns that appear in RDMA-based parallel programming are different from those in the standard MPI programming. In this paper, we evaluated the performance of typical communication patterns in RDMA-based applications using simulations. The simulations predicted execution times with errors less than 10%. The simulation accuracy was sufficient to detect a performance change between two different synchronization points in an application. We also showed that these simulations were useful for analyzing the cause of a performance degradation in the communication pattern.
暂无评论