Coprocessors based on Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC proce...
详细信息
ISBN:
(纸本)9781479941162
Coprocessors based on Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. In this work, we conduct a detailed study on the performance and scalability of the MIC processors under different programmingmodels using the Beacon computer cluster. Followings are our findings. (1) The native MPI programmingmodel on the MIC processors is typically better than the offload programmingmodel, which offloads the workload to MIC cores using OpenMP, on Beacon computer cluster. (2) On top of the native MPI programmingmodel, multithreading inside each MPI process can further improve the performance for parallel applications on computer clusters with MIC coprocessors. (3) Given a fixed number of MPI processes, it is a good strategy to schedule these MPI processes to as few MIC processors as possible to reduce the cross-processor communication overhead. (4) The hybrid MPI programmingmodel, in which data processing is distributed to both MIC cores and CPU cores, can outperform the native MPI programmingmodel.
We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respec...
详细信息
We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respectively. The Chunks and Tasks library maps the chunks and tasks to physical resources. In this way we seek to combine user friendliness with high performance. An application programmer can express a parallel algorithm using a few simple building blocks, defining data and work objects and their relationships. No explicit communication calls are needed;the distribution of both work and data is handled by the Chunks and Tasks library. This makes efficient implementation of complex applications that require dynamic distribution of work and data easier. At the same time, Chunks and Tasks imposes restrictions on data access and task dependencies that facilitate the development of high performance parallel back ends. We discuss the fundamental abstractions underlying the programmingmodel, as well as performance, determinism, and fault resilience considerations. We also present a pilot C++ library implementation for clusters of multicore machines and demonstrate its performance for irregular block-sparse matrix-matrix multiplication. (C) 2013 Elsevier B.V. All rights reserved.
In this paper we consider software implementation algorithm for finding the boundaries of objects in images using Sobel operator. Software implementation is presented in structural-graphical form. We propose a semi-au...
详细信息
ISBN:
(纸本)9781479960200
In this paper we consider software implementation algorithm for finding the boundaries of objects in images using Sobel operator. Software implementation is presented in structural-graphical form. We propose a semi-automatic parallelization considered program load. parallelized algorithm implemented in software product and the efficiency of parallelization was analyzed.
Current workstations can offer really amazing raw computational power: up to 10 TFlops on a single machine equipped with multiple CPUs and accelerators as the Intel Xeon Phi or GPU devices. Such results can only be ac...
详细信息
ISBN:
(纸本)9781479927289
Current workstations can offer really amazing raw computational power: up to 10 TFlops on a single machine equipped with multiple CPUs and accelerators as the Intel Xeon Phi or GPU devices. Such results can only be achieved with a massive parallelism of computational devices, thus the actual barrier posed by the exploitation of modern heterogeneous HPC resources is the difficulty in development and/or (performance) efficient porting of software on such architectures. In this paper, we present an experimental study about achievable performance of a widely used, computational intensive application the Fourier Transform, i.e. Discrete Fourier Transform (DFT) and Fast Fourier Transform. We propose an evaluation of the benefits obtained exploiting such resources in terms of performance and programming efforts in the development of the code with a emphasis on the programming approach adopted for code parallelization. With the exception of the interesting performance achieved exploiting GPU for the DFT algorithm, the use state-of-the-art software libraries provide the best solution since they represent a good compromise to balance programming efforts and performance achievements.
This paper describes the online, real-time traffic information system OLSIMv4 which is the updated version of the traffic information platform for the large-scale, real-world highway network of North Rhine-Westphalia....
详细信息
ISBN:
(纸本)9780769550732
This paper describes the online, real-time traffic information system OLSIMv4 which is the updated version of the traffic information platform for the large-scale, real-world highway network of North Rhine-Westphalia. OLSIMv4 gathers its traffic information from microscopic traffic simulations that are based on loop detector data. The simulations take advantage of the topological road traffic network information such as speed limits, lane closings or mergings, and overtaking restrictions. As a result OLSIMv4 is prepared to use dynamic traffic information as provided by variable traffic signs and traffic or road works messages. Additionally, OLSIMv4 exploits thread-level parallelism on multi-core machines using a coarse-grained parallel simulation model. Moreover, it substitutes nonexistent and faulty loop detector data with calculated values in order to provide failure-safety. Its simulation results are available for four varying time horizons and they are in good accordance with empirical findings even in scenarios with larger distances between subsequent loop detectors.
The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the per...
详细信息
The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programmingmodels that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programmingmodel, into a new FPGA design flow called FCUDA, which efficiently maps the coarse-and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programmingmodel for high-performance computing in FPGAs.
With the advent of the multicore era, the number of cores per computational node is increasing faster than the amount of memory. This diminishing memory to core ratio sometimes even prevents pure MPI applications to e...
详细信息
ISBN:
(纸本)9780769546759
With the advent of the multicore era, the number of cores per computational node is increasing faster than the amount of memory. This diminishing memory to core ratio sometimes even prevents pure MPI applications to exploit all cores available on each node. A possible solution is to add a shared memory programmingmodel like OpenMP inside the application to share variables between OpenMP threads that would otherwise be duplicated for each MPI task. Going to hybrid can thus improve the overall memory consumption, but may be a tedious task on large applications. To allow this data sharing without the overhead of mixing multiple programmingmodels, we propose an MPI extension called Hierarchical Local Storage (HLS) that allows application developers to share common variables between MPI tasks on the same node. HLS is designed as a set of directives that preserve the original parallel semantics of the code and are compatible with C, C++ and Fortran languages and the OpenMP programmingmodel. This new mechanism is implemented inside a state-of-the-art MPI 1.3 compliant runtime called MPC. Experiments show that the HLS mechanism can effectively reduce memory consumption of HPC applications. Moreover, by reducing data duplication in the shared cache of modern multicores, the HLS mechanism can also improve performances of memory intensive applications.
The conventional unified parallel computation model becomes more and more complicated which has weak pertinence and little guidance for each parallel computing phase. Therefore, a general layered and heterogeneous ide...
详细信息
ISBN:
(纸本)9780769548791
The conventional unified parallel computation model becomes more and more complicated which has weak pertinence and little guidance for each parallel computing phase. Therefore, a general layered and heterogeneous idea for parallel computation model research was proposed in this paper. The general layered heterogeneous parallel computation model was composed of parallel algorithm design model, parallel programming model, parallel execution model, and each model correspond to the three computing phases respectively. The properties of each model were described and research spots were also given. In parallel algorithm design model, an advanced language was designed for algorithm designers, and the corresponding interpretation system which based on text scanning was proposed to map the advanced language to machine language that runs on the heterogeneous software and hardware architectures. The parallel method library and parameter library were also provided to achieve the comprehensive utilization of the different computing resources and assign parallel tasks reasonably. Theoretical analysis results show that the general layered heterogeneous parallel computation model is clear and single goaled for each parallel computing phase.
The continuous proliferation of multicore architectures has placed developers under great pressure to parallelize their applications accordingly with what such platforms can offer. Unfortunately, traditional low-level...
详细信息
ISBN:
(纸本)9780769547497
The continuous proliferation of multicore architectures has placed developers under great pressure to parallelize their applications accordingly with what such platforms can offer. Unfortunately, traditional low-level programmingmodels exacerbate the difficulties of building large and complex parallel applications. High-level parallel programming models are in high-demand as they reduce the burdens on programmers significantly and provide enough abstraction to accommodate hardware heterogeneity. In this paper, we propose a flexible parallelization methodology, and we introduce a new task-based hybrid programmingmodel (MHPM) designed to provide high productivity and expressiveness without sacrificing performance. We show that MHPM allows easy expression of both sequential execution and several types of parallelism including task, data and temporal parallelism at all levels of granularity inside a single structured homogeneous programmingmodel. In order to demonstrate the potential of our approach, we present a pure C++ implementation of MHPM, and we show that, despite its high abstraction, it provides comparable performances to lower-level programmingmodels.
In this paper we present DFScala, a library for constructing and executing dataflow graphs in the Scala language. Through the use of Scala this library allows the programmer to construct coarse grained dataflow graphs...
详细信息
ISBN:
(纸本)9780769549545
In this paper we present DFScala, a library for constructing and executing dataflow graphs in the Scala language. Through the use of Scala this library allows the programmer to construct coarse grained dataflow graphs that take advantage of functional semantics for the dataflow graph and both functional and imperative semantics within the dataflow nodes. This combination allows for very clean code which exhibits the properties of dataflow programs, but we believe is more accessible to imperative programmers. We first describe DFScala in detail, before using a number of benchmarks to evaluate both its scalability and its absolute performance relative to existing codes. DFScala has been constructed as part of the Teraflux project and is being used extensively as a basis for further research into dataflow programming.
暂无评论