检索结果-内蒙古大学图书馆

28th IEEE International parallel & Distributed Processing Symposium Workshops (IPDPSW)

作者： Lai, Chenggang Hao, Zhijun Huang, Miaoqing Shi, Xuan You, Haihang Univ Arkansas Fayetteville AR 72701 USA Fudan Univ Shanghai Peoples R China Univ Tennessee Knoxville TN 37996 USA

ISBN: (纸本)9781479941162

Coprocessors based on Intel Many Integrated Core (MIC) Architecture have been adopted in many high-performance computer clusters. Typical parallel programming models, such as MPI and OpenMP, are supported on MIC processors to achieve the parallelism. In this work, we conduct a detailed study on the performance and scalability of the MIC processors under different programming models using the Beacon computer cluster. Followings are our findings. (1) The native MPI programming model on the MIC processors is typically better than the offload programming model, which offloads the workload to MIC cores using OpenMP, on Beacon computer cluster. (2) On top of the native MPI programming model, multithreading inside each MPI process can further improve the performance for parallel applications on computer clusters with MIC coprocessors. (3) Given a fixed number of MPI processes, it is a good strategy to schedule these MPI processes to as few MIC processors as possible to reduce the cross-processor communication overhead. (4) The hybrid MPI programming model, in which data processing is distributed to both MIC cores and CPU cores, can outperform the native MPI programming model.

关键词： parallel programming model Intel MIC processor MPI OpenMP performance evaluation

来源：评论

学校读者我要写书评

暂无评论

Chunks and Tasks: A programming model for parallelization of dynamic algorithms

引用

parallel COMPUTING 2014年第7期40卷 328-343页

作者： Rubensson, Emanuel H. Rudberg, Elias Uppsala Univ Dept Informat Technol Div Comp Sci SE-75105 Uppsala Sweden

We propose Chunks and Tasks, a parallel programming model built on abstractions for both data and work. The application programmer specifies how data and work can be split into smaller pieces, chunks and tasks, respectively. The Chunks and Tasks library maps the chunks and tasks to physical resources. In this way we seek to combine user friendliness with high performance. An application programmer can express a parallel algorithm using a few simple building blocks, defining data and work objects and their relationships. No explicit communication calls are needed;the distribution of both work and data is handled by the Chunks and Tasks library. This makes efficient implementation of complex applications that require dynamic distribution of work and data easier. At the same time, Chunks and Tasks imposes restrictions on data access and task dependencies that facilitate the development of high performance parallel back ends. We discuss the fundamental abstractions underlying the programming model, as well as performance, determinism, and fault resilience considerations. We also present a pilot C++ library implementation for clusters of multicore machines and demonstrate its performance for irregular block-sparse matrix-matrix multiplication. (C) 2013 Elsevier B.V. All rights reserved.

关键词： Distributed memory parallelization Dynamic data distribution Dynamic load balancing Fault tolerance parallel programming model Determinism

来源：评论

学校读者我要写书评

暂无评论

parallelization Research of Algorithm for Detecting Borders on the Basis of Graph Representation 12

Parallelization Research of Algorithm for Detecting Borders ...

引用

12th International Conference on Actual Problems of Electronics Instrument Engineering (APEIE)

作者： Demin, A. Y. Dorofeev, V. A. Tomsk Polytech Univ Comp Sci & Engn Syst Tomsk Russia

ISBN: (纸本)9781479960200

In this paper we consider software implementation algorithm for finding the boundaries of objects in images using Sobel operator. Software implementation is presented in structural-graphical form. We propose a semi-automatic parallelization considered program load. parallelized algorithm implemented in software product and the efficiency of parallelization was analyzed.

关键词： Sobel operator image processing tree operators data flow graph parallel programming model

来源：评论

学校读者我要写书评

暂无评论

Compute Intensive Algorithm on Heterogeneous System: a Case Study about Fourier Transform

Compute Intensive Algorithm on Heterogeneous System: a Case ...

引用

22nd Euromicro International Conference on parallel, Distributed, and Network-Based Processing (PDP)

作者： Galizia, Antonella Danovaro, Emanuele Ripepi, Giuseppe Clematis, Andrea CNR Inst Appl Math & Informat Technol Genoa Italy

ISBN: (纸本)9781479927289

Current workstations can offer really amazing raw computational power: up to 10 TFlops on a single machine equipped with multiple CPUs and accelerators as the Intel Xeon Phi or GPU devices. Such results can only be achieved with a massive parallelism of computational devices, thus the actual barrier posed by the exploitation of modern heterogeneous HPC resources is the difficulty in development and/or (performance) efficient porting of software on such architectures. In this paper, we present an experimental study about achievable performance of a widely used, computational intensive application the Fourier Transform, i.e. Discrete Fourier Transform (DFT) and Fast Fourier Transform. We propose an evaluation of the benefits obtained exploiting such resources in terms of performance and programming efforts in the development of the code with a emphasis on the programming approach adopted for code parallelization. With the exception of the interesting performance achieved exploiting GPU for the DFT algorithm, the use state-of-the-art software libraries provide the best solution since they represent a good compromise to balance programming efforts and performance achievements.

关键词： Complex Heterogeneous System parallel programming model Fourier Transform

来源：评论

学校读者我要写书评

暂无评论

Real-time Traffic Information System Using Microscopic Traffic Simulation 8

Real-time Traffic Information System Using Microscopic Traff...

引用

8th EUROSIM Congress on modelling and Simulation (EUROSIM)

作者： Bruegmann, Johannes Schreckenberg, Michael Luther, Wolfram Univ Duisburg Essen Dept Phys Transport & Traff Duisburg Germany Univ Duisburg Essen Dept Comp Sci & Appl Cognit Sci Duisburg Germany

ISBN: (纸本)9780769550732

This paper describes the online, real-time traffic information system OLSIMv4 which is the updated version of the traffic information platform for the large-scale, real-world highway network of North Rhine-Westphalia. OLSIMv4 gathers its traffic information from microscopic traffic simulations that are based on loop detector data. The simulations take advantage of the topological road traffic network information such as speed limits, lane closings or mergings, and overtaking restrictions. As a result OLSIMv4 is prepared to use dynamic traffic information as provided by variable traffic signs and traffic or road works messages. Additionally, OLSIMv4 exploits thread-level parallelism on multi-core machines using a coarse-grained parallel simulation model. Moreover, it substitutes nonexistent and faulty loop detector data with calculated values in order to provide failure-safety. Its simulation results are available for four varying time horizons and they are in good accordance with empirical findings even in scenarios with larger distances between subsequent loop detectors.

关键词： traffic information system real-time simulation microscopic traffic simulation parallel programming model

来源：评论

学校读者我要写书评

暂无评论

Efficient Compilation of CUDA Kernels for High-Performance Computing on FPGAs

引用

ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS 2013年第2期13卷 25-25页

作者： Papakonstantinou, Alexandros Gururaj, Karthik Stratton, John A. Chen, Deming Cong, Jason Hwu, Wen-Mei W. Univ Illinois Elect & Comp Engn Dept Urbana IL 60680 USA Univ Calif Los Angeles Dept Comp Sci Los Angeles CA 90024 USA

The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse-and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

关键词： Design Performance FPGA high-level synthesis parallel programming model high-performance computing source-to-source compiler heterogeneous compute systems

来源：评论

学校读者我要写书评

暂无评论

Hierarchical Local Storage: Exploiting Flexible User-Data Sharing Between MPI Tasks

Hierarchical Local Storage: Exploiting Flexible User-Data Sh...

引用

26th IEEE International parallel and Distributed Processing Symposium (IPDPS) / Workshop on High Performance Data Intensive Computing

作者： Tchiboukdjian, Marc Carribault, Patrick Perache, Marc Exascale Comp Res Versailles France

ISBN: (纸本)9780769546759

With the advent of the multicore era, the number of cores per computational node is increasing faster than the amount of memory. This diminishing memory to core ratio sometimes even prevents pure MPI applications to exploit all cores available on each node. A possible solution is to add a shared memory programming model like OpenMP inside the application to share variables between OpenMP threads that would otherwise be duplicated for each MPI task. Going to hybrid can thus improve the overall memory consumption, but may be a tedious task on large applications. To allow this data sharing without the overhead of mixing multiple programming models, we propose an MPI extension called Hierarchical Local Storage (HLS) that allows application developers to share common variables between MPI tasks on the same node. HLS is designed as a set of directives that preserve the original parallel semantics of the code and are compatible with C, C++ and Fortran languages and the OpenMP programming model. This new mechanism is implemented inside a state-of-the-art MPI 1.3 compliant runtime called MPC. Experiments show that the HLS mechanism can effectively reduce memory consumption of HPC applications. Moreover, by reducing data duplication in the shared cache of modern multicores, the HLS mechanism can also improve performances of memory intensive applications.

关键词： High Performance Computing parallel programming model Memory Consumption

来源：评论

学校读者我要写书评

暂无评论

Layered models for General parallel Computation Based on Heterogeneous System

Layered Models for General Parallel Computation Based on Het...

引用

13th International Conference on parallel and Distributed Computing, Applications, and Technologies (PDCAT)

作者： Sheng, Yanxiu Gui, Lin Wei, Zhiqiang Duan, Jibing Liu, Yingying Ocean Univ China Coll Informat Sci & Engn Qingdao Shandong Peoples R China

ISBN: (纸本)9780769548791

The conventional unified parallel computation model becomes more and more complicated which has weak pertinence and little guidance for each parallel computing phase. Therefore, a general layered and heterogeneous idea for parallel computation model research was proposed in this paper. The general layered heterogeneous parallel computation model was composed of parallel algorithm design model, parallel programming model, parallel execution model, and each model correspond to the three computing phases respectively. The properties of each model were described and research spots were also given. In parallel algorithm design model, an advanced language was designed for algorithm designers, and the corresponding interpretation system which based on text scanning was proposed to map the advanced language to machine language that runs on the heterogeneous software and hardware architectures. The parallel method library and parameter library were also provided to achieve the comprehensive utilization of the different computing resources and assign parallel tasks reasonably. Theoretical analysis results show that the general layered heterogeneous parallel computation model is clear and single goaled for each parallel computing phase.

关键词： layered models heterogeneous system parallel algorithm design model interpretation system parallel programming model parallel execution model

来源：评论

学校读者我要写书评

暂无评论

MHPM: Multi-Scale Hybrid programming model A Flexible parallelization Methodology

MHPM: Multi-Scale Hybrid Programming Model A Flexible Parall...

引用

14th IEEE International Conference on High Performance Computing and Communications (HPCC) / IEEE 9th International Conference on Embedded Software and Systems (ICESS)

作者： Khammassi, Nader Le Lann, Jean-Christophe Diguet, Jean-Philippe Skrzyniarz, Alexandre ENSTA Bretagne CNRS Lab STICC UMR 6285 F-29806 Brest 9 France Univ South Brittany CNRS Lab STICC F-56321 Lorient France Thales Airbone Syst Radra & Warfare Syst Domain Desing Author F-29200 Brest France

ISBN: (纸本)9780769547497

The continuous proliferation of multicore architectures has placed developers under great pressure to parallelize their applications accordingly with what such platforms can offer. Unfortunately, traditional low-level programming models exacerbate the difficulties of building large and complex parallel applications. High-level parallel programming models are in high-demand as they reduce the burdens on programmers significantly and provide enough abstraction to accommodate hardware heterogeneity. In this paper, we propose a flexible parallelization methodology, and we introduce a new task-based hybrid programming model (MHPM) designed to provide high productivity and expressiveness without sacrificing performance. We show that MHPM allows easy expression of both sequential execution and several types of parallelism including task, data and temporal parallelism at all levels of granularity inside a single structured homogeneous programming model. In order to demonstrate the potential of our approach, we present a pure C++ implementation of MHPM, and we show that, despite its high abstraction, it provides comparable performances to lower-level programming models.

关键词： parallel programming model Structured parallelism Skeleton Execution Patterns parallel Constructs Multicore

来源：评论

学校读者我要写书评

暂无评论

DFScala: High Level Dataflow Support for Scala

DFScala: High Level Dataflow Support for Scala

引用

2nd International Workshop on Data-Flow Execution models (DFM) for Extreme Scale Computing

作者： Goodman, Daniel Khan, Salman Seaton, Chris Guskov, Yegor Khan, Behram Lujan, Mikel Watson, Ian Univ Manchester Manchester Lancs England

ISBN: (纸本)9780769549545

In this paper we present DFScala, a library for constructing and executing dataflow graphs in the Scala language. Through the use of Scala this library allows the programmer to construct coarse grained dataflow graphs that take advantage of functional semantics for the dataflow graph and both functional and imperative semantics within the dataflow nodes. This combination allows for very clean code which exhibits the properties of dataflow programs, but we believe is more accessible to imperative programmers. We first describe DFScala in detail, before using a number of benchmarks to evaluate both its scalability and its absolute performance relative to existing codes. DFScala has been constructed as part of the Teraflux project and is being used extensively as a basis for further research into dataflow programming.

关键词： Dataflow Scala Coarse Grained parallel programming model

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：