检索结果-内蒙古大学图书馆

Reaching bandwidth saturation using transparent injection parallelization

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2017年第5期31卷 405-421页

作者： Chaimov, Nicholas Ibrahim, Khaled Z. Williams, Samuel Iancu, Costin Univ Oregon Eugene OR 97403 USA Lawrence Berkeley Natl Lab Computat Res Div Berkeley CA USA Lawrence Berkeley Natl Lab Performance & Algorithms Res Grp Berkeley CA USA

Although logically available, applications may not exploit enough instantaneous communication concurrency to maximize network utilization on HPC systems. This is exacerbated in hybrid programming models that combine single program multiple data with OpenMP or CUDA. We present the design of a multi-threaded runtime able to transparently increase the instantaneous network concurrency and to provide near saturation bandwidth, independent of the application configuration and dynamic behavior. The runtime offloads communication requests from application level tasks to multiple communication servers. The servers use system specific performance models to attain network saturation. Our techniques alleviate the need for spatial and temporal application level message concurrency optimizations. Experimental results show improved message throughput and bandwidth by as much as 150% for 4 KB messages on InfiniBand and by as much as 120% for 4 KB messages on Cray Aries. For more complex operations such as all-to-all collectives, we observe as much as 30% speedup. This translates into 23% speedup on 12,288 cores for a NAS FT implemented using FFTW. We observe as much as 76% speedup on 1500 cores for an already optimized UPC+OpenMP geometric multigrid application using hybrid parallelism. For the geometric multigrid GPU implementation, we observe as much as 44% speedup on 512 GPUs.

关键词： Communication concurrency multi-threaded runtime one sided communication parallel injection UPC

来源：评论

学校读者我要写书评

暂无评论

PPoPP 2014 - Proceedings of the 2014 acm sigplan symposium on principles and practice of parallel programming

PPoPP 2014 - Proceedings of the 2014 ACM SIGPLAN Symposium o...

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

ISBN: (纸本)9781450326568

The proceedings contain 43 papers. The topics discussed include: predator: predictive false sharing detection;concurrency testing using schedule bounding: an empirical study;trace driven dynamic deadlock detection and reproduction;efficient search for inputs causing high floating-point errors;portable, MPI-interoperable coarray Fortran;eliminating global interpreter locks in ruby through hardware transactional memory;leveraging hardware message passing for efficient thread synchronization;well-structured futures and cache locality;time-warp: lightweight abort minimization in transactional memory;beyond parallel programming with domain specific languages;a decomposition for in-place matrix transposition;in-place transposition of rectangular matrices on accelerators;and parallelizing dynamic programming through rank convergence.

关键词： FORTRAN (programming language)

来源：评论

学校读者我要写书评

暂无评论

JParEnt: parallel entropy decoding for JPEG decompression on heterogeneous multicore architectures 7

JParEnt: Parallel entropy decoding for JPEG decompression on...

引用

Euro-Par Conference / 7th International Workshop on programming Models and Applications for Multicores and Manycores (PMAM) held in conjunction with the 21st sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Sodsong, Wasuwee Jung, Minyoung Park, Jinwoo Burgstaller, Bernd Yonsei Univ Dept Comp Sci Seoul South Korea

ISBN: (纸本)9781450341967

The JPEG format employs Huffman codes to compress the entropy data of an image. Huffman codewords are of variable length, which makes parallel entropy decoding a difficult problem. To determine the start position of a codeword in the bitstream, the previous codeword must be decoded first. We present JParEnt, a new approach to parallel entropy decoding for JPEG decompression on heterogeneous multicores. JParEnt conducts JPEG decompression in two steps: (1)an efficient sequential scan of the entropy data on the CPU to determine the start-positions (boundaries) of coefficient blocks in the bitstream, followed by (2)a parallel entropy decoding step on the graphics processing unit (GPU). The block boundary scan constitutes a reinterpretation of the Huffman-coded entropy data to determine codeword boundaries in the bitstream. We introduce a dynamic workload partitioning scheme to account for GPUs of low compute power relative to the CPU. This configuration has become common with the advent of SoCs with integrated graphics processors (IGPs). We leverage additional parallelism through pipelined execution across CPU and GPU. For systems providing a unified address space between CPU and GPU, we employ zero-copy to completely eliminate the data transfer overhead. Our experimental evaluation of JParEnt was conducted on six heterogeneous multicore systems: one server and two desktops with dedicated GPUs, one desktop with an IGP, and two embedded systems. For a selection of more than 1000JPEG images, JParEnt outperforms the SIMD-implementation of the libjpeg-turbo library by up to a factor of 4.3x, and the previously fastest JPEG decompression method for heterogeneous multicores by up to a factor of 2.2x. JParEnt's entropy data scan consumes 45% of the entropy decoding time of libjpeg-turbo on average. Given this new ratio for the sequential part of JPEG decompression, JParEnt achieves up to97% of the maximum attainable speedup (95% on average). On the IGP-based desktop platform,

关键词： JPEG decoding entropy decoding prefix codes heterogeneous multicores GPU programming

来源：评论

学校读者我要写书评

暂无评论

A programming System for Future Proofing Performance Critical Libraries 16

A Programming System for Future Proofing Performance Critica...

引用

21st acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Chang, Li-Wen El Hajj, Izzat Kim, Hee-Seok Gomez-Luna, Juan Dakkak, Abdul Hwu, Wen-Mei Univ Illinois Champaign IL 61801 USA Univ Cordoba E-14071 Cordoba Spain

ISBN: (纸本)9781450340922

We present Tangram, a programming system for writing performance-portable programs. The language enables programmers to write computation and composition codelets, supported by tuning knobs and primitives for expressing data parallelism and work decomposition. The compiler and runtime use a set of techniques such as hierarchical composition, coarsening, data placement, tuning, and runtime selection based on input characteristics and micro-profiling. The resulting performance is competitive with optimized vendor libraries.

关键词： Libraries

来源：评论

学校读者我要写书评

暂无评论

Samsara parallel: A Non-BSP parallel-in-Time Model 16

Samsara Parallel: A Non-BSP Parallel-in-Time Model

引用

21st acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Chen, Yifeng Huang, Kun Wang, Bei Li, Guohui Cui, Xiang Peking Univ HCST Key Lab EECS Beijing 100871 Peoples R China Chinese Acad Sci Dalian Inst Chem Phys Dalian 116023 Peoples R China

ISBN: (纸本)9781450340922

Many time-dependent problems like molecular dynamics of protein folding require a large number of time steps. The latencies and overheads of common-purpose clusters with accelerators are too big for high-frequency iteration. We introduce an algorithmic model called Samsara parallel (or SP) which, unlike BSP, relies on asynchronous communications and can repeatedly return to earlier time steps to refine the precision of computation. This also extends a line of research called parallel-in-Time in computational chemistry and physics.

关键词： BSP parallel-in-time Molecular Dynamics

来源：评论

学校读者我要写书评

暂无评论

Data-centric Combinatorial Optimization of parallel Code 16

Data-centric Combinatorial Optimization of Parallel Code

引用

21st acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Luo, Hao Chen, Guoyang Li, Pengcheng Ding, Chen Shen, Xipeng Univ Rochester Rochester NY 14627 USA North Carolina State Univ Raleigh NC 27695 USA

ISBN: (纸本)9781450340922

Memory performance is one essential factor for tapping into the full potential of the massive parallelism of GPU. It has motivated some recent efforts in GPU cache modeling. This paper presents a new data-centric way to model the performance of a system with heterogeneous memory resources. The new model is composable, meaning it can predict the performance difference due to placing data differently by profiling the execution just once.

关键词： Locality metrics Locality modeling Footprint

来源：评论

学校读者我要写书评

暂无评论

parallel Type-checking with Haskell using Saturating LVars and Stream Generators 16

Parallel Type-checking with Haskell using Saturating LVars a...

引用

21st acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Newton, Ryan R. Agacan, Omer S. Fogg, Peter Tobin-Hochstadt, Sam Indiana Univ Bloomington IN 47405 USA edX Cambridge MA USA

ISBN: (纸本)9781450340922

Given the sophistication of recent type systems, unification-based type-checking and inference can be a time-consuming phase of compilation-especially when union types are combined with subtyping. It is natural to consider improving performance through parallelism, but these algorithms are challenging to parallelize due to complicated control structure and difficulties representing data in a way that is both efficient and supports concurrency. We provide techniques that address these problems based on the LVish approach to deterministic-by-default parallel programming. We extend LVish with Saturating LVars, the first LVars implemented to release memory during the object's lifetime. Our design allows us to achieve a parallel speedup on worst-case (exponential) inputs of Hindley-Milner inference, and on the Typed Racket type-checking algorithm, which yields up an 8.46x parallel speedup on 14 cores for type-checking examples drawn from the Racket repository.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Merge-based Sparse Matrix-Vector Multiplication (SpMV) using the CSR Storage Format 16

Merge-based Sparse Matrix-Vector Multiplication (SpMV) using...

引用

21st acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Merrill, Duane Garland, Michael NVIDIA Corp Santa Clara CA 95050 USA

ISBN: (纸本)9781450340922

We present a perfectly balanced, "merge-based" parallel method for computing sparse matrix-vector products (SpMV). Our algorithm operates directly upon the Compressed Sparse Row (CSR) sparse matrix format, a predominant in-memory representation for general-purpose sparse linear algebra computations. Our CsrMV performs an equitable multi-partitioning of the input dataset, ensuring that no single thread can be overwhelmed by assignment to (a) arbitrarily-long rows or (b) an arbitrarily-large number of zerolength rows. This parallel decomposition requires neither offline preprocessing nor specialized/ancillary data formats. We evaluate our method on both CPU and GPU microarchitecture across an enormous corpus of diverse real world matrix datasets. We show that traditional CsrMV methods are inconsistent performers subject to order-of-magnitude slowdowns, whereas the performance response of our method is substantially impervious to row-length heterogeneity.

关键词： SpMV segmented reduction sparse matrix sparse graph parallel decomposition merge merge-path GPU

来源：评论

学校读者我要写书评

暂无评论

Declarative Coordination of Graph-based parallel Programs 16

Declarative Coordination of Graph-based Parallel Programs

引用

21st acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Cruz, Flavio Rocha, Ricardo Goldstein, Seth Copen Univ Porto CRACS Rua Campo Alegre 1021 P-4169007 Oporto Portugal Univ Porto INESC TEC Rua Campo Alegre 1021 P-4169007 Oporto Portugal Univ Porto Fac Sci Rua Campo Alegre 1021 P-4169007 Oporto Portugal Carnegie Mellon Univ Pittsburgh PA 15213 USA

ISBN: (纸本)9781450340922

Declarative programming has been hailed as a promising approach to parallel programming since it makes it easier to reason about programs while hiding the implementation details of parallelism from the programmer. However, its advantage is also its disadvantage as it leaves the programmer with no straightforward way to optimize programs for performance. In this paper, we introduce Coordinated Linear Meld (CLM), a concurrent forward-chaining linear logic programming language, with a declarative way to coordinate the execution of parallel programs allowing the programmer to specify arbitrary scheduling and data partitioning policies. Our approach allows the programmer to write graph-based declarative programs and then optionally to use coordination to fine-tune parallel performance. In this paper we specify the set of coordination facts, discuss their implementation in a parallel virtual machine, and show-through example-how they can be used to optimize parallel execution. We compare the performance of CLM programs against the original uncoordinated Linear Meld and several other frameworks.

关键词： Design Languages Performance parallel programming Linear Logic

来源：评论

学校读者我要写书评

暂无评论

POSTER: An Architecture and programming Model for Accelerating parallel Commutative Computations via Privatization 17

POSTER: An Architecture and Programming Model for Accelerati...

引用

Proceedings of the 22nd acm sigplan symposium on principles and practice of parallel programming

作者： Vignesh Balaji Dhruva Tirumala Brandon Lucia Carnegie Mellon University Pittsburgh PA USA

ISBN: (纸本)9781450344937

Synchronization and data movement are the key impediments to an efficient parallel execution. To ensure that data shared by multiple threads remain consistent, the programmer must use synchronization (e.g., mutex locks) to serialize threads' accesses to data. This limits parallelism because it forces threads to sequentially access shared resources. Additionally, systems use cache coherence to ensure that processors always operate on the most up-to-date version of a value even in the presence of private caches. Coherence protocol implementations cause processors to serialize their accesses to shared data, further limiting parallelism and performance.

关键词： cache-coherence shared memory parallel programming commutativity

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：