检索结果-内蒙古大学图书馆

Energy-optimal configuration selection for manycore chips with variation

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2017年第5期31卷 451-466页

作者： Langer, Akhil Totoni, Ehsan Palekar, Udatta Kale, Laxmikant V. Intel Fed 1906 Fox Dr Champaign IL 61820 USA Univ Illinois Coll Business Urbana IL 61801 USA Univ Illinois Dept Comp Sci 1304 W Springfield Ave Urbana IL 61801 USA

Operating chips at high energy efficiency is one of the major challenges for modern large-scale supercomputers. Low-voltage operation of transistors increases the energy efficiency but leads to frequency and power variation across cores on the same chip. Finding energy-optimal configurations for such chips is a hard problem. In this work, we study how integer linear programming techniques can be used to obtain energy-efficient configurations of chips that have heterogeneous cores. Our proposed methodologies give optimal configurations as compared with competent but sub-optimal heuristics while having negligible timing overhead. the proposed ParSearch method gives up to 13.2% and 7% savings in energy while causing only 2% increase in execution time of two HPC applications: miniMD and Jacobi, respectively. Our results show that integer linear programming can be a very powerful online method to obtain energy-optimal configurations.

关键词： energy power optimization multicore chips low-voltage computing near-threshold voltage computing process variation heterogeneity integer programming quadratic integer programming

来源：评论

学校读者我要写书评

暂无评论

RaftLib: A C plus plus template library for high performance stream parallel processing

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2017年第5期31卷 391-404页

作者： Beard, Jonathan C. Li, Peng Chamberlain, Roger D. ARM Res 5707 Southwest Pkwy Suite 100 Austin Austin TX 78735 USA Amazon Inc Seattle WA USA Washington Univ Dept Comp Sci & Engn St Louis MO 63130 USA

Stream processing is a compute paradigm that has been around for decades, yet until recently has failed to garner the same attention as other mainstream languages and libraries (e.g. C++, OpenMP, MPI). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism to process huge volumes of streaming data. there have been many implementations, both libraries and full languages. the full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the vast expanse of extant legacy code. Libraries, however are often criticized for yielding to the shape of their respective languages. RaftLib aims to fully exploit the stream processing paradigm, enabling a full spectrum of streaming graph optimizations, while providing a platform for the exploration of integrability with legacy C/C++ code. RaftLib is built as a C++ template library, enabling programmers to utilize the robust C++ standard library, and other legacy code, along with RaftLib's parallelization framework. RaftLib supports several online optimization techniques: dynamic queue optimization, automatic parallelization, and real-time low overhead performance monitoring.

关键词： Stream processing big-data C plus plus template library high performance computing RaftLib performance monitoring parallel processing

来源：评论

学校读者我要写书评

暂无评论

Reaching bandwidth saturation using transparent injection parallelization

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2017年第5期31卷 405-421页

作者： Chaimov, Nicholas Ibrahim, Khaled Z. Williams, Samuel Iancu, Costin Univ Oregon Eugene OR 97403 USA Lawrence Berkeley Natl Lab Computat Res Div Berkeley CA USA Lawrence Berkeley Natl Lab Performance & Algorithms Res Grp Berkeley CA USA

Although logically available, applications may not exploit enough instantaneous communication concurrency to maximize network utilization on HPC systems. this is exacerbated in hybrid programming models that combine single program multiple data with OpenMP or CUDA. We present the design of a multi-threaded runtime able to transparently increase the instantaneous network concurrency and to provide near saturation bandwidth, independent of the application configuration and dynamic behavior. the runtime offloads communication requests from application level tasks to multiple communication servers. the servers use system specific performance models to attain network saturation. Our techniques alleviate the need for spatial and temporal application level message concurrency optimizations. Experimental results show improved message throughput and bandwidth by as much as 150% for 4 KB messages on InfiniBand and by as much as 120% for 4 KB messages on Cray Aries. For more complex operations such as all-to-all collectives, we observe as much as 30% speedup. this translates into 23% speedup on 12,288 cores for a NAS FT implemented using FFTW. We observe as much as 76% speedup on 1500 cores for an already optimized UPC+OpenMP geometric multigrid application using hybrid parallelism. For the geometric multigrid GPU implementation, we observe as much as 44% speedup on 512 GPUs.

关键词： Communication concurrency multi-threaded runtime one sided communication parallel injection UPC

来源：评论

学校读者我要写书评

暂无评论

JParEnt: parallel entropy decoding for JPEG decompression on heterogeneous multicore architectures 7

JParEnt: Parallel entropy decoding for JPEG decompression on...

引用

Euro-Par Conference / 7th International Workshop on programming Models and Applications for Multicores and Manycores (PMAM) held in conjunction with the 21st sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Sodsong, Wasuwee Jung, Minyoung Park, Jinwoo Burgstaller, Bernd Yonsei Univ Dept Comp Sci Seoul South Korea

ISBN: (纸本)9781450341967

the JPEG format employs Huffman codes to compress the entropy data of an image. Huffman codewords are of variable length, which makes parallel entropy decoding a difficult problem. To determine the start position of a codeword in the bitstream, the previous codeword must be decoded first. We present JParEnt, a new approach to parallel entropy decoding for JPEG decompression on heterogeneous multicores. JParEnt conducts JPEG decompression in two steps: (1)an efficient sequential scan of the entropy data on the CPU to determine the start-positions (boundaries) of coefficient blocks in the bitstream, followed by (2)a parallel entropy decoding step on the graphics processing unit (GPU). the block boundary scan constitutes a reinterpretation of the Huffman-coded entropy data to determine codeword boundaries in the bitstream. We introduce a dynamic workload partitioning scheme to account for GPUs of low compute power relative to the CPU. this configuration has become common with the advent of SoCs with integrated graphics processors (IGPs). We leverage additional parallelism through pipelined execution across CPU and GPU. For systems providing a unified address space between CPU and GPU, we employ zero-copy to completely eliminate the data transfer overhead. Our experimental evaluation of JParEnt was conducted on six heterogeneous multicore systems: one server and two desktops with dedicated GPUs, one desktop with an IGP, and two embedded systems. For a selection of more than 1000JPEG images, JParEnt outperforms the SIMD-implementation of the libjpeg-turbo library by up to a factor of 4.3x, and the previously fastest JPEG decompression method for heterogeneous multicores by up to a factor of 2.2x. JParEnt's entropy data scan consumes 45% of the entropy decoding time of libjpeg-turbo on average. Given this new ratio for the sequential part of JPEG decompression, JParEnt achieves up to97% of the maximum attainable speedup (95% on average). On the IGP-based desktop platform,

关键词： JPEG decoding entropy decoding prefix codes heterogeneous multicores GPU programming

来源：评论

学校读者我要写书评

暂无评论

Proceedings of the acm sigplan symposium on principles and practice of parallel programming, PPOPP

Proceedings of the ACM SIGPLAN Symposium on Principles and P...

引用

20th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2015

ISBN: (纸本)9781450332057

the proceedings contain 44 papers. the topics discussed include: predicate RCU: an RCU for scalable concurrent updates;automatic scalable atomicity via semantic locking;a framework for practical parallel fast matrix multiplication;PLUTO+: near-complete modeling of affine transformations for parallelism and locality;distributed memory code generation for mixed irregular/regular computations;performance implications of dynamic memory allocators on transactional memory systems;low-overhead software transactional memory with progress guarantees and strong semantics∗;barrier elision for production parallel programs;scalable and efficient implementation of 3D unstructured meshes computation: a case study on matrix assembly;and diagnosing the causes and severity of one-sided message contention.

关键词：

来源：评论

学校读者我要写书评

暂无评论

parallel Type-checking with Haskell using Saturating LVars and Stream Generators 16

Parallel Type-checking with Haskell using Saturating LVars a...

引用

21st acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Newton, Ryan R. Agacan, Omer S. Fogg, Peter Tobin-Hochstadt, Sam Indiana Univ Bloomington IN 47405 USA edX Cambridge MA USA

ISBN: (纸本)9781450340922

Given the sophistication of recent type systems, unification-based type-checking and inference can be a time-consuming phase of compilation-especially when union types are combined with subtyping. It is natural to consider improving performance through parallelism, but these algorithms are challenging to parallelize due to complicated control structure and difficulties representing data in a way that is both efficient and supports concurrency. We provide techniques that address these problems based on the LVish approach to deterministic-by-default parallel programming. We extend LVish with Saturating LVars, the first LVars implemented to release memory during the object's lifetime. Our design allows us to achieve a parallel speedup on worst-case (exponential) inputs of Hindley-Milner inference, and on the Typed Racket type-checking algorithm, which yields up an 8.46x parallel speedup on 14 cores for type-checking examples drawn from the Racket repository.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

AUTOGEN: Automatic Discovery of Cache-Oblivious parallel Recursive Algorithms for Solving Dynamic Programs 16

AUTOGEN: Automatic Discovery of Cache-Oblivious Parallel Rec...

引用

21st acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Chowdhury, Rezaul Ganapathi, Pramod Tithi, Jesmin Jahan Bachmeier, Charles Kuszmaul, Bradley C. Leiserson, Charles E. Solar-Lezama, Armando Tang, Yuan SUNY Stony Brook Dept Comp Sci Stony Brook NY 11794 USA MIT Comp Sci & Artificial Intelligence Lab Cambridge MA 02139 USA Fudan Univ Shanghai Key Lab Intelligent Informat Proc Sch Software Shanghai Peoples R China

ISBN: (纸本)9781450340922

We present AUTOGEN-an algorithm that for a wide class of dynamic programming (DP) problems automatically discovers highly efficient cache-oblivious parallel recursive divide-and-conquer algorithms from inefficient iterative descriptions of DP recurrences. AUTOGEN analyzes the set of DP table locations accessed by the iterative algorithm when run on a DP table of small size, and automatically identifies a recursive access pattern and a corresponding provably correct recursive algorithm for solving the DP recurrence. We use AUTOGEN to autodiscover efficient algorithms for several well-known problems. Our experimental results show that several autodiscovered algorithms significantly outperform parallel looping and tiled loop-based algorithms. Also these algorithms are less sensitive to fluctuations of memory and bandwidth compared with their looping counterparts, and their running times and energy profiles remain relatively more stable. To the best of our knowledge, AUTOGEN is the first algorithm that can automatically discover new nontrivial divide-and-conquer algorithms.

关键词： AutoGen automatic discovery dynamic programming recursive divide-and-conquer cache-efficient parallel cacheoblivious energy-efficient cache-adaptive

来源：评论

学校读者我要写书评

暂无评论

PPoPP 2014 - Proceedings of the 2014 acm sigplan symposium on principles and practice of parallel programming

PPoPP 2014 - Proceedings of the 2014 ACM SIGPLAN Symposium o...

引用

2014 19th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2014

ISBN: (纸本)9781450326568

the proceedings contain 43 papers. the topics discussed include: predator: predictive false sharing detection;concurrency testing using schedule bounding: an empirical study;trace driven dynamic deadlock detection and reproduction;efficient search for inputs causing high floating-point errors;portable, MPI-interoperable coarray Fortran;eliminating global interpreter locks in ruby through hardware transactional memory;leveraging hardware message passing for efficient thread synchronization;well-structured futures and cache locality;time-warp: lightweight abort minimization in transactional memory;beyond parallel programming with domain specific languages;a decomposition for in-place matrix transposition;in-place transposition of rectangular matrices on accelerators;and parallelizing dynamic programming through rank convergence.

关键词： FORTRAN (programming language)

来源：评论

学校读者我要写书评

暂无评论

Tiles: A new language mechanism for heterogeneous parallelism 2015

Tiles: A new language mechanism for heterogeneous parallelis...

引用

20th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2015

作者： Chen, Yifeng Cui, Xiang Mei, Hong HCST Key Lab. School of EECS Peking University Beijing100871 China

ISBN: (纸本)9781450332057

this paper studies the essence of heterogeneity from the perspective of language mechanism design. the proposed mechanism, called tiles, is a program construct that bridges two relative levels of computation: an outer level of source data in larger, slower or more distributed memory and an inner level of data blocks in smaller, faster or more localized memory.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

An OpenACC-based unified programming model for multi-accelerator systems 2015

An OpenACC-based unified programming model for multi-acceler...

引用

20th acm sigplan symposium on principles and practice of parallel programming, PPoPP 2015

作者： Kim, Jungwon Lee, Seyong Vetter, Jeffrey S. Oak Ridge National Laboratory United States Georgia Institute of Technology United States

ISBN: (纸本)9781450332057

this paper proposes a novel SPMD programming model of OpenACC. Our model integrates the different granularities of parallelism from vector-level parallelism to node-level parallelism into a single, unified model based on OpenACC. It allows programmers to write programs for multiple accelerators using a uniform programming model whether they are in shared or distributed memory systems. We implement a prototype of our model and evaluate its performance with a GPU-based supercomputer using three benchmark applications.

关键词： Supercomputers

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：