检索结果-内蒙古大学图书馆

22nd acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Ren, Bin Krishnamoorthy, Sriram Agrawal, Kunal Kulkarni, Milind Coll William & Mary Williamsburg VA 23187 USA Pacific Northwest Natl Labs Richland WA USA Washington Univ St Louis St Louis MO USA Purdue Univ W Lafayette IN 47907 USA

ISBN: (纸本)9781450344937

Modern hardware contains parallel execution resources that are well-suited for data-parallelism vector units and task parallelism multicores. However, most work on parallel scheduling focuses on one type of hardware or the other. In this work, we present a scheduling framework that allows for a unified treatment of task- and data-parallelism. Our key insight is an abstraction, task blocks, that uniformly handles data-parallel iterations and task-parallel tasks, allowing them to be scheduled on vector units or executed independently as multicores. Our framework allows us to define schedulers that can dynamically select between executing task blocks on vector units or multicores. We show that these schedulers are asymptotically optimal, and deliver the maximum amount of parallelism available in computation trees. To evaluate our schedulers, we develop program transformations that can convert mixed data- and task-parallel programs into task block based programs. Using a prototype instantiation of our scheduling framework, we show that, on an 8-core system, we can simultaneously exploit vector and multicore parallelism to achieve 14 x-108 x speedup over sequential baselines.

关键词： Task parallelism Data parallelism General Scheduler

来源：评论

学校读者我要写书评

暂无评论

Eunomia: Scaling Concurrent Search Trees under Contention Using HTM 17

Eunomia: Scaling Concurrent Search Trees under Contention Us...

引用

22nd acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Wang, Xin Zhang, Weihua Wang, Zhaoguo Wei, Ziyun Chen, Haibo Zhao, Wenyun Fudan Univ Software Sch Shanghai Peoples R China Fudan Univ Shanghai Key Lab Data Sci Shanghai Peoples R China Fudan Univ Sch Comp Sci Shanghai Peoples R China Shanghai Jiao Tong Univ Inst Parallel & Distributed Syst Shanghai Peoples R China NYU Comp Sci Dept New York NY 10003 USA

ISBN: (纸本)9781450344937

While hardware transactional memory (HTM) has recently been adopted to construct efficient concurrent search tree structures, such designs fail to deliver scalable performance under contention. In this paper, we first conduct a detailed analysis on an HTM-based concurrent B+Tree, which uncovers several reasons for excessive HTM aborts induced by both false and true conflicts under contention. Based on the analysis, we advocate Eunomia, a design pattern for search trees which contains several principles to reduce HTM aborts, including splitting HTM regions with version based concurrency control to reduce HTM working sets, partitioned data layout to reduce false conflicts, proactively detecting and avoiding true conflicts, and adaptive con currency control. To validate their effectiveness, we apply such designs to construct a scalable concurrent B+Tree using HTM. Evaluation using key-value store benchmarks on a 20-core HTM-capable multi-core machine shows that Eunomia leads to 5X-11X speedup under high contention, while incurring small overhead under low contention.

关键词： Hardware Transactional Memory Concurrent Search Tree Opportunistic Consistency

来源：评论

学校读者我要写书评

暂无评论

Self-Checkpoint: An In-Memory Checkpoint Method Using Less Space and Its practice on Fault-Tolerant HPL 17

Self-Checkpoint: An In-Memory Checkpoint Method Using Less S...

引用

22nd acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Tang, Xiongchao Zhai, Jidong Yu, Bowen Chen, Wenguang Zheng, Weimin Tsinghua Univ Dept Comp Sci & Technol Beijing Peoples R China

ISBN: (纸本)9781450344937

Fault tolerance is increasingly important in high performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk. In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50%. To validate our method, we apply the self-checkpoint to an important problem, fault tolerant HPL. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95% of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47% and the performance by 5%.

关键词： Fault Tolerance In-Memory Checkpoint Fault-Tolerant HPL Memory Consumption

来源：评论

学校读者我要写书评

暂无评论

Optimizing the Four-Index Integral Transform Using Data Movement Lower Bounds Analysis 17

Optimizing the Four-Index Integral Transform Using Data Move...

引用

22nd acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Rajbhandari, Samyam Rastello, Fabrice Kowalski, Karol Krishnamoorthy, Sriram Sadayappan, P. Ohio State Univ Columbus OH 43210 USA INRIA Rocquencourt France Pacific Northwest Natl Lab Richland WA 99352 USA

ISBN: (纸本)9781450344937

The four-index integral transform is a fundamental and computationally demanding calculation used in many computational chemistry suites such as NWChem. It transforms a four-dimensional tensor from one basis to another. This transformation is most efficiently implemented as a sequence of four tensor contractions that each contract a four-dimensional tensor with a two-dimensional transformation matrix. Differing degrees of permutation symmetry in the intermediate and final tensors in the sequence of contractions cause intermediate tensors to be much larger than the final tensor and limit the number of electronic states in the modeled systems. Loop fusion, in conjunction with tiling, can be very effective in reducing the total space requirement, as well as data movement. However, the large number of possible choices for loop fusion and tiling, and data/computation distribution across a parallel system, make it challenging to develop an optimized parallel implementation for the four-index integral transform. We develop a novel approach to address this problem, using lower bounds modeling of data movement complexity. We establish relationships between available aggregate physical memory in a parallel computer system and ineffective fusion configurations, enabling their pruning and consequent identification of effective choices and a characterization of optimality criteria. This work has resulted in the development of a significantly improved implementation of the four-index transform that enables higher performance and the ability to model larger electronic systems than the current implementation in the NWChem quantum chemistry software suite.

关键词： four-index distributed algorithm tensors lower bounds parallel algorithm fusion 4-index processor mapping optimal schedule communication optimization scheduling tensor contraction optimizing 4-index transform

来源：评论

学校读者我要写书评

暂无评论

Reducing the burden of parallel loop schedulers for many-core processors 18

Reducing the burden of parallel loop schedulers for many-cor...

引用

Proceedings of the 23rd acm sigplan symposium on principles and practice of parallel programming

作者： Mahwish Arif Hans Vandierendonck Queen's University Belfast

ISBN: (纸本)9781450349826

This work proposes a low-overhead half-barrier pattern to schedule fine-grain parallel loops and considers its integration in the Intel OpenMP and Cilkplus schedulers. Experimental evaluation demonstrates that the scheduling overhead of our techniques is 43% lower than Intel OpenMP and 12.1x lower than Cilk. We observe 22% speedup on 48 threads, with a peak of 2.8x speedup.

关键词：

来源：评论

学校读者我要写书评

暂无评论

Deadlock-free buffer configuration for stream computing

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2017年第5期31卷 441-450页

作者： Li, Peng Beard, Jonathan C. Buhler, Jeremy D. Washington Univ Dept Comp Sci & Engn St Louis MO 63130 USA

Stream computing is a popular paradigm for parallel and distributed computing, where compute nodes are connected by first-in first-out data channels. Each channel can be considered as a concatenation of several data buffers, including an output buffer for the sender and an input buffer for the receiver. The configuration of buffer sizes impacts the performance as well as the correctness of the application. In this article, we focus on application deadlocks that are caused by incorrect configuration of buffer sizes. We describe three types of deadlock in streaming applications, categorized by how they can be created. To avoid them, we first prove necessary and sufficient conditions for deadlock-free computations;then based on the theorems, we propose both compile-time and runtime solutions for deadlock avoidance.

关键词： Buffer configuration deadlock avoidance feedback channels parallel and distributed computing streaming computing

来源：评论

学校读者我要写书评

暂无评论

Performance challenges in modular parallel programs 18

Performance challenges in modular parallel programs

引用

Proceedings of the 23rd acm sigplan symposium on principles and practice of parallel programming

作者： Umut A. Acar Vitaly Aksenov Arthur Charguéraud Mike Rainey Carnegie Mellon University and Inria France Inria France and ITMO University Russia Inria France and Université de Strasbourg CNRS ICube France Inria France

ISBN: (纸本)9781450349826

Over the past decade, many programming languages and systems for parallel-computing have been developed, including Cilk, Fork/Join Java, Habanero Java, parallel Haskell, parallel ML, and X10. Although these systems raise the level of abstraction at which parallel code are written, performance continues to require the programmer to perform extensive optimizations and tuning, often by taking various architectural details into account. One such key optimization is granularity control, which requires the programmer to determine when and how parallel tasks should be *** this paper, we briefly describe some of the challenges associated with automatic granularity control when trying to achieve portable performance for parallel programs with arbitrary nesting of parallel constructs. We consider a result from the functional-programming community, whose starting point is to consider an "oracle" that can predict the work of parallel codes, and thereby control granularity. We discuss the challenges in implementing such an oracle and proving that it has the desired theoretical properties under the nested-parallel programming model.

关键词：

来源：评论

学校读者我要写书评

暂无评论

It's Time for a New Old Language 17

It's Time for a New Old Language

引用

22nd acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Steele, Guy L., Jr. Oracle Labs Burlington MA 01803 USA

ISBN: (纸本)9781450344937

The most popular programming language in computer science has no compiler or interpreter. Its definition is not written down in any one place. It has changed a lot over the decades, and those changes have introduced ambiguities and inconsistencies. Today, dozens of variations are in use, and its complexity has reached the point where it needs to be re-explained, at least in part, every time it is used. Much effort has been spent in hand-translating between this language and other languages that do have compilers. The language is quite amenable to parallel computation, but this fact has gone unexploited. In this talk we will summarize the history of the language, highlight the variations and some of the problems that have arisen, and propose specific solutions. We suggest that it is high time that this language be given a complete formal specification, and that compilers, IDEs, and proof-checkers be created to support it, so that all the best tools and techniques of our trade may be applied to it also.

关键词： programming languages compilers specifications

来源：评论

学校读者我要写书评

暂无评论

Energy-optimal configuration selection for manycore chips with variation

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2017年第5期31卷 451-466页

作者： Langer, Akhil Totoni, Ehsan Palekar, Udatta Kale, Laxmikant V. Intel Fed 1906 Fox Dr Champaign IL 61820 USA Univ Illinois Coll Business Urbana IL 61801 USA Univ Illinois Dept Comp Sci 1304 W Springfield Ave Urbana IL 61801 USA

Operating chips at high energy efficiency is one of the major challenges for modern large-scale supercomputers. Low-voltage operation of transistors increases the energy efficiency but leads to frequency and power variation across cores on the same chip. Finding energy-optimal configurations for such chips is a hard problem. In this work, we study how integer linear programming techniques can be used to obtain energy-efficient configurations of chips that have heterogeneous cores. Our proposed methodologies give optimal configurations as compared with competent but sub-optimal heuristics while having negligible timing overhead. The proposed ParSearch method gives up to 13.2% and 7% savings in energy while causing only 2% increase in execution time of two HPC applications: miniMD and Jacobi, respectively. Our results show that integer linear programming can be a very powerful online method to obtain energy-optimal configurations.

关键词： energy power optimization multicore chips low-voltage computing near-threshold voltage computing process variation heterogeneity integer programming quadratic integer programming

来源：评论

学校读者我要写书评

暂无评论

RaftLib: A C plus plus template library for high performance stream parallel processing

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2017年第5期31卷 391-404页

作者： Beard, Jonathan C. Li, Peng Chamberlain, Roger D. ARM Res 5707 Southwest Pkwy Suite 100 Austin Austin TX 78735 USA Amazon Inc Seattle WA USA Washington Univ Dept Comp Sci & Engn St Louis MO 63130 USA

Stream processing is a compute paradigm that has been around for decades, yet until recently has failed to garner the same attention as other mainstream languages and libraries (e.g. C++, OpenMP, MPI). Stream processing has great promise: the ability to safely exploit extreme levels of parallelism to process huge volumes of streaming data. There have been many implementations, both libraries and full languages. The full languages implicitly assume that the streaming paradigm cannot be fully exploited in legacy languages, while library approaches are often preferred for being integrable with the vast expanse of extant legacy code. Libraries, however are often criticized for yielding to the shape of their respective languages. RaftLib aims to fully exploit the stream processing paradigm, enabling a full spectrum of streaming graph optimizations, while providing a platform for the exploration of integrability with legacy C/C++ code. RaftLib is built as a C++ template library, enabling programmers to utilize the robust C++ standard library, and other legacy code, along with RaftLib's parallelization framework. RaftLib supports several online optimization techniques: dynamic queue optimization, automatic parallelization, and real-time low overhead performance monitoring.

关键词： Stream processing big-data C plus plus template library high performance computing RaftLib performance monitoring parallel processing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：