检索结果-内蒙古大学图书馆

Energy-optimal configuration selection for manycore chips with variation

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2017年第5期31卷 451-466页

作者： Langer, Akhil Totoni, Ehsan Palekar, Udatta Kale, Laxmikant V. Intel Fed 1906 Fox Dr Champaign IL 61820 USA Univ Illinois Coll Business Urbana IL 61801 USA Univ Illinois Dept Comp Sci 1304 W Springfield Ave Urbana IL 61801 USA

Operating chips at high energy efficiency is one of the major challenges for modern large-scale supercomputers. Low-voltage operation of transistors increases the energy efficiency but leads to frequency and power variation across cores on the same chip. Finding energy-optimal configurations for such chips is a hard problem. In this work, we study how integer linear programming techniques can be used to obtain energy-efficient configurations of chips that have heterogeneous cores. Our proposed methodologies give optimal configurations as compared with competent but sub-optimal heuristics while having negligible timing overhead. the proposed ParSearch method gives up to 13.2% and 7% savings in energy while causing only 2% increase in execution time of two HPC applications: miniMD and Jacobi, respectively. Our results show that integer linear programming can be a very powerful online method to obtain energy-optimal configurations.

关键词： energy power optimization multicore chips low-voltage computing near-threshold voltage computing process variation heterogeneity integer programming quadratic integer programming

来源：评论

学校读者我要写书评

暂无评论

Interaction of parallel programming constructs and coherence protocols

Interaction of parallel programming constructs and coherence...

引用

Proceedings of the 1997 6th acm sigplan symposium on principles and practice of parallel programming

作者： Bianchini, Ricardo Carrera, Enrique V. Kontothanassis, Leonidas Federal Univ of Rio de Janeiro Rio de Janeiro Brazil

Some of the most common parallel programming idioms include locks, barriers, and reduction operations. the interaction of these programming idioms with the multiprocessor's coherence protocol has a significant impact on performance. In addition, the advent of machines that support multiple coherence protocols prompts the question of how to best implement such parallel constructs, i.e. what combination of implementation and coherence protocol yields the best performance. In this paper we study the running time and communication behavior of (1) centralized (ticket) and MCS spin locks, (2) centralized, dissemination, and tree-based barriers, and (3) parallel and sequential reductions, under pure and competitive update coherence protocols;results for write-invalidate protocol are presented mostly for comparison purposes. Our experiments indicate that parallel programming techniques that are well-established for write invalidate protocols, such as MCS locks and parallel reductions, are often inappropriate for update-based protocols. In contrast, techniques such as dissemination and tree barriers achieve superior performance under update-based protocols. Our results also show that the implementation of parallel programming idioms must take the coherence protocol into account, since update-based protocols often lead to different design decisions than write invalidate protocols. Our main conclusion is that protocol-conscious implementation of parallel programming structures can significantly improve application performance;for multiprocessors that can support more than one coherence protocol both the protocol and implementation should be taken into account when exploiting parallel constructs.

关键词： parallel processing systems

来源：评论

学校读者我要写书评

暂无评论

TigerQuoll: parallel Event-based JavaScript 13

TigerQuoll: Parallel Event-based JavaScript

引用

18th acm sigplan symposium on principles and practice of parallel programming

作者： Bonetta, Daniele Binder, Walter Pautasso, Cesare Univ Lugano USI Fac Informat Lugano Switzerland

ISBN: (纸本)9781450319225

JavaScript, the most popular language on the Web, is rapidly moving to the server-side, becoming even more pervasive. Still, JavaScript lacks support for shared memory parallelism, making it challenging for developers to exploit multicores present in both servers and clients. In this paper we present TigerQuoll, a novel API and runtime for parallel programming in JavaScript. TigerQuoll features an event-based API and a parallel runtime allowing applications to exploit a mutable shared memory space. the programming model of TigerQuoll features automatic consistency and concurrency management, such that developers do not have to deal with shared-data synchronization. TigerQuoll supports an innovative transaction model that allows for eventual consistency to speed up high-contention workloads. Experiments show that TigerQuoll applications scale well, allowing one to implement common parallelism patterns in JavaScript.

关键词： Languages Performance JavaScript Event-based programming parallelism Eventual Transactions

来源：评论

学校读者我要写书评

暂无评论

POSTER: LB-HM: Load Balance-Aware Data Placement on Heterogeneous Memory for Task-parallel HPC Applications 27

POSTER: LB-HM: Load Balance-Aware Data Placement on Heteroge...

引用

27th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Xie, Zhen Liu, Jie Ma, Sam Li, Jiajia Li, Dong Univ Calif Merced CA USA Coll William Mary Williamsburg WA USA

ISBN: (纸本)9781450392044

the emergence of heterogeneous memory (HM) provides a cost-effective and high-performance solution to memory-consuming HPC applications. However, using HM, wisely migrating data objects on it is critical for high performance. In this work, we introduce a load balance-aware page management system, named LB-HM. LB-HM introduces task semantics during memory profiling, rather than being application-agnostic. Evaluating with a set of memory-consuming HPC applications, we show that we show that LB-HM reduces existing load imbalance and leads to an average of 17.1% and 15.4% (up to 26.0% and 23.2%) performance improvement, compared with a hardware-based solution and an industry-quality software-based solution on Optane-based HM.

关键词： Semantics

来源：评论

学校读者我要写书评

暂无评论

Load Balancing on Speed

Load Balancing on Speed

引用

15th acm sigplan symposium on principles and practice of parallel programming

作者： Hofmeyr, Steven Iancu, Costin Blagojevic, Filip Lawrence Berkeley Natl Lab Berkeley CA USA

ISBN: (纸本)9781605587080

To fully exploit multicore processors, applications are expected to provide a large degree of thread-level parallelism. While adequate for low core counts and their typical workloads, the current load balancing support in operating systems may not be able to achieve efficient hardware utilization for parallel workloads. Balancing run queue length globally ignores the needs of parallel applications where threads are required to make equal progress. In this paper we present a load balancing technique designed specifically for parallel applications running on multicore systems. Instead of balancing run queue length, our algorithm balances the time a thread has executed on "faster" and "slower" cores. We provide a user level implementation of speed balancing on UMA and NUMA multi-socket architectures running Linux and discuss behavior across a variety of workloads, usage scenarios and programming models. Our results indicate that speed balancing when compared to the native Linux load balancing improves performance and provides good performance isolation in all cases considered. Speed balancing is also able to provide comparable or better performance than DWRR, a fair multi-processor scheduling implementation inside the Linux kernel. Furthermore, parallel application performance is often determined by the implementation of synchronization operations and speed balancing alleviates the need for tuning the implementations of such primitives.

关键词： Experimentation theory Performance Measurement Languages Design parallel programming Operating System Load Balancing Speed Balancing Multicore Multisocket

来源：评论

学校读者我要写书评

暂无评论

Expressing Graph Algorithms Using Generalized Active Messages 13

Expressing Graph Algorithms Using Generalized Active Message...

引用

18th acm sigplan symposium on principles and practice of parallel programming

作者： Edmonds, Nick Willcock, Jeremiah Lumsdaine, Andrew Indiana Univ Bloomington IN 47405 USA

ISBN: (纸本)9781450319225

Recently, graph computation has emerged as an important class of high-performance computing application whose characteristics differ markedly from those of traditional, compute-bound, kernels. Libraries such as BLAS, LAPACK, and others have been successful in codifying best practices in numerical computing. the data-driven nature of graph applications necessitates a more complex application stack incorporating runtime optimization. In this paper, we present a method of phrasing graph algorithms as collections of asynchronous, concurrently executing, concise code fragments which may be invoked both locally and in remote address spaces. A runtime layer performs a number of dynamic optimizations, including message coalescing, message combining, and software routing. Practical implementations and performance results are provided for a number of representative algorithms.

关键词： parallel Graph Algorithms Active Messages parallel programming Models

来源：评论

学校读者我要写书评

暂无评论

DOJ: Dynamically parallelizing Object-Oriented Programs 12

DOJ: Dynamically Parallelizing Object-Oriented Programs

引用

17th acm sigplan symposium on principles and practice of parallel programming

作者： Eom, Yong Hun Yang, Stephen Jenista, James C. Demsky, Brian Univ Calif Irvine Irvine CA 92717 USA

ISBN: (纸本)9781450311601

We present Dynamic Out-of-Order Java (DOJ), a dynamic parallelization approach. In DOJ, a developer annotates code blocks as tasks to decouple these blocks from the parent execution thread. the DOJ compiler then analyzes the code to generate heap examiners that ensure the parallel execution preserves the behavior of the original sequential program. Heap examiners dynamically extract heap dependences between code blocks and determine when it is safe to execute a code block. We have implemented DOJ and evaluated it on twelve benchmarks. We achieved an average compilation speedup of 31.15x over OoOJava and an average execution speedup of 12.73x over sequential versions of the benchmarks.

关键词： Algorithms Performance parallel programming Dynamic Analysis Object-Oriented Analysis Heap Analysis parallelization

来源：评论

学校读者我要写书评

暂无评论

Analysis of event synchronization in a parallel programming tool 90

Analysis of event synchronization in a parallel programming ...

引用

2nd acm sigplan symposium on principles and practice of parallel programming, PPOPP 1990

作者： Callahan, David Kennedy, Ken Subhlok, Jaspal Tera Computer Company 400 North 34th Street SeattleWA98103 United States Department of Computer Science Rice University HoustonTX77251-1892 United States

ISBN: (纸本)0897913507

Understanding synchronization is important for a parallel programming tool that uses dependence analysis as the basis for advising programmers on the correctness of parallel constructs. this paper discusses static analysis methods that can be applied to parallel programs with event variable synchronization. the objective is to be able to predict potential data races in a parallel program. the focus is on how dependences and synchronization statements inside loops can be used to analyze complete programs with parallel loop and parallel case style parallelism. © 1990 acm.

关键词： Synchronization

来源：评论

学校读者我要写书评

暂无评论

Provably and Practically Efficient Granularity Control 19

Provably and Practically Efficient Granularity Control

引用

24th acm sigplan symposium on principles and practice of parallel programming (PPoPP)

作者： Acar, Umut A. Aksenov, Vitaly Chargueraud, Arthur Rainey, Mike Carnegie Mellon Univ Pittsburgh PA 15213 USA INRIA Paris France ITMO Univ St Petersburg Russia Univ Strasbourg ICube CNRS Strasbourg France Indiana Univ Bloomington IN 47405 USA

ISBN: (纸本)9781450362252

Over the past decade, many programming languages and systems for parallel-computing have been developed, e.g., Fork/Join and Habanero Java, parallel Haskell, parallel ML, and X10. Although these systems raise the level of abstraction for writing parallel codes, performance continues to require labor-intensive optimizations for coarsening the granularity of parallel executions. In this paper, we present provably and practically efficient techniques for controlling granularity within the run-time system of the language. Our starting point is "oracle-guided scheduling", a result from the functional-programming community that shows that granularity can be controlled by an "oracle" that can predict the execution time of parallel codes. We give an algorithm for implementing such an oracle and prove that it has the desired theoretical properties under the nested-parallel programming model. We implement the oracle in C++ by extending Cilk and evaluate its practical performance. the results show that our techniques can essentially eliminate hand tuning while closely matching the performance of hand tuned codes.

关键词： parallel programming languages granularity control

来源：评论

学校读者我要写书评

暂无评论

OpenCL as a Unified programming Model for Heterogeneous CPU/GPU Clusters 12

OpenCL as a Unified Programming Model for Heterogeneous CPU/...

引用

17th acm sigplan symposium on principles and practice of parallel programming

作者： Kim, Jungwon Seo, Sangmin Lee, Jun Nah, Jeongho Jo, Gangwon Lee, Jaejin Seoul Natl Univ Sch Comp Sci & Engn Ctr Manycore Programming Seoul 151744 South Korea

ISBN: (纸本)9781450311601

In this paper, we propose an OpenCL framework for heterogeneous CPU/GPU clusters, and show that the framework achieves both high performance and ease of programming. the framework provides an illusion of a single system for the user. It allows the application to utilize multiple heterogeneous compute devices, such as multicore CPUs and GPUs, in a remote node as if they were in a local node. No communication API, such as the MPI library, is required in the application source. We implement the OpenCL framework and evaluate its performance on a heterogeneous CPU/GPU cluster that consists of one host node and nine compute nodes using eleven OpenCL benchmark applications.

关键词： Algorithm Design Experimentation Languages Measurement Performance OpenCL Clusters Heterogeneous computing programming models

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：