检索结果-内蒙古大学图书馆

Porting the PLASMA Numerical Library to the OpenMP Standard

INTERNATIONAL JOURNAL OF PARALLEL programming 2017年第3期45卷 612-633页

作者： YarKhan, Asim Kurzak, Jakub Luszczek, Piotr Dongarra, Jack Univ Tennessee Elect Engn & Comp Sci 1122 Volunteer BlvdSte 203 Claxton Knoxville TN 37996 USA

PLASMA is a numerical library intended as a successor to LAPACK for solving problems in dense linear algebra on multicore processors. PLASMA relies on the QUARK scheduler for efficient multithreading of algorithms expressed in a serial fashion. QUARK is a superscalar scheduler and implements automatic parallelization by tracking data dependencies and resolving data hazards at runtime. Recently, this type of scheduling has been incorporated in the OpenMP standard, which allows to transition PLASMA from the proprietary solution offered by QUARK to the standard solution offered by OpenMP. This article studies the feasibility of such transition.

关键词： Parallel computing Multithreading Multicore processors programming models Runtime systems Task scheduling Numerical libraries Linear algebra

来源：评论

学校读者我要写书评

暂无评论

What Scalable Programs Need from Transactional Memory 17

What Scalable Programs Need from Transactional Memory

引用

22nd ACM International Conference on Architectural Support for programming Languages and Operating Systems (ASPLOS)

作者： Nguyen, Donald Pingali, Keshav Synthace Ltd London England Univ Texas Austin Austin TX 78712 USA

ISBN: (纸本)9781450344654

Transactional memory (TM) has been the focus of numerous studies, and it is supported in processors such as the IBM Blue Gene/Q and Intel Haswell. Many studies have used the STAMP benchmark suite to evaluate their designs. However, the speedups obtained for the STAMP benchmarks on all TM systems we know of are quite limited;for example, with 64 threads on the IBM Blue Gene/Q, we observe a median speedup of 1.4X using the Blue Gene/Q hardware transactional memory (HTM), and a median speedup of 4.1X using a software transactional memory (STM). What limits the performance of these benchmarks on TMs? In this paper, we argue that the problem lies with the programming model and data structures used to write them. To make this point, we articulate two principles that we believe must be embodied in any scalable program and argue that STAMP programs violate both of them. By modifying the STAMP programs to satisfy both principles, we produce a new set of programs that we call the Stampede suite. Its median speedup on the Blue Gene/Q is 8.0X when using an STM. The two principles also permit us to simplify the TM design. Using this new STM with the Stampede benchmarks, we obtain a median speedup of 17.7X with 64 threads on the Blue Gene/Q and 13.2X with 32 threads on an Intel Westmere system. These results suggest that HTM and STM designs will benefit if more attention is paid to the division of labor between application programs, systems software, and hardware.

关键词： transactions transactional memory programming models scalability STAMP benchmarks Stampede benchmarks

来源：评论

学校读者我要写书评

暂无评论

Performance study of multithreaded MPI and OpenMP tasking in a large scientific code 31

Performance study of multithreaded MPI and OpenMP tasking in...

引用

31st IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPS)

作者： Akhmetova, Dana Iakymchuk, Roman Ekeberg, Orjan Laure, Erwin KTH Royal Inst Technol Dept Computat Sci & Technol Stockholm Sweden

ISBN: (纸本)9780769561493

With a large variety and complexity of existing HPC machines and uncertainty regarding exact future Exascale hardware, it is not clear whether existing parallel scientific codes will perform well on future Exascale systems: they can be largely modified or even completely rewritten from scratch. Therefore, now it is important to ensure that software is ready for Exascale computing and will utilize all Exascale resources well. Many parallel programming models try to take into account all possible hardware features and nuances. However, the HPC community does not yet have a precise answer whether, for Exascale computing, there should be a natural evolution of existing models interoperable with each other or it should be a disruptive approach. Here, we focus on the first option, particularly on a practical assessment of how some parallel programming models can coexist with each other. This work describes two API combination scenarios on the example of iPIC3D [26], an implicit Particle-in-Cell code for space weather applications written in C++ and MPI plus OpenMP. The first scenario is to enable multiple OpenMP threads call MPI functions simultaneously, with no restrictions, using an MPI_THREAD_MULTIPLE thread safety level. The second scenario is to utilize the OpenMP tasking model on top of the first scenario. The paper reports a step-by-step methodology and experience with these API combinations in iPIC3D;provides the scaling tests for these implementations with up to 2048 physical cores;discusses occurred interoperability issues;and provides suggestions to programmers and scientists who may adopt these API combinations in their own codes.

关键词： API interoperability programming models MPI OpenMP tasks multithreading MPI_THREAD_MULTIPLE thread safety

来源：评论

学校读者我要写书评

暂无评论

A Distributed Stream Library for Java 8

引用

IEEE TRANSACTIONS ON BIG DATA 2017年第3期3卷 262-275页

作者： Chan, Yu Wellings, Andy Gray, Ian Audsley, Neil Univ York Dept Comp Sci York YO10 5DD N Yorkshire England

Java 8 has introduced new capabilities such as lambda expressions and streams which simplify data-parallel computing. However, as a base language for Big Data systems, it still lacks a number of important capabilities such as processing very large datasets and distributing the computation over multiple machines. This paper gives an overview of the Java 8 Streams API and proposes extensions to allow its use in Big Data systems. It also shows how the API can be used to implement a range of standard Big Data paradigms. Finally, it compares performance with that of Hadoop and Spark. Despite being a proof-of-concept implementation, results indicate that it is a lightweight and efficient framework, comparable in performance to Hadoop and Spark, and is up to 5 times faster for the largest input sizes tested.

关键词： Big data Java distributed computing programming models

来源：评论

学校读者我要写书评

暂无评论

Apollo: Reusable models for Fast, Dynamic Tuning of Input-Dependent Code 31

Apollo: Reusable Models for Fast, Dynamic Tuning of Input-De...

引用

31st IEEE International Parallel and Distributed Processing Symposium (IPDPS)

作者： Beckingsale, David Pearce, Olga Laguna, Ignacio Gamblin, Todd Lawrence Livermore Natl Lab Ctr Appl Sci Comp Livermore CA 94550 USA

ISBN: (纸本)9781538639146

Increasing architectural diversity makes performance portability extremely important for parallel simulation codes. Emerging on-node parallelization frameworks such as Kokkos and RAJA decouple the work done in kernels from the parallelization mechanism, allowing for a single source kernel to be tuned for different architectures at compile time. However, computational demands in production applications change at runtime, and performance depends both on the architecture and the input problem, and tuning a kernel for one set of inputs may not improve its performance on another. The statically optimized versions need to be chosen dynamically to obtain the best performance. Existing auto-tuning approaches can handle slowly evolving applications effectively, but are too slow to tune highly input-dependent kernels. We developed Apollo, an auto-tuning extension for RAJA that uses pre-trained, reusable models to tune input-dependent code at runtime. Apollo is designed for highly dynamic applications;it generates sufficiently low-overhead code to tune parameters each time a kernel runs, making fast decisions. We apply Apollo to two hydrodynamics benchmarks and to a production multi-physics code, and show that it can achieve speedups from 1.2x to 4.8x.

关键词： autotuning Machine learning performance programming models

来源：评论

学校读者我要写书评

暂无评论

A Directive-Based Approach to Perform Persistent Checkpoint/Restart 15

A Directive-Based Approach to Perform Persistent Checkpoint/...

引用

15th International Conference on High Performance Computing & Simulation (HPCS)

作者： Maronas, Marcos Mateo, Sergi Beltran, Vicenc Ayguade, Eduard BSC Barcelona Spain

ISBN: (纸本)9781538632505

Exascale platforms require support for resilience capabilities due to increasing numbers of components and associated error rates. In this paper, we present a new directive-based approach to perform application-level checkpoint/restart in a simplified and portable way. We propose a solution based on compiler directives, similar to OpenMP, that allows users to easily specify the state of the application that has to be saved and restored. This leaves the tedious and error-prone serialization and deserialization activities to our library, which relies on SCR/FTI to perform scalable and efficient I/O operations. Our results, based on several benchmarks and two large applications, reveal no additional overhead compared to the direct use of FTI and SCR checkpoint/restart libraries. Apart from that, our portable approach significantly increases the programmability reducing the number of code lines required to perform checkpoint/restart in an average of approximate to 82% and approximate to 94%, for FTI and SCR respectively.

关键词： checkpoint/restart resiliency fault tolerance exascale programmability programming models

来源：评论

学校读者我要写书评

暂无评论

GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations 46

GLTO: On the Adequacy of Lightweight Thread Approaches for O...

引用

46th International Conference on Parallel Processing Workshops (ICPPW)

作者： Castello, Adrian Mayo, Rafael Quintana-Orti, Enrique S. Seo, Sangmin Balaji, Pavan Pena, Antonio J. Univ Jaume I Castello Castellon de La Plana Spain Argonne Natl Lab Lemont IL USA BSC Barcelona Spain

ISBN: (纸本)9781538610428

OpenMP is the de facto standard application programming interface (API) for on-node parallelism. The most popular OpenMP runtimes rely on POSIX threads (pthreads) implementations that offer an excellent performance for coarse-grained parallelism and match perfectly with the current hardware. However, a recent trend in runtimes/applications points in the direction of leveraging massive on-node parallelism in conjunction with fine-grained and dynamic scheduling paradigms. It has been demonstrated that lightweight thread (LWT) solutions are more appropriate for these new parallel paradigms. We have developed GLTO, an OpenMP implementation over the recently-emerged Generic Lightweight Threads (GLT) API. GLT exports a common API for LWT libraries that offers the possibility of running the same application over different native LWT solutions. In this paper we use GLTO to analyze different scenarios where OpenMP implementations may benefit from the use of either LWT or pthreads. Our study reveals that none of the threading approaches obtains the best performance in all the scenarios, but that there are important gaps among them.

关键词： GLT Lightweight Threads OpenMP POSIX Threads programming models

来源：评论

学校读者我要写书评

暂无评论

Addressing Global Data Dependencies in Heterogeneous Asynchronous Runtime Systems on GPUs 3

Addressing Global Data Dependencies in Heterogeneous Asynchr...

引用

3rd IEEE International Workshop on Extreme Scale programming models and Middleware (ESPM2)

作者： Peterson, Brad Humphrey, Alan Schmidt, John Berzins, Martin Univ Utah Sci Comp & Imaging Inst Salt Lake City UT 84112 USA

ISBN: (纸本)9781450351331

Large-scale parallel applications with complex global data dependencies beyond those of reductions pose significant scalability challenges in an asynchronous runtime system. Internodal challenges include identifying the all-to-all communication of data dependencies among the nodes. Intranodal challenges include gathering together these data dependencies into usable data objects while avoiding data duplication. This paper addresses these challenges within the context of a large-scale, industrial coal boiler simulation using the Uintah asynchronous many-task runtime system on GPU architectures. We show significant reduction in time spent analyzing data dependencies through refinements in our dependency search algorithm. Multiple task graphs are used to eliminate subsequent analysis when task graphs change in predictable and repeatable ways. Using a combined data store and task scheduler redesign reduces data dependency duplication ensuring that problems fit within host and GPU memory. These modifications did not require any changes to application code or sweeping changes to the Uintah runtime system. We report results running on the DOE Titan system on 119K CPU cores and 7.5K GPUs simultaneously. Our solutions can be generalized to other task dependency problems with global dependencies among thousands of nodes which must be processed efficiently at large scale.

关键词： Data dependencies Asynchronous Many-Task programming models Runtime Systems Scalability GPU Uintah Coal Boiler Radiative Heat Transfer

来源：评论

学校读者我要写书评

暂无评论

Characterizing and Improving the Performance of Many-Core Task-Based Parallel programming Runtimes 31

Characterizing and Improving the Performance of Many-Core Ta...

引用

31st IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPS)

作者： Bosch, Jaume Tan, Xubin Alvarez, Carlos Jimenez-Gonzalez, Daniel Martorell, Xavier Ayguade, Eduard Univ Politecn Cataluna BSC Barcelona Spain

ISBN: (纸本)9780769561493

Parallel task-based programming models like OpenMP support the declaration of task data dependences. This information is used to delay the task execution until the task data is available. The dependences between tasks are calculated at runtime using shared graphs that are updated concurrently by all threads. However, only one thread can modify the task graph at a time to ensure correctness;others need to wait before doing their modifications. This waiting limits the application's parallelism and becomes critical in many-core systems. This paper characterizes this behavior, analyzing how it hinders performance and presenting an alternative organization suitable for the runtimes of task-based programming models. This organization allows managing the runtime structures asynchronously or synchronously, adapting the runtime to reduce the waste of computation resources and increase the performance. Results show that the new runtime structure outperforms the peak speedup of the original runtime model when contention is huge and achieves similar or better performance results for real applications.

关键词： OmpSs OpenMP Nanos plus task-based programming models dependence manager task-graph

来源：评论

学校读者我要写书评

暂无评论

A Comparison of Task Parallel Frameworks based on Implicit Dependencies in Multi-core Environments 50

A Comparison of Task Parallel Frameworks based on Implicit D...

引用

50th Annual Hawaii International Conference on System Sciences(HICSS)

作者： Fraguela, Basilio B. Univ A Coruna La Coruna Spain

ISBN: (纸本)9780998133102

The larger flexibility that task parallelism offers with respect to data parallelism comes at the cost of a higher complexity due to the variety of tasks and the arbitrary patterns of dependences that they can exhibit. These dependencies should be expressed not only correctly, but optimally, i.e. avoiding over-constraints, in order to obtain the maximum performance from the underlying hardware. There have been many proposals to facilitate this non-trivial task, particularly within the scope of nowadays ubiquitous multi-core architectures. A very interesting family of solutions because of their large scope of application, ease of use and potential performance are those in which the user declares the dependences of each task, and lets the parallel programming framework figure out which are the concrete dependences that appear at runtime and schedule accordingly the parallel tasks. Nevertheless, as far as we know, there are no comparative studies of them that help users identify their relative advantages. In this paper we describe and evaluate four tools of this class discussing the strengths and weaknesses we have found in their use.

关键词： programmability task parallelism dependencies programming models

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：