检索结果-内蒙古大学图书馆

TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2019年第16期31卷

作者： Kurth, Thorsten Smorkalov, Mikhail Mendygral, Peter Sridharan, Srinivas Mathuriya, Amrita Lawrence Berkeley Natl Lab Natl Energy Res Sci Comp Ctr Berkeley CA 94270 USA Intel Corp Software & Serv Grp Moscow Russia Cray Inc Cray Programming Environm Performance Engn Bloomington MN USA Intel Corp Parallel Comp Labs Bangalore Karnataka India Intel Corp Data Ctr Grp Hillsboro OR USA

Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems.

关键词： deep learning performance scalability

来源：评论

学校读者我要写书评

暂无评论

SPar: A DSL for High-Level and Productive Stream parallelism

SPar: A DSL for High-Level and Productive Stream Parallelism

引用

作者： Griebler, Dalvan Danelutto, Marco Torquati, Massimo Fernandes, Luiz Gustavo Av. Ipiranga 6681 - Building 32 Porto Alegre - CEP90619-900 Brazil Department of Computer Science Parallel Programming Models Group Largo Pontecorvo 3 PISA56127 Italy

This paper introduces SPar, an internal C++ Domain-Specific Language (DSL) that supports the development of classic stream parallel applications. The DSL uses standard C++ attributes to introduce annotations tagging the notable components of stream parallel applications: stream sources and stream processing stages. A set of tools process SPar code (C++ annotated code using the SPar attributes) to generate FastFlow C++ code that exploits the stream parallelism denoted by SPar annotations while targeting shared memory multi-core architectures. We outline the main SPar features along with the main implementation techniques and tools. Also, we show the results of experiments assessing the feasibility of the entire approach as well as SPar's performance and expressiveness. © 2017 World Scientific Publishing Company.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Scaling Score-P to the next level *

引用

Procedia Computer Science 2017年 108卷 2180-2189页

作者： Daniel Lorenz Christian Feld Laboratory for Parallel Programming Technische Universität Darmstadt Darmstadt Germany Jülich Supercomputing Centre Forschungszentrum Jülich GmbH Jülich Germany

As part of performance measurements with Score-P, a description of the system and the execution locations is recorded into the performance measurement reports. For large-scale measurements using a million or more processes, the global system description can consume all the available memory. While the information stored process-locally during measurement is small, the memory requirement becomes a bottleneck in the process of constructing a global representation of the whole system. To address this problem we implemented a new system description in Score-P that exploits regular structures of the system, and results, on homogeneous systems, in a system description of constant size. Furthermore, we present a parallel algorithm to create a global view from the process-local information. The scalable system description comes at the price that it is no longer possible to assign individual names to each system element, but only enumerate elements of the same type. We have successfully tested the new approach on the full JUQUEEN system with up to nearly two million processes.

关键词： Performance analysis data compression exascale computing

来源：评论

学校读者我要写书评

暂无评论

Automatic parallel Pattern Detection in the Algorithm Structure Design Space 30

Automatic Parallel Pattern Detection in the Algorithm Struct...

引用

30th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Huda, Zia Ul Atre, Rohit Jannesari, Ali Wolf, Felix Tech Univ Darmstadt Lab Parallel Programming Darmstadt Germany Rhein Westfal TH Aachen Aachen Germany

ISBN: (纸本)9781509021406

parallel design patterns have been developed to help programmers efficiently design and implement parallel applications. However, identifying a suitable parallel pattern for a specific code region in a sequential application is a difficult task. Transforming an application according to support structures applicable to these parallel patterns is also very challenging. In this paper, we present a novel approach to automatically find parallel patterns in the algorithm structure design space of sequential applications. In our approach, we classify code blocks in a region according to the appropriate support structure of the detected pattern. This classification eases the transformation of a sequential application into its parallel version. We evaluated our approach on 17 applications from four different benchmark suites. Our method identified suitable algorithm structure patterns in the sequential applications. We confirmed our results by comparing them with the existing parallel versions of these applications. We also implemented the patterns we detected in cases in which parallel implementations were not available and achieved speedups of up to 14x.

关键词： parallelism parallel patterns task parallelism

来源：评论

学校读者我要写书评

暂无评论

SOLVERS FOR O(N) ELECTRONIC STRUCTURE IN THE STRONG SCALING LIMIT

引用

SIAM JOURNAL ON SCIENTIFIC COMPUTING 2016年第1期38卷 C1-C21页

作者： Bock, Nicolas Challacombe, Matt Kale, Laxmikant V. Los Alamos Natl Lab Div Theoret Los Alamos NM 87544 USA Univ Illinois Dept Comp Sci Parallel Programming Lab Champaign IL 61801 USA

We present a hybrid OpenMP/Charm++ framework for solving the O(N) self-consistent-field eigenvalue problem with parallelism in the strong scaling regime, P >> N, where P is the number of cores, and N is a measure of system size, i.e., the number of matrix rows/columns, basis functions, atoms, molecules, etc. This result is achieved with a nested approach to spectral projection and the sparse approximate matrix multiply [Bock and Challacombe, SIAM J. Sci. Comput., 35 (2013), pp. C72-C98], and involves a recursive, task-parallel algorithm, often employed by generalized N-Body solvers, to occlusion and culling of negligible products in the case of matrices with decay. Employing classic technologies associated with generalized N-Body solvers, including overdecomposition, recursive task parallelism, orderings that preserve locality, and persistence-based load balancing, we obtain scaling beyond hundreds of cores per molecule for small water clusters ([H2O](N), N is an element of {30, 90, 150}, P/N approximate to {819, 273, 164}) and find support for an increasingly strong scalability with increasing system size N.

关键词： sparse approximate matrix multiply sparse linear algebra SpAMM reduced complexity algorithm linear scaling quantum chemistry spectral projection N-Body Charm plus matrices with decay parallel irregular space filling curve persistence load balancing overdecomposition

来源：评论

学校读者我要写书评

暂无评论

A Learner-Centered Computational Experience in Nanotechnology for Undergraduate STEM Students 6

A Learner-Centered Computational Experience in Nanotechnolog...

引用

6th IEEE Integrated STEM Education Conference (ISEC)

作者： Asaduzzaman, Abu Asmatulu, Ramazan Wichita State Univ Comp Architecture & Parallel Programming Dept Elect Engn & Comp Sci Wichita KS 67260 USA Wichita State Univ Dept Mech Engn Wichita KS 67260 USA

ISBN: (纸本)9781467397735

According to recent studies, the current state of Science, Technology, Engineering, and Mathematics (STEM) education in the U.S. has not been impressive. In this paper, we introduce an interdisciplinary learner-centered computational experience in nanotechnology for undergraduate STEM students. Three important tasks associated with this work are applying power-aware data-regrouping based parallel computation to analyze nanoscale materials;updating and/or developing "handson computational experience in nanotechnology" courses;and assessing students' learning experience and interest in high performance computing (HPC) simulation for nanotechnology. The proposed activities have potential to improve motivation, engagement, and learning of STEM students, enhancing the Engaged Student Learning environment. The tasks described in this work incorporate many-core computing, nanomanufacturing, and energy savings, and are aimed at advancing HPC with fundamental understanding of nanostructured fiber behavior, which in turn will allow the use of effective materials for renewable energy conversion. Activities to address industry-oriented real-world problems will attract new students to the STEM education, as the job market in related fields is growing.

关键词： Interdisciplinary education parallel computing nanotechnology STEM education

来源：评论

学校读者我要写书评

暂无评论

Design Productivity of a High Level Synthesis Compiler versus HDL 16

Design Productivity of a High Level Synthesis Compiler versu...

引用

International Conference on Embedded Computer Systems - Architectures, Modeling and Simulation (SAMOS)

作者： Pelcat, Maxime Bourrasset, Cedric Maggiani, Luca Berry, Francois UBL IETR INSA Rennes F-35708 Rennes France CINES Atos Bull Ctr Excellence Parallel Programming Montpellier France Inst Pascal F-63178 Aubiere France Scuola Super Sant Anna Pisa Italy

ISBN: (纸本)9781509030767

The complexity of hardware systems is currently growing faster than the productivity of system designers and programmers. This phenomenon is called Design Productivity Gap and results in inflating design costs. In this paper, the notion of Design Productivity is precisely defined, as well as a metric to assess the Design Productivity of a High-Level Synthesis (HLS) method versus a manual hardware description. The proposed Design Productivity metric evaluates the trade-off between design efficiency and implementation quality. The method is generic enough to be used for comparing several HLS methods of different natures, opening opportunities for further progress in Design Productivity. To demonstrate the Design Productivity evaluation method, an HLS compiler based on the CAPH language is compared to manual VHDL writing. The causes that make VHDL lower level than CAPH are discussed. Versions of the sub-pixel interpolation filter from the MPEG HEVC standard are implemented and a design productivity gain of 2.3x in average is measured for the CAPH HLS method. It results from an average gain in design time of 4.4x and an average loss in quality of 1.9x.

关键词： High level synthesis

来源：评论

学校读者我要写书评

暂无评论

Automatic parallel Pattern Detection in the Algorithm Structure Design Space

Automatic Parallel Pattern Detection in the Algorithm Struct...

引用

International Symposium on parallel and Distributed Processing (IPDPS)

作者： Zia Ul Huda Rohit Atre Ali Jannesari Felix Wolf Laboratory for Parallel Programming Technische Universität Darmstadt Darmstadt Germany RWTH Aachen University Aachen Germany

ISBN: (纸本)9781509021413

parallel design patterns have been developed to help programmers efficiently design and implement parallel applications. However, identifying a suitable parallel pattern for a specific code region in a sequential application is a difficult task. Transforming an application according to support structures applicable to these parallel patterns is also very challenging. In this paper, we present a novel approach to automatically find parallel patterns in the algorithm structure design space of sequential applications. In our approach, we classify code blocks in a region according to the appropriate supportstructure of the detected pattern. This classification eases the transformation of a sequential application into its parallel version. Weevaluated our approach on 17 applications from four different benchmark suites. Our method identified suitable algorithm structure patterns in the sequential applications. We confirmed our results by comparing them with the existing parallel versions of these applications. We also implemented the patterns we detected in cases in which parallel implementations were not available and achieved speedups of up to 14x.

关键词： Pipelines parallel processing Positron emission tomography Linear regression Algorithm design and analysis Benchmark testing Software

来源：评论

学校读者我要写书评

暂无评论

Power management of extreme-scale networks with on/off links in runtime systems

引用

ACM Transactions on parallel Computing 2015年第2期1卷 1–21页

作者： Totoni, Ehsan Jain, Nikhil Kale, Laxmikant V. Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana-Champaign UrbanaIL61801 United States

Networks are among major power consumers in large-scale parallel systems. During execution of common parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used or are underutilized. We propose a runtime system based adaptive approach to turn off unused links, which has various advantages over the previously proposed hardware and compiler based approaches. We discuss why the runtime system is the best system component to accomplish this task, and test the effectiveness of our approach using real applications (including NAMD, MILC), and application benchmarks (including NAS parallel Benchmarks, Stencil). These codes are simulated on representative topologies such as 6-D Torus and multilevel directly connected network (similar to IBM PERCS in Power 775 and Dragonfly in Cray Aries). For common applications with near-neighbor communication pattern, our approach can save up to 20% of total machine's power and energy, without any performance penalty. © 2015 ACM.

关键词： Topology

来源：评论

学校读者我要写书评

暂无评论

Preventing the explosion of exascale profile data with smart thread-level aggregation 4

Preventing the explosion of exascale profile data with smart...

引用

4th Workshop on Extreme Scale programming Tools, ESPT 2015

作者： Lorenz, Daniel Shudler, Sergei Wolf, Felix Laboratory for Parallel Programming Technische Universität Darmstadt Darmstadt Germany

ISBN: (纸本)9781450339971

State of the art performance analysis tools, such as Score-P, record performance profoles on a per-thread basis. However, for exascale systems the number of threads is expected to be in the order of a billion threads, and this would result in extremely large performance profioles. In most cases the user almost never inspects the individual per-thread data. In this paper, we propose to aggregate per-thread performance data in each process to reduce its amount to a reasonable size. Our goal is to aggregate the threads such that the thread-level performance issues are still visible and analyzable. Therefore, we implemented four aggregation strategies in Score-P: (i) SUM-aggregates all threads of a process into a process profile;(ii) SET-calculates statistical key data as well as the sum;(iii) KEY-identifies three threads (i.e., key threads) of particular interest for performance analysis and aggregates the rest of the threads;(iv) CALLTREE-clusters threads that have the same call-tree structure. For each one of these strategies we evaluate the compression ratio and how they maintain thread-level performance behavior information. The aggregation does not incur any additional performance overhead at application run-time. © 2015 ACM.

关键词： Aggregates

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：