Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and mor...
详细信息
Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems.
This paper introduces SPar, an internal C++ Domain-Specific Language (DSL) that supports the development of classic stream parallel applications. The DSL uses standard C++ attributes to introduce annotations tagging t...
详细信息
As part of performance measurements with Score-P, a description of the system and the execution locations is recorded into the performance measurement reports. For large-scale measurements using a million or more proc...
详细信息
As part of performance measurements with Score-P, a description of the system and the execution locations is recorded into the performance measurement reports. For large-scale measurements using a million or more processes, the global system description can consume all the available memory. While the information stored process-locally during measurement is small, the memory requirement becomes a bottleneck in the process of constructing a global representation of the whole system. To address this problem we implemented a new system description in Score-P that exploits regular structures of the system, and results, on homogeneous systems, in a system description of constant size. Furthermore, we present a parallel algorithm to create a global view from the process-local information. The scalable system description comes at the price that it is no longer possible to assign individual names to each system element, but only enumerate elements of the same type. We have successfully tested the new approach on the full JUQUEEN system with up to nearly two million processes.
parallel design patterns have been developed to help programmers efficiently design and implement parallel applications. However, identifying a suitable parallel pattern for a specific code region in a sequential appl...
详细信息
ISBN:
(纸本)9781509021406
parallel design patterns have been developed to help programmers efficiently design and implement parallel applications. However, identifying a suitable parallel pattern for a specific code region in a sequential application is a difficult task. Transforming an application according to support structures applicable to these parallel patterns is also very challenging. In this paper, we present a novel approach to automatically find parallel patterns in the algorithm structure design space of sequential applications. In our approach, we classify code blocks in a region according to the appropriate support structure of the detected pattern. This classification eases the transformation of a sequential application into its parallel version. We evaluated our approach on 17 applications from four different benchmark suites. Our method identified suitable algorithm structure patterns in the sequential applications. We confirmed our results by comparing them with the existing parallel versions of these applications. We also implemented the patterns we detected in cases in which parallel implementations were not available and achieved speedups of up to 14x.
We present a hybrid OpenMP/Charm++ framework for solving the O(N) self-consistent-field eigenvalue problem with parallelism in the strong scaling regime, P >> N, where P is the number of cores, and N is a measur...
详细信息
We present a hybrid OpenMP/Charm++ framework for solving the O(N) self-consistent-field eigenvalue problem with parallelism in the strong scaling regime, P >> N, where P is the number of cores, and N is a measure of system size, i.e., the number of matrix rows/columns, basis functions, atoms, molecules, etc. This result is achieved with a nested approach to spectral projection and the sparse approximate matrix multiply [Bock and Challacombe, SIAM J. Sci. Comput., 35 (2013), pp. C72-C98], and involves a recursive, task-parallel algorithm, often employed by generalized N-Body solvers, to occlusion and culling of negligible products in the case of matrices with decay. Employing classic technologies associated with generalized N-Body solvers, including overdecomposition, recursive task parallelism, orderings that preserve locality, and persistence-based load balancing, we obtain scaling beyond hundreds of cores per molecule for small water clusters ([H2O](N), N is an element of {30, 90, 150}, P/N approximate to {819, 273, 164}) and find support for an increasingly strong scalability with increasing system size N.
According to recent studies, the current state of Science, Technology, Engineering, and Mathematics (STEM) education in the U.S. has not been impressive. In this paper, we introduce an interdisciplinary learner-center...
详细信息
ISBN:
(纸本)9781467397735
According to recent studies, the current state of Science, Technology, Engineering, and Mathematics (STEM) education in the U.S. has not been impressive. In this paper, we introduce an interdisciplinary learner-centered computational experience in nanotechnology for undergraduate STEM students. Three important tasks associated with this work are applying power-aware data-regrouping based parallel computation to analyze nanoscale materials;updating and/or developing "handson computational experience in nanotechnology" courses;and assessing students' learning experience and interest in high performance computing (HPC) simulation for nanotechnology. The proposed activities have potential to improve motivation, engagement, and learning of STEM students, enhancing the Engaged Student Learning environment. The tasks described in this work incorporate many-core computing, nanomanufacturing, and energy savings, and are aimed at advancing HPC with fundamental understanding of nanostructured fiber behavior, which in turn will allow the use of effective materials for renewable energy conversion. Activities to address industry-oriented real-world problems will attract new students to the STEM education, as the job market in related fields is growing.
The complexity of hardware systems is currently growing faster than the productivity of system designers and programmers. This phenomenon is called Design Productivity Gap and results in inflating design costs. In thi...
详细信息
ISBN:
(纸本)9781509030767
The complexity of hardware systems is currently growing faster than the productivity of system designers and programmers. This phenomenon is called Design Productivity Gap and results in inflating design costs. In this paper, the notion of Design Productivity is precisely defined, as well as a metric to assess the Design Productivity of a High-Level Synthesis (HLS) method versus a manual hardware description. The proposed Design Productivity metric evaluates the trade-off between design efficiency and implementation quality. The method is generic enough to be used for comparing several HLS methods of different natures, opening opportunities for further progress in Design Productivity. To demonstrate the Design Productivity evaluation method, an HLS compiler based on the CAPH language is compared to manual VHDL writing. The causes that make VHDL lower level than CAPH are discussed. Versions of the sub-pixel interpolation filter from the MPEG HEVC standard are implemented and a design productivity gain of 2.3x in average is measured for the CAPH HLS method. It results from an average gain in design time of 4.4x and an average loss in quality of 1.9x.
parallel design patterns have been developed to help programmers efficiently design and implement parallel applications. However, identifying a suitable parallel pattern for a specific code region in a sequential appl...
详细信息
ISBN:
(纸本)9781509021413
parallel design patterns have been developed to help programmers efficiently design and implement parallel applications. However, identifying a suitable parallel pattern for a specific code region in a sequential application is a difficult task. Transforming an application according to support structures applicable to these parallel patterns is also very challenging. In this paper, we present a novel approach to automatically find parallel patterns in the algorithm structure design space of sequential applications. In our approach, we classify code blocks in a region according to the appropriate supportstructure of the detected pattern. This classification eases the transformation of a sequential application into its parallel version. Weevaluated our approach on 17 applications from four different benchmark suites. Our method identified suitable algorithm structure patterns in the sequential applications. We confirmed our results by comparing them with the existing parallel versions of these applications. We also implemented the patterns we detected in cases in which parallel implementations were not available and achieved speedups of up to 14x.
Networks are among major power consumers in large-scale parallel systems. During execution of common parallel applications, a sizeable fraction of the links in the high-radix interconnects are either never used or are...
详细信息
State of the art performance analysis tools, such as Score-P, record performance profoles on a per-thread basis. However, for exascale systems the number of threads is expected to be in the order of a billion threads,...
详细信息
暂无评论