检索结果-内蒙古大学图书馆

7th Workshop on Extreme-Scale programming Tools (ESPT)

作者： Shudler, Sergei Vrabec, Jadran Wolf, Felix Argonne Natl Lab 9700 S Cass Ave Argonne IL 60439 USA Tech Univ Berlin Thermodynam & Proc Engn D-10587 Berlin Germany Tech Univ Darmstadt Lab Parallel Programming D-64289 Darmstadt Germany

ISBN: (纸本)9783030178727;9783030178710

Molecular dynamics (MD) simulation allows for the study of static and dynamic properties of molecular ensembles at various molecular scales, from monatomics to macromolecules such as proteins and nucleic acids. It has applications in biology, materials science, biochemistry, and biophysics. Recent developments in simulation techniques spurred the emergence of the computational molecular engineering (CME) field, which focuses specifically on the needs of industrial users in engineering. Within CME, the simulation code ms2 allows users to calculate thermodynamic properties of bulk fluids. It is a parallel code that aims to scale the temporal range of the simulation while keeping the execution time minimal. In this paper, we use empirical performance modeling to study the impact of simulation parameters on the execution time. Our approach is a systematic workflow that can be used as a blue-print in other fields that aim to scale their simulation codes. We show that the generated models can help users better understand how to scale the simulation with minimal increase in execution time.

关键词： Molecular dynamics Performance modeling parallel programming

来源：评论

学校读者我要写书评

暂无评论

Performance evaluation of secret sharing schemes with data recovery in secured and reliable heterogeneous multi-cloud storage

引用

CLUSTER COMPUTING-THE JOURNAL OF NETWORKS SOFTWARE TOOLS AND APPLICATIONS 2019年第4期22卷 1173-1185页

作者： Tchernykh, Andrei Miranda-Lopez, Vanessa Babenko, Mikhail Armenta-Cano, Fermin Radchenko, Gleb Drozdov, Alexander Yu. Avetisyan, Arutyun CICESE Res Ctr Comp Sci Dept Ensenada BC Mexico CICESE Res Ctr Ensenada BC Mexico North Caucasus Fed Univ Stavropol Russia South Ural State Univ Parallel Comp Lab CICESE Chelyabinsk Russia South Ural State Univ Lab Problem Oriented Cloud Environm Chelyabinsk Russia South Ural State Univ Fac Computat Math & Informat Chelyabinsk Russia Moscow Inst Phys & Technol Moscow Russia Inst Syst Programming Moscow Russia Inst Syst Programming Samung Lab Moscow Russia

Properties of redundant residue number system (RRNS) are used for detecting and correcting errors during the data storing, processing and transmission. However, detection and correction of a single error require significant decoding time due to the iterative calculations needed to locate the error. In this paper, we provide a performance evaluation of Asmuth-Bloom and Mignotte secret sharing schemes with three different mechanisms for error detecting and correcting: Projection, Syndrome, and AR-RRNS. We consider the best scenario when no error occurs and worst-case scenario, when error detection needs the longest time. When examining the overall coding/decoding performance based on real data, we show that AR-RRNS method outperforms Projection and Syndrome by 68% and 52% in the worst-case scenario.

关键词： Storage Reliability Residue number system

来源：评论

学校读者我要写书评

暂无评论

Estimating the Impact of External Interference on Application Performance 24th

Estimating the Impact of External Interference on Applicatio...

引用

24th International European Conference on parallel and Distributed Computing (Euro-Par)

作者： Shah, Aamer Mueller, Matthias Wolf, Felix Rhein Westfal TH Aachen IT Ctr Aachen Germany Tech Univ Darmstadt Lab Parallel Programming Darmstadt Germany

ISBN: (纸本)9783319969831;9783319969824

The wall-clock execution time of applications on HPC clusters is commonly subject to run-to-run variation, often caused by external interference from concurrently running jobs. Because of the irregularity of this interference from the perspective of the affected job, performance analysts do not consider it an intrinsic part of application execution, which is why they wish to factor it out when measuring execution time. However, if chances are high enough that at least one interference event strikes while the job is running, merely repeating runs several times and picking the fastest run does not guarantee a measurement free of external influence. In this paper, we present a novel approach to estimate the impact of sporadic and high-impact interference on bulk-synchronous MPI applications. An evaluation with several realistic benchmarks shows that the impact of interference can be estimated already based on a single run.

关键词： Artificial intelligence

来源：评论

学校读者我要写书评

暂无评论

Automatic Instrumentation Refinement for Empirical Performance Modeling

Automatic Instrumentation Refinement for Empirical Performan...

引用

IEEE/ACM International Workshop on programming and Performance Visualization Tools (ProTools)

作者： Jan-Patrick Lehr Alexandru Calotoiu Christian Bischof Felix Wolf Institute for Scientific Computing Technische Universität Darmstadt Darmstadt Germany Laboratory for Parallel Programming Technische Universität Darmstadt Darmstadt Germany

ISBN: (纸本)9781728160276

The analysis of runtime performance is important during the development and throughout the life cycle of HPC applications. One important objective in performance analysis is to identify regions in the code that show significant runtime increase with larger problem sizes or more processes. One approach to identify such regions is to use empirical performance modeling, i.e., building performance models based on measurements. While the modeling itself has already been streamlined and automated, the generation of the required measurements is time consuming and tedious. In this paper, we propose an approach to automatically adjust the instrumentation to reduce overhead and focus the measurements to relevant regions, i.e.,such that show increasing runtime with larger input parameters or increasing number of MPI ranks. Our approach employs Extra-P to generate performance models, which it then uses to extrapolate runtime and, finally, decide which functions should be kept for measurement. Also, the analysis expands the instrumentation, by heuristically adding functions based on static source-code features. We evaluate our approach using benchmarks from SPEC CPU 2006, SU2, and parallel MILC. The evaluation shows that our approach can filter functions of little interest and generate profiles that contain mostly relevant regions. For example, the overhead for SU2 can be improved automatically from 200% to 11% compared to filtered Score-P measurements.

关键词： Instruments Runtime Mathematical model Analytical models Atmospheric measurements Particle measurements parallel processing

来源：评论

学校读者我要写书评

暂无评论

TensorFlow at Scale: Performance and productivity analysis of distributed training with Horovod, MLSL, and Cray PE ML

TensorFlow at Scale: Performance and productivity analysis o...

引用

Conference on Practice and Experience in Advanced Research Computing (PEARC)

作者： Kurth, Thorsten Smorkalov, Mikhail Mendygral, Peter Sridharan, Srinivas Mathuriya, Amrita Lawrence Berkeley Natl Lab Natl Energy Res Sci Comp Ctr Berkeley CA 94270 USA Intel Corp Software & Serv Grp Moscow Russia Cray Inc Cray Programming Environm Performance Engn Bloomington MN USA Intel Corp Parallel Comp Labs Bangalore Karnataka India Intel Corp Data Ctr Grp Hillsboro OR USA

Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems.

关键词： deep learning performance scalability

来源：评论

学校读者我要写书评

暂无评论

Exploiting social network graph characteristics for efficient BFS on heterogeneous chips

引用

JOURNAL OF parallel AND DISTRIBUTED COMPUTING 2018年 120卷 282-294页

作者： Remis, Luis Garzaran, Maria Jesus Asenjo, Rafael Navarro, Angeles Intel Corp Santa Clara CA 95051 USA Univ Malaga Comp Architecture Dept Malaga Spain Univ Malaga Parallel Programming Models & Compilers Grp Malaga Spain Univ Illinois Champaign IL USA Univ Illinois Dept Comp Sci Champaign IL USA

Several approaches implement efficient BFS algorithms for multicores and for GPUs. However, when targeting heterogeneous architectures, it is still an open problem how to distribute the work among the CPU cores and the accelerators. In this paper, we assess several approaches to perform BFS on different heterogeneous chips (a multicore CPU and an integrated GPU). In particular, we propose three heterogeneous approaches that exploit the collaboration between both devices: Selective, Concurrent and Asynchronous. We identify how to take advantage of the features of social network graphs, that are a particular example of highly connected graphs-with fewer iterations and more unbalanced-, as well as the drawbacks of each algorithmic implementation. One key feature of our approaches is that they switch between different versions of the algorithm, depending on the device that collaborates in the computation. Through exhaustive evaluation we find that our heterogeneous implementations can be up to 1.56 x faster and 1.32 x more energy efficient with respect to the best baseline where only one device is used, being the overhead w.r.t. an oracle scheduler below 10%. We also compare with other related heterogeneous approach finding that ours can be up to 3.6x faster. (C) 2017 Elsevier Inc. All rights reserved.

关键词： BFS Heterogeneous chips Social network graphs Performance-energy efficiency

来源：评论

学校读者我要写书评

暂无评论

Exploring the Performance Envelope of the LLL Algorithm 21

Exploring the Performance Envelope of the LLL Algorithm

引用

21st IEEE International Conference on Computational Science and Engineering (CSE)

作者： Burger, Michael Bischof, Christian Calotoiu, Alexandru Wunderer, Thomas Wolf, Felix Tech Univ Darmstadt Sci Comp D-64283 Darmstadt Germany Tech Univ Darmstadt Parallel Programming D-64283 Darmstadt Germany Tech Univ Darmstadt Cryptog & Comp Algebra D-64283 Darmstadt Germany

ISBN: (纸本)9781538676493

In this paper, we investigate two implementations of the LLL lattice basis reduction algorithm in the popular NTL and fplll libraries, which helps to assess the security of lattice-based cryptographic schemes. The work has two main contributions: First, we present a novel method to develop performance models and use the unpredictability of LLL's behavior in dependence of the structure of the input lattice as an illustrative example. The model generation approach is based on profiled training measurements of the code and the final runtime performance models are constructed by an extended version of the open source tool Extra-P by systematic consideration of a variety of hypothesis functions via shared-memory parallelized simulated annealing. We employ three kinds of lattice bases for our tests: Random lattice bases of Goldstein-Mayer form with linear and quadratic increase in the bit length of their entries and NTRU-like matrices. The performance models derived show a very good fit to the experimental data and a high variety in their range of complexity which we compare to predictions by theoretical upper bounds and previous average-case estimates. The modeling principles demonstrated by the example of the use case LLL are directly applicable to other algorithms in cryptography and general serial and parallel algorithms. Second, we also evaluate the common approach of estimating the runtime on the basis of the number of floating point operations or bit operations executed within an algorithm and combining them with theoretical assumptions about the executing processor (clock rate, operations per tick). Our experiments show that this approach leads to unreliable estimates for the runtime.

关键词： performance models model fitting parallel simulated annealing OpenMP

来源：评论

学校读者我要写书评

暂无评论

Analysis of classic algorithms on highly-threaded many-core architectures

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2018年 82卷 528-543页

作者： Ma, Lin Chamberlain, Roger D. Agrawal, Kunal Tian, Chen Hu, Ziang Washington Univ St Louis MO USA Washington Univ Comp Sci & Engn St Louis MO USA Huawei Amer Res Ctr Programming Technol Lab Santa Clara CA 95050 USA Huawei Amer Res Ctr Parallel & Distributed Lab Santa Clara CA 95050 USA Huawei Amer Res Ctr Santa Clara CA 95050 USA

The recently developed Threaded Many-core Memory (TMM) model provides a framework for analyzing algorithms for highly-threaded many-core machines such as GPUs and Cray supercomputers. In particular, it tries to capture the fact that these machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The TMM model analysis contains two components: computational and memory complexity. A model is only useful if it can explain and predict empirical data. In this work, we investigate the effectiveness of the TMM model. Under this model, we analyze algorithms for 5 classic problems suffix tree/array for string matching, fast Fourier transform, merge sort, list ranking, and all-pairs shortest paths on a variety of GPUs. We also analyze memory access, matrix multiply and a sequence alignment algorithm-on a set of Cray XMT supercomputers, the latest NVIDIA and AMD GPUs. We compare the results of the analysis with the experimental findings of ours and other researchers who have implemented and measured the performance of these algorithms on a spectrum of diverse GPUs and Cray appliances. We find that the TMM model is able to predict important, non-trivial, and sometimes previously unexplained trends and artifacts in the experimental data. (C) 2017 Elsevier B.V. All rights reserved.

关键词： Threaded Many-core Memory (TMM) model GPU XMT Algorithm analysis Modeling Simulation and evaluation techniques of HPC systems

来源：评论

学校读者我要写书评

暂无评论

Scaling Score-P to the next level

Scaling Score-P to the next level

引用

International Conference on Computational Science (ICCS)

作者： Lorenz, Daniel Feld, Christian Tech Univ Darmstadt Lab Parallel Programming Darmstadt Germany Forschungszentrum Julich Julich Supercomp Ctr Julich Germany

As part of performance measurements with Score-P, a description of the system and the execution locations is recorded into the performance measurement reports. For large-scale measurements using a million or more processes, the global system description can consume all the available memory. While the information stored process-locally during measurement is small, the memory requirement becomes a bottleneck in the process of constructing a global representation of the whole system. To address this problem we implemented a new system description in Score-P that exploits regular structures of the system, and results, on homogeneous systems, in a system description of constant size. Furthermore, we present a parallel algorithm to create a global view from the process-local information. The scalable system description comes at the price that it is no longer possible to assign individual names to each system element, but only enumerate elements of the same type. We have successfully tested the new approach on the full JUQUEEN system with up to nearly two million processes. (C) 2017 The Authors. Published by Elsevier B.V.

关键词： Performance analysis data compression exascale computing

来源：评论

学校读者我要写书评

暂无评论

Managing Power Demand and Load Imbalance to Save Energy on Systems with Heterogeneous CPU Speeds

Managing Power Demand and Load Imbalance to Save Energy on S...

引用

International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)

作者： Edson Luiz Padoin Matthias Diener Philippe O. A. Navaux Jean-François Méhaut Department of Exact Sciences and Engineering (DCEEng) Regional University of Northwest of Rio Grande do Sul (UNIJUI) Ijuí Brazil Parallel Programming Laboratory (PPL) University of Illinois at Urbana-Champaign (UIUC) Urbana-Champaign USA Informatics Institute Federal University of Rio Grande do Sul (UFRGS) Porto Alegre Brazil University Grenoble Alpes Grenoble France

Different simulations of real problems have been executed in High Performance Computing systems. However, the power consumption of these systems is an increasing concern once more energy are consumed to large simulations. In this context, load balancers emerge as a promising alternative for supporting the computational science methods. In response to this challenge, we developed a new heterogeneous energy-aware load balancer called H-ENERGYLB to reduce the average power demand of systems with heterogeneous processors and save energy when scientific applications with imbalanced load are executed. Our new load balancing strategy combines dynamic load balancing with DVFS techniques to mitigate the imbalanced workloads in order to reduce the clock frequency of underloaded computing cores which experience some residual imbalance even after tasks are remapped. Experiments with three applications on two different heterogeneous architectures show that H-ENERGYLB results in power reductions of 7.14% in average with the energy saving of 36.6% in average compared to others load balancers.

关键词： Program processors Load management Task analysis Power demand Clocks Runtime Load modeling

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：