Molecular dynamics (MD) simulation allows for the study of static and dynamic properties of molecular ensembles at various molecular scales, from monatomics to macromolecules such as proteins and nucleic acids. It has...
详细信息
ISBN:
(纸本)9783030178727;9783030178710
Molecular dynamics (MD) simulation allows for the study of static and dynamic properties of molecular ensembles at various molecular scales, from monatomics to macromolecules such as proteins and nucleic acids. It has applications in biology, materials science, biochemistry, and biophysics. Recent developments in simulation techniques spurred the emergence of the computational molecular engineering (CME) field, which focuses specifically on the needs of industrial users in engineering. Within CME, the simulation code ms2 allows users to calculate thermodynamic properties of bulk fluids. It is a parallel code that aims to scale the temporal range of the simulation while keeping the execution time minimal. In this paper, we use empirical performance modeling to study the impact of simulation parameters on the execution time. Our approach is a systematic workflow that can be used as a blue-print in other fields that aim to scale their simulation codes. We show that the generated models can help users better understand how to scale the simulation with minimal increase in execution time.
Properties of redundant residue number system (RRNS) are used for detecting and correcting errors during the data storing, processing and transmission. However, detection and correction of a single error require signi...
详细信息
Properties of redundant residue number system (RRNS) are used for detecting and correcting errors during the data storing, processing and transmission. However, detection and correction of a single error require significant decoding time due to the iterative calculations needed to locate the error. In this paper, we provide a performance evaluation of Asmuth-Bloom and Mignotte secret sharing schemes with three different mechanisms for error detecting and correcting: Projection, Syndrome, and AR-RRNS. We consider the best scenario when no error occurs and worst-case scenario, when error detection needs the longest time. When examining the overall coding/decoding performance based on real data, we show that AR-RRNS method outperforms Projection and Syndrome by 68% and 52% in the worst-case scenario.
The wall-clock execution time of applications on HPC clusters is commonly subject to run-to-run variation, often caused by external interference from concurrently running jobs. Because of the irregularity of this inte...
详细信息
ISBN:
(纸本)9783319969831;9783319969824
The wall-clock execution time of applications on HPC clusters is commonly subject to run-to-run variation, often caused by external interference from concurrently running jobs. Because of the irregularity of this interference from the perspective of the affected job, performance analysts do not consider it an intrinsic part of application execution, which is why they wish to factor it out when measuring execution time. However, if chances are high enough that at least one interference event strikes while the job is running, merely repeating runs several times and picking the fastest run does not guarantee a measurement free of external influence. In this paper, we present a novel approach to estimate the impact of sporadic and high-impact interference on bulk-synchronous MPI applications. An evaluation with several realistic benchmarks shows that the impact of interference can be estimated already based on a single run.
The analysis of runtime performance is important during the development and throughout the life cycle of HPC applications. One important objective in performance analysis is to identify regions in the code that show s...
详细信息
ISBN:
(纸本)9781728160276
The analysis of runtime performance is important during the development and throughout the life cycle of HPC applications. One important objective in performance analysis is to identify regions in the code that show significant runtime increase with larger problem sizes or more processes. One approach to identify such regions is to use empirical performance modeling, i.e., building performance models based on measurements. While the modeling itself has already been streamlined and automated, the generation of the required measurements is time consuming and tedious. In this paper, we propose an approach to automatically adjust the instrumentation to reduce overhead and focus the measurements to relevant regions, i.e.,such that show increasing runtime with larger input parameters or increasing number of MPI ranks. Our approach employs Extra-P to generate performance models, which it then uses to extrapolate runtime and, finally, decide which functions should be kept for measurement. Also, the analysis expands the instrumentation, by heuristically adding functions based on static source-code features. We evaluate our approach using benchmarks from SPEC CPU 2006, SU2, and parallel MILC. The evaluation shows that our approach can filter functions of little interest and generate profiles that contain mostly relevant regions. For example, the overhead for SU2 can be improved automatically from 200% to 11% compared to filtered Score-P measurements.
Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and mor...
详细信息
Deep learning has proven to be a successful tool for solving a large variety of problems in various scientific fields and beyond. In recent years, the models as well as the available datasets have grown bigger and more complicated, and thus, an increasing amount of computing resources is required in order to train these models in a reasonable amount of time. Besides being able to use HPC resources, deep learning model developers want flexible frameworks which allow for rapid prototyping. One of the most important of these frameworks is Google TensorFlow, which provides both features, ie, good performance as well as flexibility. In this paper, we discuss different solutions for scaling the TensorFlow Framework to thousands of nodes on contemporary Cray XC supercomputing systems.
Several approaches implement efficient BFS algorithms for multicores and for GPUs. However, when targeting heterogeneous architectures, it is still an open problem how to distribute the work among the CPU cores and th...
详细信息
Several approaches implement efficient BFS algorithms for multicores and for GPUs. However, when targeting heterogeneous architectures, it is still an open problem how to distribute the work among the CPU cores and the accelerators. In this paper, we assess several approaches to perform BFS on different heterogeneous chips (a multicore CPU and an integrated GPU). In particular, we propose three heterogeneous approaches that exploit the collaboration between both devices: Selective, Concurrent and Asynchronous. We identify how to take advantage of the features of social network graphs, that are a particular example of highly connected graphs-with fewer iterations and more unbalanced-, as well as the drawbacks of each algorithmic implementation. One key feature of our approaches is that they switch between different versions of the algorithm, depending on the device that collaborates in the computation. Through exhaustive evaluation we find that our heterogeneous implementations can be up to 1.56 x faster and 1.32 x more energy efficient with respect to the best baseline where only one device is used, being the overhead w.r.t. an oracle scheduler below 10%. We also compare with other related heterogeneous approach finding that ours can be up to 3.6x faster. (C) 2017 Elsevier Inc. All rights reserved.
In this paper, we investigate two implementations of the LLL lattice basis reduction algorithm in the popular NTL and fplll libraries, which helps to assess the security of lattice-based cryptographic schemes. The wor...
详细信息
ISBN:
(纸本)9781538676493
In this paper, we investigate two implementations of the LLL lattice basis reduction algorithm in the popular NTL and fplll libraries, which helps to assess the security of lattice-based cryptographic schemes. The work has two main contributions: First, we present a novel method to develop performance models and use the unpredictability of LLL's behavior in dependence of the structure of the input lattice as an illustrative example. The model generation approach is based on profiled training measurements of the code and the final runtime performance models are constructed by an extended version of the open source tool Extra-P by systematic consideration of a variety of hypothesis functions via shared-memory parallelized simulated annealing. We employ three kinds of lattice bases for our tests: Random lattice bases of Goldstein-Mayer form with linear and quadratic increase in the bit length of their entries and NTRU-like matrices. The performance models derived show a very good fit to the experimental data and a high variety in their range of complexity which we compare to predictions by theoretical upper bounds and previous average-case estimates. The modeling principles demonstrated by the example of the use case LLL are directly applicable to other algorithms in cryptography and general serial and parallel algorithms. Second, we also evaluate the common approach of estimating the runtime on the basis of the number of floating point operations or bit operations executed within an algorithm and combining them with theoretical assumptions about the executing processor (clock rate, operations per tick). Our experiments show that this approach leads to unreliable estimates for the runtime.
The recently developed Threaded Many-core Memory (TMM) model provides a framework for analyzing algorithms for highly-threaded many-core machines such as GPUs and Cray supercomputers. In particular, it tries to captur...
详细信息
The recently developed Threaded Many-core Memory (TMM) model provides a framework for analyzing algorithms for highly-threaded many-core machines such as GPUs and Cray supercomputers. In particular, it tries to capture the fact that these machines hide memory latencies via the use of a large number of threads and large memory bandwidth. The TMM model analysis contains two components: computational and memory complexity. A model is only useful if it can explain and predict empirical data. In this work, we investigate the effectiveness of the TMM model. Under this model, we analyze algorithms for 5 classic problems suffix tree/array for string matching, fast Fourier transform, merge sort, list ranking, and all-pairs shortest paths on a variety of GPUs. We also analyze memory access, matrix multiply and a sequence alignment algorithm-on a set of Cray XMT supercomputers, the latest NVIDIA and AMD GPUs. We compare the results of the analysis with the experimental findings of ours and other researchers who have implemented and measured the performance of these algorithms on a spectrum of diverse GPUs and Cray appliances. We find that the TMM model is able to predict important, non-trivial, and sometimes previously unexplained trends and artifacts in the experimental data. (C) 2017 Elsevier B.V. All rights reserved.
As part of performance measurements with Score-P, a description of the system and the execution locations is recorded into the performance measurement reports. For large-scale measurements using a million or more proc...
详细信息
As part of performance measurements with Score-P, a description of the system and the execution locations is recorded into the performance measurement reports. For large-scale measurements using a million or more processes, the global system description can consume all the available memory. While the information stored process-locally during measurement is small, the memory requirement becomes a bottleneck in the process of constructing a global representation of the whole system. To address this problem we implemented a new system description in Score-P that exploits regular structures of the system, and results, on homogeneous systems, in a system description of constant size. Furthermore, we present a parallel algorithm to create a global view from the process-local information. The scalable system description comes at the price that it is no longer possible to assign individual names to each system element, but only enumerate elements of the same type. We have successfully tested the new approach on the full JUQUEEN system with up to nearly two million processes. (C) 2017 The Authors. Published by Elsevier B.V.
Different simulations of real problems have been executed in High Performance Computing systems. However, the power consumption of these systems is an increasing concern once more energy are consumed to large simulati...
详细信息
Different simulations of real problems have been executed in High Performance Computing systems. However, the power consumption of these systems is an increasing concern once more energy are consumed to large simulations. In this context, load balancers emerge as a promising alternative for supporting the computational science methods. In response to this challenge, we developed a new heterogeneous energy-aware load balancer called H-ENERGYLB to reduce the average power demand of systems with heterogeneous processors and save energy when scientific applications with imbalanced load are executed. Our new load balancing strategy combines dynamic load balancing with DVFS techniques to mitigate the imbalanced workloads in order to reduce the clock frequency of underloaded computing cores which experience some residual imbalance even after tasks are remapped. Experiments with three applications on two different heterogeneous architectures show that H-ENERGYLB results in power reductions of 7.14% in average with the energy saving of 36.6% in average compared to others load balancers.
暂无评论