A common simplification made when modeling the performance of a parallel program is the assumption that the performance behavior of all processes or threads is largely uniform. Empirical performance-modeling tools suc...
详细信息
ISBN:
(纸本)9780738110707
A common simplification made when modeling the performance of a parallel program is the assumption that the performance behavior of all processes or threads is largely uniform. Empirical performance-modeling tools such as Extra-P exploit this common pattern to make their modeling process more noise resilient, mitigating the effect of outliers by summarizing performance measurements of individual functions across all processes. While the underlying assumption does not equally hold for all applications, knowing the qualitative differences in how the performance of individual processes changes as execution parameters are varied can reveal important performance bottlenecks such as malicious patterns of load imbalance. A challenge for empirical modeling tools, however, arises from the fact that the behavioral class of a process may depend on the process configuration, letting process ranks migrate between classes as the number of processes grows. In this paper, we introduce a novel approach to the problem of modeling of spatially diverging performance based on a certain type of process clustering. We apply our technique to identify a previously unknown performance bottleneck in the BoSSS fluid-dynamics code. Removing it made the code regions in question running up to 20 times and the application as a whole run up to 4.5 times faster.
A complete textbook and reference for engineers to learn the fundamentals of computer programming with modern C++ Introduction to programming with C++ for Engineers is an original presentation teaching the fundamental...
详细信息
ISBN:
(数字)9781119431152
ISBN:
(纸本)9781119431107
A complete textbook and reference for engineers to learn the fundamentals of computer programming with modern C++ Introduction to programming with C++ for Engineers is an original presentation teaching the fundamentals of computer programming and modern C++ to engineers and engineering students. Professor Cyganek, a highly regarded expert in his field, walks users through basics of data structures and algorithms with the help of a core subset of C++ and the Standard Library, progressing to the object-oriented domain and advanced C++ features, computer arithmetic, memory management and essentials of parallel programming, showing with real world examples how to complete tasks. He also guides users through the software development process, good programming practices, not shunning from explaining low-level features and the programming tools. Being a textbook, with the summarizing tables and diagrams the book becomes a highly useful reference for C++ programmers at all levels. Introduction to programming with C++ for Engineers teaches how to program by: Guiding users from simple techniques with modern C++ and the Standard Library, to more advanced object-oriented design methods and language features Providing meaningful examples that facilitate understanding of the programming techniques and the C++ language constructions Fostering good programming practices which create better professional programmers Minimizing text descriptions, opting instead for comprehensive figures, tables, diagrams, and other explanatory material Granting access to a complementary website that contains example code and useful links to resources that further improve the reader’s coding ability Including test and exam question for the reader’s review at the end of each chapter Engineering students, students of other sciences who rely on computer programming, and professionals in various fields will find this book invaluable when learning to program with C++.
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very ...
详细信息
ISBN:
(纸本)9783030576752;9783030576745
Current scientific workflows are large and complex. They normally perform thousands of simulations whose results combined with searching and data analytics algorithms, in order to infer new knowledge, generate a very large amount of data. To this end, workflows comprise many tasks and some of them may fail. Most of the work done about failure management in workflow managers and runtimes focuses on recovering from failures caused by resources (retrying or resubmitting the failed computation in other resources, etc.) However, some of these failures can be caused by the application itself (corrupted data, algorithms which are not converging for certain conditions, etc.), and these fault tolerance mechanisms are not sufficient to perform a successful workflow execution. In these cases, developers have to add some code in their applications to prevent and manage the possible failures. In this paper, we propose a simple interface and a set of transparent runtime mechanisms to simplify how scientists deal with application-based failures in task-based parallel workflows. We have validated our proposal with use-cases from e-science and machine learning to show the benefits of the proposed interface and mechanisms in terms of programming productivity and performance.
Message passing model, represented by MPI (Message Passing Interface), is the principal parallel programming tool for distributed computer systems. The most of MPI-programs contain collective communications, which inv...
详细信息
Message passing model, represented by MPI (Message Passing Interface), is the principal parallel programming tool for distributed computer systems. The most of MPI-programs contain collective communications, which involve all the processes of a parallel program. Effectiveness of collective communications substantially effects on total time of program execution. In this work, we consider the problem of design of adaptive algorithms of collective communications on the example of barrier synchronization, which refers to one of the most common types of collective communications. We developed adaptive algorithm of barrier synchronization, which suboptimally selects barrier synchronization scheme in parallel MPI-programs among such algorithms as Central Counter, Combining Tree and Dissemination Barrier. The adaptive algorithm chooses the barrier algorithm with the minimal evaluation of execution time in the model LogP. Model LogP considers performance of computational resources and interconnect for point-to-point communications. Proposed algorithm has been implemented for MPI. We present the results of experiments on cluster systems, analyse dependency of algorithm selection on LogP parameters values. In particular, for the number of processes less than 20 adaptive algorithm selects Combining Tree, while for a larger number of processes adaptive algorithm selects Dissemination Barrier. Developed algorithm minimizes average time of barrier synchronization by 4%, in comparison with the most common determined barrier algorithms.
Modern compute architectures often consist of multiple CPU cores to achieve their performance, as physical properties put a limit on the execution speed of a single processor. This trend is also visible in the embedde...
详细信息
ISBN:
(纸本)9783030527938;9783030527945
Modern compute architectures often consist of multiple CPU cores to achieve their performance, as physical properties put a limit on the execution speed of a single processor. This trend is also visible in the embedded and real-time domain, where programmers are forced to parallelize their software to keep deadlines. Additionally, embedded systems rely increasingly on modular applications, that can easily be adapted to different system loads and hardware configurations. To parallelize applications under these dynamic conditions, often dispatching frameworks like Threading Building Blocks (TBB) are used in the desktop and server segment. More recently, Embedded Multicore Building Blocks (EMB2) was developed as a task-based programming solution designed with the constraints of embedded systems in mind. In this paper, we discuss how task-based programming fits such systems by analyzing scheduler implementation variants, with a focus on classic work-stealing and the libraries TBB and EMB2. Based on the state of the art we introduce a novel resource-trading concept that allows static memory allocation in a work-stealing runtime holding strict space and time bounds. We conduct benchmarks between an early prototype of the concept, TBB and EMB2, showing that resource-trading does not introduce additional runtime overheads, while unfortunately also not improving on execution time variances.
parallel technologies evolve at a fast rate, with new hardware and programming frameworks being introduced every few years. Keeping a parallel and Distributed Computing (PDC) lecture up to date is a challenge in itsel...
详细信息
ISBN:
(纸本)9780738143057
parallel technologies evolve at a fast rate, with new hardware and programming frameworks being introduced every few years. Keeping a parallel and Distributed Computing (PDC) lecture up to date is a challenge in itself, let alone when one has to consider the synergies between other courses and the shifts in direction that are industry-driven and echo inside the student body. This paper details the process of aligning parallel and distributed curriculum at the Military Technical Academy of Bucharest (MTA) over the last five years, with government and industry demands as well as faculty and student expectations. The result has been an adaptation and an update of the previous lectures and assignments on PDC, and the creation of a new course that relies heavily on parallel technologies to provide a modern outlook on software security and the tools used to combat cyber threats. Concepts and assignments originally designed for a PDC course have molded perfectly into a new supporting paradigm focused on malicious code (malware) analysis.
Nowadays with a tremendous speed and continuous development, the world's connection to the Internet has increased to a point that it has become part of their unsteady *** as technology evolves, encryption has beco...
详细信息
ISBN:
(数字)9781728156118
ISBN:
(纸本)9781728156118
Nowadays with a tremendous speed and continuous development, the world's connection to the Internet has increased to a point that it has become part of their unsteady *** as technology evolves, encryption has become a priority for our lives to protect sensitive data from hacking and piracy. However, this process takes a long time to transfer, handle and program data by the appropriate electronic means. In this paper we present a new methodology which depends on distributed memory architecture based on message passing Interface(MPI) to enhance the performance of the sequential algorithm. Our results showed that the new model give 2x speed up compared to the previous one.
This work explores the effects of nonassociativity of floating-point addition on Message Passing Interface (MPI) reduction operations. Previous work indicates floating-point summation error is comprised of two indepen...
详细信息
ISBN:
(纸本)9780738110448
This work explores the effects of nonassociativity of floating-point addition on Message Passing Interface (MPI) reduction operations. Previous work indicates floating-point summation error is comprised of two independent factors: error based on the summation algorithm and error based on the summands themselves. We find evidence to suggest, for MPI reductions, the error based on summands has a much greater effect than the error based on the summation algorithm. We begin by sampling from the state space of all possible summation orders for MPI reduction algorithms. Next, we show the effect of different random number distributions on summation error, taking a 1000-digit precision floating-point accumulator as ground truth. Our results show empirical error bounds that are much tighter than existing analytical bounds. Last, we simulate different allreduce algorithms on the high performance computing (HPC) proxy application Nekbone and find that the error is relatively stable across algorithms. Our approach provides HPC application developers with more realistic error bounds of MPI reduction operations. Quantifying the small-but nonzero-discrepancies between reduction algorithms can help developers ensure correctness and aid reproducibility across MPI implementations and cluster topologies.
The natural and the design limitations of the evolution of processors, e.g., frequency scaling and memory bandwidth bottle-necks, push towards scaling applications on multiple-node configurations besides to exploiting...
详细信息
ISBN:
(数字)9783030343569
ISBN:
(纸本)9783030343569;9783030343552
The natural and the design limitations of the evolution of processors, e.g., frequency scaling and memory bandwidth bottle-necks, push towards scaling applications on multiple-node configurations besides to exploiting the power of each single node. This introduced new challenges to porting applications to the new infrastructure, especially with the heterogeneous environments. Domain decomposition and handling the resulting necessary communication is not a trivial task. parallelizing code automatically cannot be decided by tools in general as a result of the semantics of the general-purpose languages. To allow scientists to avoid such problems, we introduce the Memory-Oblivious Data Access (MODA) technique, and use it to scale code to configurations ranging from a single node to multiple nodes, supporting different architectures, without requiring changes in the source code of the application. We present a technique to automatically identify necessary communication based on higher-level semantics. The extracted information enables tools to generate code that handles the communication. A prototype is developed to implement the techniques and used to evaluate the approach. The results show the effectiveness of using the techniques to scale code on multi-core processors and on GPU based machines. Comparing the ratios of the achieved GFLOPS to the number of nodes in each run, and repeating that on different numbers of nodes shows that the achieved scaling efficiency is around 100%. This was repeated with up to 100 nodes. An exception to this is the singlenode configuration using a GPU, in which no communication is needed, and hence, no data movement between GPU and host memory is needed, which yields higher GFLOPS.
programming models for task-based parallelization based on compile-time directives are very effective at uncovering the parallelism available in HPC applications. Despite that, the process of correctly annotating comp...
详细信息
ISBN:
(纸本)9783030576752;9783030576745
programming models for task-based parallelization based on compile-time directives are very effective at uncovering the parallelism available in HPC applications. Despite that, the process of correctly annotating complex applications is error-prone and may hinder the general adoption of these models. In this paper, we target the OmpSs-2 programming model and present a novel toolchain able to detect parallelization errors coming from non-compliant OmpSs-2 applications. Our toolchain verifies the compliance with the OmpSs-2 programming model using local task analysis to deal with each task separately, and structural induction to extend the analysis to the whole program. To improve the effectiveness of our tools, we also introduce some ad-hoc verification annotations, which can be used manually or automatically to disable the analysis of specific code regions. Experiments run on a sample of representative kernels and applications show that our toolchain can be successfully used to verify the parallelization of complex real-world applications.
暂无评论