检索结果-内蒙古大学图书馆

Adaptive MPI collective operations based on evaluations in LogP model

Procedia Computer Science 2021年 186卷 323-330页

作者： A.A. Paznikov M.S. Kupriyanov Saint Petersburg Electrotechnical University “LETI” ul. Professora Popova 5 St. Petersburg 197376 Russia

Message passing model, represented by MPI (Message Passing Interface), is the principal parallel programming tool for distributed computer systems. The most of MPI-programs contain collective communications, which involve all the processes of a parallel program. Effectiveness of collective communications substantially effects on total time of program execution. In this work, we consider the problem of design of adaptive algorithms of collective communications on the example of barrier synchronization, which refers to one of the most common types of collective communications. We developed adaptive algorithm of barrier synchronization, which suboptimally selects barrier synchronization scheme in parallel MPI-programs among such algorithms as Central Counter, Combining Tree and Dissemination Barrier. The adaptive algorithm chooses the barrier algorithm with the minimal evaluation of execution time in the model LogP. Model LogP considers performance of computational resources and interconnect for point-to-point communications. Proposed algorithm has been implemented for MPI. We present the results of experiments on cluster systems, analyse dependency of algorithm selection on LogP parameters values. In particular, for the number of processes less than 20 adaptive algorithm selects Combining Tree, while for a larger number of processes adaptive algorithm selects Dissemination Barrier. Developed algorithm minimizes average time of barrier synchronization by 4%, in comparison with the most common determined barrier algorithms.

关键词： Collectives collective communications barrier barrier synchronization distributed computer systems LogP MPI parallel programming

来源：评论

学校读者我要写书评

暂无评论

Accelerating Real-Time Applications with Predictable Work-Stealing 33rd

Accelerating Real-Time Applications with Predictable Work-St...

引用

33rd International Conference on Architecture of Computing Systems (ARCS)

作者： Fritz, Florian Schmid, Michael Mottok, Juergen Regensburg Univ Appl Sci Lab Safe & Secure Syst LaS3 Regensburg Germany

ISBN: (纸本)9783030527938;9783030527945

Modern compute architectures often consist of multiple CPU cores to achieve their performance, as physical properties put a limit on the execution speed of a single processor. This trend is also visible in the embedded and real-time domain, where programmers are forced to parallelize their software to keep deadlines. Additionally, embedded systems rely increasingly on modular applications, that can easily be adapted to different system loads and hardware configurations. To parallelize applications under these dynamic conditions, often dispatching frameworks like Threading Building Blocks (TBB) are used in the desktop and server segment. More recently, Embedded Multicore Building Blocks (EMB2) was developed as a task-based programming solution designed with the constraints of embedded systems in mind. In this paper, we discuss how task-based programming fits such systems by analyzing scheduler implementation variants, with a focus on classic work-stealing and the libraries TBB and EMB2. Based on the state of the art we introduce a novel resource-trading concept that allows static memory allocation in a work-stealing runtime holding strict space and time bounds. We conduct benchmarks between an early prototype of the concept, TBB and EMB2, showing that resource-trading does not introduce additional runtime overheads, while unfortunately also not improving on execution time variances.

关键词： Real-time parallel programming Work-stealing

来源：评论

学校读者我要写书评

暂无评论

Applying parallel and Distributed Computing Curriculum to Cyber Security Courses

Applying Parallel and Distributed Computing Curriculum to Cy...

引用

Workshop on Education for High Performance Computing (EduHPC)

作者： Velea, Radu Ilie, Valentin Bica, Ion Mil Tech Acad Fac Informat Syst & Cyber Secur Bucharest Romania

ISBN: (纸本)9780738143057

parallel technologies evolve at a fast rate, with new hardware and programming frameworks being introduced every few years. Keeping a parallel and Distributed Computing (PDC) lecture up to date is a challenge in itself, let alone when one has to consider the synergies between other courses and the shifts in direction that are industry-driven and echo inside the student body. This paper details the process of aligning parallel and distributed curriculum at the Military Technical Academy of Bucharest (MTA) over the last five years, with government and industry demands as well as faculty and student expectations. The result has been an adaptation and an update of the previous lectures and assignments on PDC, and the creation of a new course that relies heavily on parallel technologies to provide a modern outlook on software security and the tools used to combat cyber threats. Concepts and assignments originally designed for a PDC course have molded perfectly into a new supporting paradigm focused on malicious code (malware) analysis.

关键词： parallel programming Malware Analysis GPU Computing Undergraduate Education

来源：评论

学校读者我要写书评

暂无评论

Enhancement of an Encryption System Performance using MPI 13

Enhancement of an Encryption System Performance using MPI

引用

13th International Conference on Communications (COMM)

作者： Amar, Islam Abutaha, Mohammed Palestine Polytech Univ PPU Coll Informat Technol & Comp Engn Hebron Palestine

ISBN: (数字)9781728156118

ISBN: (纸本)9781728156118

Nowadays with a tremendous speed and continuous development, the world's connection to the Internet has increased to a point that it has become part of their unsteady *** as technology evolves, encryption has become a priority for our lives to protect sensitive data from hacking and piracy. However, this process takes a long time to transfer, handle and program data by the appropriate electronic means. In this paper we present a new methodology which depends on distributed memory architecture based on message passing Interface(MPI) to enhance the performance of the sequential algorithm. Our results showed that the new model give 2x speed up compared to the previous one.

关键词： Message Passing Interface Cryptography Image Encryption and Decryption Security parallel programming

来源：评论

学校读者我要写书评

暂无评论

A Statistical Analysis of Error in MPI Reduction Operations 4

A Statistical Analysis of Error in MPI Reduction Operations

引用

4th IEEE/ACM International Workshop on Software Correctness for HPC Applications (Correctness)

作者： Pollard, Samuel D. Norris, Boyana Univ Oregon Comp & Informat Sci Eugene OR 97403 USA

ISBN: (纸本)9780738110448

This work explores the effects of nonassociativity of floating-point addition on Message Passing Interface (MPI) reduction operations. Previous work indicates floating-point summation error is comprised of two independent factors: error based on the summation algorithm and error based on the summands themselves. We find evidence to suggest, for MPI reductions, the error based on summands has a much greater effect than the error based on the summation algorithm. We begin by sampling from the state space of all possible summation orders for MPI reduction algorithms. Next, we show the effect of different random number distributions on summation error, taking a 1000-digit precision floating-point accumulator as ground truth. Our results show empirical error bounds that are much tighter than existing analytical bounds. Last, we simulate different allreduce algorithms on the high performance computing (HPC) proxy application Nekbone and find that the error is relatively stable across algorithms. Our approach provides HPC application developers with more realistic error bounds of MPI reduction operations. Quantifying the small-but nonzero-discrepancies between reduction algorithms can help developers ensure correctness and aid reproducibility across MPI implementations and cluster topologies.

关键词： Floating-point arithmetic message passing interface parallel programming reduction tree roundoff error summation order

来源：评论

学校读者我要写书评

暂无评论

Scalable parallelization of Stencils Using MODA 1

引用

34th International Conference on High Performance Computing (ISC High Performance)

作者： Jumah, Nabeeh Kunkel, Julian Univ Hamburg Hamburg Germany Univ Reading Reading Berks England

ISBN: (数字)9783030343569

ISBN: (纸本)9783030343569;9783030343552

The natural and the design limitations of the evolution of processors, e.g., frequency scaling and memory bandwidth bottle-necks, push towards scaling applications on multiple-node configurations besides to exploiting the power of each single node. This introduced new challenges to porting applications to the new infrastructure, especially with the heterogeneous environments. Domain decomposition and handling the resulting necessary communication is not a trivial task. parallelizing code automatically cannot be decided by tools in general as a result of the semantics of the general-purpose languages. To allow scientists to avoid such problems, we introduce the Memory-Oblivious Data Access (MODA) technique, and use it to scale code to configurations ranging from a single node to multiple nodes, supporting different architectures, without requiring changes in the source code of the application. We present a technique to automatically identify necessary communication based on higher-level semantics. The extracted information enables tools to generate code that handles the communication. A prototype is developed to implement the techniques and used to evaluate the approach. The results show the effectiveness of using the techniques to scale code on multi-core processors and on GPU based machines. Comparing the ratios of the achieved GFLOPS to the number of nodes in each run, and repeating that on different numbers of nodes shows that the achieved scaling efficiency is around 100%. This was repeated with up to 100 nodes. An exception to this is the singlenode configuration using a GPU, in which no communication is needed, and hence, no data movement between GPU and host memory is needed, which yields higher GFLOPS.

关键词： HPC Scalability parallel programming Stencils

来源：评论

学校读者我要写书评

暂无评论

A Toolchain to Verify the parallelization of OmpSs-2 Applications 26th

A Toolchain to Verify the Parallelization of OmpSs-2 Applica...

引用

26th International Conference on parallel and Distributed Computing (Euro-Par)

作者： Economo, Simone Royuela, Sara Ayguade, Eduard Beltran, Vicenc Barcelona Supercomp Ctr BSC Barcelona Spain Sapienza Univ Roma DIAG Antonio Ruberti Rome Italy

ISBN: (纸本)9783030576752;9783030576745

programming models for task-based parallelization based on compile-time directives are very effective at uncovering the parallelism available in HPC applications. Despite that, the process of correctly annotating complex applications is error-prone and may hinder the general adoption of these models. In this paper, we target the OmpSs-2 programming model and present a novel toolchain able to detect parallelization errors coming from non-compliant OmpSs-2 applications. Our toolchain verifies the compliance with the OmpSs-2 programming model using local task analysis to deal with each task separately, and structural induction to extend the analysis to the whole program. To improve the effectiveness of our tools, we also introduce some ad-hoc verification annotations, which can be used manually or automatically to disable the analysis of specific code regions. Experiments run on a sample of representative kernels and applications show that our toolchain can be successfully used to verify the parallelization of complex real-world applications.

关键词： Synchronization Software testing and debugging parallel programming

来源：评论

学校读者我要写书评

暂无评论

Agent-Navigable Dynamic Graph Construction and Visualization over Distributed Memory 8

Agent-Navigable Dynamic Graph Construction and Visualization...

引用

8th IEEE International Conference on Big Data (Big Data)

作者： Gilroy, Justin Paronyan, Satine Acoltzi, Jonathan Fukuda, Munehiro Univ Washington Bothell Comp & Software Syst Bothell WA 98011 USA

ISBN: (纸本)9781728162515

Some graph analyses, such as social network and biological network, need large-scale graph construction and maintenance over distributed memory space. Distributed data-streaming tools, including MapReduce and Spark, restrict some computational freedom of incremental graph modification and run-time graph visualization. Instead, we take an agent-based approach. We construct a graph from a scientific dataset in CSV, tab, and XML formats;dispatch many reactive agents on it;and analyze the graph in the form of their collective group behavior: propagation, flocking, and collision. The key to success is how to automate the run-time construction and visualization of agent-navigable graphs mapped over distributed memory. We implemented this distributed graph-computing support in the multi-agent spatial simulation (MASS) library, coupled with the Cytoscape graph visualization software. This paper presents the MASS implementation techniques and demonstrates its execution performance in comparison to MapReduce and Spark, using two benchmark programs: (1) an incremental construction of a complete graph and (2) a KD tree construction.

关键词： multi-agent systems agent-based modeling data analysis data visualization parallel programming

来源：评论

学校读者我要写书评

暂无评论

An HTM-Based Update-side Synchronization for RCU on NUMA systems 20

An HTM-Based Update-side Synchronization for RCU on NUMA sys...

引用

15th European Conference on Computer Systems (EuroSys)

作者： Park, SeongJae McKenney, Paul E. Dufour, Laurent Yeom, Heon Y. Amazon Bellevue WA 98004 USA Facebook Cambridge MA USA IBM Linux Technol Ctr Cambridge MD USA Seoul Natl Univ Seoul South Korea

ISBN: (纸本)9781450368827

Read-copy update (RCU) can provide ideal scalability for read-mostly workloads, but some believe that it provides only poor performance for updates. This belief is due to the lack of RCU-centric update synchronization mechanisms. RCU instead works with a range of update-side mechanisms, such as locking. In fact, many developers embrace simplicity by using global locking. Logging, hardware transactional memory, or fine-grained locking can provide better scalability, but each of these approaches has limitations, such as imposing overhead on readers or poor scalability on nonuniform memory access (NUMA) systems, mainly due to their lack of NUMA-aware design principles. This paper introduces an RCU extension (RCX) that provides highly scalable RCU updates on NUMA systems while retaining RCU's read-side benefits. RCX is a software-based synchronization mechanism combining hardware transactional memory (HTM) and traditional locking based on our NUMA-aware design principles for RCU. Micro-bench-marks on a NUMA system having 144 hardware threads show RCX has up to 22.6 times better performance and up to 145 times lower HTM abort rates compared to a state-of-the-art RCU/HTM combination. To demonstrate the effectiveness and applicability of RCX, we have applied RCX to parallelize some of the Linux kernel memory management system and an in-memory database system. The optimized kernel and the database show up to 24 and 17 times better performance compared to the original version, respectively.

关键词： RCU synchronization transactional memory parallel programming operating systems

来源：评论

学校读者我要写书评

暂无评论

Assessing Kokkos Performance on Selected Architectures 6th

Assessing Kokkos Performance on Selected Architectures

引用

6th Latin American Conference on High Performance Computing (CARLA)

作者： Phuong, Chang Saied, Noman Tanis, Craig Univ Tennessee Chattanooga TN 37403 USA

ISBN: (纸本)9783030410056;9783030410049

Performance Portability frameworks allow developers to write code for familiar High-Performance Computing (HPC) architecture and minimize development effort over time to port it to other HPC architectures with little to no loss of performance. In our research, we conducted experiments with the same codebase on a Serial, OpenMP, and CUDA execution and memory space and compared it to the Kokkos Performance Portability framework. We assessed how well these approaches meet the goals of Performance Portability by solving a thermal conduction model on a 2D plate on multiple architectures (NVIDIA (K20, P100, V100, XAVIER), Intel Xeon, IBM Power 9, ARM64) and collected execution times (wall-clock) and performance counters with perf and nvprof for analysis. We used the Serial model to determine a baseline and to confirm that the model converges on both the native and Kokkos code. The OpenMP and CUDA models were used to analyze the parallelization strategy as compared to the Kokkos framework for the same execution and memory spaces.

关键词： Performance Portability OpenMP CUDA Kokkos High-Performance Computing HPC parallel programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：