检索结果-内蒙古大学图书馆

ISC High Performance 2025 Research Paper Proceedings (40th International Conference)

作者： Anh Tran Ignacio Laguna Konstantinos Parasyris Giorgis Georgakoudis Ganesh Gopalakrishnan Kahlert School of Computing University of Utah Utah USA Lawrence Livermore National Laboratory Center for Applied Scientific Computing California USA

ISBN: (数字)9783982633619

There is an unmet need for static data race checkers that can analyze incomplete programs typical of early program development stages, and are also easily to adapt to different parallel programming models. In this work, we present a novel race checking approach based on Graph Neural Networks (GNN) called GORC that has these attributes. GORC is trained on PrograML control/data graph representations extracted from OpenMP programs that are labeled as racy or race-free, and helps predict races in unseen OpenMP programs. We provide a detailed evaluation of GORC, demonstrating that our approach can deliver high accuracy while also handling many more programs than existing static race checkers. Despite the scarcity of training data, GORC achieves a higher recall rate than LLOV, a widely cited static race checker for OpenMP. It outperforms state-of-the-art ML-based techniques for OpenMP data race detection on three different data-sets. This paper describes GORC's architecture, detailed evaluations, and a novel attribution study that confirms that GORC is learning features relevant to producing data race classifications.

关键词： Adaptation models Analytical models Accuracy parallel programming Training data Detectors Benchmark testing Graph neural networks Data models

来源：评论

学校读者我要写书评

暂无评论

Compiler-Aided Correctness Checking of CUDA-Aware MPI Applications

Compiler-Aided Correctness Checking of CUDA-Aware MPI Applic...

引用

High Performance Computing, Networking, Storage and Analysis, SC-W: Workshops of the International Conference for

作者： Alexander Hück Tim Ziegler Simon Schwitanski Joachim Jenke Christian Bischof Technical University Darmstadt Darmstadt Germany RWTH Aachen University Aachen Germany

ISBN: (数字)9798350355543

ISBN: (纸本)9798350355550

Hybrid MPI + X models, combining the Message Passing Interface (MPI) with node-level parallel programming models, increase complexity and introduce additional correctness issues. This work addresses the challenges of detecting data races in hybrid CUDA-aware MPI applications due to the asynchronous and non-blocking nature of CUDA and MPI APIs. We introduce CuSan, an LLVM compiler extension, and runtime that tracks CUDA-specific concurrency, synchronization, and memory access semantics. We integrate CuSan with MUST, a dynamic MPI correctness tool, and ThreadSanitizer (TSan), a thread-level data race detector. MUST with TSan can already detect concurrency issues for multi-threaded MPI codes. Together with CuSan, these tools allow for comprehensive correctness checking of concurrency issues in CUDA-aware MPI applications. Our evaluation of two mini-apps reveals runtime overhead of CuSan ranging from 6× to 36×, depending on the amount of memory tracked by TSan, compared to the uninstrumented version. Memory overhead consistently remains under 1.8×. CuSan is available at https://***/tudasc/cusan.

关键词： Concurrent computing Runtime parallel programming Semantics Memory management Graphics processing units Optical fiber networks Optical fiber devices Synchronization Message systems

来源：评论

学校读者我要写书评

暂无评论

mpiPython: Extensions of Collective Operations

mpiPython: Extensions of Collective Operations

引用

International Conference on Information and Computer Technologies (ICICT)

作者： Judah Nava Jaden Jinu Lee Hanku Lee Computer Science and Information Systems Minnesota State University Moorhead Moorhead MN

ISBN: (数字)9798350385625

ISBN: (纸本)9798350385632

Despite performance limitations due to its interpreted nature, Python remains a dominant language among scientists and engineers. Enhancing its capabilities for parallel programming unlocks significant potential within parallel and cloud computing environments. mpiPython, a Python binding for message-passing interfaces, empowers Python for Single Program Multiple Data (SPMD) execution, enabling efficient parallel computations. Additionally, Python's inherent accessibility and versatility foster a growing demand for scaling and parallelizing it on distributed cloud environments. This paper extends mpiPython, bridging the gap in collective operations for parallel computing. The extension builds upon the original mpiPython's class-based structure, emphasizing two core principles: supporting vanilla Python with MPI and focusing on a C-based CPU-focused implementation. Unlike existing implementations like mpi4py, mpiPython directly interacts with the Python C API, offering greater control. Two new functions, MPI Gather and MPI Reduce, significantly improve efficiency and streamline collective operations between working nodes. The results demonstrate mpiPython's ability to perform at the level of other libraries while prioritizing a simple implementation accessible to a broad range of users.

关键词： Cloud computing Technological innovation Reviews parallel programming parallel processing Performance gain Libraries

来源：评论

学校读者我要写书评

暂无评论

Towards a Scalable and Efficient PGAS-based Distributed OpenMP

arXiv

引用

arXiv 2024年

作者： Shan, Baodi Araya-Polo, Mauricio Chapman, Barbara Stony Brook University Stony BrookNY11794 United States TotalEnergies EP Research & Technology US LLC HoustonTX77002 United States

MPI+X has been the de facto standard for distributed memory parallel programming. It is widely used primarily as an explicit two-sided communication model, which often leads to complex and error-prone code. Alternatively, PGAS model utilizes efficient one-sided communication and more intuitive communication primitives. In this paper, we present a novel approach that integrates PGAS concepts into the OpenMP programming model, leveraging the LLVM compiler infrastructure and the GASNet-EX communication library. Our model addresses the complexity associated with traditional MPI+OpenMP programming models while ensuring excellent performance and scalability. We evaluate our approach using a set of micro-benchmarks and application kernels on two distinct platforms: Ookami from Stony Brook University and NERSC Perlmutter. The results demonstrate that DiOMP achieves superior bandwidth and lower latency compared to MPI+OpenMP, up to 25% higher bandwidth and down to 45% on latency. DiOMP offers a promising alternative to the traditional MPI+OpenMP hybrid programming model, towards providing a more productive and efficient way to develop high-performance parallel applications for distributed memory systems. © 2024, CC BY.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Chaining Transactions for Effective Concurrency Management in Hardware Transactional Memory

Chaining Transactions for Effective Concurrency Management i...

引用

IEEE/ACM International Symposium on Microarchitecture (MICRO)

作者： Víctor Nicolás-Conesa Rubén Titos-Gil Ricardo Fernández-Pascual Manuel E. Acacio Alberto Ros Computer Engineering Department University of Murcia Murcia Spain

ISBN: (数字)9798350350579

ISBN: (纸本)9798350350586

Hardware Transactional Memory (HTM) offers the opportunity to ease parallel programming. However, driven by hardware limitations, commercial implementations eschew the complexity involved in early sophisticated proposals from academia, and, among other things, opt for simple conflict resolution policies that inevitably increase transaction aborts. To increase thread level parallelism, previous works propose conflict resolution schemes that, instead of aborting, add a second level of speculation consisting in using not-yet-committed data from another transaction. This policy, which we refer to as requester-speculates, has not yet been considered in the context of the kind of best-effort HTM support provided by commercial processors. This work proposes CHAining TransactionS (CHATS), a simple yet effective realization of the requester-speculates con-flict resolution policy in which cyclic dependencies between transactions are avoided and the commit ordering respects the dependencies that transactions make once speculative values are communicated. The ultimate result is a best-effort HTM implementation that forces a partial order between transactions in a way that ensures effective utilization of forwarded data and that gets away from the complexity of previous proposals. Simulations using gem5 demonstrate the effectiveness of CHATS in both commercial-like setups and academic state-of-the-art best-effort systems (22% and 16% reduction in execution time, on average, respectively). These improvements are achieved by requiring less than 280 bytes of extra storage.

关键词： Concurrent computing Context Microarchitecture parallel programming Instruction sets Memory management parallel processing Hardware Complexity theory Proposals

来源：评论

学校读者我要写书评

暂无评论

Performance Evaluation of CUDA parallel Matrix Multiplication using Julia and C++

Performance Evaluation of CUDA Parallel Matrix Multiplicatio...

引用

IEEE International Symposium on Embedded Multicore Socs (MCSoC)

作者： Robertus Hudi Mikael Silvano Kennedy Suganto Departement of Informatics Universitas Pelita Harapan Tangerang Indonesia

ISBN: (数字)9798331530471

ISBN: (纸本)9798331530488

Compute Unified Device Architecture (CUDA) was developed as a GPU parallel programming platform and API, primarily designed for use with C/C++. Over the years, fundamental linear algebra functionalities on CUDA have reached a mature state, and many of these are now accessible on CUDA’s GitHub repository. As other high-level programming languages have begun incorporating CUDA-compatible methods into their libraries, the Julia programming Language introduced CUDA support in 2021, aiming to offer an abstraction level similar to that of C implementations. However, research has shown that Julia’s linear algebra computations— despite leveraging CUDA for parallelization and computational reduction—have yet to match the execution speed achieved by C implementations. This study uses matrix multiplication as a representative linear algebra computation, given its well-optimized CUDA kernel. Outputs of the study include an NSight report file and an SQLite database, which are analysed using NVIDIA Nsight Systems to assess each kernel's runtime and memory usage for performance evaluation. Findings indicate that Julia’s CUDA kernel invocation has a high runtime overhead, growing at a rate of O(n 2 ), which presents a bottleneck when performing high-throughput computations on square binary matrices. This paper suggests that resolving this issue may involve developing a custom CUDA kernel in Julia that employs a more efficient reduction technique to reduce overhead and enhance performance.

关键词： Performance evaluation Computer languages Runtime parallel programming Graphics processing units Linear algebra Libraries Stability analysis Kernel Software development management

来源：评论

学校读者我要写书评

暂无评论

DroMPI: parallel Computation Over Drop Computing

DroMPI: Parallel Computation Over Drop Computing

引用

Cluster, Cloud and Internet Computing Workshops (CCGridW), IEEE/ACM International Symposium on

作者： George-Mircea Grosu Silvia-Elena Nistor Radu-Ioan Ciobanu Ciprian Dobre Florin Pop National University of Science and Technology Politehnica Bucharest Romania National Institute for Research and Development in Informatics (ICI) Bucharest Romania Academy of Romanian Scientists Bucharest Romania

ISBN: (数字)9798350377514

ISBN: (纸本)9798350377521

With the advancement of technology and the spread of multi-core systems, the need for parallelization arises and the interest in programming models is growing. At the same time, new distributed computing models have been proposed, being in fierce competition to obtain the highest possible performance. The Drop Computing Paradigm proposes the idea of decentralized computing over ad-hoc opportunistic networks of mobile and Edge devices. In this respect, the Drop Computing model does not only aim to achieve a minimum turnaround time but also to optimize other characteristics related to mobile devices, such as limited resources and opportunistic communication. Therefore, it is necessary to define a new programming model called DroMPI that intends to extend the capabilities of current parallel and distributed programming models, based on the Drop Computing paradigm. Therefore, the solution aims to develop a library that takes advantage of hardware capabilities in the interest of the Drop Computing paradigm and also provides programmers with a high-level programming interface. The library’s features will be based on the Message Passing Interface (MPI) standard, which will be responsible for inter-node parallelization. The name of the library, DroMPI, is an acronym for Drop Computing and MPI. The implementation of the model will be responsible for the management of communication between nodes and for providing an Application programming Interface (API) for the development of parallel applications in the Drop Computing paradigm.

关键词： Productivity Adaptation models parallel programming Terminology Computational modeling Message passing Libraries Resource management Standards Application programming interfaces

来源：评论

学校读者我要写书评

暂无评论

Design and Development of RISC-V Based Virtual Cluster using QEMU Simulator

Design and Development of RISC-V Based Virtual Cluster using...

引用

Frontiers of Information Technology (FIT)

作者： Tassadaq Hussain Muhammad Wasay Tahir Abdul Qadeer Eduard Ayguade Centre for AI & Big Data Namal University Mianwali Pakistan Supercomputing Namal University Mianwali Barcelona Supercomputing Center Spain

ISBN: (数字)9798331510503

ISBN: (纸本)9798331510510

In this era, we are witnessing a huge amount of data growth, and artificial intelligence applications are the only solutions to process this data. These AI applications demand massive computations that a single machine is not capable of. Hence high-performance cluster machines are required. Several Existing HPC clusters are available with x86-64, AMD64, PowerPC, and Arch64 processor architectures; however, a full-stack Open-source software and open hardware clusters are missing. In this work, we have developed an open-hardware RISC-V based HPC cluster using the QEMU simulator. The cluster has a master node and four slave nodes, which are emulated using QEMU. It uses an open-source Linux operating system, distributed and parallel programming and compiler toolchain. The work involved configuring multiple virtual nodes, setting up file systems, establishing networking and installing a distributed compiler toolchain. Several benchmark applications are used to measure the performance of the virtual cluster and compare with that of a physical RISC-V cluster having 4 nodes. The benchmarking was towards execution time and computation cost to evaluate the comparative performance. The findings indicate that the physical cluster performs better than the virtual cluster, with performance difference reaching 50 % under some configurations. Despite these drawbacks, the virtual cluster approach provides a simple and scalable HPC configuration well suited for development and testing, especially when physical resources are limited. This work creates possibilities in utilizing RISC-V for HPC, cloud computing, edge computing, and IoT applications.

关键词： Cloud computing Technological innovation Program processors parallel programming Benchmark testing Main-secondary Internet of Things Artificial intelligence Virtualization Edge computing

来源：评论

学校读者我要写书评

暂无评论

Towards a Domain-Specific Language for Patterns-Oriented parallel programming

Towards a Domain-Specific Language for Patterns-Oriented Par...

引用

17th Brazilian Symposium on programming Languages (SBLP)

作者： Griebler, Dalvan Fernandes, Luiz Gustavo Pontifiicia Univ Catolica Rio Grande Sul PUCRS GMAP Res Grp FACIN PPGCC BR-90619900 Porto Alegre RS Brazil

ISBN: (纸本)9783642409226;9783642409219

Pattern-oriented programming has been used in parallel code development for many years now. During this time, several tools (mainly frameworks and libraries) proposed the use of patterns based on programming primitives or templates. The implementation of patterns using those tools usually requires human expertise to correctly set up communication/synchronization among processes. In this work, we propose the use of a Domain Specific Language to create pattern-oriented parallel programs (DSL-POPP). This approach has the advantage of offering a higher programming abstraction level in which communication/synchronization among processes is hidden from programmers. We compensate the reduction in programming flexibility offering the possibility to use combined and/or nested parallel patterns (i.e., parallelism in levels), allowing the design of more complex parallel applications. We conclude this work presenting an experiment in which we develop a parallel application exploiting combined and nested parallel patterns in order to demonstrate the main properties of DSL-POPP.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Towards an Optimized Heterogeneous Distributed Task Scheduler in OpenMP Cluster

Towards an Optimized Heterogeneous Distributed Task Schedule...

引用

High Performance Computing, Networking, Storage and Analysis, SC-W: Workshops of the International Conference for

作者： Rémy Neveu Rodrigo Ceccato Gustavo Leite Guido Araujo Jose M. Monsalve Diaz Hervé Yviquel Instituto de Computação Universidade Estadual de Campinas (UNICAMP) Campinas Brazil

ISBN: (数字)9798350355543

ISBN: (纸本)9798350355550

This paper addresses the challenges of optimizing task scheduling for a distributed, task-based execution model in OpenMP for cluster computing environments. Traditional OpenMP implementations are primarily designed for shared-memory parallelism and offer limited control over task scheduling. However, improved scheduling mechanisms are critical to achieving performance and portability in distributed and heterogeneous environments. OpenMP Cluster (OMPC) was introduced to overcome these limitations, extending OpenMP with the Heterogeneous Earliest Finish Time (HEFT) task scheduling algorithm tailored for large-scale systems. To improve scheduling and enable better system utilization, the runtime system must resolve challenges such as changes in the application balance, amount of parallelism, and varying communication *** work presents three key contributions: first, the refactoring of the OMPC runtime to unify task scheduling across devices and hosts; second, the optimization of the HEFT-based scheduling algorithm to ensure efficient task execution in distributed environments; and third, an extensive evaluation of Work Stealing and HEFT scheduling mechanisms in real-world clusters. While the HEFT implementation in OMPC is not fully optimized, this work provides a significant step toward improving distributed task scheduling in cluster computing, offering insights and incremental advancements that support the development of scalable and high-performance applications. Results show improvements of up to 24% in scheduling time while opening up to more extensions in the scheduling methods.

关键词： Runtime Scheduling algorithms parallel programming Optimal scheduling Cluster computing parallel processing Dynamic scheduling Large-scale systems Resource management Iterative methods

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：