检索结果-内蒙古大学图书馆

2024 International Conference for High Performance Computing, Networking, Storage and Analysis

作者： Xu, Mingkuan Cao, Shiyi Miao, Xupeng Acar, Umut A. Jia, Zhihao Carnegie Mellon Univ Comp Sci Dept Pittsburgh PA 15213 USA Univ Calif Berkeley Dept EECS Berkeley CA USA

ISBN: (数字)9798350352917

ISBN: (纸本)9798350352924;9798350352917

This paper presents techniques for theoretically and practically efficient and scalable Schrodinger-style quantum circuit simulation. Our approach partitions a quantum circuit into a hierarchy of subcircuits and simulates the subcircuits on multi-node GPUs, exploiting available data parallelism while minimizing communication costs. To minimize communication costs, we formulate an Integer Linear Program that rewards simulation of "nearby" gates on "nearby" GPUs. To maximize throughput, we use a dynamic programming algorithm to compute the subcircuit simulated by each kernel at a GPU. We realize these techniques in Atlas, a distributed, multi-GPU quantum circuit simulator. Our evaluation on a variety of quantum circuits shows that Atlas outperforms state-of-the-art GPU-based simulators by more than 2x on average and is able to run larger circuits via offloading to DRAM, outperforming other large-circuit simulators by two orders of magnitude.

关键词： parallel programming quantum simulation

来源：评论

学校读者我要写书评

暂无评论

Reshaping High Energy Physics Applications for Near-Interactive Execution Using TaskVine 24

Reshaping High Energy Physics Applications for Near-Interact...

引用

2024 International Conference for High Performance Computing, Networking, Storage and Analysis

作者： Sly-Delgado, Barry Tovar, Ben Zhou, Jin Thain, Douglas Univ Notre Dame South Bend IN 46601 USA

ISBN: (数字)9798350352917

ISBN: (纸本)9798350352924;9798350352917

High energy physics experiments produce petabytes of data annually that must be reduced to gain insight into the laws of nature. Early-stage reduction executes long-running, high-throughput workflows across thousands of nodes spanning multiple facilities to produce shared datasets. Later stages are typically written by individuals or small groups and must be refined and re-run many times for correctness. Reducing iteration times of later stages is key to accelerating discovery. We demonstrate our experience reshaping late-stage analysis applications on thousands of nodes. It is not enough merely to increase scale: it is necessary to make changes throughout the stack, including storage systems, data management, task scheduling, and application design. We demonstrate these changes when applied to two analysis applications built on open source data analysis frameworks (Coffea, Dask, TaskVine). We evaluate the performance of the applications on opportunistic campus clusters, showing effective scaling up to 7200 cores, thus producing significant speedup.

关键词： parallel programming Data Transfer Physics Computing Scientific Computing

来源：评论

学校读者我要写书评

暂无评论

Relational Denotational and Algebraic Semantics Based on UTP 1st

Relational Denotational and Algebraic Semantics Based on U...

引用

1st International Symposium on Software Fault Prevention, Verification, and Validation, SFPVV 2024

作者： Hou, Zhiru Zhu, Huibiao Shanghai Key Laboratory of Trustworthy Computing East China Normal University Shanghai China

ISBN: (纸本)9789819616206

Relational Hoare logic [18] extends the applicability of modular deductive verification to encompass the verification of crucial 2-run properties, such as confidentiality. Most of the current research on relational Hoare logic primarily focuses on its practical applications. However, incorporating parallel programs into the logic may further complicate the system design, which is an aspect that most research has overlooked. Therefore, this paper updates the previous system, referred to as the relational system, by incorporating parallel composition. Based on the Unifying Theories of programming (UTP), we further explore the denotational semantics and algebraic semantics of the system with 2-runs, employing relational denotational and algebraic semantics for representation. And the study of the conditional construct and parallel composition are the crucial points. To facilitate the algebraic exploration of parallel expansion laws, we extend the system with a new concept called guarded choice, enabling the transformation of any program into a guarded choice form. © The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2025.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Researchers Simplify parallel programming

引用

COMMUNICATIONS OF THE ACM 2014年第11期57卷 13-15页

作者： Anthes, Gary Arlington VA

parallel computing has become increasingly important as chipmakers put more and more processor cores on individual chips.

关键词： Simplify parallel programming

来源：评论

学校读者我要写书评

暂无评论

HPC master design: experience from Pisa 33

HPC master design: experience from Pisa

引用

33rd Euromicro International Conference on parallel, Distributed, and Network-Based Processing, PDP 2025

作者： Danelutto, Marco Univ. of Pisa Dept. of Computer Science Italy

ISBN: (纸本)9798331524937

HPC is a widely used term, often referred to the applications, architectures and programming models and tools targeting highly parallel machines such as those of the *** lists. Recent advances in computing hardware resources require application of HPC techniques when using much smaller machines. Indeed, proper parallel programming tools and applications are needed also to exploit parallel hardware resources in personal computers (laptops, desktops, servers). This paper outlines key challenges in designing master's degree programs in HPC and shares lessons learned from various experiences in developing and implementing such programs in Italy and Europe. © 2025 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Etudes for parallel Programmers 33

Etudes for Parallel Programmers

引用

33rd Euromicro International Conference on parallel, Distributed, and Network-Based Processing, PDP 2025

作者： Marzolla, Moreno Center for Inter-Department Industrial Research ICT Bologna Italy

ISBN: (数字)9798331524937

ISBN: (纸本)9798331524937

Mini-applications are widely used in parallel computing for testing and benchmarking purposes. However, many existing mini-applications are not suitable for teaching, since they require advanced knowledge of algebra, numerical analysis or physics to be fully understood, which might be beyond the reach of beginners. In this paper we describe a set of programming assignments, called parallel etudes, that have been used in the last years for teaching High Performance Computing at the undergraduate level. These applications are self-contained, self-documenting, and short. They are drawn from more familiar domains such as 3D rendering, simulation, image processing and simple physics models, to be more accessible to students without a strong mathematical background. The mini-applications target shared-memory, distributed-memory and GPU programming. The analysis of the students' feedback and final grades provides indirect support for the effectiveness of the etudes. © 2025 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Chaining Transactions for Effective Concurrency Management in Hardware Transactional Memory 57

Chaining Transactions for Effective Concurrency Management i...

引用

57th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2024

作者： Nicolas-Conesa, Victor Titos-Gil, Ruben Fernandez-Pascual, Ricardo Acacio, Manuel E. Ros, Alberto University of Murcia Computer Engineering Department Murcia Spain

ISBN: (纸本)9798350350579

Hardware Transactional Memory (HTM) offers the opportunity to ease parallel programming. However, driven by hardware limitations, commercial implementations eschew the complexity involved in early sophisticated proposals from academia, and, among other things, opt for simple conflict resolution policies that inevitably increase transaction aborts. To increase thread level parallelism, previous works propose conflict resolution schemes that, instead of aborting, add a second level of speculation consisting in using not-yet-committed data from another transaction. This policy, which we refer to as requester-speculates, has not yet been considered in the context of the kind of best-effort HTM support provided by commercial processors. This work proposes CHAining TransactionS (CHATS), a simple yet effective realization of the requester-speculates con-flict resolution policy in which cyclic dependencies between transactions are avoided and the commit ordering respects the dependencies that transactions make once speculative values are communicated. The ultimate result is a best-effort HTM implementation that forces a partial order between transactions in a way that ensures effective utilization of forwarded data and that gets away from the complexity of previous proposals. Simulations using gem5 demonstrate the effectiveness of CHATS in both commercial-like setups and academic state-of-the-art best-effort systems (22% and 16% reduction in execution time, on average, respectively). These improvements are achieved by requiring less than 280 bytes of extra storage. © 2024 IEEE.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

An Automated Tool for Analysis and Tuning of GPU-Accelerated Code in HPC Applications

引用

IEEE TRANSACTIONS ON parallel AND DISTRIBUTED SYSTEMS 2022年第4期33卷 854-865页

作者： Zhou, Keren Meng, Xiaozhu Sai, Ryuichi Grubisic, Dejan Mellor-Crummey, John Rice Univ Comp Sci Dept Houston TX 77054 USA

The US Department of Energy's fastest supercomputers and forthcoming exascale systems employ Graphics Processing Units (GPUs) to increase the computational performance of compute nodes. However, the complexity of GPU architectures makes tailoring sophisticated applications to achieve high performance on GPU-accelerated systems a major challenge. At best, prior performance tools for GPU code only provide coarse-grained tuning advice at the kernel level. In this article, we describe GPA, a performance advisor that suggests potential code optimizations at a hierarchy of levels, including individual lines, loops, and functions. To gather the fine-grained measurements needed to produce such insights, GPA uses instruction sampling and binary instrumentation to monitor execution of GPU code. At the time of this writing, GPU instruction sampling is only available on NVIDIA GPUs. To understand performance losses, GPA uses data flow analysis to approximately attribute measured instruction stalls back to their causes. GPA then analyzes patterns of stalls using information about a program's structure and the GPU architecture to identify optimization strategies that address inefficiencies observed. GPA then employs detailed performance models to estimate the potential speedup that each optimization might provide. Experiments with benchmarks and applications show that GPA provides useful advice for tuning GPU code. We applied GPA to analyze and tune a collection of codes on NVIDIA V100 and A100 GPUs. GPA suggested optimizations that it estimates will accelerate performance across the set of codes by a geometric mean of 1.21x. Applying these optimizations suggested by GPA accelerated these codes by a geometric mean of 1.19x.

关键词： Graphics processing units Optimization Tools Measurement Instruments Tuning Registers High performance computing performance analysis parallel programming parallel architectures

来源：评论

学校读者我要写书评

暂无评论

Comparing Block-Based programming Models for Two-Armed Robots

引用

IEEE TRANSACTIONS ON SOFTWARE ENGINEERING 2022年第5期48卷 1630-1643页

作者： Ritschel, Nico Kovalenko, Vladimir Holmes, Reid Garcia, Ronald Shepherd, David C. Univ British Columbia Dept Comp Sci Vancouver BC V6T 1Z4 Canada JetBrains NV JetBrains Res NL-1017 ZM Amsterdam Netherlands Virginia Commonwealth Univ Coll Engn Richmond VA 23284 USA

Modern industrial robots can work alongside human workers and coordinate with other robots. This means they can perform complex tasks, but doing so requires complex programming. Therefore, robots are typically programmed by experts, but there are not enough to meet the growing demand for robots. To reduce the need for experts, researchers have tried to make robot programming accessible to factory workers without programming experience. However, none of that previous work supports coordinating multiple robot arms that work on the same task. In this paper we present four block-based programming language designs that enable end-users to program two-armed robots. We analyze the benefits and trade-offs of each design on expressiveness and user cognition, and evaluate the designs based on a survey of 273 professional participants of whom 110 had no previous programming experience. We further present an interactive experiment based on a prototype implementation of the design we deem best. This experiment confirmed that novices can successfully use our prototype to complete realistic robotics tasks. This work contributes to making coordinated programming of robots accessible to end-users. It further explores how visual programming elements can make traditionally challenging programming tasks more beginner-friendly.

关键词： Robot kinematics programming profession Visualization Task analysis Manipulators programming environments user interfaces robot programming parallel programming block-based programming

来源：评论

学校读者我要写书评

暂无评论

NAS parallel Benchmarks with CUDA and beyond

引用

SOFTWARE-PRACTICE & EXPERIENCE 2023年第1期53卷 53-80页

作者： Araujo, Gabriell Griebler, Dalvan Rockenbach, Dinei A. Danelutto, Marco Fernandes, Luiz G. Pontifical Catholic Univ Rio Grande do Sul PUCRS Sch Technol BR-90619900 Porto Alegre RS Brazil Tres de Maio Educ Soc Setrem Lab Adv Res Cloud Comp LARCC Tres De Maio Brazil Univ Pisa Dept Comp Sci Pisa Italy

NAS parallel Benchmarks (NPB) is a standard benchmark suite used in the evaluation of parallel hardware and software. Several research efforts from academia have made these benchmarks available with different parallel programming models beyond the original versions with OpenMP and MPI. This work joins these research efforts by providing a new CUDA implementation for NPB. Our contribution covers different aspects beyond the implementation. First, we define design principles based on the best programming practices for GPUs and apply them to each benchmark using CUDA. Second, we provide ease of use parametrization support for configuring the number of threads per block in our version. Third, we conduct a broad study on the impact of the number of threads per block in the benchmarks. Fourth, we propose and evaluate five strategies for helping to find a better number of threads per block configuration. The results have revealed relevant performance improvement solely by changing the number of threads per block, showing performance improvements from 8% up to 717% among the benchmarks. Fifth, we conduct a comparative analysis with the literature, evaluating performance, memory consumption, code refactoring required, and parallelism implementations. The performance results have shown up to 267% improvements over the best benchmarks versions available. We also observe the best and worst design choices, concerning code size and the performance trade-off. Lastly, we highlight the challenges of implementing parallel CFD applications for GPUs and how the computations impact the GPU's behavior.

关键词： graphics processing units high-performance computing NPB parallel applications parallel programming performance analysis

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：