检索结果-内蒙古大学图书馆

11th International Conference on Parallel Processing and Applied Mathematics (PPAM)

作者： Oz, Isil Gil, Marisa Utrera, Gladys Martorell, Xavier Marmara Univ Dept Comp Engn Goztepe Campus TR-34722 Istanbul Turkey Univ Politecn Cataluna Comp Architecture Dept Jordi Girona 3 ES-08034 Barcelona Spain

ISBN: (纸本)9783319321493;9783319321486

Transistor size reduction and more aggressive power modes in HPC platforms make chip components more error prone. In this context, HPC applications can have a diverse level of tolerance to memory errors that may change the execution in different ways. As the tolerance to memory errors depends on write frequency and access patterns, different programming models may exhibit a different behavior in the rate of failures and alleviate the performance loss caused by the overhead of fault-tolerance mechanisms. In this paper, we explore how tolerant to memory errors are two main parallel programming models, message-passing and shared memory: we perform a memory vulnerability analysis and also conduct error propagation experiments to observe the effect of memory errors through program flow. Our results show the need for soft error resiliency methods based on memory behavior of programs, and the evaluation of the tradeoffs between performance and reliability.

关键词： Memory errors Reliability SDC programming models

来源：评论

学校读者我要写书评

暂无评论

AMA: Asynchronous Management of Accelerators for Task-based programming models

AMA: Asynchronous Management of Accelerators for Task-based ...

引用

15th Annual International Conference on Computational Science (ICCS)

作者： Planas, Judit Badia, Rosa M. Ayguade, Eduard Labarta, Jesus CNS BSC Barcelona Spain Univ Politecn Cataluna Barcelona Spain CSIC Spanish Natl Res Council IIIA Artificial Intelligence Res Inst Madrid Spain

Computational science has benefited in the last years from emerging accelerators that increase the performance of scientific simulations, but using these devices hinders the programming task. This paper presents AMA: a set of optimization techniques to efficiently manage multi-accelerator systems. AMA maximizes the overlap of computation and communication in a blocking-free way. Then, we can use such spare time to do other work while waiting for device operations. Implemented on top of a task-based framework, the experimental evaluation of AMA on a quad-GPU node shows that we reach the performance of a hand-tuned native CUDA code, with the advantage of fully hiding the device management. In addition, we obtain up to more than 2x performance speed-up with respect to the original framework implementation.

关键词： accelerator management asynchronous devices programming models multi-GPU systems

来源：评论

学校读者我要写书评

暂无评论

Exploring Parallel programming models for Heterogeneous Computing Systems

Exploring Parallel Programming Models for Heterogeneous Comp...

引用

IEEE International Symposium on Workload Characterization IISWC

作者： Daga, Mayank Tschirhart, Zachary S. Freitag, Chip Adv Micro Devices Inc AMD Res Sunnyvale CA 94088 USA

ISBN: (纸本)9781509000883

Parallel systems that employ CPUs and GPUs as two heterogeneous computational units have become immensely popular due to their ability to maximize performance under restrictive thermal budgets. However, programming heterogeneous systems via traditional programming models like OpenCL or CUDA involves rewriting large portions of application-code. They also lead to code that is not performance portable across different architectures or even across different generations of the same architecture. In this paper, we evaluate the current state of two emerging parallel programming models: C++ AMP and OpenACC. These emerging programming paradigms require minimal code changes and rely on compilers to interact with the low-level hardware language, thereby producing performance portable code from an application standpoint. We analyze the performance and productivity of the emerging programming models and compare them with OpenCL using a diverse set of applications on two different architectures, a CPU coupled with a discrete GPU and an Accelerated programming Unit (APU). Our experiments demonstrate that while the emerging programming models improve programmer productivity, they do not yet expose enough flexibility to extract maximum performance as compared to traditional programming models.

关键词： APU C plus plus AMP evaluation GPU OpenACC OpenCL performance productivity programming models

来源：评论

学校读者我要写书评

暂无评论

AMA: Asynchronous Management of Accelerators for Task-based programming models

引用

Procedia Computer Science 2015年 51卷 130-139页

作者： Judit Planas Rosa M. Badia Eduard Ayguade Jesus Labarta Barcelona Supercomputing Center (BSC-CNS) Barcelona Spain Universitat Politecnica de Catalunya (UPC) Barcelona Spain Artificial Intelligence Research Institute Spanish National Research Council (IIIA CSIC) Spain

关键词： Accelerator management Asynchronous devices programming models Multi-GPU systems

来源：评论

学校读者我要写书评

暂无评论

Effect of implementations of the N-body problem on the performance and portability across GPU vendors

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2025年 169卷

作者： Bartolomeu, Rodrigo A. C. Halver, Rene Meinke, Jan H. Sutmann, Godehard Forschungszentrum Julich Inst Adv Simulat Julich Supercomp Ctr JSC D-52425 Julich Germany Ruhr Univ Bochum ICAMS D-44801 Bochum Germany

Since Aurora entered the TOP500 list in November 2023, the top ten systems saw some shifts in the ratio of GPU vendors represented. With each vendor supplying their own preferred programming models for their hardware, it becomes relevant to compare the portability of these models on other hardware platforms. For the present paper we implemented the N-body problem with different optimizations using native and portable programming frameworks. For each of those we determined the best performing optimized version on one target architecture and compared the performance achieved for each platform.

关键词： GPU High Performance Computing N-body problem programming models Performance portability

来源：评论

学校读者我要写书评

暂无评论

Automated parallel execution of distributed task graphs with FPGA clusters

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2024年 160卷 808-824页

作者： Ruiz, Juan Miguel de Haro Martinez, Carlos alvarez Jimenez-Gonzalez, Daniel Martorell, Xavier Ueno, Tomohiro Sano, Kentaro Ringlein, Burkhard Abel, Francois Weiss, Beat Barcelona Supercomp Ctr Barcelona Spain Univ Politecn Cataluna Barcelona Spain Riken Ctr Computat Sci Kobe Hyogo Japan IBM Res Europe Zurich Switzerland

Over the years, Field Programmable Gate Arrays (FPGA) have been gaining popularity in the High Performance Computing (HPC) field, because their reconfigurability enables very fine-grained optimizations with low energy cost. However, the different characteristics, architectures, and network topologies of the clusters have hindered the use of FPGAs at a large scale. In this work, we present an evolution of OmpSs@FPGA, a high-level taskbased programming model and extension to OmpSs-2, that aims at unifying all FPGA clusters by using a message-passing interface that is compatible with FPGA accelerators. These accelerators are programmed with C/C++ pragmas, and synthesized with High-Level Synthesis tools. The new framework includes a custom protocol to exchange messages between FPGAs, agnostic of the architecture and network type. On top of that, we present a new communication paradigm called Implicit Message Passing (IMP), where the user does not need to call any message-passing API. Instead, the runtime automatically infers data movement between nodes. We test classic message passing and IMP with three benchmarks on two different FPGA clusters. One is cloudFPGA, a disaggregated platform with AMD FPGAs that are only connected to the network through UDP/TCP/IP. The other is ESSPER, composed of CPU-attached Intel FPGAs that have a private network at the ethernet level. In both cases, we demonstrate that IMP with OmpSs@FPGA can increase the productivity of FPGA programmers at a large scale thanks to simplifying communication between nodes, without limiting the scalability of applications. We implement the N-body, Heat simulation and Cholesky decomposition benchmarks, and show that FPGA clusters get 2.6x and 2.4x better performance per watt than a CPU-only supercomputer for N-body and Heat.

关键词： FPGA MPI Task graphs Heterogeneous computing High performance computing programming models Distributed computing

来源：评论

学校读者我要写书评

暂无评论

Benchmarking parallel programming for single-board computers

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2024年 161卷 119-134页

作者： Hoffmann, Renato B. Griebler, Dalvan Righi, Rodrigo da Rosa Fernandes, Luiz G. Pontif Catholic Univ Rio Grande do Sul PUCRS Sch Technol BR-90619900 Porto Alegre Brazil Univ Vale Rio dos Sinos Sao Leopoldo Brazil

Within the computing continuum, SBCs (single-board computers) are essential in the Edge and Fog, with many featuring multiple processing cores and GPU accelerators. In this way, parallel computing plays a crucial role in enabling the full computational potential of SBCs. However, selecting the best-suited solution in this context is inherently complex due to the intricate interplay between PPI (parallel programming interface) strategies, SBC architectural characteristics, and application characteristics and constraints. To our knowledge, no solution presents a combined discussion of these three aspects. To tackle this problem, this article aims to provide a benchmark of the best-suited parallelism PPIs given a set of hardware and application characteristics and requirements. Compared to existing benchmarks, we introduce new metrics, additional applications, various parallelism interfaces, and extra hardware devices. Therefore, our contributions are the methodology to benchmark parallelism on SBCs and the characterization of the best-performing parallelism PPIs and strategies for given situations. We are confident that parallel computing will be mainstream to process edge and fog computing;thus, our solution provides the first insights regarding what kind of application and parallel programming interface is the most suited for a particular SBC hardware.

关键词： programming models Parallel computing Edge computing Performance analysis Computing continuum

来源：评论

学校读者我要写书评

暂无评论

IRIS: A Performance-Portable Framework for Cross-Platform Heterogeneous Computing

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2024年第10期35卷 1796-1809页

作者： Kim, Jungwon Lee, Seyong Johnston, Beau Vetter, Jeffrey S. NVIDIA Santa Clara CA 95051 USA Oak Ridge Natl Lab Comp Sci & Math Div Oak Ridge TN 37831 USA

From edge to exascale, computer architectures are becoming more heterogeneous and complex. The systems typically have fat nodes, with multicore CPUs and multiple hardware accelerators such as GPUs, FPGAs, and DSPs. This complexity is causing a crisis in programming systems and performance portability. Several programming systems are working to address these challenges, but the increasing architectural diversity is forcing software stacks and applications to be specialized for each architecture. As we show, all of these approaches critically depend on their software framework for discovery, execution, scheduling, and data orchestration. To address this challenge, we believe that a more agile and proactive software framework is essential to increase performance portability and improve user productivity. To this end, we have designed and implemented IRIS: a performance-portable framework for cross-platform heterogeneous computing. IRIS can discover available resources, manage multiple diverse programming platforms (e.g., CUDA, Hexagon, HIP, Level Zero, OpenCL, OpenMP) simultaneously in the same execution, respect data dependencies, orchestrate data movement proactively, and provide for user-configurable scheduling. To simplify data movement, IRIS introduces a shared virtual device memory with relaxed consistency among different heterogeneous devices. IRIS also adds an automatic kernel workload partitioning technique using the polyhedral model so that it can resize kernels for a wide range of devices. Our evaluation on three architectures, ranging from Qualcomm Snapdragon to a Summit supercomputer node, shows that IRIS improves portability across a wide range of diverse heterogeneous architectures with negligible overhead.

关键词： Heterogeneous architectures runtime systems compilers programming models Heterogeneous architectures runtime systems compilers programming models

来源：评论

学校读者我要写书评

暂无评论

To Share or Not to Share: A Case for MPI in Shared-Memory 31st

To Share or Not to Share: A Case for MPI in Shared-Memory

引用

31st European MPI Users' Group Meeting (EuroMPI)

作者： Adam, Julien Besnard, Jean-Baptiste Roussel, Adrien Jaeger, Julien Carribault, Patrick Perache, Marc ParaTools SAS Bruyeres Le Chatel France CEA DAM DIF F-91297 Arpajon France Univ Paris Saclay CEA Lab Informat Haute Performance Calcul & Simulat F-91680 Bruyeres Le Chatel France

ISBN: (纸本)9783031733697;9783031733703

The evolution of parallel computing architectures presents new challenges for developing efficient parallelized codes. The emergence of heterogeneous systems has given rise to multiple programming models, each requiring careful adaptation to maximize performance. In this context, we propose reevaluating memory layout designs for computational tasks within larger nodes by comparing various architectures. To gain insight into the performance discrepancies between shared memory and shared-address space settings, we systematically measure the bandwidth between cores and sockets using different methodologies. Our findings reveal significant differences in performance, suggesting that MPI running inside UNIX processes may not fully utilize its intranode bandwidth potential. In light of our work in the MPC thread-based MPI runtime, which can leverage shared memory to achieve higher performance due to its optimized layout, we advocate for enabling the use of shared memory within the MPI standard.

关键词： MPI NUMA Memory Thread programming models

来源：评论

学校读者我要写书评

暂无评论

Impact of productivity shock on household welfare in AfCFTA: a GSSA method

引用

SN Business and Economics 2025年第5期5卷 1-18页

作者： Signe, Franck Xavier University of Rennes Bretagne Rennes France University of Yaounde II Center Soa Cameroon

In a bid to improve living standards, the African Union and African Development Bank is encouraging the free movement of goods and productivity via technological innovation. In 2018, it even signed a continental free trade agreement (AfCFTA) between the various member countries, given that these countries do not have the same currency. This study quantitatively analyses the impact of a productivity shock on household welfare in Africa free trade zone (AfCFTA). Using the Generalized Simulation Stochastic Algorithm (GSSA) method and productivity data from the AfCFTA countries, we analyse welfare when there is economic integration and when there is not. The results show that when a country’s productivity level is high, so is the welfare of its population, with or without economic integration. Monetary integration must precede economic integration if the welfare of all households in member countries is to improve. Monetary integration guarantees a fixed parity of exchange rates between member countries and consequently stable or low fluctuating prices. This stability enables individuals to make better forecasts and improve trade. © The Author(s), under exclusive licence to Springer Nature Switzerland AG 2025.

关键词： Economic integration Neoclassical economic growth programming models Technological change Welfare

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：