检索结果-内蒙古大学图书馆

Architectural Adaptation and Performance-Energy Optimization for CFD Application on AMD EPYC Rome

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2021年第12期32卷 2852-2866页

作者： Szustak, Lukasz Wyrzykowski, Roman Kuczynski, Lukasz Olas, Tomasz Czestochowa Tech Univ Dept Comp Sci PL-42201 Czestochowa Poland

The advantages of the second-generation AMD EPYC Rome processors can be successfully used in the race to Exascale. However, the novel architecture's complexity makes it challenging to adapt demanding scientific codes - like stencil ones - to platforms with Rome CPUs. This article tackles this challenge by exploring the adaptation of the stencil-based CFD (computational fluid dynamics) application called MPDATA to these processors' influential features. We show that the previously proposed parametric adaptation methodology can be profitably applied to extend the performance portability of the memory-bound MPDATA on the AMD EPYC architecture. The extension of the parametric adaptation on the novel architecture requires careful consideration of two relevant aspects that reflect splitting the Rome architecture into multiple dies - features of the cache hierarchy and partitioning cores into work teams. The article also investigates the correlation between the performance optimizations and energy efficiency for a ccNUMA platform powered by top-of-the-line 64-core AMD Rome 7742 CPUs, comparing the results against two servers with Intel Xeon Scalable processors of different generations. Even without appealing to prices, the achieved performance and energy efficiency results are a solid argument confirming the competitiveness of AMD Rome processors against Intel Xeon CPUs in scientific applications.

关键词： Program processors Computer architecture Optimization Servers Libraries Solids Sockets CFD MPDATA AMD EPYC Rome shared-memory programming performance portability energy efficiency

来源：评论

学校读者我要写书评

暂无评论

Toward a Standard Interface for User-Defined Scheduling in OpenMP 15th

Toward a Standard Interface for User-Defined Scheduling in O...

引用

15th International Workshop on OpenMP (IWOMP)

作者： Kale, Vivek Iwainsky, Christian Klemm, Michael Korndoerfer, Jonas H. Mueller Ciorba, Florina M. Brookhaven Natl Lab Upton NY 11973 USA Tech Univ Darmstadt Darmstadt Germany Intel Deutschland GmbH Feldkirchen Germany Univ Basel Basel Switzerland

ISBN: (纸本)9783030285968;9783030285951

Parallel loops are an important part of OpenMP programs. Efficient scheduling of parallel loops can improve performance of the programs. The current OpenMP specification only offers three options for loop scheduling, which are insufficient in certain instances. Given the large number of other possible scheduling strategies, standardizing each of them is infeasible. A more viable approach is to extend the OpenMP standard to allow a user to define loop scheduling strategies within her application. The approach will enable standard-compliant application-specific scheduling. This work analyzes the principal components required by user-defined scheduling and proposes two competing interfaces as candidates for the OpenMP standard. We conceptually compare the two proposed interfaces with respect to the three host languages of OpenMP, i.e., C, C++, and Fortran. These interfaces serve the OpenMP community as a basis for discussion and prototype implementation supporting user-defined scheduling in an OpenMP library.

关键词： OpenMP Multithreaded applications shared-memory programming Multicore Loop scheduling Self-scheduling User-defined loop scheduling Dynamic load balancing High performance computing

来源：评论

学校读者我要写书评

暂无评论

A Pthreads Wrapper for Fortran 2003

引用

ACM TRANSACTIONS ON MATHEMATICAL SOFTWARE 2014年第3期40卷 19-19页

作者： Awile, Omar Sbalzarini, Ivo F. Swiss Fed Inst Technol Inst Theoret Comp Sci MOSAIC Grp Zurich Switzerland

With the advent of multicore processors, numerical and mathematical software relies on parallelism in order to benefit from hardware performance increases. We present the design and use of a Fortran 2003 wrapper for POSIX threads, called forthreads. Forthreads is complete in the sense that is provides native Fortran 2003 interfaces to all pthreads routines where possible. We demonstrate the use and efficiency of forthreads for SIMD parallelism and task parallelism. We present forthreads/MPI implementations that enable hybrid shared-/distributed-memory parallelism in Fortran 2003. Our benchmarks show that forthreads offers performance comparable to that of OpenMP, but better thread control and more freedom. We demonstrate the latter by presenting a multithreaded Fortran 2003 library for POSIX Internet sockets, enabling interactive numerical simulations with runtime control.

关键词： Algorithms Languages Performance Standardization POSIX threads pthreads Fortran Fortran 2003 scientific computing shared-memory programming mathematical software parallel particle-mesh PPM library

来源：评论

学校读者我要写书评

暂无评论

Tpetra, and the use of generic programming in scientific computing

引用

SCIENTIFIC programming 2012年第2期20卷 115-128页

作者： Baker, C. G. Heroux, M. A. Oak Ridge Natl Lab Computat Engn & Energy Sci Grp Oak Ridge TN 37831 USA Sandia Natl Labs Dept Scalable Algorithms Albuquerque NM 87185 USA

We present Tpetra, a Trilinos package for parallel linear algebra primitives implementing the Petra object model. We describe Tpetra's design, based on generic programming via C++ templated types and template metaprogramming. We discuss some benefits of this approach in the context of scientific computing, with illustrations consisting of code and notable empirical results.

关键词： Generic programming scientific computing template metaprogramming shared-memory programming distributed-memory programming many-core computing

来源：评论

学校读者我要写书评

暂无评论

An advanced compiler framework for non-cache-coherent multiprocessors

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2002年第3期13卷 241-259页

作者： Paek, Y Navarro, A Zapata, E Hoeflinger, J Padua, D Korea Adv Inst Sci & Technol Dept Elect Engn & Comp Sci Yusong Ku Taejon 305701 South Korea Univ Malaga Dept Comp Architecture Malaga 29080 Spain Intel Corp Champaign IL 61820 USA Univ Illinois Dept Comp Sci Urbana IL 61801 USA

The Cray T3D and T3E are non-cacho-coherent (NCC) computers with a NUMA structure. They have been shown to exhibit a very stable and scalable performance for a variety of application programs. Considerable evidence suggests that they are more stable and scalable than many other shared-memory multiprocessors. However, the principal drawback of these machines is a lack of programmability, caused by the absence of the global cache coherence that is necessary to provide a convenient shared view of memory in hardware. This forces the programmer to keep careful track of where each piece of data is stored, a complication that is unnecessary when a pure shared-memory view is presented to the user. We believe that a remedy for this problem is advanced compiler technology. In this paper, we present our experience with a compiler framework for automatic parallelization and communication generation that has the potential to reduce the time-consuming hand-tuning that would otherwise be necessary to achieve good performance with this type of machine. From our experiments, we learned that our compiler performs well for a variety of applications on the T3D and T3E and we found a few sophisticated techniques that could improve performance even more once they are fully implemented in the compiler.

关键词： compiler array privatization dependence analysis multiprocessors noncoherent caches shared-memory programming Put/Get

来源：评论

学校读者我要写书评

暂无评论

HPF to OpenMP on the Origin2000: a case study

引用

CONCURRENCY-PRACTICE AND EXPERIENCE 2000年第12期12卷 1147-1154页

作者： Brieger, L CRS4 Geophys I-09010 Uta Italy

The geophysics group at CRS4 has long developed echo reconstruction codes in HPF on distributed-memory machines. Now, however, with the arrival of shared-memory machines and their native OpenMP compilers, the transfer to OpenMP would seem to present the logical next step in our code development strategy. Recent experience with porting one of our important HPF codes to OpenMP does not bear this out-at least not on the Origin2000, The OpenMP code suffers from the immaturity of the standard, and the operating system's handling of UNIX threads seems to severely penalize OpenMP performance. On the other hand, the HPF code on the Origin2000 is fast, scalable and not disproportionately sensitive to load on the machine, Copyright (C) 2000 John Wiley & Sons, Ltd.

关键词： shared-memory programming OpenMP HPF origin2000

来源：评论

学校读者我要写书评

暂无评论

Quantitative characterization and analysis of the I/O behavior of a commercial distributed-shared-memory machine

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2000年第5期11卷 509-526页

作者： Bordawekar, RR IBM Corp TJ Watson Res Ctr Hawthorne NY 10532 USA

This paper presents a unified evaluation of the I/O behavior of a commercial clustered DSM machine, the HP Exemplar. Our study has the following objectives: 1) To evaluate the impact of different interacting system components, namely, architecture, operating system, and programming model, on the overall I/O behavior and identify possible performance bottlenecks, and 2) To provide hints to the users for achieving high out-of-box I/O throughput. We find that for the DSM machines that are built as a cluster of SMP nodes, integrated clustering of computing and I/O resources, both hardware and software, is not advantageous for two reasons. First, within an SMP node, the I/O bandwidth is often restricted by the performance of the peripheral components and cannot match the memory bandwidth. Second, since the I/O resources are shared as a global resource, the file-access costs become nonuniform and the I/O behavior of the entire system, in terms of both scalability and balance, degrades. We observe that the buffered I/O performance is determined not only by the I/O subsystem, but also by the programming model, global-shared memory subsystem, and data-communication mechanism. Moreover, programming-model support can be used effectively to overcome the performance constraints created by the architecture and operating system. For example, on the HP Exemplar, users can achieve high I/O throughput by using features of the programming model that balance the sharing and locality of the user buffers and file systems. Finally, we believe that at present, the I/O subsystems are being designed in isolation, and there is a need for mending the traditional memory-oriented design approach to address this problem.

关键词： input-output operating/file systems distributed-shared-memory architecture clustered computing performance evaluation shared-memory programming

来源：评论

学校读者我要写书评

暂无评论

shared memory programming in metacomputing environments: The global array approach

引用

JOURNAL OF SUPERCOMPUTING 1997年第2期11卷 119-136页

作者： Nieplocha, J Harrison, RJ Pacific Northwest National Laboratory Richland

The performance of the Global Array shared-memory nonuniform memory-access programming model is explored in a wide-area-network (WAN) distributed supercomputer environment. The Global Array model is extended by introducing a concept of mirrored arrays that thanks to the caching and user-controlled consistency of the shared data structure scan reduce the application sensitivity to the network latency. Latencies and bandwidths for remote memory access are studied, and the performance of a large application from computational chemistry is evaluated using both fully distributed and also mirrored arrays. Excellent performance can be obtained with mirroring if even modest (0.5 MB/s) network bandwidth is available.

关键词： metacomputing shared-memory programming NUMA memory architecture global arrays distributed arrays

来源：评论

学校读者我要写书评

暂无评论

HPF to OpenMP on the Origin2000: a case study

引用

Concurrency and Computation: Practice and Experience 2000年第12期12卷

作者： Leesa Brieger Geophysics CRS4 C.P. 94 I - 09010 Uta Italy

The geophysics group at CRS4 has long developed echo reconstruction codes in HPF on distributed-memory machines. Now, however, with the arrival of shared-memory machines and their native OpenMP compilers, the transfer to OpenMP would seem to present the logical next step in our code development strategy. Recent experience with porting one of our important HPF codes to OpenMP does not bear this out—at least not on the Origin2000. The OpenMP code suffers from the immaturity of the standard, and the operating system's handling of UNIX threads seems to severely penalize OpenMP performance. On the other hand, the HPF code on the Origin2000 is fast, scalable and not disproportionately sensitive to load on the machine. Copyright © 2000 John Wiley & Sons, Ltd.

关键词： shared-memory programming OpenMP HPF Origin2000

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：