检索结果-内蒙古大学图书馆

作者： Dinan, James Scott The Ohio State University

学位级别：Ph.D.

Applications that exhibit irregular, dynamic, and unbalanced parallelism are growing in number and importance in the computational science and engineering communities. These applications span many domains including computational chemistry, physics, biology, and data mining. In such applications, the units of computation are often irregular in size and the availability of work may be depend on the dynamic, often recursive, behavior of the program. Because of these properties, it is challenging for these programs to achieve high levels of performance and scalability on modern high performance clusters.@pqdt@break@A new family of programming models, called the Partitioned Global Address Space (PGAS) family, provides the programmer with a global view of shared data and allows for asynchronous, one-sided access to data regardless of where it is physically stored. In this model, the global address space is distributed across the memories of multiple nodes and, for any given node, is partitioned into local patches that have high affinity and low access cost and remote patches that have a high access cost due to communication. The PGAS data model relaxes conventional two-sided communication semantics and allows the programmer to access remote data without the cooperation of the remote processor. Thus, this model is attractive for supporting irregular and dynamic applications on distributed memory clusters. However, in spite of the flexible data model, PGAS execution models require the programmer to explicitly partition the computation into a process-centric execution.@pqdt@break@In this work, we build a complete environment to support irregular and dynamic parallel computations on large scale clusters by extending the PGAS data model with a task parallel execution model. Task parallelism allows the programmer to express their computation as a dynamic collection of tasks. The execution of these tasks is managed by a scalable and efficient runtime system that performs dynamic

关键词： Computer Science High performance computing parallel programming models Scalable runtime systems

来源：评论

学校读者我要写书评

暂无评论

UCX: An Open Source Framework for HPC Network APIs and Beyond 23

UCX: An Open Source Framework for HPC Network APIs and Beyon...

引用

IEEE 23rd Annual Symposium on High-Performance Interconnects

作者： Shamis, Pavel Venkata, Manjunath Gorentla Lopez, M. Graham Baker, Matthew B. Hernandez, Oscar Itigin, Yossi Dubman, Mike Shainer, Gilad Graham, Richard L. Liss, Liran Shahar, Yiftah Potluri, Sreeram Rossetti, Davide Becker, Donald Poole, Duncan Lamb, Christopher Kumar, Sameer Stunkel, Craig Bosilca, George Bouteiller, Aurelien Oak Ridge Natl Lab Oak Ridge TN 37831 USA Mellanox Technol Yokneam Illit Israel NVIDIA Corp Santa Clara CA USA IBM Corp Armonk NY USA Univ Tennessee Knoxville TN USA

ISBN: (纸本)9781467391603

This paper presents Unified Communication X (UCX), a set of network APIs and their implementations for high throughput computing. UCX comes from the combined effort of national laboratories, industry, and academia to design and implement a high-performing and highly-scalable network stack for next generation applications and systems. UCX design provides the ability to tailor its APIs and network functionality to suit a wide variety of application domains and hardware. We envision these APIs to satisfy the networking needs of many programming models such as Message Passing Interface (MPI), OpenSHMEM, Partitioned Global Address Space (PGAS) languages, task-based paradigms and I/O bound applications. To evaluate the design we implement the APIs and protocols, and measure the performance of overhead-critical network primitives fundamental for implementing many parallel programming models and system libraries. Our results show that the latency, bandwidth, and message rate achieved by the portable UCX prototype is very close to that of the underlying driver. With UCX, we achieved a message exchange latency of 0.89 us, a bandwidth of 6138.5 MB/s, and a message rate of 14 million messages per second. As far as we know, this is the highest bandwidth and message rate achieved by any network stack (publicly known) on this hardware.

关键词： application program interfaces input-output programs message passing parallel programming public domain software HPC network APIs I/O bound applications MPI OpenSHMEM PGAS languages UCX Unified Communication X high throughput computing highly-scalable network stack message passing interface open source framework parallel programming models partitioned global address space languages system libraries task-based paradigms Bandwidth Electronics packaging Hardware Libraries Memory management programming Protocols HPC Infiniband Middleware PGAS RDMA message passing mannose phosphate isomerase remote procedure calls parallel programming input-output programs open source software Application programming interfaces Infiniband Electronics packaging RDMA Bandwidth High Performance Computing Computer hardware Store management Middleware

来源：评论

学校读者我要写书评

暂无评论

Communication Avoiding 2D Stencil Implementations over PaRSEC Task-Based Runtime 34

Communication Avoiding 2D Stencil Implementations over PaRSE...

引用

34th IEEE International parallel and Distributed Processing Symposium (IPDPS)

作者： Pei, Yu Cao, Qinglei Bosilca, George Luszczek, Piotr Eijkhout, Victor Dongarra, Jack Univ Tennessee Innovat Comp Lab Knoxville TN 37996 USA Univ Texas Austin Texas Adv Comp Ctr Austin TX USA

ISBN: (纸本)9781728174457

Stencil computation or general sparse matrix-vector product (SpMV) are key components in many algorithms like geometric multigrid or Krylov solvers. But their low arithmetic intensity means that memory bandwidth and network latency will be the performance limiting factors. The current architectural trend favors computations over bandwidth, worsening the already unfavorable imbalance. Previous work approached stencil kernel optimization either by improving memory bandwidth usage or by providing a Communication Avoiding (CA) scheme to minimize network latency in repeated sparse vector multiplication by replicating remote work in order to delay communications on the critical path. Focusing on minimizing communication bottleneck in distributed stencil computation, in this study we combine a CA scheme with the computation and communication overlapping that is inherent in a dataflow task-based runtime system such as PaRSEC to demonstrate their combined benefits. We implemented the 2D five point stencil (Jacobi iteration) in PETSc, and over PaRSEC in two flavors, full communications (base-PaRSEC) and CA-PaRSEC which operate directly on a 2D compute grid. Our results running on two clusters, NaCL and Stampede2 indicate that we can achieve 2X speedup over the standard SpMV solution implemented in PETSc, and in certain cases when kernel execution is not dominating the execution time, the CA-PaRSEC version achieved up to 57% and 33% speedup over base-PaRSEC implementation on NaCL and Stampede2 respectively.

关键词： 2D stencil communication avoiding parallel programming models

来源：评论

学校读者我要写书评

暂无评论

Transparent execution of task-based parallel applications in Docker with COMP Superscalar 25

Transparent execution of task-based parallel applications in...

引用

25th Euromicro International Conference on parallel, Distributed and Network-Based Processing (PDP)

作者： Anton, Victor Ramon-Cortes, Cristian Ejarque, Jorge Badia, Rosa M. BSC Barcelona Spain CSIC IIIA Artificial Intelligence Res Inst Spanish Natl Res Council Barcelona Spain

ISBN: (纸本)9781509060580

This paper presents a framework to easily build and execute parallel applications in container-based distributed computing platforms in a user transparent way. The proposed framework is a combination of the COMP Superscalar and Docker. We have built a prototype in order to evaluate how it performs by evaluating the overhead in the building, deployment and execution phases. We have observed an important gain compared with cloud environments during the building and deployment phases. In contrast, we have detected an extra overhead during the execution, which is mainly due to the multi-host Docker networking.

关键词： Cloud Computing Linux Containers Distributed Systems parallel programming models

来源：评论

学校读者我要写书评

暂无评论

programming bare-metal accelerators with heterogeneous threading models:a case study of Matrix-3000

引用

Frontiers of Information Technology & Electronic Engineering 2023年第4期24卷 509-520页

作者： Jianbin FANG Peng ZHANG Chun HUANG Tao TANG Kai LU Ruibo WANG Zheng WANG College of Computer Science and Technology National University of Defense TechnologyChangsha 410073China School of Computing University of LeedsLeeds LS29JTUK

As the hardware industry moves toward using specialized heterogeneous many-core processors to avoid the effects of the power wall,software developers are finding it hard to deal with the complexity of these *** this paper,we share our experience of developing a programming model and its supporting compiler and libraries for Matrix-3000,which is designed for next-generation exascale supercomputers but has a complex memory hierarchy and processor *** assist its software development,we have developed a software stack from scratch that includes a low-level programming interface and a high-level OpenCL *** low-level programming model offers native programming support for using the bare-metal accelerators of Matrix-3000,while the high-level model allows programmers to use the OpenCL programming *** detail our design choices and highlight the lessons learned from developing system software to enable the programming of bare-metal *** programming models have been deployed in the production environment of an exascale prototype system.

关键词： Heterogeneous computing parallel programming models Programmability Compilers Runtime systems

来源：评论

学校读者我要写书评

暂无评论

A high-productivity task-based programming model for clusters

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2012年第18期24卷 2421-2448页

作者： Tejedor, Enric Farreras, Montse Grove, David Badia, Rosa M. Almasi, Gheorghe Labarta, Jesus Barcelona Supercomp Ctr BSC CNS Barcelona 08034 Spain Univ Politecn Cataluna UPC Barcelona Spain IBM Corp Thomas J Watson Res Ctr Yorktown Hts NY 10598 USA CSIC Artificial Intelligence Res Inst IIIA Barcelona Spain

programming for large-scale, multicore-based architectures requires adequate tools that offer ease of programming and do not hinder application performance. StarSs is a family of parallel programming models based on automatic function-level parallelism that targets productivity. StarSs deploys a data-flow model: it analyzes dependencies between tasks and manages their execution, exploiting their concurrency as much as possible. This paper introduces Cluster Superscalar (ClusterSs), a new StarSs member designed to execute on clusters of SMPs (Symmetric Multiprocessors). ClusterSs tasks are asynchronously created and assigned to the available resources with the support of the IBM APGAS runtime, which provides an efficient and portable communication layer based on one-sided communication. We present the design of ClusterSs on top of APGAS, as well as the programming model and execution runtime for Java applications. Finally, we evaluate the productivity of ClusterSs, both in terms of programmability and performance and compare it to that of the IBM X10 language. Copyright (c) 2012 John Wiley & Sons, Ltd.

关键词： parallel programming models high performance computing asynchronous execution productivity

来源：评论

学校读者我要写书评

暂无评论

PGAS-FMM: Implementing a distributed fast multipole method using the X10 programming language

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2014年第3期26卷 712-727页

作者： Milthorpe, Josh Rendell, Alistair P. Huber, Thomas Australian Natl Univ Res Sch Comp Sci Canberra ACT 0200 Australia Australian Natl Univ Res Sch Chem Canberra ACT 0200 Australia

The fast multipole method (FMM) is a complex, multi-stage algorithm over a distributed tree data structure, with multiple levels of parallelism and inherent data locality. X10 is a modern partitioned global address space language with support for asynchronous activities. The parallel tasks comprising FMM may be expressed in X10 by using a scalable pattern of activities. This paper demonstrates the use of X10 to implement FMM for simulation of electrostatic interactions between ions in a cyclotron resonance mass spectrometer. X10's task-parallel model is used to express parallelism by using a pattern of activities mapping directly onto the tree. X10's work stealing runtime handles load balancing fine-grained parallel activities, avoiding the need for explicit work sharing. The use of global references and active messages to create and synchronize parallel activities over a distributed tree structure is also demonstrated. In contrast to previous simulations of ion trajectories in cyclotron resonance mass spectrometers, our code enables both simulation of realistic particle numbers and guaranteed error bounds. Single-node performance is comparable with the fastest published FMM implementations, and critical expansion operators are faster for high accuracy calculations. A comparison of parallel and sequential codes shows the overhead of activity management and work stealing in this application is low. Scalability is evaluated for 8k cores on a Blue Gene/Q system and 512 cores on a Nehalem/InfiniBand cluster. Copyright (c) 2013 John Wiley & Sons, Ltd.

关键词： X10 partitioned global address space (PGAS) active messages parallel programming models scientific computing fast multipole method

来源：评论

学校读者我要写书评

暂无评论

Enhancing Kokkos with OpenACC

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2024年第5期38卷 409-426页

作者： Valero-Lara, Pedro Lee, Seyong Gonzalez-Tallada, Marc Denny, Joel Teranishi, Keita Vetter, Jeffrey S. Oak Ridge Natl Lab 1 Bethel Valley Rd Oak Ridge TN 37830 USA Univ Politecn Cataluna Barcelona Spain

C++ template metaprogramming has emerged as a prominent approach for achieving performance portability in heterogeneous computing. Kokkos represents a notable paradigm in this domain, offering programmers a suite of high-level abstractions for generic programming while deferring much of the device-specific code generation and optimization to the compiler through template specializations. Kokkos furnishes a range of device-specific code specializations across multiple back ends, including CUDA and HIP. Diverging from conventional back ends, the OpenACC implementation presents a high-level, multicompiler, multidevice, and directive-based programming model. This paper presents recent advancements in the OpenACC back end for Kokkos (i.e., KokkACC) and focuses on its integration into the Kokkos ecosystem, exploration of automatic device selection capabilities to enhance productivity, and performance evaluation on modern hardware such as NVIDIA H100 GPUs. The study includes implementation details and a thorough performance assessment across various computational benchmarks, including minibenchmarks (AXPY and DOT product), miniapps (LULESH, MiniFE, and SNAP-LAMMPS), and a scientific kernel based on the lattice Boltzmann method.

关键词： OpenACC C plus plus metaprogramming Kokkos CUDA OpenMP target parallel programming models

来源：评论

学校读者我要写书评

暂无评论

EVALUATING COMPUTATIONAL COSTS WHILE HANDLING DATA AND CONTROL parallelISM

引用

parallel PROCESSING LETTERS 2008年第1期18卷 165-174页

作者： Campa, Sonia Univ Pisa Dept Comp Sci I-56123 Pisa Italy

The aim of this work is to introduce a computational costs system associated to a semantic framework for orthogonal data and control parallelism handling. In such a framework a parallel application is described by a semantic expression involving in an orthogonal manner both data access and control parallelism abstractions. The evaluation of such an expression is driven by a set of rewriting rules each of which is combined with a computational cost. We present how to proceed in the evaluation of the final cost of the application as well as how such information together with the semantic framework capabilities can be exploited to increase the overall performance.

关键词： parallel programming models data parallelism control parallelism formal semantics cost modelling

来源：评论

学校读者我要写书评

暂无评论

parallel signal processing with S-Net

引用

Procedia Computer Science 2010年第1期1卷 2085-2094页

作者： Frank Penczek Stephan Herhut Clemens Grelck Sven-Bodo Scholz Alex Shafarenko Rémi Barrère Eric Lenormand University of Hertfordshire School of Computer Science Hatfield UK University of Amsterdam Institute of Informatics Amsterdam The Netherlands Thales Research & Technologies Palaiseau France

We argue that programming high-end stream-processing applications requires a form of coordination language that enables the designer to represent interactions between stream-processing functions asynchronously. We further argue that the level of abstraction that current programming tools engender should be drastically increased and present a coordination language and component technology that is suitable for that purpose. We demonstrate our approach on a real radar-data processing application from which we reuse all existing components and present speed-ups that we were able to achieve on contemporary multi-core hardware.

关键词： parallel programming models Component models Signal processing Stream processing

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：