检索结果-内蒙古大学图书馆

IEEE International Symposium on Embedded Multicore Socs (MCSoC)

作者： Kazuei Hironaka Kensuke Iizuka Hideharu Amano Dept. of Information and Computer Science Keio University Yokohama Japan

ISBN: (纸本)9781665465007

One obstacle to application development on multi-FPGA systems with high-level synthesis (HLS) is a lack of support for a programming interface. Implementing and debugging an application on multiple FPGA boards is difficult without a standard interface. Message Passing Interface (MPI) is a standard parallel programming interface commonly used in distributed memory systems. This paper presents a tool-independent MPI library called FiC-MPI that can be used in HLS for multi-FPGA systems in which each FPGA node is connected directly. By using FiC-MPI, various parallel software, including a general-purpose benchmark, can be easily implemented. FiC-MPI was implemented and evaluated on the M-KUBOS cluster consisting of Zynq MPSoC boards connected with a static time-division multiplexing network. By using the FiC-MPI simulator, parallel programs can be debugged before implementing on real machines. As a case study, the Himeno-BMT benchmark was implemented with FiC-MPI. It achieved 178.7 MFLOPS with a single node and scaled to 643.7 MFLOPS with four nodes, and 896.9 MFLOPS with six nodes of the M-KUBOS cluster. Through the implementation, the easiness of developing parallel programs with FiC-MPI on multi-FPGA systems was demonstrated.

关键词： Multiplexing parallel programming Multicore processing Message passing Benchmark testing Libraries Software

来源：评论

学校读者我要写书评

暂无评论

Parla: A Python Orchestration System for Heterogeneous Architectures

Parla: A Python Orchestration System for Heterogeneous Archi...

引用

Supercomputing Conference

作者： Hochan Lee William Ruys Ian Henriksen Arthur Peters Yineng Yan Sean Stephens Bozhi You Henrique Fingler Martin Burtscher Milos Gligoric Karl Schulz Keshav Pingali Christopher J. Rossbach Mattan Erez George Biros The University of Texas at Austin Austin TX USA Texas State University San Marcos TX USA

ISBN: (纸本)9781665454452

Python's ease of use and rich collection of numeric libraries make it an excellent choice for rapidly developing scientific applications. However, composing these libraries to take advantage of complex heterogeneous nodes is still difficult. To simplify writing multi-device code, we created Parla, a heterogeneous task-based programming framework that fully supports Python's scientific programming stack. Parla's API is based on Python decorators and allows users to wrap code in Parla tasks for parallel execution. Parla arrays enable automatic movement of data between devices. The Parla runtime handles resource-aware mapping, scheduling, and execution of tasks. Compared to other Python tasking systems, Parla is unique in its parallelization of tasks within a single process, its GPU context and resource-aware runtime, and its design around gradual adoption to provide easy migration of and integration into existing Python applications. We show that Parla can achieve performance competitive with hand-optimized code while improving ease of development.

关键词： Runtime Codes Scheduling algorithms parallel programming Prefetching High performance computing Writing

来源：评论

学校读者我要写书评

暂无评论

Towards performance portability of AI graphs using SYCL

Towards performance portability of AI graphs using SYCL

引用

Performance, Portability and Productivity in HPC (P3HPC)

作者： Kumudha Narasimhan Ouadie El Farouki Mehdi Goli Muhammad Tanvir Svetlozar Georgiev Isaac Ault Codeplay Software Ltd. Edinburgh UK

ISBN: (纸本)9781665460224

The wide adoption of Deep Neural Networks (DNN) has served as an incentive to design and manufacture powerful and specialized hardware technologies, targeting systems from Edge devices to Cloud and *** the proposed ONNX as a de facto for DNN model description, provides portability across various AI frameworks, supporting DNN models on various hardware architectures remains *** provides a C++-based portable parallel programming model to target various devices. Thus, enabling SYCL backend for an AI framework can lead to a hardware-agnostic model for heterogeneous *** paper proposes a SYCL backend for ONNXRuntime as a possible solution towards the performance portability of deep learning algorithms. The proposed backend uses existing state-of-the-art SYCL-DNN and SYCL-BLAS libraries to invoke tuned SYCL kernels for DNN operations. Our performance evaluation shows that the proposed approach can achieve comparable performance with respect to the state-of-the-art optimized vendor-specific libraries.

关键词： Performance evaluation Deep learning Productivity Runtime parallel programming Neural networks Libraries

来源：评论

学校读者我要写书评

暂无评论

UnoAPI: Balancing Performance, Portability, and Productivity (P3) in HPC Education

UnoAPI: Balancing Performance, Portability, and Productivity...

引用

Workshop on Education for High Performance Computing (EduHPC)

作者： Konstantin Läufer George K. Thiruvathukal Department of Computer Science Software and Systems Laboratory Loyola University Chicago

ISBN: (纸本)9781665473675

oneAPI is a major initiative by Intel aimed at making it easier to program heterogeneous architectures used in high-performance computing using a unified application programming interface (API). While raising the abstraction level via a unified API represents a promising step for the current generation of students and practitioners to embrace high-performance computing, we argue that a curriculum of well-developed software engineering methods and well-crafted exem-plars will be necessary to ensure interest by this audience and those who teach them. We aim to bridge the gap by developing a curriculum-codenamed UnoAPI-that takes a more holistic approach by looking beyond language and framework to include the broader development ecosystem, similar to the experience found in popular HPC languages such as Python. We hope to make parallel programming a more attractive option by making it look more like general application development in modern languages being used by most students and educators today. Our curriculum emanates from the perspective of well-crafted exemplars from the foundations of computer systems-given that most HPC architectures of interest begin from the systems tradition-with an integrated treatment of essential principles of distributed systems, programming languages, and software engineering. We argue that a curriculum should cover the essence of these topics to attract students to HPC and enable them to confidently solve computational problems using oneAPI. By the time of this submission, we have shared our materials with a small group of undergraduate sophomores, and their responses have been encouraging in terms of self-reported comprehension and ability to reproduce the compilation and execution of exemplars on their personal systems. We plan a follow-up study with a larger cohort by incorporating some of our materials in our existing course on High-Performance Computing.

关键词： Productivity parallel programming High performance computing Education Ecosystems Computer architecture Software

来源：评论

学校读者我要写书评

暂无评论

Optimizing GNN on ARM Multi-Core Processors

Optimizing GNN on ARM Multi-Core Processors

引用

IEEE International Conference on High Performance Computing and Communications (HPCC)

作者： Chaorun Liu Huayou Su Yong Dou Kangkang Chen Yanjie Sun National University of Defense and Technology China

Graph Neural Network (GNN) has shown great success in graph learning, including physics systems, protein interfaces, disease classification, molecular fingerprints, etc. Due to the complexity of the real-world tasks and the big graph datasets, current GNN models become increasingly bigger and more complicated to enhance the learning ability and prediction accuracy. In addition, GNN contains two main major components graph operation and neural network (NN) operation, both are executed alternately during processing. The interleaved complex processing of GNN poses a challenge for many computational platforms, especially for those without accelerators. Current optimization frameworks solely for deep learning or graph computing cannot achieve good performance on GNN. In this work, we first investigate the performance bottlenecks when handling GNN over multi-core processors. Then, we perform a set of optimization strategies to leverage the capability of multi-core processors for GNN processing. Specifically, we build a set of microkernels for graph operations of GNN with assembly instructions that can exploit the capability of SIMD processing units within a core and implement a task allocator based on a greedy algorithm to balance the workloads between the cores of CPU. In addition, we use some strategies to optimize NN operation for GNN, according to the characteristics of NN operation in GNN. Experimental results show that the proposed methods can achieve a maximum of 2.88x and 4.08x performance improvements for various GNN models on Phytium 2000+ and Kunpeng 920 over the state-of-the-art GNN framework.

关键词： Proteins Program processors Multicore processing parallel programming Artificial neural networks Predictive models Graph neural networks

来源：评论

学校读者我要写书评

暂无评论

Analysis of Validating and Verifying OpenACC Compilers 3.0 and Above

Analysis of Validating and Verifying OpenACC Compilers 3.0 a...

引用

Workshop on Accelerator programming using Directives (WACCPD)

作者： Aaron Jarmusch Aaron Liu Christian Munley Daniel Horta Vaidhyanathan Ravichandran Joel Denny Kyle Friedline Sunita Chandrasekaran University of Delaware Oak Ridge National Laboratory

ISBN: (纸本)9781665490207

OpenACC is a high-level directive-based parallel programming model that can manage the sophistication of heterogeneity in architectures and abstract it from the users. The portability of the model across CPUs and accelerators has gained the model a wide variety of users. This means it is also crucial to analyze the reliability of the compilers’ implementations. To address this challenge, the OpenACC Validation and Verification team has proposed a validation testsuite to verify the OpenACC implementations across various compilers with an infrastructure for a more streamlined execution. This paper will cover the following aspects: (a) the new developments since the last publication on the testsuite, (b) outline the use of the infrastructure, (c) discuss tests that highlight our workflow process, (d) analyze the results from executing the testsuite on various systems, and (e) outline future developments.

关键词： Analytical models Program processors parallel programming Conferences Reliability

来源：评论

学校读者我要写书评

暂无评论

The OpenMP Cluster programming Model

arXiv

引用

arXiv 2022年

作者： Yviquel, Hervé Pereira, Marcio Francesquini, Emílio Valarini, Guilherme Leite, Gustavo Rosso, Pedro Ceccato, Rodrigo Cusihualpa, Carla Dias, Vitoria Rigo, Sandro Souza, Alan Araujo, Guido Institute of Computing University of Campinas - Unicamp Brazil Federal University of Abc - Ufabc Brazil Petrobras Brazil

Despite the various research initiatives and proposed programming models, efficient solutions for parallel programming in HPC clusters still rely on a complex combination of different programming models (e.g., OpenMP and MPI), languages (e.g., C++ and CUDA), and specialized runtimes (e.g., Charm++ and Legion). On the other hand, task parallelism has shown to be an efficient and seamless programming model for clusters. This paper introduces OpenMP Cluster (OMPC), a task-parallel model that extends OpenMP for cluster programming. OMPC leverages OpenMP's offloading standard to distribute annotated regions of code across the nodes of a distributed system. To achieve that it hides MPI-based data distribution and load-balancing mechanisms behind OpenMP task dependencies. Given its compliance with OpenMP, OMPC allows applications to use the same programming model to exploit intra- and inter-node parallelism, thus simplifying the development process and maintenance. We evaluated OMPC using Task Bench, a synthetic benchmark focused on task parallelism, comparing its performance against other distributed runtimes. Experimental results show that OMPC can deliver up to 1.53x and 2.43x better performance than Charm++ on CCR and scalability experiments, respectively. Experiments also show that OMPC performance weakly scales for both Task Bench and a real-world seismic imaging application. Copyright © 2022, The Authors. All rights reserved.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Designing an Independent Study to Create HPC Learning Experiences for Undergraduates

Designing an Independent Study to Create HPC Learning Experi...

引用

IEEE International Conference on High Performance Computing Workshops (HiPCW)

作者： Sandino Vargas-Pérez Department of Computer Science Kalamazoo College Kalamazoo MI USA

This paper aims to present a multi-tiered approach to designing learning experiences in HPC for undergraduate students that significantly reinforce comprehension of CS topics while working with new concepts in parallel and distributed computing. The paper will detail the experience of students working in the design, construction, and testing of a computing cluster including budgeting, hardware purchase and setup, software installation and configuration, interconnection networks, communication, benchmarking, and running parallel code using MPI and OpenMP. The case study of building a relatively low-cost, small-scale computing cluster that can be used as a template for CS senior projects or independent studies, also yielded an opportunity to involve students in the creation of teaching tools for parallel computing at many levels of the CS curriculum.

关键词： Computer science parallel programming Education Taxonomy Hardware Supercomputers Software

来源：评论

学校读者我要写书评

暂无评论

Static Security Assessment of Large Power Systems Under N-1-1 Contingency

Static Security Assessment of Large Power Systems Under N-1-...

引用

National Power Systems Conference (NPSC)

作者： P S V Prabhakar Ram Krishan Deepak Reddy Pullaguram Department of Electrical Engineering NIT Warangal

Contingency analysis (CA) is one of the critical tools of a static security assessment (SSA). It is used to forecast the operating states of a power system under one or more outages of generators, transmission lines, transformers, etc. To perform SSA, repetitive load flow analyses are required for obtaining the bus voltages, bus injections, and line flows considering each possible outage. A repetitive load flow analysis demands huge computational efforts like efficient system modelling for faster load flow solutions, parallel programming and High performance computing (HPC). In this paper, an N-1-1 CA has been analysed using fast decoupled load flow (FDLF) with a strategy of screening and ranking the catastrophic contingencies. This paper explores a computationally efficient method to analyze the severity and the ranking of N-1-1 contingencies for large power system SSA. The performance of the FDLF based SSA method is demonstrated on two standard IEEE 14 and 118 bus systems.

关键词： Power transmission lines parallel programming High performance computing Contingency management Transformers Computational efficiency Security

来源：评论

学校读者我要写书评

暂无评论

Climbing the summit and pushing the frontier of mixed precision benchmarks at extreme scale 22

Climbing the summit and pushing the frontier of mixed precis...

引用

Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis

作者： Hao Lu Michael Matheson Vladyslav Oles Austin Ellis Wayne Joubert Feiyi Wang Oak Ridge National Laboratory

The rise of machine learning (ML) applications and their use of mixed precision to perform interesting science are driving forces behind AI for science on HPC. The convergence of ML and HPC with mixed precision offers the possibility of transformational changes in computational science. The HPL-AI benchmark is designed to measure the performance of mixed precision arithmetic as opposed to the HPL benchmark which measures double precision performance. Pushing the limits of systems at extreme scale is nontrivial ---little public literature explores optimization of mixed precision computations at this scale. In this work, we demonstrate how to scale up the HPL-AI benchmark on the pre-exascale Summit and exascale Frontier systems at the Oak Ridge Leadership Computing Facility (OLCF) with a cross-platform design. We present the implementation, performance results, and a guideline of optimization strategies employed for delivering portable performance on both AMD and NVIDIA GPUs at extreme scale.

关键词： high performance computing exascale computing parallel programming linear algebra

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：