检索结果-内蒙古大学图书馆

Studying the expressiveness and performance of parallelization abstractions for linear pipelines 14

Studying the expressiveness and performance of parallelizati...

14th International Workshop on programming Models and Applications for Multicores and Manycores, PMAM 2023 - Part of PPoPP 2023

作者： Mastoras, Aristeidis Yzelman, Albert-Jan N. Computing Systems Laboratory Zurich Research Center Huawei Technologies Zurich Switzerland

ISBN: (纸本)9798400701153

Semi-automatic parallelization provides abstractions that simplify the programming effort and allow the user to make decisions that cannot be made by tools. However, abstractions for general-purpose systems usually do not carry sufficient knowledge about the structure of the program, and thus parallelization with them may lead to poor *** this paper, we present a popular class of programs, called linear pipelines, that cannot be easily and efficiently parallelized with general-purpose abstractions. We discuss the difficulties and inefficiencies of parallelizing linear pipelines with general-purpose abstractions, and we explain how pattern-specific abstractions overcome these problems. We present the properties of linear pipelines that should be described with pattern-specific abstractions and how these properties are exploited by the state of the art. In addition, we discuss the importance of exposing the performance parameters and how they are combined by pattern-specific knowledge. We claim that designing pattern-specific abstractions for general-purpose programming models is one way to simplify parallel programming and improve performance without sacrificing any expressive power. Consequently, we propose possible pattern-specific extensions to general-purpose parallel programming models, e.g., OpenMP, to support easy and efficient parallelization of linear pipelines. © 2023 ACM.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

A Task-parallel Runtime for Heterogeneous Multi-node Vector Systems 23rd

A Task-Parallel Runtime for Heterogeneous Multi-node Vector...

引用

23rd International Conference on parallel and Distributed Computing, Applications, and Technologies, PDCAT 2022

作者： Ide, Kazuki Takahashi, Keichi Shimomura, Yoichi Takizawa, Hiroyuki Graduate School of Information Sciences Tohoku University Miyagi Sendai980-8578 Japan Cyberscience Center Tohoku University Miyagi Sendai980-8578 Japan

ISBN: (纸本)9783031299261

In recent years, high-performance computing systems are equipped with not only host processors but also accelerators, and becoming more heterogeneous as well as becoming larger in scale. The task parallel execution model is promising to efficiently utilize such a large-scale system by minimizing synchronizations in comparison with traditional models. In this paper, we propose a task-parallel runtime system that individually considers the processors for task management and task execution;those two roles could be assigned to different processors. This paper focuses on NEC SX-Aurora TSUBASA as an example of heterogeneous multi-node systems, which are equipped with two kinds of general-purpose processors, to exploit the system heterogeneity for efficient task-parallel execution. Specifically, the proposed runtime system is used to select an appropriate processor for task management, depending on several execution conditions. The performance of the proposed runtime is discussed by running a Cholesky factorization implementation. The evaluation results show that the proposed runtime system can improve performance by more than 25% in comparison with a conventional a conventional implementation. © 2023, The Author(s), under exclusive license to Springer Nature Switzerland AG.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

parallel programming patterns for multi-processor soc: Application to video processing

Parallel programming patterns for multi-processor soc: Appli...

引用

作者： Paulin, Pierre G. Ozcan, Ali Erdem Gagne, Vincent Lavigueur, Bruno Benny, Olivier Inc. 16 Fitzgerald Road Ottawa K2H 8R6 Canada

Efficient, scalable and productive parallel programming is a major challenge for exploiting the future multiprocessor SoC platforms. This article presents the MultiFlex programming environment which has been developed to address this challenge. It is targeted for use on Platform 2012, a scalable multi-processor fabric. The MultiFlex environment supports high-level simulation, iterative platform mapping, and includes tools for programming model aware debug, trace, visualization and analysis. This article focuses on the two classes of programming abstractions supported in MultiFlex. The first is a set of parallel programming Patterns (PPP) which offer a rich set of programming abstractions for implementing efficient data-and task-level parallel applications. The second is a Reactive TaskManagement (RTM) abstraction, which offers a lightweight C-based API to support dynamic dispatching of small grain tasks on tightly coupled parallel processing resources. The use of the MultiFlex native programming model is illustrated through the capture and mapping of two representative video applications. The first is a high-quality rescaling (HQR) application on a multiprocessor platform. We present the details of the optimization process which was required for mapping the HQR application, for which the reference code requires 350 GIPS (giga instructions per second), onto a 16 processor cluster. Our results show that the parallel implementation using the PPP model offers almost linear acceleration with respect to the number of processing elements. The second application is a high-definition VC-1 decoder. For this application, we illustrate two different parallel programming model variants, one using PPPs, the other based on RTM. These two versions are mapped onto two variants of a homogeneous version of the Platform 2012 multi-core fabric. © 2013 ACM.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

A General-purpose parallel and Heterogeneous Task programming System for VLSI CAD

A General-purpose Parallel and Heterogeneous Task Programmin...

引用

IEEE International Conference on Computer-Aided Design

作者： Tsung-Wei Huang University of Utah

ISBN: (数字)9781665423243

This paper introduces Taskflow to address the critical question of “How can we make it easier to implement and deploy parallel computer-aided design (CAD) algorithms on large heterogeneous nodes with high performance and simultaneous high productivity?” parallelizing CAD is an extremely challenging job. Modern CAD applications exhibit unique computational patterns and user requirements that need very strategic decomposition to benefit from parallelism. Taskflow assists researchers and developers in the implementation complexity of parallel algorithms by introducing a new high-level programming model supported by an efficient runtime. By capitalizing on emerging parallelism comprising manycore central processing units (CPUs), graphics processing units (GPUs), and custom accelerators, Taskflow enables CAD to achieve new performance and productivity milestones that were previously out of reach.

关键词： Task analysis Timing Solid modeling Tools Productivity parallel programming Object oriented modeling

来源：评论

学校读者我要写书评

暂无评论

parallel programming with big operators 13

Parallel programming with big operators

引用

18th ACM SIGPLAN Symposium on Principles and Practice of parallel programming, PPoPP 2013

作者： Park, Changhee Steele Jr., Guy L. Tristan, Jean-Baptiste KAIST Seoul Korea Republic of Oracle Labs. Burlington MA United States

ISBN: (纸本)9781450319225

In the sciences, it is common to use the so-called "big operator" notation to express the iteration of a binary operator (the reducer) over a collection of values. Such a notation typically assumes that the reducer is associative and abstracts the iteration process. Consequently, from a programming point-of-view, we can organize the reducer operations to minimize the depth of the overall reduction, allowing a potentially parallel evaluation of a big operator expression. We believe that the big operator notation is indeed an effective construct to express parallel computations in the Generate/Map/Reduce programming model, and our goal is to introduce it in programming languages to support parallel programming. The effective definition of such a big operator expression requires a simple way to generate elements, and a simple way to declare algebraic properties of the reducer (such as its identity, or its commutativity). In this poster, we want to present an extension of Scala with support for big operator expressions. We show how big operator expressions are defined and how the API is organized to support the simple definition of reducers with their algebraic properties. © 2013 Authors.

关键词： parallel programming

来源：评论

学校读者我要写书评

暂无评论

Exploring High Performance and Energy Efficient Graph Processing on GPU

Exploring High Performance and Energy Efficient Graph Proces...

引用

作者： Watling, Robert P. Michigan Technological University

学位级别：M.S., Master of Science/Master of Surgery

parallel graph processing is central to analytical computer science applications, and GPUs have proven to be an ideal platform for parallel graph processing. Existing GPU graph processing frameworks present performance improvements but often neglect two issues: the unpredictability of a given input graph and the energy consumption of the graph processing. Our prototype software, EEGraph (Energy Efficiency of Graph processing), is a flexible system consisting of several graph processing algorithms with configurable parameters for vertex update synchronization, vertex activation, and memory management along with a lightweight software-based GPU energy measurement scheme. We observe relationships between different configurations of our software, performance, and GPU energy for processing in-memory and out-of-memory graphs. The ideal parameters are discovered for specific input graphs by analyzing the observed relationships. We also present the utility of subgraph generation to predict the performance and energy consumption of complete graph configurations. EEGraph improves upon state-of-the-art GPU-based graph processing software by 2.08 times for performance and 1.60 times for GPU energy for processing in-memory graph datasets. Additionally, EEGraph improves upon the state-of-the-art by 3.30 times for performance and 1.63 times for GPU energy for processing large out-of-memory graph datasets.

关键词： Computer systems Energy Graphical processing unit Graphs High performance computing parallel programming

来源：评论

学校读者我要写书评

暂无评论

Sisal 3.2: functional language for scientific parallel programming

引用

ENTERPRISE INFORMATION SYSTEMS 2013年第2期7卷 227-236页

作者： Kasyanov, Victor Russian Acad Sci Inst Informat Syst Novosibirsk 630090 Russia

Sisal 3.2 is a new input language of system of functional programming (SFP) which is under development at the Institute of Informatics Systems in Novosibirsk as an interactive visual environment for supporting of scientific parallel programming. This paper contains an overview of Sisal 3.2 and a description of its new features compared with previous versions of the SFP input language such as the multidimensional array support, new abstractions like parametric types and generalised procedures, more flexible user-defined reductions, improved interoperability with other programming languages and specification of several optimising source text annotations.

关键词： functional programming dataflow languages parallel programming scientific computations

来源：评论

学校读者我要写书评

暂无评论

Many Sequential Iterative Algorithms Can Be parallel and (Nearly) Work-efficient 22

Many Sequential Iterative Algorithms Can Be Parallel and (Ne...

引用

34th ACM Symposium on parallelism in Algorithms and Architectures (SPAA)

作者： Shen, Zheqi Wan, Zijin Gu, Yan Sun, Yihan UC Riverside Riverside CA 92521 USA

ISBN: (纸本)9781450391467

Some recent papers showed that many sequential iterative algorithms can be directly parallelized, by identifying the dependences between the input objects. This approach yields many simple and practical parallel algorithms, but there are still challenges to achieve work-efficiency and high-parallelism. Work-efficiency means that the number of operations is asymptotically the same as the best sequential solution. This can be hard for certain problems where the number of dependences between objects is asymptotically more than optimal sequential work, and we cannot even afford the cost to generate them. To achieve high-parallelism, we always want it to process as many objects as possible in parallel. The goal is to achieve (O) over tilde (D) span for a problem with the deepest dependence length D. We refer to this property as round-efficiency. This paper presents work-efficient and round-efficient algorithms for a variety of classic problems and propose general approaches to do so. To efficiently parallelize many sequential iterative algorithms, we propose the phase-parallel framework. The framework assigns a rank to each object and processes the objects based on the order of their ranks. All objects with the same rank can be processed in parallel. To enable work-efficiency and high parallelism, we use two types of general techniques. Type 1 algorithms aim to use range queries to extract all objects with the same rank to avoid evaluating all the dependences. We discuss activity selection, and Dijkstra's algorithm using Type 1 framework. Type 2 algorithms aim to wake up an object when the last object it depends on is finished. We discuss activity selection, longest increasing subsequence (LIS), greedy maximal independent set (MIS), and many other algorithms using Type 2 framework. All of our algorithms are (nearly) work-efficient and round-efficient, and some of them (e.g., LIS) are the first to achieve the both. Many of them improve the previous best bounds. Moreover,

关键词： parallel algorithms phase-parallel framework parallel programming sequential iterative algorithms activity selection longest increasing subsequence maximal independent set independence system

来源：评论

学校读者我要写书评

暂无评论

STMatch: Accelerating Graph Pattern Matching on GPU with Stack-Based Loop Optimizations

STMatch: Accelerating Graph Pattern Matching on GPU with Sta...

引用

International Conference for High Performance Computing, Networking, Storage and Analysis (HPC)

作者： Wei, Yihua Jiang, Peng Univ Iowa Iowa City IA 52242 USA

ISBN: (纸本)9781665454445

Graph pattern matching is a fundamental task in many graph analytics and graph mining applications. As an NP-hard problem, it is often a performance bottleneck in these applications. Previous work has proposed to use GPU to accelerate the computation. However, we find that the existing GPU solutions fail to show a performance advantage over the state-of-the-art CPU implementation due to their subgraphcentric design. This work proposes a novel stack-based graph pattern matching system on GPU that avoids the synchronization and memory consumption issues of the previous subgraph-centric systems. We also propose a two-level work-stealing and a loopunrolling technique to improve the inter-warp and intra-warp GPU resource utilization of our system. The experiments show that our system significantly advances the state-of-the-art for graph pattern matching on GPU.

关键词： parallel programming Backtracking

来源：评论

学校读者我要写书评

暂无评论

Vectorizing Sparse Matrix Computations with Partially-Strided Codelets

Vectorizing Sparse Matrix Computations with Partially-Stride...

引用

International Conference for High Performance Computing, Networking, Storage and Analysis (HPC)

作者： Cheshmi, Kazem Cetinic, Zachary Dehnavi, Maryam Mehri Univ Toronto Toronto ON Canada

ISBN: (纸本)9781665454445

The compact data structures and irregular computation patterns in sparse matrix computations introduce challenges to vectorizing these codes. Available approaches primarily vectorize strided computation regions of a sparse code. In this work, we propose a locality-based codelet mining (LCM) algorithm that efficiently searches for strided and partially strided regions in sparse matrix computations for vectorization. We also present a classification of partially strided codelets and a differentiation-based approach to generate codelets from memory accesses in the sparse computation. LCM is implemented as an inspector-executor framework called LCM I/E that generates vectorized code for the sparse matrix-vector multiplication (SpMV), sparse matrix times dense matrix (SpMM), and sparse triangular solver (SpTRSV). LCM I/E outperforms the MKL library with an average speedup of 1.67x, 4.1x, and 1.75x for SpMV, SpTRSV, and SpMM, respectively. It is also faster than the state-of-the-art inspector-executor framework Sympiler [1] for the SpTRSV kernel with an average speedup of 1.9x.

关键词： Vectorization parallel programming Polyhedral analysis Sparse matrix computation

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：