检索结果-内蒙古大学图书馆

Evaluating Data Redistribution in PaRSEC

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2022年第8期33卷 1856-1872页

作者： Cao, Qinglei Bosilca, George Losada, Nuria Wu, Wei Zhong, Dong Dongarra, Jack Univ Tennessee Dept Elect Engn & Comp Sci Knoxiville TN 37996 USA Los Alamos Natl Lab Los Alamos NM 87545 USA

Data redistribution aims to reshuffle data to optimize some objective for an algorithm. The objective can be multi-dimensional, such as improving computational load balance or decreasing communication volume or cost, with the ultimate goal of increasing the efficiency and therefore reducing the time-to-solution for the algorithm. The classic redistribution problem focuses on optimally scheduling communications when reshuffling data between two regular, usually block-cyclic, data distributions. Besides distribution, data size is also a performance-critical parameter because it affects the reshuffling algorithm in terms of cache, communication efficiency, and potential parallelism. In addition, task-based runtime systems have gained popularity recently as a potential candidate to address the programming complexity on the way to exascale. In this scenario, it becomes paramount to develop a flexible redistribution algorithm for task-based runtime systems, which could support all types of regular and irregular data distributions and take data size into account. In this article, we detail a flexible redistribution algorithm and implement an efficient approach in a task-based runtime system, PaRSEC. Performance results show great capability compared to the theoretical bound and ScaLAPACK, and applications highlight an increased efficiency with little overhead in terms of data distribution, data size, and data format.

关键词： task analysis Runtime Distributed databases programming Parallel processing Costs Program processors Data redistribution data distribution data size data format task-based programming model dynamic runtime system high-performance computing

来源：评论

学校读者我要写书评

暂无评论

Scheduling across Multiple Applications using task-based programming models 4

Scheduling across Multiple Applications using Task-Based Pro...

引用

IEEE/ACM 4th Annual Workshop on Emerging Parallel and Distributed Runtime Systems and Middleware (IPDRM)

作者： Chung, Minh Thanh Weidendorfer, Josef Samfass, Philipp Fuerlinger, Karl Kranzlmuller, Dieter Ludwig Maximilians Univ Munchen MNM Team Munich Germany Leibniz Supercomp Ctr LRZ Garching Germany Tech Univ Munich Dept Informat Munich Germany Ludwig Maximilians Univ Munchen MNM Team Leibniz Supercomp Ctr LRZ Munich Germany

ISBN: (纸本)9780738110646

task-based programming models have shown their potential for efficiency and scalability in parallel and distributed systems. With such a model, a parallel application is broken down into a graph of tasks, which are subsequently scheduled for execution. Recently, implementations of task-based models have addressed distributed memory and heterogeneous systems with accelerators. However, the problem of scheduling tasks as well as allocating resources at runtime is still a challenge. In this paper, we propose coordinated and cooperative task scheduling across multiple applications. The main idea is to exploit the application's idle time e.g. from imbalance to serve tasks from another application. The experiments use Chameleon, a task-based framework for reactive tasking in distributed memory systems. In various example scenarios, we show improvements in CPU utilization of 5% - 15% by coordinated scheduling.

关键词： task-based programming model distributed system scheduling resource management task migration

来源：评论

学校读者我要写书评

暂无评论

A Framework to Exploit Data Sparsity in Tile Low-Rank Cholesky Factorization 36

A Framework to Exploit Data Sparsity in Tile Low-Rank Choles...

引用

36th IEEE International Parallel and Distributed Processing Symposium (IEEE IPDPS)

作者： Cao, Qinglei Alomairy, Rabab Pei, Yu Bosilca, George Ltaief, Hatem Keyes, David Dongarra, Jack Univ Tennessee Innovat Comp Lab Knoxville TN 37996 USA King Abdullah Univ Sci & Technol Extreme Comp Res Ctr Div Comp Elect & Math Sci & Engn Thuwal Saudi Arabia Oak Ridge Natl Lab Oak Ridge TN USA Univ Manchester Manchester Lancs England

ISBN: (纸本)9781665481069

We present a general framework that couples the PaRSEC runtime system and the HiCMA numerical library to solve challenging 3D data-sparse problems. Though formally dense, many matrix operators possess a rank structured property that can be exploited during the most time-consuming computational phase, i.e., the matrix factorization. In particular, this work highlights how a software bundle powered by a task-based programming model can address the heterogeneous workloads engendered by compressing the dense operator. Using Tile LowRank (TLR) approximation, our approach consists in capturing the most significant information in each tile of the matrix using a threshold which satisfies the application's accuracy requirements. Matrix operations are performed on the compressed data layout, reducing memory footprint and algorithmic complexity. Our proposed software solution accommodates a range of traditional data structures of linear algebra, i.e., from dense and datasparse to sparse, within a single matrix operation. Separation of concerns is at the heart: hardware-agnostic implementation, asynchronous execution with a dynamic runtime system, and high performance numerical kernels, to prepare scientific applications to embrace exascale opportunities. This ambition necessitates extensions to PaRSEC that incorporate information related to data structure and rank distribution into the runtime decision-making. We introduce two runtime optimizations to address the challenges encountered when confronted with a large rank disparity: (1) a trimming procedure performed at runtime to cut away data dependencies from the directed acyclic graph discovered to be no longer required after compression and (2) a rank-aware diamond-shaped data distribution to mitigate the load imbalance overheads, reduce data movement, and conserve memory footprint. We assess our implementation using 3D unstructured mesh deformation based on Radial Basis Function (RBF) interpolation. We report performance re

关键词： Low-rank approximations task-based programming model Dynamic runtime system HPC Mesh deformations

来源：评论

学校读者我要写书评

暂无评论

Leveraging PaRSEC Runtime Support to Tackle Challenging 3D Data-Sparse Matrix Problems 35

Leveraging PaRSEC Runtime Support to Tackle Challenging 3D D...

引用

35th IEEE International Parallel and Distributed Processing Symposium (IPDPS)

作者： Cao, Qinglei Pei, Yu Akbudak, Kadir Bosilca, George Ltaief, Hatem Keyes, David Dongarra, Jack Univ Tennessee Innovat Comp Lab Knoxville TN 37996 USA King Abdullah Univ Sci & Technol Extreme Comp Res Ctr Div Comp Elect & Math Sci & Engn Thuwal Saudi Arabia ASELSAN Res Ctr Ankara Turkey Oak Ridge Natl Lab Oak Ridge TN USA Univ Manchester Manchester Lancs England

ISBN: (纸本)9781665440660

The task-based programming model associated with dynamic runtime systems has gained popularity for challenging problems because of workload imbalance, heterogeneous resources, or extreme concurrency. During the last decade, low-rank matrix approximations-where the main idea consists of exploiting data sparsity, typically by compressing off-diagonal tiles up to an application-specific accuracy threshold-have been adopted to address the curse of dimensionality at extreme scale. In this paper, we create a bridge between the runtime and the linear algebra by communicating knowledge of the data sparsity to the runtime. We design and implement this synergistic approach with high user productivity in mind, in the context of the PaRSEC runtime system and the HiCMA numerical library. This requires extending PaRSEC with new features to integrate rank information into the dataflow so that proper decisions can be made at runtime. We focus on the tile low-rank (TLR) Cholesky factorization for solving 3D data-sparse covariance matrix problems arising in environmental applications. In particular, we employ the 3D exponential model of the Match] matrix kernel, which exhibits challenging nonuniform high ranks in off-diagonal tiles. We first provide dynamic data structure management driven by a performance model to reduce extra floating-point operations. Next, we optimize the memory footprint of the application by relying on a dynamic memory allocator, and supported by a rank-aware data distribution to cope with the workload imbalance. Finally, we expose further parallelism using kernel recursive formulations to shorten the critical path. Our resulting high-performance implementation outperforms existing data-sparse TLR Cholesky factorization by up to 7-fold on a large-scale distributed-memory system, while minimizing the memory footprint up to a 44-fold factor. This multidisciplinary work highlights the need to empower runtime systems beyond their original duty of task scheduling fo

关键词： Low-rank matrix computations task-based programming model Dynamic runtime system Asynchronous executions and load balancing High-performance computing User productivity Environmental applications

来源：评论

学校读者我要写书评

暂无评论

Abstraction Layer For Standardizing APIs of task-based Engines

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2020年第11期31卷 2482-2495页

作者： Alomairy, Rabab Ltaief, Hatem Abduljabbar, Mustafa Keyes, David King Abdullah Univ Sci & Technol Comp Elect & Math Sci & Engn CEMSE Div Extreme Comp Res Ctr Thuwal 239556900 Saudi Arabia

We introduce AL4SAN, a lightweight library for abstracting the APIs of task-based runtime engines. AL4SAN unifies the expression of tasks and their data dependencies. It supports various dynamic runtime systems relying on compiler technology and user-defined APIs. It enables a single application to employ different runtimes and their respective scheduling components, while providing user-obliviousness to the underlying hardware configurations. AL4SAN exposes common front-end APIs and connects to different back-end runtimes. Experiments on performance and overhead assessments are reported on various shared- and distributed-memory systems, possibly equipped with hardware accelerators. A range of workloads, from compute-bound to memory-bound regimes, are employed as proxies for current scientific applications. The low overhead (less than 10 percent) achieved using a variety of workloads enables AL4SAN to be deployed for fast development of task-based numerical algorithms. More interestingly, AL4SAN enables runtime interoperability by switching runtimes at runtime. Blending runtime systems permits to achieve a twofold speedup on a task-based generalized symmetric eigenvalue solver, relative to state-of-the-art implementations. The ultimate goal of AL4SAN is not to create a new runtime, but to strengthen co-design of existing runtimes/applications, while facilitating user productivity and code portability. The code of AL4SAN is freely available at https://***/ecrc/al4san, with extensions in progress.

关键词： Runtime task analysis Hardware Engines Libraries programming Productivity task-based programming model dynamic runtime systems abstraction layer API standardization user productivity LLVM compiler infrastructure runtime interoperability

来源：评论

学校读者我要写书评

暂无评论

PureMEM: A Structured programming model for Transiently Powered Computers 19

PureMEM: A Structured Programming Model for Transiently Powe...

引用

34th ACM/SIGAPP Annual International Symposium on Applied Computing (SAC)

作者： Durmaz, Caglar Yildirim, Kasim Sinan Kardas, Geylani Ege Univ Int Comp Inst Izmir Turkey Ege Univ Dept Comp Engn Izmir Turkey

ISBN: (纸本)9781450359337

Advances in energy harvesting circuits and energy efficient architecture of processors create the potential for batteryless computing and sensing systems called transiently powered computers. These computers can only operate intermittently due to fluctuating nature of ambient energy. Intermittent operation requires a new programming model that should preserve forward progress and maintain data consistency;which are challenging. We propose a structured task-based programming model;namely PureMEM, to cope with these challenges. We discuss how PureMEM prevents interdependencies caused by the unstructured control encountered in intermittent operation, enables re-usability of the tasks, provides dynamic memory management and supports error handling. We also present intermittent programs to exemplify the features of PureMEM.

关键词： task-based programming model Transiently Powered Computers Embedded Systems and Software Structured programming model

来源：评论

学校读者我要写书评

暂无评论

Optimizing Computation-Communication Overlap in Asynchronous task-based Programs 19

Optimizing Computation-Communication Overlap in Asynchronous...

引用

33rd ACM International Conference on Supercomputing (ICS)

作者： Castillo, Emilio Jain, Nikhil Casas, Marc Moreto, Miquel Schulz, Martin Beivide, Ramon Valero, Mateo Bhatele, Abhinav Barcelona Supercomp Ctr Barcelona Spain Univ Politecn Cataluna Barcelona Spain NVIDIA Inc Santa Clara CA USA Tech Univ Munich Munich Germany Univ Cantabria Barcelona Spain Lawrence Livermore Natl Lab Livermore CA 94550 USA

ISBN: (纸本)9781450360791

Asynchronous task-based programming models are gaining popularity to address the programmability and performance challenges in high performance computing. One of the main attractions of these models and runtimes is their potential to automatically expose and exploit overlap of computation with communication. However, we find that inefficient interactions between these programming models and the underlying messaging layer (in most cases, MPI) limit the achievable computation-communication overlap and negatively impact the performance of parallel programs. We address this challenge by exposing and exploiting information about MPI internals in a task-based runtime system to make better task-creation and scheduling decisions. In particular, we present two mechanisms for exchanging information between MPI and a task-based runtime, and analyze their trade-offs. Further, we present a detailed evaluation of the proposed mechanisms implemented in MPI and a task-based runtime. We show performance improvements of up to 16.3% and 34.5% for proxy applications with point-to-point and collective communication, respectively.

关键词： task-based programming model computation-communication overlap mpi

来源：评论

学校读者我要写书评

暂无评论

Performance Analysis of Tile Low-Rank Cholesky Factorization Using PaRSEC Instrumentation Tools 1

Performance Analysis of Tile Low-Rank Cholesky Factorization...

引用

IEEE/ACM International Workshop on programming and Performance Visualization Tools (ProTools)

作者： Cao, Qinglei Pei, Yu Herault, Thomas Akbudak, Kadir Mikhalev, Aleksandr Bosilca, George Ltaief, Hatem Keyes, David Dongarra, Jack Univ Tennessee Innovat Comp Lab Knoxville TN 37996 USA King Abdullah Univ Sci & Technol Extreme Comp Res Ctr Div Comp Elect & Math Sci & Engn Thuwal Saudi Arabia Oak Ridge Natl Lab Oak Ridge TN USA Univ Manchester Manchester Lancs England

ISBN: (纸本)9781728160269

This paper highlights the necessary development of new instrumentation tools within the PaRSEC task-based runtime system to leverage the performance of low-rank matrix computations. In particular, the tile low-rank (TLR) Cholesky factorization represents one of the most critical matrix operations toward solving challenging large-scale scientific applications. The challenge resides in the heterogeneous arithmetic intensity of the various computational kernels, which stresses PaRSEC's dynamic engine when orchestrating the task executions at runtime. Such irregular workload imposes the deployment of new scheduling heuristics to privilege the critical path, while exposing task parallelism to maximize hardware occupancy. To measure the effectiveness of PaRSEC's engine and its various scheduling strategies for tackling such workloads, it becomes paramount to implement adequate performance analysis and profiling tools tailored to fine-grained and heterogeneous task execution. This permits us not only to provide insights from PaRSEC, but also to identify potential applications' performance bottlenecks. These instrumentation tools may actually foster synergism between applications and PaRSEC developers for productivity as well as high-performance computing purposes. We demonstrate the benefits of these amenable tools, while assessing the performance of TLR Cholesky factorization from data distribution, communication-reducing and synchronization-reducing perspectives. This tool-assisted performance analysis results in three major contributions: a new hybrid data distribution, a new hierarchical TLR Cholesky algorithm, and a new performance model for tuning the tile size. The new TLR Cholesky factorization achieves an 8x performance speedup over existing implementations on massively parallel supercomputers, toward solving large-scale 3D climate and weather prediction applications.

关键词： Performance analysis Profiling tools task-based programming model Dynamic runtime system

来源：评论

学校读者我要写书评

暂无评论

Asynchronous task-based Execution of the Reverse Time Migration for the Oil and Gas Industry

Asynchronous Task-Based Execution of the Reverse Time Migrat...

引用

IEEE International Conference on Cluster Computing (IEEE CLUSTER)

作者： AlOnazi, A. Ltaief, H. Keyes, D. Said, I. Thibault, S. King Abdullah Univ Sci & Technol Extreme Comp Res Ctr Jeddah 23955 Saudi Arabia NVIDIA Oil & Gas Dept Paris France Univ Bordeaux Bordeaux INP LaBRI CNRSInriaUMR 5800 F-33400 Talence France

ISBN: (纸本)9781728147345

We propose a new framework for deploying Reverse Time Migration (RTM) simulations on distributed-memory systems equipped with multiple GPUs. Our software, TB-RTM, infrastructure engine relies on the STARPU dynamic runtime system to orchestrate the asynchronous scheduling of RTM computational tasks on the underlying resources. Besides dealing with the challenging hardware heterogeneity, TB-RTM supports tasks with different workload characteristics, which stress disparate components of the hardware system. RTM is challenging in that it operates intensively at both ends of the memory hierarchy, with compute kernels running at the highest level of the memory system, possibly in GPU main memory, while I/O kernels are saving solution data to fast storage. We consider how to span the wide performance gap between the two extreme ends of the memory system, i.e., GPU memory and fast storage, on which large-scale RTM simulations routinely execute. To maximize hardware occupancy while maintaining high memory bandwidth throughout the memory subsystem, our framework presents the new out-of-core (OOC) feature from STARPU to prefetch data solutions in and out not only from/to the GPU/CPU main memory but also from/to the fast storage system. The OOC technique may trigger opportunities for overlapping expensive data movement with computations. TB-RTM framework addresses this challenging problem of heterogeneity with a systematic approach that is oblivious to the targeted hardware architectures. Our resulting RTM framework can effectively be deployed on massively parallel GPU-based systems, while delivering performance scalability up to 500 GPUs.

关键词： Reverse Time Migration task-based programming model Out-Of-Core Algorithms Asynchronous Executions Overlapping I/O with Computation STARPU OOC

来源：评论

学校读者我要写书评

暂无评论

Unified fault-tolerance framework for hybrid task-parallel message-passing applications

引用

INTERNATIONAL JOURNAL OF HIGH PERFORMANCE COMPUTING APPLICATIONS 2018年第5期32卷 641-657页

作者： Subasi, Omer Martsinkevich, Tatiana Zyulkyarov, Ferad Unsal, Osman Labarta, Jesus Cappello, Franck Barcelona Supercomp Ctr C Jordi Girona 31 Barcelona 08034 Spain Barcelona Supercomp Ctr Architectural Support Programming Models Grp Barcelona Spain Barcelona Supercomp Ctr Comp Sci Res Dept Barcelona Spain Univ Politecn Cataluna Barcelona Spain Univ Paris Sud INRIA Paris France Argonne Natl Lab Argonne IL 60439 USA

We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.

关键词： Fault-tolerance message logging checkpoint/restart task-based programming model optimal checkpointing interval

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：