检索结果-内蒙古大学图书馆

PASTA: programming and Automation Support for Scalable task-parallel HLS Programs on Modern Multi-Die FPGAs

ACM TRANSACTIONS ON RECONFIGURABLE TECHNOLOGY AND SYSTEMS 2024年第3期17卷 1-31页

作者： Khatti, Moazin Tian, Xingyu Baroughi, Ahmad Sedigh Baranwal, Akhil Raj Chi, Yuze Guo, Licheng Cong, Jason Fang, Zhenman Simon Fraser Univ Sch Engn Sci Burnaby BC Canada Univ Calif Los Angeles Comp Sci Dept Los Angeles CA USA

In recent years, the adoption of FPGAs in datacenters has increased, with a growing number of users choosing High-Level Synthesis (HLS) as their preferred programming method. While HLS simplifies FPGA programming, one notable challenge arises when scaling up designs for modern datacenter FPGAs that comprise multiple dies. The extra delays introduced due to die crossings and routing congestion can significantly degrade the frequency of large designs on these FPGA boards. Due to the gap between HLS design and physical design, it is challenging for HLS programmers to analyze and identify the root causes, and fix their HLS design to achieve better timing closure. Recent efforts have aimed to address these issues by employing coarse-grained floorplanning and pipelining strategies on task-parallel HLS designs where multiple tasks run concurrently and communicate through FIFO stream channels. However, many applications are not streaming friendly and many existing accelerator designs heavily rely on buffer channel based communication between tasks. In this work, we take a step further to support a task-parallel programming model where tasks can communicate via both FIFO stream channels and buffer channels. To achieve this goal, we design and implement the PASTA framework, which takes a large task-parallel HLS design as input and automatically generates a high-frequency FPGA accelerator via HLS and physical design co-optimization. Our framework introduces a latency-insensitive buffer channel design, which supports memory partitioning and ping-pong buffering while remaining compatible with vendor HLS tools. On the frontend, we provide an easy-to-use programming model for utilizing the proposed buffer channel;while on the backend, we implement efficient placement and pipelining strategies for the proposed buffer channel. To validate the effectiveness of our framework, we test it on four widely used Rodinia HLS benchmarks and two real-world accelerator designs and show an avera

关键词： Multi-die FPGA high-level synthesis task-parallel programming buffer channel hardware acceleration frequency optimization coarse-grained floorplanning

来源：评论

学校读者我要写书评

暂无评论

He..ro DB: A Concept for parallel Data Processing on Heterogeneous Hardware 33rd

He..ro DB: A Concept for Parallel Data Processing on Heterog...

引用

33rd International Conference on Architecture of Computing Systems (ARCS)

作者： Mueller, Michael Leich, Thomas Pionteck, Thilo Saake, Gunter Teubner, Jens Spinczyk, Olaf Osnabruck Univ Inst Comp Sci ESS Grp Osnabruck Germany Harz Univ Appl Sci Wernigerode Germany Otto Von Guericke Univ Magdeburg Germany TU Dortmund Univ Dept Comp Sci DBIS Grp Dortmund Germany

ISBN: (纸本)9783030527938;9783030527945

Due to the growing demand on processing power and energy efficiency by today's data-intensive applications developers have to deal with heterogeneous hardware platforms composed of specialized computing resources. These are highly efficient for certain workloads but difficult to handle from the software engineering perspective. Even state-of-the-art database management systems do not exploit all heterogeneous hardware components, as their characteristics differ significantly. They are thus hard to integrate within a coherent database architecture. To address this problem, we propose a design concept that is based on a layered system software architecture: He..ro DB transforms a data-flow graph that describes the data-processing application to a task-based execution plan. task implementations for the different computing resources and a reasonable degree of parallelism are chosen automatically based on available resources. The concept can cover any hardware configuration and application scenario. It is versatile and offers opportunities for independent optimization on each layer.

关键词： Heterogeneous many-core systems Data processing Databases task-parallel programming

来源：评论

学校读者我要写书评

暂无评论

Load Balancing Prioritized tasks via Work-Stealing 21st

Load Balancing Prioritized Tasks via Work-Stealing

引用

21st International Conference on parallel and Distributed Computing (Euro-Par)

作者： Imam, Shams Sarkar, Vivek Rice Univ Dept Comp Sci Houston TX 77005 USA

ISBN: (纸本)9783662480960;9783662480953

Work-stealing schedulers focus on minimizing overhead in task scheduling. Consequently, they avoid features, such as task priorities, which can add overhead to the implementation. Thus in such schedulers, low priority tasks may be scheduled earlier, delaying the execution of higher priority tasks and possibly increasing overall execution time. In this paper, we develop a decentralized work-stealing scheduler that dynamically schedules fixed-priority tasks in a non-preemptive manner. We adhere, as closely as possible, to the priority order while scheduling tasks by accepting some overhead to preserve order. Our approach uses non-blocking operations, is workload independent, and we achieve performance even in the presence of fine-grained tasks. Experimental results show that the Java implementation of our scheduler performs favorably compared to other schedulers (priority and non-priority) available in the Java standard library.

关键词： Work-stealing Multi-level queue Priority levels Priority scheduling Load balancing task-parallel programming

来源：评论

学校读者我要写书评

暂无评论

Futures for Dynamic Dependencies - parallelizing the H-LU Factorization 2nd

Futures for Dynamic Dependencies - Parallelizing the H-LU Fa...

引用

2nd International Workshop on Asynchronous Many-task Systems and Applications (WAMTA)

作者： Nather, Ruediger Fohry, Claudia Univ Kassel Dept Elect Engn & Comp Sci Kassel Hessen Germany

ISBN: (纸本)9783031617621;9783031617638

The LU factorization of hierarchical matrices (H-matrices) is a challenging problem for efficient parallelization, due to complex dependency patterns. Previous research suggested the usage of tasks, but existing task-based algorithms still need a preprocessing step to prepare information about the matrix structure. In consequence, this structure must not change afterwards. This paper proposes a novel algorithm that eliminates the need for preprocessing. Its core idea is usage of the future construct. A particularly expressive type of future is needed that is not yet supported by current AMT runtime systems. This paper defines the type and shows that it promotes a clear and concise way to program parallel H-LU factorization.

关键词： Futures task-parallel programming Dynamic task-level Dependencies Hierarchical Linear Algebra LU Factorization

来源：评论

学校读者我要写书评

暂无评论

Scalable task parallelism for NUMA: A Uniform Abstraction for Coordinated Scheduling and Memory Management 16

Scalable Task Parallelism for NUMA: A Uniform Abstraction fo...

引用

International Conference on parallel Architectures and Compilation (PACT)

作者： Drebes, Andi Pop, Antoniu Heydemann, Karine Cohen, Albert Drach, Nathalie Univ Manchester Sch Comp Sci Oxford Rd Manchester M13 9PL Lancs England UPMC Univ Paris 06 Sorbonne Univ UMR 7606 CNRSLIP6 4 Pl Jussieu F-75252 Paris 05 France INRIA 45 Rue Ulm F-75005 Paris France Ecole Normale Super 45 Rue Ulm F-75005 Paris France

ISBN: (纸本)9781450341219

Dynamic task-parallel programming models are popular on shared-memory systems, promising enhanced scalability, load balancing and locality. Yet these promises are undermined by non-uniform memory access (NUMA). We show that using NUMA-aware task and data placement, it is possible to preserve the uniform abstraction of both computing and memory resources for task-parallel programming models while achieving high data locality. Our data placement scheme guarantees that all accesses to task output data target the local memory of the accessing core. The complementary task placement heuristic improves the locality of task input data on a best effort basis. Our algorithms take advantage of data-flow style task parallelism, where the privatization of task data enhances scalability by eliminating false dependences and enabling fine-grained dynamic control over data placement. The algorithms are fully automatic, application independent, performance-portable across NUMA machines, and adapt to dynamic changes. Placement decisions use information about inter-task data dependences readily available in the run-time system and placement information from the operating system. We achieve 94% of local memory accesses on a 192-core system with 24 NUMA nodes, up to 5x higher performance than NUMA-aware hierarchical work stealing, and even 5.6x compared to static interleaved allocation. Finally, we show that state-of-the-art dynamic page migration by the operating system cannot catch up with frequent affinity changes between cores and data and thus fails to accelerate task-parallel applications.

关键词： task-parallel programming NUMA Scheduling Memory allocation Data-flow programming

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：