检索结果-内蒙古大学图书馆

High-performance dataflow computing in hybrid memory systems with UPC plus plus DepSpawn

JOURNAL OF SUPERcomputing 2021年第7期77卷 7676-7689页

作者： Fraguela, Basilio B. Andrade, Diego Univ A Coruna CITIC Res Ctr Informat & Commun Technol La Coruna 15071 Spain

dataflow computing is a very attractive paradigm for high-performance computing, given its ability to trigger computations as soon as their inputs are available. UPC++ DepSpawn is a novel task-based library that supports this model in hybrid shared/distributed memory systems on top of a Partitioned Global Address Space environment. While the initial version of the library provided good results, it suffered from a key restriction that heavily limited its performance and scalability. Namely, each process had to consider all the tasks in the application rather than only those of interest to it, an overhead that naturally grows with both the number of processes and tasks in the system. In this paper, this restriction is lifted, enabling our library to provide higher levels of performance. This way, in experiments using 768 cores the performance improved up to 40.1%, the average improvement being 16.1%.

关键词： dataflow computing Hybrid parallelism PGAS Runtimes Programability High-performance computing

来源：评论

学校读者我要写书评

暂无评论

On communication efficient dataflow computing in software defined networking enabled cloud

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2021年第7期33卷 1-1页

作者： Li, Yuepeng Zeng, Deze Zheng, Long China Univ Geosci Hubei Key Lab Intelligent Geoinformat Proc Sch Comp Sci Wuhan Hubei Peoples R China Shenzhen Univ Coll Comp Sci & Software Engn Shenzhen Peoples R China

dataflow computing has become a promising computing paradigm as an alternative to traditional control-centric computing paradigm to facilitate big data processing. Big data process often happens in cloud computing environment as the datacenter provisions a large amount of resource. dataflow computing, as a data-centric computing paradigm, requires the dataflows to be shuffled among different codelets (ie, data processing units) deployed in the datacenter servers. It is significant to well schedule the dataflow transferring for communication efficiency. It is highly regarded that the datacenter network shall be managed by software defined networking (SDN) technology for flexibility consideration. In SDN managed datacenter, a dataflow requires a forwarding rule in the forwarding table of each switch on its routing path. However, the SDN switches are limited in the forwarding table size. This introduces an unignorable issue in the codelet deployment problem. Therefore, we are motivated to take such forwarding table size constraints into the problem of dataflow codelet deployment in the datacenters managed by SDN. In particular, we aim at minimizing the communication cost efficiency while guarantee the dataflow computing performance at the same time. The communication cost minimization problem is formulated into an integer linear programming form, which is relaxed to design a heuristic algorithm. The experiment results show that our relaxation algorithm can significantly improve the communication cost efficiency via ingenious codelet placement.

关键词： codelet deployment dataflow computing software defined networking virtual network embedding

来源：评论

学校读者我要写书评

暂无评论

The Case for Polymorphic Registers in dataflow computing

引用

INTERNATIONAL JOURNAL OF PARALLEL PROGRAMMING 2018年第6期46卷 1185-1219页

作者： Ciobanu, Catalin Bogdan Gaydadjiev, Georgi Pilato, Christian Sciuto, Donatella Univ Amsterdam Syst & Network Engn Grp Amsterdam Netherlands Delft Univ Technol Distributed Syst Grp Delft Netherlands Maxeler Technol Ltd London England Univ Lugano Fac Informat Lugano Switzerland Politecn Milan Dip Elettron Informaz & Bioingn Milan Italy

Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050GPU. We improve the throughput upto 56.17X and show that the PRF-augmented system outperforms the GPU for for 9 x 9 or larger mask sizes, even in bandwidth-constrained systems.

关键词： dataflow computing Parallel memory accesses Polymorphic register file Bandwidth Vector lanes Convolution High performance computing High-level synthesis

来源：评论

学校读者我要写书评

暂无评论

A software cache autotuning strategy for dataflow computing with UPC plus plus DepSpawn

COMPUTATIONAL AND MATHEMATICAL METHODS

引用

COMPUTATIONAL AND MATHEMATICAL METHODS 2021年第6期3卷

作者： Fraguela, Basilio B. Andrade, Diego Univ A Coruna CITIC Ctr Singular Invest Galicia Comp Architecture Grp La Coruna Spain

dataflow computing allows to start computations as soon as all their dependencies are satisfied. This is particularly useful in applications with irregular or complex patterns of dependencies which would otherwise involve either coarse grain synchronizations which would degrade performance, or high programming costs. A recent proposal for the easy development of performant dataflow algorithms in hybrid shared/distributed memory systems is UPC++ DepSpawn. Among the many techniques it applies to provide good performance is a software cache that minimizes the communications among the processes involved. In this article we provide the details of the implementation and operation of this cache and we present an autotuning strategy that simplifies its usage by freeing the user from having to estimate an adequate size for this cache. Rather, the runtime is now able to define reasonably sized caches that provide near optimal behavior.

关键词： autotuning dataflow computing distributed memory locality PGAS runtimes

来源：评论

学校读者我要写书评

暂无评论

ADD: Accelerator Design and Deploy - A tool for FPGA high-performance dataflow computing

引用

CONCURRENCY AND COMPUTATION-PRACTICE & EXPERIENCE 2019年第18期31卷

作者： Penha, Jeronimo C. Silva, Lucas B. Silva, Jansen M. Coelho, Kristtopher K. Baranda, Hector P. Nacif, Jose Augusto M. Ferreira, Ricardo S. Univ Fed Vicosa Comp Sci Dept Vicosa MG Brazil Univ Fed Vicosa Sci & Technol Inst UFV Florestal Campus Vicosa MG Brazil Ctr Fed Educ Tecnol Minas Gerais Comp Sci & Mech Engn Dept Leopoldina Campus Belo Horizonte MG Brazil

dataflow-based FPGA accelerators have become a promising alternative to deliver energy-efficient high-performance computing. However, FPGA programming is still a challenge. This paper presents Accelerator Design and Deploy (ADD), a high-level framework to specify, to simulate, and to implement dataflow accelerators for streaming applications. The framework includes an open dataflow operator library, and templates are provided to easily design new operators. The framework also provides a high-level and an accurate simulation at circuit level with short execution times. Moreover, ADD provides software and hardware APIs to simplify the integration process, extending the benefits of portability from low-cost FPGA boards to high performance datacenter FPGA platforms. Our framework supports coupling with high-level programming languages, and it has been validated on two FPGA platforms: the Intel high-performance CPU-FPGA heterogeneous computing platform and an educational FPGA kit. We show that our simple approach presents competitive performance, both in time and energy, when compared to multi-core and GPU accelerators.

关键词： dataflow computing FPGA accelerators heterogeneous architectures high-performance computing overlay

来源：评论

学校读者我要写书评

暂无评论

Productivity, Portability, Performance, and Reproducibility: Data-Centric Python

引用

IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2025年第5期36卷 804-820页

作者： Ziogas, Alexandros Nikolaos Schneider, Timo Ben-Nun, Tal Calotoiu, Alexandru De Matteis, Tiziano de Fine Licht, Johannes Lavarini, Luca Hoefler, Torsten Swiss Fed Inst Technol Dept Informat Technol & Elect Engn CH-8092 Zurich Switzerland Swiss Fed Inst Technol Dept Comp Sci CH-8092 Zurich Switzerland Lawrence Livermore Natl Lab Ctr Appl Sci Comp Livermore CA 94550 USA Vrije Univ Amsterdam Dept Comp Sci NL-1081 HV Amsterdam Netherlands NextSilicon CH-8005 Zurich Switzerland 1plusX CH-8005 Zurich Switzerland

Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High-Performance computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. This work presents a workflow that retains Python's high productivity while achieving portable performance across different architectures. The workflow's key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes. Our benchmarks were reproduced in the Student Cluster Competition (SCC) during the Supercomputing Conference (SC) 2022. We present and discuss the student teams' results.

关键词： Productivity Codes Semantics Computer architecture Supercomputers Software Field programmable gate arrays Optimization Python Computer languages high-performance computing dataflow computing parallel programming distributed computing distributed computing

来源：评论

学校读者我要写书评

暂无评论

From C/C plus plus Code to High-Performance dataflow Circuits

引用

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS 2022年第7期41卷 2142-2155页

作者： Josipovic, Lana Guerrieri, Andrea Ienne, Paolo Ecole Polytech Fed Lausanne Sch Comp & Commun Sci CH-1015 Lausanne Switzerland

High-level synthesis (HLS) tools typically generate statically scheduled datapaths. Static scheduling implies that the resulting circuits have a hard time exploiting parallelism in code with potential memory dependences, with control dependences, or where performance is limited by long latency control decisions. In this work, we describe an HLS approach which generates dynamically scheduled, dataflow circuits out of imperative code. We detail a complete set of rules to transform a standard compiler intermediate representation into a high-performance dataflow circuit that is able to dynamically resolve memory dependences and adapt its behavior on the fly to particular control flow decisions and operation latencies. Compared to a traditional HLS tool, the result is a different tradeoff between performance and circuit complexity: statically scheduled circuits display the best performance per cost in regular applications, but general-purpose, irregular, and control-dominated computing tasks require the runtime flexibility of dynamic scheduling. Therefore, enabling dynamic behavior in HLS is key to dealing with the increasing computational demands of new contexts and broader application domains.

关键词： Schedules Dynamic scheduling Tools Field programmable gate arrays Standards Processor scheduling Pipeline processing Buffer storage circuit optimization dataflow computing high-level synthesis (HLS) memory architecture

来源：评论

学校读者我要写书评

暂无评论

GANDAFL: dataflow Acceleration for Short Read Alignment on NGS Data

引用

IEEE TRANSACTIONS ON COMPUTERS 2022年第11期71卷 3018-3031页

作者： Koliogeorgi, Konstantina Xydis, Sotirios Gaydadjiev, Georgi Soudris, Dimitrios Natl Tech Univ Athens Microprocessors & Digital Syst Lab Elect & Comp Engn Athens 10682 Greece Harokopio Univ Athens 17671 Greece Maxeler Technol London W6 0ND England

DNA read alignment is an integral part of genome study, which has been revolutionised thanks to the growth of Next Generation Sequencing (NGS) technologies. The inherent computational intensity of string matching algorithms such as Smith-Waterman (SmW) and the vast amount of NGS input data, create a bottleneck in the workflows. Accelerated reconfigurable computing has been extensively leveraged to alleviate this bottleneck, focusing on high-performance albeit standalone implementations. In existing accelerated solutions effective co-design of NGS short-read alignment still remains an open issue, mainly due to narrow view on real integration aspects, such as system wide communication and accelerator call overheads. In this paper, we first propose GANDAFL, a novel Genome AligNment DAta-FLow architecture for SmW Matrix-fill and Traceback stages to perform high throughput short-read alignment on NGS data. We then propose a radical software restructuring to widely-used Bowtie2 aligner that allows read alignment by batches to expose acceleration capabilities. Batch alignment minimizes calling overhead of the accelerators whereas moving both Matrix-fill and Traceback on chip extinguishes the communication data overheads. The standalone solution delivers up to x116 and x2 speedup over state-of-the-art software and hardware accelerators respectively and GANDAFL-enhanced Bowtie2 aligner delivers a x1.9 speedup.

关键词： Sequential analysis Genomics Bioinformatics Software Field programmable gate arrays Task analysis DNA Next generation sequencing reconfigurable acceleration dataflow computing Bowtie2 smith waterman traceback

来源：评论

学校读者我要写书评

暂无评论

VR-PRUNE: Decidable Variable-Rate dataflow for Signal Processing Systems

引用

IEEE TRANSACTIONS ON SIGNAL PROCESSING 2022年 70卷 1819-1833页

作者： Boutellier, Jani Ma, Yujunrong Wu, Jiahao Khan, Mir Bhattacharyya, Shuvra S. Univ Vaasa Sch Technol & Innovat Vaasa 65200 Finland Univ Maryland Dept Elect & Comp Engn College Pk MD 20742 USA Tampere Univ Fac Informat Technol & Commun Sci Tampere 33014 Finland Univ Maryland Inst Adv Comp Studies College Pk MD 20742 USA

The dataflow concept has been successfully used for modeling and synthesizing signal processing applications since decades, and recently, dataflow has also been discovered to match the computation model of machine learning applications, leading to extremely successful dataflow based application design frameworks. One of the most attractive features of dataflow, especially for signal processing, is related to its formal nature: when properly defined, a dataflow-based application model can be analytically verified for correctness at the stage of application design. This paper proposes VR-PRUNE, a novel dataflow model of computation that is aimed for design of high-performance signal processing software, together with runtime support that allows efficient application deployment to heterogeneous GPU-equipped platforms. Compared to prior work, VR-PRUNE features variable token rate processing, which enables designing adaptive signal processing applications, and implementing solutions that, e.g., allow trading-off between power consumption and filtering bandwidth at runtime. The paper presents the formal concepts of VR-PRUNE, as well as four application examples from domains related to signal processing, accompanied with quantitative results, which show that using VR-PRUNE enables, for example, application power-performance scaling, and on the other hand describing adaptive application behavior with 59% fewer dataflow graph components compared to previous work.

关键词： Computational modeling Signal processing Analytical models Runtime System recovery Ports (computers) Petri nets dataflow computing design automation signal processing parallel processing

来源：评论

学校读者我要写书评

暂无评论

Evaluating a Flow-Based Programming Approach as an Alternative for Developing CEP Applications in IoT

引用

IEEE INTERNET OF THINGS JOURNAL 2022年第13期9卷 11489-11499页

作者： Ortiz, Guadalupe Castillo, Ivan Garcia-de-Prado, Alfonso Boubeta-Puig, Juan Univ Cadiz Dept Comp Sci & Engn Cadiz 11519 Spain Univ Cadiz Comp Architecture & Technol Dept Cadiz 11519 Spain

One of the main advantages brought by the Internet of Things (IoT) is the possibility of having large amounts of data from several sources that allow us, once analyzed, to make decisions in various domains in real time. This implies the need to be able to process large volumes of data in more or less limited processing times depending on the application domain. In this sense, complex event processing (CEP), used in conjunction with an enterprise service bus (ESB), has proven to be very efficient in multiple domains. In search for greater efficiency, some CEP engines offer the option of using flow-based programming (FBP) rather than their traditional programming using CEP together with an event bus. However, its use, while it may be more efficient, can lead to other limitations. In this article, we analyze and describe the performance and limitations of using a CEP engine with an ESB versus a CEP engine with FBP. This will allow developers to decide which option is more convenient for their IoT system depending on the application domain and its specific needs.

关键词： Programming Engines Service-oriented architecture Computer architecture Real-time systems Benchmark testing Virtual machining Complex event processing (CEP) dataflow dataflow computing enterprise service bus (ESB) flow-based programming (FBP) Internet of Things (IoT)

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：