dataflow computing is a very attractive paradigm for high-performance computing, given its ability to trigger computations as soon as their inputs are available. UPC++ DepSpawn is a novel task-based library that suppo...
详细信息
dataflow computing is a very attractive paradigm for high-performance computing, given its ability to trigger computations as soon as their inputs are available. UPC++ DepSpawn is a novel task-based library that supports this model in hybrid shared/distributed memory systems on top of a Partitioned Global Address Space environment. While the initial version of the library provided good results, it suffered from a key restriction that heavily limited its performance and scalability. Namely, each process had to consider all the tasks in the application rather than only those of interest to it, an overhead that naturally grows with both the number of processes and tasks in the system. In this paper, this restriction is lifted, enabling our library to provide higher levels of performance. This way, in experiments using 768 cores the performance improved up to 40.1%, the average improvement being 16.1%.
dataflow computing has become a promising computing paradigm as an alternative to traditional control-centric computing paradigm to facilitate big data processing. Big data process often happens in cloud computing env...
详细信息
dataflow computing has become a promising computing paradigm as an alternative to traditional control-centric computing paradigm to facilitate big data processing. Big data process often happens in cloud computing environment as the datacenter provisions a large amount of resource. dataflow computing, as a data-centric computing paradigm, requires the dataflows to be shuffled among different codelets (ie, data processing units) deployed in the datacenter servers. It is significant to well schedule the dataflow transferring for communication efficiency. It is highly regarded that the datacenter network shall be managed by software defined networking (SDN) technology for flexibility consideration. In SDN managed datacenter, a dataflow requires a forwarding rule in the forwarding table of each switch on its routing path. However, the SDN switches are limited in the forwarding table size. This introduces an unignorable issue in the codelet deployment problem. Therefore, we are motivated to take such forwarding table size constraints into the problem of dataflow codelet deployment in the datacenters managed by SDN. In particular, we aim at minimizing the communication cost efficiency while guarantee the dataflow computing performance at the same time. The communication cost minimization problem is formulated into an integer linear programming form, which is relaxed to design a heuristic algorithm. The experiment results show that our relaxation algorithm can significantly improve the communication cost efficiency via ingenious codelet placement.
Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solution...
详细信息
Heterogeneous systems are becoming increasingly popular, delivering high performance through hardware specialization. However, sequential data accesses may have a negative impact on performance. Data parallel solutions such as Polymorphic Register Files (PRFs) can potentially accelerate applications by facilitating high-speed, parallel access to performance-critical data. This article shows how PRFs can be integrated into dataflow computational platforms. Our semi-automatic, compiler-based methodology generates customized PRFs and modifies the computational kernels to efficiently exploit them. We use a separable 2D convolution case study to evaluate the impact of memory latency and bandwidth on performance compared to a state-of-the-art NVIDIA Tesla C2050GPU. We improve the throughput upto 56.17X and show that the PRF-augmented system outperforms the GPU for for 9 x 9 or larger mask sizes, even in bandwidth-constrained systems.
dataflow computing allows to start computations as soon as all their dependencies are satisfied. This is particularly useful in applications with irregular or complex patterns of dependencies which would otherwise inv...
详细信息
dataflow computing allows to start computations as soon as all their dependencies are satisfied. This is particularly useful in applications with irregular or complex patterns of dependencies which would otherwise involve either coarse grain synchronizations which would degrade performance, or high programming costs. A recent proposal for the easy development of performant dataflow algorithms in hybrid shared/distributed memory systems is UPC++ DepSpawn. Among the many techniques it applies to provide good performance is a software cache that minimizes the communications among the processes involved. In this article we provide the details of the implementation and operation of this cache and we present an autotuning strategy that simplifies its usage by freeing the user from having to estimate an adequate size for this cache. Rather, the runtime is now able to define reasonably sized caches that provide near optimal behavior.
dataflow-based FPGA accelerators have become a promising alternative to deliver energy-efficient high-performance computing. However, FPGA programming is still a challenge. This paper presents Accelerator Design and D...
详细信息
dataflow-based FPGA accelerators have become a promising alternative to deliver energy-efficient high-performance computing. However, FPGA programming is still a challenge. This paper presents Accelerator Design and Deploy (ADD), a high-level framework to specify, to simulate, and to implement dataflow accelerators for streaming applications. The framework includes an open dataflow operator library, and templates are provided to easily design new operators. The framework also provides a high-level and an accurate simulation at circuit level with short execution times. Moreover, ADD provides software and hardware APIs to simplify the integration process, extending the benefits of portability from low-cost FPGA boards to high performance datacenter FPGA platforms. Our framework supports coupling with high-level programming languages, and it has been validated on two FPGA platforms: the Intel high-performance CPU-FPGA heterogeneous computing platform and an educational FPGA kit. We show that our simple approach presents competitive performance, both in time and energy, when compared to multi-core and GPU accelerators.
Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the ...
详细信息
Python has become the de facto language for scientific computing. Programming in Python is highly productive, mainly due to its rich science-oriented software ecosystem built around the NumPy module. As a result, the demand for Python support in High-Performance computing (HPC) has skyrocketed. However, the Python language itself does not necessarily offer high performance. This work presents a workflow that retains Python's high productivity while achieving portable performance across different architectures. The workflow's key features are HPC-oriented language extensions and a set of automatic optimizations powered by a data-centric intermediate representation. We show performance results and scaling across CPU, GPU, FPGA, and the Piz Daint supercomputer (up to 23,328 cores), with 2.47x and 3.75x speedups over previous-best solutions, first-ever Xilinx and Intel FPGA results of annotated Python, and up to 93.16% scaling efficiency on 512 nodes. Our benchmarks were reproduced in the Student Cluster Competition (SCC) during the Supercomputing Conference (SC) 2022. We present and discuss the student teams' results.
High-level synthesis (HLS) tools typically generate statically scheduled datapaths. Static scheduling implies that the resulting circuits have a hard time exploiting parallelism in code with potential memory dependenc...
详细信息
High-level synthesis (HLS) tools typically generate statically scheduled datapaths. Static scheduling implies that the resulting circuits have a hard time exploiting parallelism in code with potential memory dependences, with control dependences, or where performance is limited by long latency control decisions. In this work, we describe an HLS approach which generates dynamically scheduled, dataflow circuits out of imperative code. We detail a complete set of rules to transform a standard compiler intermediate representation into a high-performance dataflow circuit that is able to dynamically resolve memory dependences and adapt its behavior on the fly to particular control flow decisions and operation latencies. Compared to a traditional HLS tool, the result is a different tradeoff between performance and circuit complexity: statically scheduled circuits display the best performance per cost in regular applications, but general-purpose, irregular, and control-dominated computing tasks require the runtime flexibility of dynamic scheduling. Therefore, enabling dynamic behavior in HLS is key to dealing with the increasing computational demands of new contexts and broader application domains.
DNA read alignment is an integral part of genome study, which has been revolutionised thanks to the growth of Next Generation Sequencing (NGS) technologies. The inherent computational intensity of string matching algo...
详细信息
DNA read alignment is an integral part of genome study, which has been revolutionised thanks to the growth of Next Generation Sequencing (NGS) technologies. The inherent computational intensity of string matching algorithms such as Smith-Waterman (SmW) and the vast amount of NGS input data, create a bottleneck in the workflows. Accelerated reconfigurable computing has been extensively leveraged to alleviate this bottleneck, focusing on high-performance albeit standalone implementations. In existing accelerated solutions effective co-design of NGS short-read alignment still remains an open issue, mainly due to narrow view on real integration aspects, such as system wide communication and accelerator call overheads. In this paper, we first propose GANDAFL, a novel Genome AligNment DAta-FLow architecture for SmW Matrix-fill and Traceback stages to perform high throughput short-read alignment on NGS data. We then propose a radical software restructuring to widely-used Bowtie2 aligner that allows read alignment by batches to expose acceleration capabilities. Batch alignment minimizes calling overhead of the accelerators whereas moving both Matrix-fill and Traceback on chip extinguishes the communication data overheads. The standalone solution delivers up to x116 and x2 speedup over state-of-the-art software and hardware accelerators respectively and GANDAFL-enhanced Bowtie2 aligner delivers a x1.9 speedup.
The dataflow concept has been successfully used for modeling and synthesizing signal processing applications since decades, and recently, dataflow has also been discovered to match the computation model of machine lea...
详细信息
The dataflow concept has been successfully used for modeling and synthesizing signal processing applications since decades, and recently, dataflow has also been discovered to match the computation model of machine learning applications, leading to extremely successful dataflow based application design frameworks. One of the most attractive features of dataflow, especially for signal processing, is related to its formal nature: when properly defined, a dataflow-based application model can be analytically verified for correctness at the stage of application design. This paper proposes VR-PRUNE, a novel dataflow model of computation that is aimed for design of high-performance signal processing software, together with runtime support that allows efficient application deployment to heterogeneous GPU-equipped platforms. Compared to prior work, VR-PRUNE features variable token rate processing, which enables designing adaptive signal processing applications, and implementing solutions that, e.g., allow trading-off between power consumption and filtering bandwidth at runtime. The paper presents the formal concepts of VR-PRUNE, as well as four application examples from domains related to signal processing, accompanied with quantitative results, which show that using VR-PRUNE enables, for example, application power-performance scaling, and on the other hand describing adaptive application behavior with 59% fewer dataflow graph components compared to previous work.
One of the main advantages brought by the Internet of Things (IoT) is the possibility of having large amounts of data from several sources that allow us, once analyzed, to make decisions in various domains in real tim...
详细信息
One of the main advantages brought by the Internet of Things (IoT) is the possibility of having large amounts of data from several sources that allow us, once analyzed, to make decisions in various domains in real time. This implies the need to be able to process large volumes of data in more or less limited processing times depending on the application domain. In this sense, complex event processing (CEP), used in conjunction with an enterprise service bus (ESB), has proven to be very efficient in multiple domains. In search for greater efficiency, some CEP engines offer the option of using flow-based programming (FBP) rather than their traditional programming using CEP together with an event bus. However, its use, while it may be more efficient, can lead to other limitations. In this article, we analyze and describe the performance and limitations of using a CEP engine with an ESB versus a CEP engine with FBP. This will allow developers to decide which option is more convenient for their IoT system depending on the application domain and its specific needs.
暂无评论