Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for applicatio...
详细信息
Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for application domains such as image processing, where the use of customized pipelines exploits the inherent coarse-grain parallelism. In this paper we describe a set of program analyses and an implementation that map a sequential and un-annotated C program into a pipelined implementation running on a set of FPGAs, each with multiple external memories. Based on well-known parallel computing analysis techniques, our algorithms perform unrolling for operator parallelization, reuse and data layout for memory parallelization and precise communication analysis. We extend these techniques for FPGA-based systems to automatically partition the application data and computation into custom pipeline stages, taking into account the available FPGA and interconnect resources. We illustrate the analysis components by way of an example, a machine vision program. We present the algorithm results, derived with minimal manual intervention, which demonstrate the potential of this approach for automatically deriving pipelined designs from high-level sequential specifications.
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, ...
ISBN:
(纸本)9781581131857
We study the problem of executing parallel programs, in particular Cilk programs, on a collection of processors of different speeds. We consider a model in which each processor maintains an estimate of its own speed, where communication between processors has a cost, and where all scheduling must be online. This problem has been considered previously in the fields of asynchronous parallel computing and scheduling theory. Our model is a bridge between the assumptions in these fields. We provide a new more accurate analysis of of an old scheduling algorithm called the maximum utilization scheduler. Based on this analysis, we generalize this scheduling policy and define the high utilization scheduler. We next focus on the Cilck platform and introduce a new algorithm for scheduling Cilk multithreaded parallel programs on heterogeneous processors. This scheduler is inspired by the high utilization scheduler and is modified to fit in a Cilk context. A crucial aspect of our algorithm is that it keeps the original spirit of the Cilk scheduler. In fact, when our new algorithm runs on homogeneous processors, it exactly mimics the dynamics of the original Cilk scheduler.
Simple parallelalgorithms for the maximal independent set (MIS) problem are presented. The first algorithm is a Monte Carlo algorithm with a very local property. The local property of this algorithm may make it a use...
ISBN:
(纸本)9780897911511
Simple parallelalgorithms for the maximal independent set (MIS) problem are presented. The first algorithm is a Monte Carlo algorithm with a very local property. The local property of this algorithm may make it a useful protocol design tool in distributed computing environments and artificial intelligence. One of the main contributions of this paper is the development of powerful and general techniques for converting Monte Carlo algorithms into deterministic algorithms. These techniques are used to convert the Monte Carlo algorithm for the MIS problem into a simple deterministic algorithm with the same parallel running time.
We study dynamic graph algorithms in the Massively parallel Computation model, which was inspired by practical data processing systems. Our goal is to provide algorithms that can efficiently handle large batches of ed...
ISBN:
(纸本)9781611976465
We study dynamic graph algorithms in the Massively parallel Computation model, which was inspired by practical data processing systems. Our goal is to provide algorithms that can efficiently handle large batches of edge insertions and *** show algorithms that require fewer rounds to update a solution to problems such as Minimum Spanning Forest, 2-Edge Connected Components, and Maximal Matching than would be required by their static counterparts to compute it from scratch. They work in the most restrictive memory regime, in which local memory per machine is strongly sublinear in the number of graph vertices. Improving on the size of the batch they can handle efficiently would improve on the round complexity of known static algorithms on sparse *** algorithms can process batches of updates of size Θ(S), for Minimum Spanning Forest and 2-Edge Connected Components, and Θ(S1−ϵ), for Maximal Matching, in O(1) rounds, where S is the local memory of a single machine.
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose...
详细信息
ISBN:
(纸本)9780818679773
Significant advances have been made in compilation technology for capitalizing on instruction-level parallelism (ILP). The vast majority of ILP compilation research has been conducted in the context of general-purpose computing, and more specifically the SPEC benchmark suite. At the same time, a number of microprocessor architectures have emerged which have VLIW and SIMD structures that are well matched to the needs of the ILP compilers. Most of these processors are targeted at embedded applications such as multimedia and communications, rather than general-purpose systems. Conventional wisdom, and a history of hand optimization of inner-loops, suggests that ILP compilation techniques are well suited to these applications. Unfortunately, there currently exists a gap between the compiler community and embedded applications developers. This paper presents MediaBench, a benchmark suite that has been designed to fill this gap. This suite has been constructed through a three-step process: intuition and market driven initial selection, experimental measurement to establish uniqueness, and integration with system synthesis algorithms to establish usefulness.
Single-ISA heterogeneous multi-core architectures offer a compelling high-performance and high-efficiency solution to executing task-parallel workloads in mobile systems on chip (SoCs). In addition to task-parallel wo...
ISBN:
(纸本)9781665462723
Single-ISA heterogeneous multi-core architectures offer a compelling high-performance and high-efficiency solution to executing task-parallel workloads in mobile systems on chip (SoCs). In addition to task-parallel workloads, many data-parallel applications, such as machine learning, computer vision, and data analytics, increasingly run on mobile SoCs to provide real-time user interactions. Next-generation scalable vector architectures, such as the RISC-V Vector Extension and Arm SVE, have recently emerged as unified vector abstractions for both large- and small-scale systems. In this paper, we propose novel area-efficient high-performance architectures called *** that support next-generation vector architectures to efficiently accelerate data-parallel workloads in conventional *** systems. *** architectures reconfigure multiple little cores on demand to work as a decoupled vector engine when executing data-parallel workloads. Our results show that a *** system can achieve 1.6× performance speedup over an area-comparable *** system equipped with an integrated vector unit across multiple data-parallel applications and 1.7× speedup compared to an aggressive decoupled vector engine for task-parallel workloads.
The CLA-EC is a model obtained by combining the concepts of cellular learning automata and evolutionary algorithms. The parallel structure of the CLA-EC makes it suitable for hardware-based applications including evol...
详细信息
The CLA-EC is a model obtained by combining the concepts of cellular learning automata and evolutionary algorithms. The parallel structure of the CLA-EC makes it suitable for hardware-based applications including evolvable hardware. In this paper, based on the SIMD model, a parallel architecture is proposed and implemented on FPGA. Simulation results show that the proposed architecture can solve optimization problems thousands times faster than the sequential implementations.
暂无评论