The proceedings contain 50 papers. The special focus in this conference is on Architecture, Modeling, Tools, Applications, Network-on-a-Chip, Cryptography Applications and Extended Abstracts. The topics include: Reduc...
ISBN:
(纸本)9783319162133
The proceedings contain 50 papers. The special focus in this conference is on Architecture, Modeling, Tools, Applications, Network-on-a-Chip, Cryptography Applications and Extended Abstracts. The topics include: Reducing storage costs of reconfiguration contexts by sharing instruction memory cache blocks;a vector caching scheme for streaming FPGA SpMV accelerators;hardware synthesis from functional embedded domain-specific languages;operand-value-based modeling of dynamic energy consumption of soft processors in FPGA;a fully parallel particle filter architecture for FPGAs;Teach advanced reconfigurable architectures and tools;dynamic memory management in vivado-HLS for scalable many-accelerator architectures;place and route tools for the mitigation of single event transients on flash-based FPGAs;advanced systemC tracing and analysis framework for extra-functional properties;run-time partial reconfiguration simulation framework based on dynamically loadable components;architecture virtualization for run-time hardware multithreading on fieldprogrammablegatearrays;survey on real-time network-on-chip architectures;hardware benchmarking of cryptographic algorithms using high-level synthesis tools;an efficient and flexible FPGA implementation of a face detection system;a dynamically reconfigurable mixed analog-digital filter bank;a timing driven cycle-accurate simulation for coarse-grained reconfigurable architectures;a novel concept for adaptive signal processing on reconfigurable hardware;modular acquisition and stimulation system for timestamp-driven neuroscience experiments;DRAM row activation energy optimization for stride memory access on FPGA-based systems;acceleration of data streaming classification using reconfigurable technology;partial reconfiguration for dynamic mapping of task graphs onto 2D mesh platform and a challenge of portable and high-speed FPGA accelerator.
Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data inp...
详细信息
ISBN:
(纸本)9781450326711
Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the requirement of wire-speed processing. In this paper, we propose a streaming oriented PE-ring framework on FPGA for counting frequent items. Compared with the best existing FPGA implementation, our basic PE-ring framework saves 50% lookup table resources cost and achieves the same throughput in a more scalable way. Furthermore, we adopt SIMD-like cascaded filter for further performance improvements, which outperforms the previous work by up to 3.24 times in some data distributions.
OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely inuenced the recently appeared ...
详细信息
ISBN:
(纸本)9781450326711
OmpSs is an OpenMP-like directive-based programming model that includes heterogeneous execution (MIC, GPU, SMP, etc.) and runtime task dependencies management. Indeed, OmpSs has largely inuenced the recently appeared OpenMP 4.0 specication. Zynq All-programmable SoC combines the features of a SMP and a FPGA and benets DLP, ILP and TLP parallelisms in order to eciently exploit the new tech-nology improvements and chip resource capacities. In this paper, we focus on programmability and heterogeneous ex-ecution support, presenting a successful combination of the OmpSs programming model and the Zynq All-programmable SoC platforms.
In this paper, we describe the challenges that Place and Route tools face to implement the user designs on modern FPGAs while meeting the timing and power constraints.
ISBN:
(纸本)9781450325929
In this paper, we describe the challenges that Place and Route tools face to implement the user designs on modern FPGAs while meeting the timing and power constraints.
Packing is a critical step in the CAD flow for cluster-based FPGA architectures, and has a significant impact on the quality of the final placement and routing results. One basic quality metric is routability. Traditi...
详细信息
ISBN:
(纸本)9781450326711
Packing is a critical step in the CAD flow for cluster-based FPGA architectures, and has a significant impact on the quality of the final placement and routing results. One basic quality metric is routability. Traditionally, minimizing cut (the number of external signals) has been used as the main criterion in packing for routability optimization. This paper shows that minimizing cut is a sub-optimal criterion, and argues to use the Rent characteristic as the new criterion for FPGA packing. We further propose using a recursive bipartitioning-based k-way partitioner to optimize the Rent characteristic during packing. We developed a new packer, PPack2, based on this approach. Compared to T-VPack, PPack2 achieves 35.4%, 35.6%, and 11.2% reduction in wire length, minimal channel width, and critical path delay, respectively. These improvements show that PPack2 outperforms all previous leading packing tools (including iRAC, HDPack, and the original PPack) by a wide margin.
Irregular applications, by their very nature, suffer from poor data locality. This often results in high miss rates for caches, and many long waits to off-chip memory. Historically, long latencies have been dealt with...
详细信息
ISBN:
(纸本)9781467383899
Irregular applications, by their very nature, suffer from poor data locality. This often results in high miss rates for caches, and many long waits to off-chip memory. Historically, long latencies have been dealt with in two ways: (1) latency mitigation using large cache hierarchies, or (2) latency masking where threads relinquish their control after issuing a memory request. Multithreaded CPUs are designed for a fixed maximum number of threads tailored for an average application. FPGAs, however, can be customized to specific applications. Their massive parallelism is well known, and ideally suited to dynamically manage hundreds, or thousands of threads. Multithreading, in essence, trades memory bandwidth for latency. Therefore, to achieve a high throughput, the system must support a large memory bandwidth. Many irregular application, however, must rely on inter-thread synchronization for parallel execution. In-memory synchronization suffers from very long memory latencies. In this paper we describe the use of CAMs (Content Addressable Memories) as synchronizing caches for hardware multithreading. We demonstrate and evaluate this mechanism using graph breadth-first search (BFS).
In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory f...
详细信息
ISBN:
(纸本)9781467383899
In this paper we present a set of techniques that enable the synthesis of efficient custom accelerators for memory intensive, irregular applications. To address the challenges of irregular applications (large memory footprint, unpredictable fine-grained data accesses, and high synchronization intensity), and exploit their opportunities (thread level parallelism, memory level parallelism), we propose a novel accelerator design that employs an adaptive and Distributed Controller (DC) architecture, and a Memory Interface Controller (MIC) that supports concurrent and atomic memory operations on a multi-ported/multi-banked shared memory. Among the multitude of algorithms that may benefit from our solution, we focus on the acceleration of graph analytics applications and, in particular, on the synthesis of SPARQL queries on Resource Description Framework (RDF) databases. We achieve this objective by incorporating the synthesis techniques into Bambu, an Open Source high-level synthesis tools, and interfacing it with GEMS, the Graph database Engine for Multithreaded Systems. The GEMS' front-end generates optimized C implementations of the input queries, modeled as graph pattern matching algorithms, which are then automatically synthesized by Bambu. We validate our approach by synthesizing several SPARQL queries from the Lehigh University Benchmark (LUBM).
When are FPGAs more energy ecient than processors? This question is complicated by technology factors and the wide range of application characteristics that can be ex-ploited to minimize energy. Using a wire-dominated...
详细信息
ISBN:
(纸本)9781450326711
When are FPGAs more energy ecient than processors? This question is complicated by technology factors and the wide range of application characteristics that can be ex-ploited to minimize energy. Using a wire-dominated en-ergy model to estimate the absolute energy required for pro-grammable computations, we determine when spatially or-ganized programmable computations (FPGAs) require less energy than temporally organized programmable computa-tions (processors). The point of crossover will depend on the metal layers available, the locality, the SIMD wordwidth regularity, and the compactness of the instructions. When the Rent Exponent, p, is less than 0.7, the spatial design is always more energy ecient. When p = 0:8, the technology oers 8-metal layers for routing, and data can be organized into 16b words and processed in tight loops of no more than 128 instructions, the temporal design uses less energy when the number of LUTs is greater than 64K. We further show that heterogeneous multicontext architectures can use even less energy than the p = 0:8, 16b word temporal case.
Summary form only given. The complete presentation was not made available for publication as part of the conference proceedings. The slowdown of conventional CPU scaling, predicted over a decade ago, has indeed come t...
详细信息
Summary form only given. The complete presentation was not made available for publication as part of the conference proceedings. The slowdown of conventional CPU scaling, predicted over a decade ago, has indeed come true. The shift to multicore provided some initial benefit, but scaling the number of cores has not provided the commensurate performance boost to keep pace with historic trends in CPU performance and efficiency. Dataflow and Explicit Dataflow Graph Execution (EDGE) architectures have the potential to provide a path forward for computer architecture scaling without radically altering silicon process technology, but programmability and a lack of available hardware have limited the impact that dataflow and EDGE have had. fieldprogrammablegatearrays (FPGAs) offer an interesting platform for bridging the gap between conventional von Neumann CPUs and dataflow/EDGE architectures. The programming model for FPGAs, at the core, is dataflow. With some minor modifications, this can become EDGE. This has the potential to solve the issue of hardware, and provides a viable platform for experimenting with dataflow and EDGE compilation and architectures. But FPGAs are far more than just a prototyping platform. In the datacenter, they are possibly the most powerful way of improving the performance and efficiency of cloud applications. At Microsoft, we developed Catapult, an FPGA platform customized to the datacenter, which provides enormous gains in performance and efficiency for datacenter workloads. This talk will discuss the Catapult platform, and the synergy between dataflow, EDGE, and the Catapult FPGA platform, and show the enormous potential that computer architects have to extend performance and efficiency in the face of slowing CPU scaling.
暂无评论