Key-value stores (KVS) become critical in many applications because of the data explosion recently. there is a strong demand to improve the throughput and reduce the latency for KVS. FPGA-based parallel architecture c...
详细信息
ISBN:
(纸本)9782839918442
Key-value stores (KVS) become critical in many applications because of the data explosion recently. there is a strong demand to improve the throughput and reduce the latency for KVS. FPGA-based parallel architecture can bring excellent performance and power efficiency. Cuckoo hashing has proven to be an efficient approach to implement KVS with good memory utilization and constant worst case access time. In this paper, an FPGA-based KVS implementation is proposed based on Cuckoo hashing, with a decoupled storage to achieve 81.7% memory utilization, and a pipeline scheme to achieve high performance. the latency of insert, search and delete operations is only 40 ns. And the throughput for search and delete can be 200 million requests per second (MRPS) which is 5x faster than [1]. Even when the load factor becomes 0.9, the throughput for insert can still achieve 147 MRPS.
We describe architectural enhancements to Xilinx FPGAs that provide better support for the creation of dynamically reconfigurable designs. these are augmented by a new design methodology that uses pre-routed IP cores ...
详细信息
ISBN:
(纸本)9781424403127
We describe architectural enhancements to Xilinx FPGAs that provide better support for the creation of dynamically reconfigurable designs. these are augmented by a new design methodology that uses pre-routed IP cores for communication between static and dynamic modules and permits static designs to route through regions otherwise reserved. for dynamic modules. A new CAD tool flow to automate the methodology is also presented. the new tools initially target the Virtex-II, Virtex-II Pro and Virtex-4 families and are derived from Yjlinx's commercial CAD tools.
FPGA CAD tools require wirelength predictions to make informed decisions through clustering, placement and routing stages towards power, area or delay based design goals. Unfortunately, there has been minimal work dev...
详细信息
ISBN:
(纸本)9781424410590
FPGA CAD tools require wirelength predictions to make informed decisions through clustering, placement and routing stages towards power, area or delay based design goals. Unfortunately, there has been minimal work devoted to estimating individual wirelengths early in the CAD flow. Rent's rule can be used to generate a wirelength distribution but cannot be used to predict lengths of individual wires. Hence, this paper explores "structural metrics" that have been found to possess strong predictive qualities in the ASIC domain. To our knowledge this is a first study in the application of these metrics in the FPGA CAD flow. Results show that the studied metrics capture characteristics of placement optimization carried out by VPR, and hence, are good indicators of post-placement wirelengths.
Convolutional Neural Networks (CNNs) combine large amounts of parallelizable computation with frequent memory access. fieldprogrammable Gate Arrays (FPGAs) can achieve low latency and high throughput CNN inference by...
详细信息
ISBN:
(纸本)9798331530082;9798331530075
Convolutional Neural Networks (CNNs) combine large amounts of parallelizable computation with frequent memory access. fieldprogrammable Gate Arrays (FPGAs) can achieve low latency and high throughput CNN inference by implementing dataflow accelerators that pipeline layer-specific hardware to implement an entire network. By implementing a different processing element for each CNN layer, these layer-pipelined accelerators can achieve high compute density, but having all layers processing in parallel requires high memory bandwidth. Traditionally this has been satisfied by storing all weights on chip, but this is infeasible for the largest CNNs, which are often those most in need of acceleration. In this work we augment a state-of-the-art dataflow accelerator (HPIPE) to leverage both High-Bandwidth Memory (HBM) and on-chip storage, enabling high performance layer-pipelined dataflow acceleration of large CNNs. Based on profiling results of HBM's latency and throughput against expected address patterns, we develop an algorithm to choose which weight buffers should be moved off chip and how deep the on-chip FIFOs to HBM should be to minimize compute unit stalling. We integrate the new hardware generation within the HPIPE domain-specific CNN compiler and demonstrate good bandwidth efficiency against theoretical limits. Compared to the best prior work we obtain speed-ups of at least 19.4x, 5.1x and 10.5x on ResNet-18, ResNet-50 and VGG-16 respectively.
Recently, there has been an increased focus on integration of reconfigurable fabric with modern processors. However, existing soft-processors are optimized to leverage older FPGA fabrics, focus primarily on resource m...
详细信息
ISBN:
(纸本)9789090304281
Recently, there has been an increased focus on integration of reconfigurable fabric with modern processors. However, existing soft-processors are optimized to leverage older FPGA fabrics, focus primarily on resource minimization and have fixed-pipeline designs that limit the scope for tightly integrated hardware accelerators. In this work, we present Taiga: a RISC-V, 32-bit, soft-processor architecture supporting the RISC-V Multiply/Divide and Atomic operations extensions (RV32IMA) designed to support Linux-based shared-memory systems. the processor design is highly configurable and features a standardized interface for functional units allowing for ease of integration of new functional units. Despite a more complex pipeline, our design uses approximately 33% fewer slices while clocking 39% faster than a LEON3 based system built on a Xilinx Zynq X7CZ020.
this paper develops a formal model of process migration that describes pro.-rams, processes, and the migration of those processes within a migration realm. A migration realm is a group of processors modeled as finite ...
详细信息
ISBN:
(纸本)9781424410590
this paper develops a formal model of process migration that describes pro.-rams, processes, and the migration of those processes within a migration realm. A migration realm is a group of processors modeled as finite state machines. the model is motivated by a migration application between software and fieldprogrammable Gate Array (FPGA) hardware, and the theorems of the model guide the use of FPGA resources while guaranteeing complete and correct execution of a process. By defining different types of migration realms this paper also develops a migration realm taxonomy.
A method is described for enumerating the frequencies of DNA subsequences on a system comprising a host computer and a fieldprogrammable gate array (FPGA) board with one FPGA. Frequencies of subsequences with lengths...
详细信息
ISBN:
(纸本)9781424410590
A method is described for enumerating the frequencies of DNA subsequences on a system comprising a host computer and a fieldprogrammable gate array (FPGA) board with one FPGA. Frequencies of subsequences with lengths of up to K-0 K-1 K-2 (24 in the current implementation) are enumerated in three phases. In these three phases, subsequences with lengths of up to K-0, K (0) K-1, and K-0 K-1 K-2, respectively, are enumerated;these three phases are executed simultaneously on a pipelined circuit, resulting in high performance. the enumeration of frequent subsequences in databases, which are becoming larger and larger, will enable subsequences that are unique and/or repeatedly used in many parts of the sequences to be found.
A multi-threaded microprocessor with a customisable instruction set, CUStomisable threaded ARchitecture (CUSTARD), is proposed. CUSTARD features include design space exploration and a compiler for automatic selection ...
详细信息
A multi-threaded microprocessor with a customisable instruction set, CUStomisable threaded ARchitecture (CUSTARD), is proposed. CUSTARD features include design space exploration and a compiler for automatic selection of custom instructions. Custom instructions, optimised for a specific application, accelerate frequently performed computations by implementing them as dedicated hardware. fieldprogrammable gate array implementations of CUSTARD are evaluated using media and cryptography benchmarks, and commercial MicroBlaze processor is compared. As low as 28% area overhead for four interleaved threads and up to 355% speedup over a processor without custom instructions are demonstrated.
A new scalable systolic hardware architecture for RSA cryptosystems is presented. the kernel of the architecture can operate with different precision of inputs which enables making area-time tradeoff in design. the ad...
详细信息
ISBN:
(纸本)9781424410590
A new scalable systolic hardware architecture for RSA cryptosystems is presented. the kernel of the architecture can operate with different precision of inputs which enables making area-time tradeoff in design. the add-shift Montgomery algorithm is used for modular multiplication. Unlike previous approaches after add operation, the result is shifted to the previous systole to divide by radix. this simplifies the structure of processing elements. the R-L binary Montgomery exponentiation algorithm is used. the square and multiply operations are performed in parallel. the architecture is implemented in Xilinx Virtex-5 FPGA (fieldprogrammable Gate Array) chips for different radixes. the DSP48E slices in the FPGA chips are used to increase the throughput of the design. the results are compared withthe literature. It is seen that the highest performance per area is obtained withthe Radix-2(16) design.
Radar is one of the domains where adaptability is paramount and algorithms must be adapted to system state. However, most systems include static implementations on FPGA or ASIC to process the massive amount of data fr...
详细信息
ISBN:
(纸本)9781728199023
Radar is one of the domains where adaptability is paramount and algorithms must be adapted to system state. However, most systems include static implementations on FPGA or ASIC to process the massive amount of data from multiple sensors in parallel. the classic approach is to configure hardware logicthrough registers to switch radar modes, requiring to hardwire all configurations. In embedded systems, FPGA dynamic partial reconfiguration (DPR) is a promising solution to reuse scarce resources. In this paper, we use DPR for radar processing in order to switch between a classic discrete Fourier transform (DFT) sum and a fast Fourier transform (FFT) to enhance Doppler extraction. Our study explores the pros and cons of both methods. Based on these observations, we propose a new architecture and decision method that relies on Radar QoS for enabling an efficient self-adaptive solution. Finally, we provide a case study and a hardware-in-loop simulation with a reconfigurable radar implementation.
暂无评论