Withthe increasing capacity of FPGAs following the Moore's law, it is possible to build in a single FPGA, a large system on chip (SoC) composed by several cores. their performances depend strongly on their interc...
详细信息
ISBN:
(纸本)9781424410590
Withthe increasing capacity of FPGAs following the Moore's law, it is possible to build in a single FPGA, a large system on chip (SoC) composed by several cores. their performances depend strongly on their interconnection structure. Traditional and hierarchical busses are not suitable to be used. the Networks on Chip (NoC), due to their characteristics such as scalability, flexibility, high bandwidth, have been proposed as a valid approach to meet communication requirements in SoC Most of the current NoCs uses mesh topology. With mesh topology, central channels are significantly solicited this Often leads to the congestion of the center area of the mesh. the solution for such situation is to add routers in the mesh or to use torus topology which, withthe symmetry introduced on the routers in the opposite edges, has a good behavior to face congestion, and this, with a small increase of resources. In this paper, we propose a scalable implementation of a NoC for FPGA using torus topology. We proposed router architecture, a routing algorithm and a solution to the problem introduced by the long wires in torus topology.
A novel fieldprogrammable gate array (FPGA) logic synthesis technique that determines if a logic function can be implemented in a given programmable circuit is presented, and how this problem can be formalised and so...
详细信息
A novel fieldprogrammable gate array (FPGA) logic synthesis technique that determines if a logic function can be implemented in a given programmable circuit is presented, and how this problem can be formalised and solved using quantified Boolean satisfiability is described. this technique is general enough to be applied to any type of logic function and programmable circuit;thus, it has many applications to FPGAs. the application demonstrated is the FPGA programmablelogic block evaluation and the results show that this tool allows radical new features of FPGA logic blocks to be evaluated in a rigorous scientific way.
In this paper, we present the design and practical use of a programmablelogic controller (PLC) training station for use in an undergraduate electrical and computer engineering curriculum. the trainer, based on the Al...
详细信息
the programmable clock networks in FPGAs have a significant impact on overall power, area, and delay. Not only does the clock network itself dissipate a significant amount of power, since it connects to every latch on...
详细信息
ISBN:
(纸本)9781424410590
the programmable clock networks in FPGAs have a significant impact on overall power, area, and delay. Not only does the clock network itself dissipate a significant amount of power, since it connects to every latch on the FPGA and toggles every cycle, but the design of the clock network also affects how efficiently the rest of the application can be implemented since it imposes constraints on the CAD tools which map the application onto the FPGA. To examine this tradeoff, this paper describes and compares new clock-aware placement techniques and then examines how the clock network architecture affects overall power, area, and delay. Our results show that the placement techniques used to make placement clock-aware have a significant influence on power and delay. On average, circuits placed using the most effective techniques dissipate 9.9% less energy and were 2.4% faster than circuits placed using the least effective techniques. Moreover, the results show that the clock network architecture is also important. On average, FPGAs with an efficient clock network were up to 12.5% more energy efficient and 7.2% faster than other FPGAs.
We are developing a set of reusable design blocks and several prototype systems for emulation of multi-core architectures in FPGAs. RAMP Blue is the first of these prototypes and was designed to emulate a distributed-...
详细信息
ISBN:
(纸本)9781424410590
We are developing a set of reusable design blocks and several prototype systems for emulation of multi-core architectures in FPGAs. RAMP Blue is the first of these prototypes and was designed to emulate a distributed-memory message-passing architecture. the system consists of 7681008 MicroBlaze cores in 64-84 Virtex-II Pro 70 FPGAs on 16-21 BEE2 boards, surpassing the milestone of 1000 cores in a standard 42U rack. An architecture based on point-to-point channels and switches using a combination of custom and generic hardware provides the functionality. Virtual-cut-through dimensional routing on one of two hybrid topologies with virtual channels provides the connectivity. A control network with a tree topology provides management and debugging capabilities. A software infrastructure consisting of GCC, uClinux and UPC allows running off-the-shelf applications and scientific benchmarks. Initial performance is encouraging for emulation purposes. In this paper we report on the design and implementation of RAMP Blue and discuss our experiences and lessons learned.
this paper introduces a software supported methodology for exploring/evaluating 3D FPGA architectures. Two new CAD tools are developed: (i) the 3DPRO for placement and routing on 3D FPGAs and (ii) the 3DPower for powe...
详细信息
ISBN:
(纸本)9781424410590
this paper introduces a software supported methodology for exploring/evaluating 3D FPGA architectures. Two new CAD tools are developed: (i) the 3DPRO for placement and routing on 3D FPGAs and (ii) the 3DPower for power/energy estimation on such architectures. We mainly focus our exploration on the total number of layers and the amount of vertical interconnects (or vias). the efficiency of the proposed architecture is evaluated by making an exhaustive exploration for via connections under the EnergyxDelay Product criterion. Experimental results demonstrate the effectiveness of our solution, considering the 20 largest MCNC benchmarks. Considering 3D architectures with 4 layers and two scenarios of fabricated via densities (30% and 70%), we achieve an average decrease in the delay, the wire length, and the energy consumption of 18%, 17%, and 31%, respectively, as compared to 2D FPGAs. We also achieved high utilization of vias links.
the complexity of today's embedded applications requires modern high-performance embedded System-on-Chip (SoC) platforms to be multiprocessor architectures. Advances in FPGA technology make the implementation of s...
详细信息
ISBN:
(纸本)9781424410590
the complexity of today's embedded applications requires modern high-performance embedded System-on-Chip (SoC) platforms to be multiprocessor architectures. Advances in FPGA technology make the implementation of such architectures in a single chip (MP-SoC) feasible and very appealing. In recent years, the FPGA vendors integrated enormous amount of hardware resources in their FPGAs allowing larger and more complex MPSoCs to be built in their FPGA fabric. the main limitation on the size of an MPSoC that can be built in a single FPGA appears to be the amount Of on-chip memory. To relax this limitation, the usage of external (off-chip) memory has to be considered. the state-of-the-art development tools support off-chip memory for (multi-master) shared bus architectures with arbitration of the memory accesses. Such architectures might be efficient for single processor systems however for multiprocessor systems the shared bus concept significantly limits the systems performance even if a DMA mechanism is used. In this paper we present our approach and interface when using an external memory for inter-processor data communication in multiprocessor platforms. We propose a hierarchical memory system with a programmable controller to transfer data between external and on-chip memories using a DMA mechanism. Our approach does not require arbitration which results in better overall performance. Results demonstrating the effectiveness of the proposed hierarchical memory system are presented as well.
this paper presents a new direct implementation of a popular RTOS with an associated application - the WiMAX physical layer - on reconfigurable computing architectures. A novel coarse-gained reconfigurable instruction...
详细信息
ISBN:
(纸本)9781424410590
this paper presents a new direct implementation of a popular RTOS with an associated application - the WiMAX physical layer - on reconfigurable computing architectures. A novel coarse-gained reconfigurable instruction cell based architecture is chosen as the target architecture. Firstly an RTOS - Micro C/0S-II - was ported to the target architecture, and then the WiMAX physical layer program was partitioned into multiple OS tasks which communicate with each other through the synchronization approaches provided by this RTOS. the WiMAX physical layer program has been also implemented on the ARM7TDMI processor. the results show that the performance of the target architecture is much better than the ARM7TDMI, and not limited by the bottleneck of memory latency.
Recently, there is a surge of interests in using FPGAs for computer architecture research including applications from emulating and analyzing a new platform to accelerating microarchitecural simulation speed for desig...
详细信息
ISBN:
(纸本)9781424410590
Recently, there is a surge of interests in using FPGAs for computer architecture research including applications from emulating and analyzing a new platform to accelerating microarchitecural simulation speed for design space exploration. this paper proposes and demonstrates a novel usage of FPGAs for measuring the efficiency of coherent traffic of an actual computer system. Our approach employs an FPGA acting as a bus agent, interacting with a real CPU in a dual processor system to measure the intrinsic delay of coherence traffic. this technique eliminates non-deterministic factors in the measurement, such as the arbitration delay and stall in the pipelined bus. It completely isolates the impact of pure coherence traffic delay on system performance while executing workloads natively. Our experiments show that the overall execution time of the benchmark programs on a system with coherence traffic was actually increased over one without coherent traffic. It indicates that cache-to-cache transfers are less efficient in an Intel-based server system, and there exists room for further improvement such as the inclusion of the O state and cache line buffers in the memory controller.
We present the first open-source TensorFlow to FPGA tool capable of running state-of-the-art DNNs. Running TensorFlow on the Amazon cloud FPGA instances, we provide competitive performance and higher accuracy compared...
详细信息
ISBN:
(纸本)9781728148847
We present the first open-source TensorFlow to FPGA tool capable of running state-of-the-art DNNs. Running TensorFlow on the Amazon cloud FPGA instances, we provide competitive performance and higher accuracy compared to a proprietary tool, thus providing a public framework for research exploration in the DNN inference space. We also detail the optimizations needed to map modern DNN frameworks to FPGAs, provide novel analysis of design tradeoffs for FPGA DNN accelerators and present experiments across a range of DNNs.
暂无评论