The proceedings contains 24 papers. Topics discussed include logic design, fieldprogrammablegatearrays, pipelined routing and scheduling, logic synthesis, architecture of special purpose structures, field programma...
详细信息
The proceedings contains 24 papers. Topics discussed include logic design, fieldprogrammablegatearrays, pipelined routing and scheduling, logic synthesis, architecture of special purpose structures, fieldprogrammablegatearrays partitioning, applications and bit-serial synthesis.
To truly exploit fpgas for rapid turn-around development and prototyping, placement times must be reduced to seconds;late-bound, reconfigurable computing applications may demand placement times as short as microsecond...
详细信息
To truly exploit fpgas for rapid turn-around development and prototyping, placement times must be reduced to seconds;late-bound, reconfigurable computing applications may demand placement times as short as microseconds. In this paper, we show how a systolic structure can accelerate placement by assigning one processing element to each possible location for an fpga LUT from a design netlist. We demonstrate that our technique approaches the same quality point as traditional simulated annealing as measured by a simple linear wirelength metric. Experimental results look ahead to compare quality against VPR's fast placer when considering the minimum channel width required to route as the primary optimization criteria. Preliminary results from an fpga implementation show the feasibility of accelerating simulated annealing by three orders of magnitude using this approach. This means we can place the largest design in the University of Toronto's "fpga Placement and Routing Challenge" in around 4ms.
fieldprogrammablegate Array (fpga) implementations of sorting algorithms have proven to be efficient, but existing implementations lack portability and maintainability because they are written in low-level hardware ...
详细信息
ISBN:
(纸本)9781450338561
fieldprogrammablegate Array (fpga) implementations of sorting algorithms have proven to be efficient, but existing implementations lack portability and maintainability because they are written in low-level hardware description languages that require substantial domain expertise to develop and maintain. To address this problem, we develop a framework that generates sorting architectures for different requirements (speed, area, power, etc.). Our framework provides ten highly optimized basic sorting architectures, easily composes basic architectures to generate hybrid sorting architectures, enables non-hardware experts to quickly design efficient hardware sorters, and facilitates the development of customized heterogeneous fpga/CPU sorting systems. Experimental results show that our framework generates architectures that perform at least as well as existing RTL implementations for arrays smaller than 16K elements, and are comparable to RTL implementations for sorting larger arrays. We demonstrate a prototype of an end-to-end system using our sorting architectures for large arrays (16K-130K) on a heterogeneous fpga/CPU system.
Current advances in chip design and manufacturing have allowed IC manufacturing to approach the nanometer range. As the feature size scales down, greater variability is experienced, forcing designers to reduce perform...
详细信息
ISBN:
(纸本)9781595939340
Current advances in chip design and manufacturing have allowed IC manufacturing to approach the nanometer range. As the feature size scales down, greater variability is experienced, forcing designers to reduce performance requirements in order to reserve larger margins. Better than worst-case design can be used to address the variability problem, as well as breaking the performance limit set by the worst-case delay in the conventional design style, even without the consideration of delay variation. In this paper we will present a novel methodology for measuring and optimizing the performance of circuits to operate with the clock period smaller than the worst-case delay. We also develop a novel technology mapping algorithm that optimizes circuits under such a metric. Using our novel mapping algorithm named BTWMap (Better Than Worst-case Mapper) and its area-optimized version named BTWMap+area, we are able to improve the overall circuit latency by 13% and 11%, respectively.
A large semantic gap between the high-level synthesis (HLS) design and the low-level (on-board or RTL) simulation environment often creates a barrier for those who are not fpga experts. Moreover, such low-level simula...
详细信息
ISBN:
(纸本)9781450361378
A large semantic gap between the high-level synthesis (HLS) design and the low-level (on-board or RTL) simulation environment often creates a barrier for those who are not fpga experts. Moreover, such low-level simulation takes a long time to complete. Software-based HLS simulators can help bridge this gap and accelerate the simulation process;however, we found that the current fpga HLS commercial software simulators sometimes produce incorrect results. In order to solve this correctness issue while maintaining the high speed of a software-based simulator, this paper proposes a new HLS simulation flow named FLASH. The main idea behind the proposed flow is to extract the scheduling information from the HLS tool and automatically construct an equivalent cycle-accurate simulation model while preserving C semantics. Experimental results show that FLASH runs three orders of magnitude faster than the RTL simulation.
Spiking Neural Networks (SNNs) are the next generation of Artificial Neural Networks (ANNs) that utilize an event-based representation to perform more efficient computation. Most SNN implementations have a systolic ar...
详细信息
The Carnegie Mellon In Silico Vox project seeks to move best-quality speech recognition technology from its current software-only form into a range of efficient all-hardware implementations. The central thesis is that...
详细信息
ISBN:
(纸本)9781595936004
The Carnegie Mellon In Silico Vox project seeks to move best-quality speech recognition technology from its current software-only form into a range of efficient all-hardware implementations. The central thesis is that, like graphics chips, the application is simply too performance hungry, and too power sensitive, to stay as a large software application. As a first step in this direction, we describe the design and implementation of a fully functional speech-to-text recognizer on a single Xilinx XUP platform. The design recognizes a 1000 word vocabulary, is speaker-independent, recognizes continuous (connected) speech, and is a "live mode" engine, wherein recognition can start as soon as speech input appears. To the best of our knowledge, this is the most complex recognizer architecture ever fully committed to a hardware-only form. The implementation is extraordinarily small, and achieves the same accuracy as state-of-the-art software recognizers, while running at a fraction of the clock speed.
The routing resources available in recent fpga architectures (e.g., Xilinx Virtex-II) are very different from the older generation of fpgas (e.g., Xilinx XC4000). The latest fpga architectures have heterogeneous routi...
详细信息
The routing resources available in recent fpga architectures (e.g., Xilinx Virtex-II) are very different from the older generation of fpgas (e.g., Xilinx XC4000). The latest fpga architectures have heterogeneous routing resources which include directly driven wires of different lengths and connectivity. Most of the fpga routing algorithms have been implemented on XC4000 style architectures and not much work has been reported for architectures with directly driven heterogeneous routing resources. Since routing resources in fpgas are fixed, it is very important for the routing algorithms to fully exploit the potential of new routing architectures. fpga routing architectures are usually represented as a routing resource graph (RRG). The construction of RRG to accurately model all the routing resources of current fpga architectures is not a trivial task. In this paper we present a simplified scheme to build the RRG for fpga architectures with heterogeneous routing resources. Using our RRG construction scheme we have built a routability driven detailed fpga router named "Bison". We also present a dynamic weight update based heuristic which we have incorporated into the router, so that efficient utilization of routing resources can be achieved.
Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data inp...
详细信息
ISBN:
(纸本)9781450326711
Frequent item counting is one of the most important operations in time series data mining algorithms, and the space saving algorithm is a widely used approach to solving this problem. With the rapid rising of data input speeds, the most challenging problem in frequent item counting is to meet the requirement of wire-speed processing. In this paper, we propose a streaming oriented PE-ring framework on fpga for counting frequent items. Compared with the best existing fpga implementation, our basic PE-ring framework saves 50% lookup table resources cost and achieves the same throughput in a more scalable way. Furthermore, we adopt SIMD-like cascaded filter for further performance improvements, which outperforms the previous work by up to 3.24 times in some data distributions.
A method for evaluating and constructing sparse crossbars which are both area efficient and highly routable is presented. The evaluation method uses a network flow algorithm to accurately compute the percentage of ran...
详细信息
ISBN:
(纸本)9781581131932
A method for evaluating and constructing sparse crossbars which are both area efficient and highly routable is presented. The evaluation method uses a network flow algorithm to accurately compute the percentage of random test vectors that can be routed. The construction method attempts to maximize the spread of the switch locations, such that any given subset of input wires can connect to as many output wires as possible. Based on Hall's Theorem, we argue that this increases the likelihood of routing. The hardest test vectors to route are those which attempt to use all of the crossbar outputs. Results in this paper show that area-efficient sparse crossbars can be constructed by providing more outputs than required and a sufficient number of switches. In a few specific case studies, it is shown that sparse crossbars with about 90% fewer switches than a full crossbar can be constructed, and these crossbars are capable of routing over 95% of randomly chosen routing vectors. In one case, a new switch matrix which can replace the one in the Altera FLEX8000 family is shown. This new switch matrix uses approximately 14% more transistors, yet can increase the routability of the most difficult test vectors from 1% to over 96%.
暂无评论