A new high-performance systolic architecture for calculating the discrete Fourier transform (DFT) is described which is based on two levels of transform factorization. One level uses an index remapping that converts t...
详细信息
A new high-performance systolic architecture for calculating the discrete Fourier transform (DFT) is described which is based on two levels of transform factorization. One level uses an index remapping that converts the direct transform into structured sets of arithmetically simple four-point transforms. Another level adds a row/column decomposition of the DFT. The architecture supports transform lengths that are not powers of two or based on products of coprime numbers. Compared to previous systolic implementations, the architecture is computationally more efficient and uses less hardware. It provides low latency as well as high throughput, and can do both one- and two-dimensional DFTs. An automated computer-aided design tool was used to find latency and throughput optimal designs that matched the target field programmable gate array structure and functionality.
It is shown that, given the ability to restructure wafer-level designs, there are different ways to employ redundancy. Redundancy is evaluated by estimating system computational availability over a mission lifetime. T...
详细信息
It is shown that, given the ability to restructure wafer-level designs, there are different ways to employ redundancy. Redundancy is evaluated by estimating system computational availability over a mission lifetime. This technique is illustrated using two wafer-scale integration (WSI) case studies. The first is a very-fine-grained programmable systolic data processor (PSDP) that contains 4- and 8-b paths, RAM, and control optimized for signal and data processing applications. The second, the Mosaic multicomputer architecture, is a less fine-grained homogeneous architecture in which each node contains a 16-b microprocessor and associated RAM and ROM. Potential benefits of implementing these parallel processing architectures in wafer scale are discussed
In this study, the authors explore sequential and parallel processing architectures, utilising a custom ultra-low-power (ULP) processing core, to extend the lifetime of health monitoring systems, where slow biosignal ...
详细信息
In this study, the authors explore sequential and parallel processing architectures, utilising a custom ultra-low-power (ULP) processing core, to extend the lifetime of health monitoring systems, where slow biosignal events and highly parallel computations exist. To this end, a single-and a multi-core architecture are proposed and compared. The single-core architecture is composed of one ULP processing core, an instruction memory (IM) and a data memory (DM), while the multi-core architecture consists of several ULP processing cores, individual IMs for each core, a shared DM and an interconnection crossbar between the cores and the DM. These architectures are compared with respect to power/performance trade-offs for different target workloads of online biomedical signal analysis, while exploiting near threshold computing. The results show that with respect to the single-core architecture, the multi-core solution consumes 62% less power for high computation requirements (167 MOps/s), while consuming 46% more power for extremely low computation needs when the power consumption is dominated by leakage. Additionally, the authors show that the proposed ULP processing core, using a simplified instruction set architecture (ISA), achieves energy savings of 54% compared to a reference microcontroller ISA (PIC24).
A generalization of the cube-connected cycles of Preparata and Vuillemin is described which retains the symmetry of these architectures while allowing for constructions of greater density and of arbitrary degree. Thes...
详细信息
A generalization of the cube-connected cycles of Preparata and Vuillemin is described which retains the symmetry of these architectures while allowing for constructions of greater density and of arbitrary degree. These constructions are of a type known as Cayley graphs, and their analysis is greatly facilitated by the applicability of methods from abstract algebra.
This paper describes a novel loop-structured switching network (LSSN) intended for highly parallel processing architectures. With L loops, it can connect up to N = L* log2 L pairs of transmitting and receiving devices...
详细信息
This paper describes a novel loop-structured switching network (LSSN) intended for highly parallel processing architectures. With L loops, it can connect up to N = L* log2 L pairs of transmitting and receiving devices using only N/2 two-by-two switching elements; thus, it is very cost-effective in terms of its component count. Its topology resembles that of the indirect binary n-cube network, but a much higher device-to-switch ratio is achieved because all the links between the switches could be used as both transmitting and receiving stations. It has the advantage of incremental extensibility, and-it could avoid store-and-forward deadlocks (SFD) which prevail in other recirculating packet-switched networks. Our simulation studies show that the average throughput rate and delay of LSSN are close to that of other designs despite its relatively low component count.
Modeling of complex hardware/software systems is becoming more difficult due to the complexity of interactions that occur between hardware and software and the need to model each component at multiple levels of detail...
详细信息
ISBN:
(纸本)0780372522
Modeling of complex hardware/software systems is becoming more difficult due to the complexity of interactions that occur between hardware and software and the need to model each component at multiple levels of detail. System modeling languages such as SystemC are assisting in this area by allowing real application level software to be interfaced with hardware models that maintain great fidelity to the actual hardware realization. This paper describes a project to develop a model of a large complex hardware/software system that is the heart of a parallel processor interconnection architecture being developed at The University of Alabama in Huntsville. The model developed allows the investigators to vary the parameters of system workload, policy for message passing protocols, and hardware features such as size of elasticity buffers and DMA controller burst size in a single homogeneous model. Initial results are encouraging and the hope is that as SystemC synthesis tools become available, the hardware components of the model can be translated automatically into hardware designs for FPGA and other rapid prototyping platforms without redesign or coding.
暂无评论