FLOWMAP ([1]) was the first delay-optimal algorithm for the technology mapping of LUT-based FPGAs. However, even though this algorithm is polynomial, rapid prototyping using FPGAs requires faster solutions. this paper...
详细信息
the authors propose a new class of interconnection networks called recursive hierarchical swapped networks (RHSN) for general-purpose parallelprocessing. the node degrees of RHSNs can vary from a small number to as l...
详细信息
the authors propose a new class of interconnection networks called recursive hierarchical swapped networks (RHSN) for general-purpose parallelprocessing. the node degrees of RHSNs can vary from a small number to as large as required, depending on recursive and hierarchical composition parameters and the nucleus graph chosen. the diameter of an RHSN can be asymptotically optimal within a small constant factor. they present efficient routing, semigroup computation, ascend/descend, matrix-matrix multiplication, and emulation algorithms, thus proving the versatility of RHSNs. In particular on suitably constructed RHSNs, matrix multiplication can be performed faster than the DNS algorithm on a hypercube. Furthermore, ascend/descend algorithms, semigroup computation, and parallel prefix computation can be done using algorithms with asymptotically fewer communication steps than on a hypercube.
this paper describes λ-FLOW, a new functional synchronous dataflow language for DSP applications. It is independent of the handled data. It plainly supports the modular design. Its sound semantics allows proofs of pr...
详细信息
this paper describes λ-FLOW, a new functional synchronous dataflow language for DSP applications. It is independent of the handled data. It plainly supports the modular design. Its sound semantics allows proofs of programs and time/memory determinisms. the target code is dynamically loaded into the compiler with a target description that is defined with less than twenty lines of definitions. Due to the static feature of the solving model, it is possible to implement programs onto a static parallel architecture.
A sorting device capable of sorting p items in constant time is called a p-sorter. It is known that the task of sorting N items using a p-sorter requires at least /spl Omega/ (N log N/p log p) applications of the p-so...
详细信息
A sorting device capable of sorting p items in constant time is called a p-sorter. It is known that the task of sorting N items using a p-sorter requires at least /spl Omega/ (N log N/p log p) applications of the p-sorter. this bound is tight: there exist algorithmsthat use O (N log N/p log p) calls to the p-sorter to sort N items. However, there is no known implementable algorithm that can sort N items in O(N log N/p log p) time using a p-sorter. the main contribution of this paper is to propose a simple VLSI architecture and to show that in our architecture N items can be sorted in O(N log N/p log p) calls to the p-sorter, while enforcing conflict-free memory accesses. An important feature of our design is that the total additional VLSI area for hardware, other than the memory for data and the p-sorter, is kept to a minimum.
A new generation of high performance programmable digital signal processors (DSPs) has a highly-integrated parallel architecture, incorporating special-purpose hardware features, on-chip memory and multiple processors...
详细信息
ISBN:
(纸本)0780335295
A new generation of high performance programmable digital signal processors (DSPs) has a highly-integrated parallel architecture, incorporating special-purpose hardware features, on-chip memory and multiple processors into a single chip. For such single-chip multiprocessor DSPs, however, a sophisticated performance monitoring tool is essential to achieve the maximum performance. the authors discuss the requirements and functionality of performance monitoring tools suitable for single-chip multiprocessor DSPs. As a specific example, they describe a performance monitoring tool developed for Texas Instruments' TMS320C80 (MVP), MVP Performance Monitor (MPM), which satisfies these requirements and functionality. the effectiveness of the MPM is demonstrated using an 8/spl times/8 block-based discrete cosine transform (DCT) implementation. An overall speed-up of 4.67 was achieved by using the MPM.
We propose novel VLSI architectures for computing the Discrete Wavelet Transforms. the proposed architectures employ a memory-based approach. ROM lookup tables are used for the implementation of complex computational ...
详细信息
We propose novel VLSI architectures for computing the Discrete Wavelet Transforms. the proposed architectures employ a memory-based approach. ROM lookup tables are used for the implementation of complex computational modules. Compared with known architecturesthat employ traditional hardware computational modules, the proposed architectures are faster and are area-efficient. the memory-based architecture is used to implement the block-based DWT withparallel I/O. the resulting architectures are area-efficient and have high throughput and low latency. these architectures are suitable for low-power single-chip implementations which are useful for DWT-based mobile/visual communication systems.
Stacked 3D Silicon has been under development for a number of years at a substantial level of investment on the part of Government as well as public and private investors. Volume manufacturing of this technology is no...
详细信息
Stacked 3D Silicon has been under development for a number of years at a substantial level of investment on the part of Government as well as public and private investors. Volume manufacturing of this technology is now in place and foundry services are provided to designers of Stacked 3D Silicon components and products. Stacked 3D Silicon has already had a major impact on microelectronics systems and products into which it has been integrated. Examples given include solid state data recorders, digital signal processors, massively parallel processors, artificial neural networks, imaging processing, and imaging sensors. Manufacturing and cost issues are identified and discussed along with present status and projections showing that, as volumes rise, no significant premium will be required to incorporate Stacked 3D Silicon into standard products. the performance advantages of Stacked 3D Silicon are very large: the ultra-high scale density results in factors of hundreds to thousands in both speed and power when ICs are designed for 3D. the paper concludes with a picture of the coming next generation 3D stacked silicon: 10 - 1000 layers of ultra-thin, low power circuits with 1000s of inter-layer interconnect comprising entire systems in a single cube.
Concurrency between access and execution has been exploited by queues in many decoupled access-execute architectures, but data dependent control dependencies often prohibit prefetching of data to queues. this paper in...
详细信息
Concurrency between access and execution has been exploited by queues in many decoupled access-execute architectures, but data dependent control dependencies often prohibit prefetching of data to queues. this paper investigates a technique to facilitate anticipatory loading to queues even in presence of data dependent control dependencies. the proposed method consists of fetching along one or both paths of a data dependent control dependency and inserting consume instructions in appropriate paths to consume the unnecessarily fetched data. the compiler hoists load instructions above control dependencies as in conventional load hoisting techniques. the technique is seen to be very effective in programs with data dependent if-then-else's. We also present an architecture with multiple access units, the mLSU architecture, which parallelizes the access process. Simulation experiments illustrate that multiple access units improve the performance if access processor instruction issue is a bottleneck.
暂无评论