Multimedia applications have become a dominant computing workload for computer systems as well as for wireless-based devices. Due to their repetitive computing and memory intensive nature, they can take effective adva...
详细信息
ISBN:
(纸本)1595931376
Multimedia applications have become a dominant computing workload for computer systems as well as for wireless-based devices. Due to their repetitive computing and memory intensive nature, they can take effective advantage from processor-in-memory (PIM) technology. In this paper, a new low-power PIM-based 32-bit reconfigurable datapath optimized for multimedia applications is presented. The new circuit efficiently performs parallel arithmetic operations on either 8-, 16-, or 32-bit integer data or on 32-bit single precision floating-point data. As a result, high flexibility is provided at a very low hardware cost. When implemented using the UMC 0. 18 mu m 1. 8 V CMOS technology, the proposed datapath exhibits a 285 MHz running frequency, dissipates just 0.12 mW/MHz and occupies a silicon area of only 107,323 mu m(2). When performing 2D-DCT, proposed architecture consumes 74% less power and is 28% more power efficient compared to top-of-the-line commercial TI DSP.
processor-in-memory (PIM) architectures are developed for high-performance computing by integrating processing units with memory blocks into a single chip to reduce the performance gap between the processor and the me...
详细信息
ISBN:
(纸本)9783540852605
processor-in-memory (PIM) architectures are developed for high-performance computing by integrating processing units with memory blocks into a single chip to reduce the performance gap between the processor and the memory. The PIM architecture combines heterogeneous processors in a single system. These processors are characterized by their computation and memory-access capabilities. Therefore, a novel mechanism must be developed to identify their capabilities and dispatch the appropriate tasks to these heterogeneous processing elements. Accordingly, this paper presents a novel parallelizing mechanism, called Critical Block Scheduling to fully utilize all of the heterogeneous processors in the PIM architecture. Integrated with our thread-level parallelizing system, Octans, this mechanism decomposes the original program into blocks, produces corresponding dependence graph, creates a feasible execution schedule, and generates corresponding threads for the host and memoryprocessors. The proposed Critical Block Scheduling not only can parallelize programs for PIM architectures but also can apply on other Multi-processor System-on-Chip (MPSoC) and Chip Multiprocessor (CMP) architectures which consist of multiple heterogeneous processors. The experimental results of real benchmarks are also discussed.
Continuous improvements in semiconductor fabrication density are supporting new classes of System-on-a-Chip (SoC) architectures that combine extensive processing logic/processor with high-density memory. Such architec...
详细信息
Continuous improvements in semiconductor fabrication density are supporting new classes of System-on-a-Chip (SoC) architectures that combine extensive processing logic/processor with high-density memory. Such architectures are generally called processor-in-memory (PIM) or Intelligent memory (I-RAM) and can support high-performance computing by reducing the performance gap between the processor and the memory. The PIM architecture combines various processors in a single system. These processors are characterized by their computation and memory-access capabilities. Therefore, a novel strategy must be developed to identify their capabilities and dispatch the most appropriate jobs to them in order to exploit them fully. Accordingly, this study presents an automatic source-to-source parallelizing system, called statement-analysis-grouping-evaluation (SAGE), to exploit the advantages of PIM architectures. Unlike conventional iteration-based parallelizing systems, SAGE adopts statement-based analyzing approaches. This study addresses the configuration of a PIM architecture with one host processor (i.e., the main processor in state-of-the-art computer systems) and one memoryprocessor (i.e., the computing logic integrated with the memory). The strategy of the SAGE system, in which the original program is decomposed into blocks and a feasible execution schedule is produced for the host and memoryprocessors, is investigated as well. The experimental results for real benchmarks are also discussed. (C) 2003 Elsevier B.V. All rights reserved.
Continuous improvements in semiconductor fabrication density are supporting new classes of Chip Multiprocessor (CMP) architectures that combine extensive processing logic/processor with high-density memory in a single...
详细信息
ISBN:
(纸本)9783540770916
Continuous improvements in semiconductor fabrication density are supporting new classes of Chip Multiprocessor (CMP) architectures that combine extensive processing logic/processor with high-density memory in a single chip. One of the architecture, called processor-in-memory (PIM) can support high-performance computing by combining various processors in a single system. Therefore, a new strategy is developed to identify their capabilities and dispatch the most appropriate jobs to them in order to exploit them fully. This paper presents a novel scheduling mechanism, called Swing Scheduling to fully utilize all of the heterogeneous processors in the PIM architecture. Integrated with our Octans system, this mechanism can decompose the original program into blocks and can produce a feasible execution schedule for the host and memoryprocessors, even for other CMP architectures. The experimental results for real benchmarks are also proposed.
It is widely known that current memory architecture is one of the bottlenecksfor high-performance computers due to the increasing gap between the processor speed and memorylatency. For this reason, several architectur...
详细信息
It is widely known that current memory architecture is one of the bottlenecksfor high-performance computers due to the increasing gap between the processor speed and memorylatency. For this reason, several architectures, called intelligent memory (IRAM) orprocessor-in-memory (PIM), have been studied in recent years aiming to integrate the processor andmemory together. A merit of PIM architecture is that the PIM chips can be used to replace the mainmemory chips in a workstation and act as coprocessors when main processor spawns them. This approachhas been adopted by Active Page, DIVA, and FlexRAM, among others. This class of architecturesprovides a hierarchical hybrid multiprocessor environment: host (main) processors and memoryprocessors. Host processor is more powerful with a deep cache hierarchies and higher latency toaccess memory. By contrast, memoryprocessors are usually less powerful but with a lower latency inmemory access. The major problems we address in this paper are: how to dispatch suitable tasks tothese different processors in PIM by their computing power and characteristics to reduce their idletime, and how to partition the original program then execute simultaneously on these heterogeneousprocessors mixture. Based on our earlier work, we propose the SAGE(Statement-Analysis-Grouping-Evaluation) system to analyze the source program, generate a WeightPartition Dependence Graph (WPG), determine the weight of each block, and then dispatch the mostsuitable jobs to the host and memoryprocessors, respectively. From the experiment, we find thatquite good speedup is obtained, which even exceeds the computation capability ratio in 1-host and1-memoryprocessors environment.
Recovering the symbols in a multiple-input multiple-output (MIMO) receiver is a computationally intensive process. The layered space-time (LST) algorithms provide a reasonable tradeoff between complexity and performan...
详细信息
Recovering the symbols in a multiple-input multiple-output (MIMO) receiver is a computationally intensive process. The layered space-time (LST) algorithms provide a reasonable tradeoff between complexity and performance. Commercial digital signal processors (DSPs) have become a key component in many high-volume products such as cellular telephones. As an alternative to power-hungry DSPs, we propose to use a moderately parallel single-instruction stream, multiple-data stream (SIMD) coprocessor architecture, called DSP-RAM, to implement an LST MIMO receiver that offers high performance with relatively low power consumption. For a, typical indoor wireless environment, a 100-MHz DSP-RAM can potentially provide more than ten times greater decoding throughput at the receiver of a (4,4) MIMO system compared with a conventional 720-MHz DSP. The DSP-RAM processor has been coded in a hardware description language (HDL) and synthesized for both available field-programmable gate arrays (FPGAs) and for a 0.18-mu m CM,OS standard cell implementation.
Motion estimation is the most computationally intensive task in present video compression standards. Parallel processing has proved to be an efficient approach for similar kinds of applications. In this paper, we prop...
详细信息
ISBN:
(纸本)0780375149
Motion estimation is the most computationally intensive task in present video compression standards. Parallel processing has proved to be an efficient approach for similar kinds of applications. In this paper, we propose two parallel implementations of block-based motion estimation for a massively-parallel, processor-in-memory hardware architecture known as Computational RAM (C-RAM). Our simulation study showed that, although the massive parallelism of C-RAM does potentially have great benefits, the use of embedded DRAM and bit-serial arithmetic reduced the achievable speed-up to about 4 compared to 733 MHz Pentium III machine.
暂无评论