An efficient paraUel processing method for deblocking filter design in H.264 video coding standard is presented in this paper. In order to reduce the memory reference and make the intermediate data reused as soon as p...
详细信息
An efficient paraUel processing method for deblocking filter design in H.264 video coding standard is presented in this paper. In order to reduce the memory reference and make the intermediate data reused as soon as possible,an advanced filtering order is taken,and read/write operation on external memory is executed in parallel with filtering computation. Furthermore,preloading operation is taken to reduce complexity of memory structure,and vertical MB processing order is used for improving the efficiency of intermediate data *** a result,the processing cycles of the proposed architecture with single-port memory architecture is reduced by 80.5%compared withthe advanced architecture of previous proposals.
this work presents a SystemC-based design of custom SIMD instructions for accelerating media and telecom codes on a next-generation configurable, extensible processor. the SS_SPARC processing platform, incorporates a ...
详细信息
this work presents a SystemC-based design of custom SIMD instructions for accelerating media and telecom codes on a next-generation configurable, extensible processor. the SS_SPARC processing platform, incorporates a generic vector unit which can be extended with pipelined, SIMD computation units (datapaths) designed either with established (RTL-based) or in this case, hybrid (SystemC-RTL) methodologies. this work elaborates on a custom methodology for automatically encapsulating the data-parallel sections of the MPEG-4 XviD the G723.1 and G729A reference codes into a SystemC wrapper which is subsequently synthesized to RTL with a commercial SystemC-synthesis tool. the resulting RTL is then attached to the exposed vector unit of the SS_SPARC engine. We present results from a standard-cell RTL synthesis campaign and the VLSI implementation of a high-end (8-contexts, 256bit) and a low-end (2-context, 128bit) configuration of the vector engine for the workloads of interest.
To combine presented MIMO scheme with multiuser detectors for uplink will suffer from the problems of high computation complexity and channenl ***,in this paper we propose a MIMO multiuser detection(MUD) scheme that r...
详细信息
To combine presented MIMO scheme with multiuser detectors for uplink will suffer from the problems of high computation complexity and channenl ***,in this paper we propose a MIMO multiuser detection(MUD) scheme that reduces considerably the system computation complexity. the proposed algorithm adopts inverse channel matrix for MIMO decoding,which is not sensitive to the coherency of *** of the scattering characteristic of the MIMO channel,the inverse channel matrices are always nonsingular, which keeps the receivers can get stable spatial diversity gain. the MUD algorithms can be realized using a parallel modular *** is based on a Minimum Mean Square Error (MMSE) *** results show that our MIMO-MUD performs much better than presented MIMO-MUD for the same order of complexity,though the MIMO CDMA system has only two antennas at each BS and two antennas at each mobile station.
this paper presented an improved word -level sequential scheme and parallel architecture for bit plane coding of EBCOT used in JPEG *** bit plane coding adopted by EBCOT is divided into two stages:coding pass predicti...
详细信息
this paper presented an improved word -level sequential scheme and parallel architecture for bit plane coding of EBCOT used in JPEG *** bit plane coding adopted by EBCOT is divided into two stages:coding pass prediction and context formation,which work in parallel and pipelined. Word-level and sequential bit plane coding could be achieved that coefficient bits modelling in different bit plane are performed concurrently and,all three passes coding included in each bit plane are completed in one *** result demonstrates that the proposed architecture could efficiently reduce hardware complexity,compared to the up-to -date design.
State space explosion is the main obstacle for model checking concurrent programs. Among the solutions, partial-order reduction (POR), especially dynamic partial-order reduction (DPOR) [1], is one of the promising app...
详细信息
A semi-dynamic system is presented that is capable of predicting the performance of parallel programs at runtime. the functionality given by the system allows for efficient handling of portability and irregularity of ...
详细信息
ISBN:
(纸本)0769525091
A semi-dynamic system is presented that is capable of predicting the performance of parallel programs at runtime. the functionality given by the system allows for efficient handling of portability and irregularity of parallel programs. Two forms of parallelism are addressed: loop level parallelism and task level parallelism.
A novel extension to external double hashing providing significant reduction to both successful and unsuccessful search lengths is presented. the experimental and analytical results demonstrate the reductions possible...
详细信息
ISBN:
(纸本)0769525091
A novel extension to external double hashing providing significant reduction to both successful and unsuccessful search lengths is presented. the experimental and analytical results demonstrate the reductions possible. this method does not restrict the hashing table configuration parameters and utilizes very little additional storage space per bucket. the runtime performance for insertion is slightly greater than for ordinary external double hashing.
Within the parallel computing domain, field programmable gate arrays (FPGA) are no longer restricted to their traditional role as substitutes for application-specific integrated circuits-as hardware "hidden"...
详细信息
ISBN:
(纸本)0769525091
Within the parallel computing domain, field programmable gate arrays (FPGA) are no longer restricted to their traditional role as substitutes for application-specific integrated circuits-as hardware "hidden" from the end user Several high performance computing vendors offer parallel reconfigurable computers employing user-programmable FPGAs. these exciting new architectures allow end-users to, in effect, create reconfigurable coprocessors targeting the computationally intensive parts of each problem. the increased capability of contemporary FPGAs coupled withthe embarrassingly parallel nature of the Jacobi iterative method make the Jacobi method an ideal candidate for hardware acceleration. this paper introduces a parameterized design for a deeply pipelined, highly parallelized IEEE 64-bit floating-point version of the Jacobi method. A Jacobi circuit is implemented using a Xilinx Virtex-II Pro as the target FPGA device. Implementation statistics and performance estimates are presented.
A chip-multiprocessor is one of the promising architecturesthat can overcome the ILP limitation, high power consumption and high heating that current processors face. On a shared memory multiprocessor a performance i...
详细信息
ISBN:
(纸本)0769525091
A chip-multiprocessor is one of the promising architecturesthat can overcome the ILP limitation, high power consumption and high heating that current processors face. On a shared memory multiprocessor a performance improvement relies on an efficient communication and synchronization method via shared variables. the TSVM cache combines communication and synchronization withthe coherence maintenance on a chip-multiprocessor that is, the communication and synchronization via shared variables are realized by one coherence transaction through a highspeed on chip inter-connection. the TSVM cache provides several instructions that each instruction has the individual coherence maintenance scheme. the combinations of these instructions can realize the producer-consumers synchronization, mutual exclusion and barrier synchronization with communication easily and systematically. this paper describes how those instructions construct three primitives and shows effect of these primitives using a clock cycle-accurate simulator written in VHDL. the result shows that the TSVM cache can improve a performance of 9.8 times compared with a traditional cache memory, and improve a performance of 2 times compared with a conventional cache memory with synchronization mechanism.
the amount of Task Level parallelism (TLP) in runtime workload is useful information to determine the efficient us age of multiprocessors. this paper presents mechanisms to dynamically estimate the amount of TLP in ru...
详细信息
ISBN:
(纸本)0769525091
the amount of Task Level parallelism (TLP) in runtime workload is useful information to determine the efficient us age of multiprocessors. this paper presents mechanisms to dynamically estimate the amount of TLP in runtime work loads. Modifications are added to the operating system (OS) to collect information about processor utilization, task activities, from which TLP can be calculated. By effectively utilizing the Time Stamp Counter (TSC) hardware, the task activities can be monitored at fine time resolution, result ing in capability of estimation of TLP at fine granularity. We implement the mechanisms on a recent version of Linux OS. Evaluation results indicate that the mechanisms can estimate TLP accurately for various kinds of workloads with small overheads.
暂无评论