this paper presents a parallel architecture that can simultaneously perform block-matching motion estimation (ME) and discrete cosine transform (DCT). Because DCT and ME are both processed block by block, it is prefer...
详细信息
ISBN:
(纸本)9783540729044
this paper presents a parallel architecture that can simultaneously perform block-matching motion estimation (ME) and discrete cosine transform (DCT). Because DCT and ME are both processed block by block, it is preferable to put them in one module for resource sharing. Simulation results performed using Simulink demonstrate that the parallel fashioned architecture improves the performance in terms of running time by 18.6% compared to the conventional sequential fashioned architecture.
In several digital signal processingalgorithms, computational nodes are organized in consecutive stages and data is reordered between these stages. parallel computation of such algorithms with reduced number of proce...
详细信息
ISBN:
(纸本)0769522262
In several digital signal processingalgorithms, computational nodes are organized in consecutive stages and data is reordered between these stages. parallel computation of such algorithms with reduced number of processing elements implies that several computational nodes are assigned to each element. As a drawback, permutations become more complex and require data storage. In this paper, a systematic design methodology for stride permutation networks is derived. these permutations are represented with Boolean matrices, which are decomposed and mapped directly onto register-based networks. the resulting networks are regular and scalable and they support any stride of power-of-two. In addition, the networks reach the lower bound in the number of registers indicating area-efficiency. Since the proposed methodology is systematic, it can be exploited in automated design generation.
this paper describes a hybrid two-level parallel method with MPI/OpenMP for computing the eigenvalues of dense symmetric matrices on cluster of SMP's environments. the eigenvalue computation is Based on boththe H...
详细信息
ISBN:
(纸本)9783540729044
this paper describes a hybrid two-level parallel method with MPI/OpenMP for computing the eigenvalues of dense symmetric matrices on cluster of SMP's environments. the eigenvalue computation is Based on boththe Householder tridiagonalization method and a divide-and-conquer algorithm of tridiagonal eigenproblem. In hybrid parallel design, We take a coarse-grain approach to OpenMP shared-memory parallelization, which keeps BLAS-3 operations in tridiagonalization. Moreover, dynamic work sharing is used in the divide-and-conquer algorithm of tridiagonal eigenproblem. So the amount of synchronization has also been reduced, and these could have an effect on the load balance. In addition, we analyze the communication overhead between hybrid MPI/ OpenMP and pure MPI. An experimental analysis on the Deepcomp6800 shows the hybrid algorithm performs good scalability.
Low-density parity-check (LDPC) codes have recently been included as error-correcting codes in IEEE 802.16e, for wireless metropolitan area networks. this paper proposes a flexible, low-complexity LDPC decoder fully c...
详细信息
ISBN:
(纸本)9780769529783
Low-density parity-check (LDPC) codes have recently been included as error-correcting codes in IEEE 802.16e, for wireless metropolitan area networks. this paper proposes a flexible, low-complexity LDPC decoder fully complaint with all 114 codes defined by the standard. the decoder runs the layered decoding algorithm to increase the convergence speed, and relies on a semi-parallel implementation with serial processing units working in pipeline to reduce the latency. Particularly, two different architectures are considered, and their RTL/memory complexity tradeoffs are analyzed. the resulting design yields a throughput ranging from 93 to 497 Mbps by means of 15 iterations at the clock frequency of 400 MHz. Synthesis on 65 nm CMOS technology, shows a chip area less than 0.59 mm(2), despite the high flexibility, which compares favourably with similar implementations.
In this paper, a pixel-parallel image sensor/processor architecture with a fine-grain massively parallel SIMD analogue processor array is overviewed and the latest VLSI implementation, SCAMP-3 vision chip, comprising ...
详细信息
Separated grid systems. are becoming the new information islands when more and more grid systems are deployed. Grid interoperation is a direction to solve that problem. this paper introduces the implementation of data...
详细信息
ISBN:
(纸本)9783540729044
Separated grid systems. are becoming the new information islands when more and more grid systems are deployed. Grid interoperation is a direction to solve that problem. this paper introduces the implementation of data interoperation between ChinaGrid and SRB. the data interoperation between them is divided into two parts: data access from SRB to ChinaGrid and from ChinaGrid to SRB. Also this paper considers the issues about performance optimization. We get a satisfied experiment result through the optimization measures.
there are only three real "dimensions" to processor performance increases beyond Moore's law: clock frequency, superscalar instruction issue, and multiprocessing. the first two have been pushed to their ...
详细信息
ISBN:
(纸本)9783540729044
there are only three real "dimensions" to processor performance increases beyond Moore's law: clock frequency, superscalar instruction issue, and multiprocessing. the first two have been pushed to their logical limits and we must focus on multiprocessing. SMT (simultaneous multithreading) [1] and CMP(chip multiprocessing) [2] are two architectural approaches to exploit thread-level parallelism using available on-chip resources. SMT processors execute instructions from different threads in the same cycle, which has the unique ability to exploit ILP (instruction-level parallelism) and TLP(thread-level parallelism) simultaneously. EPIC(explicitly parallel instruction computing) emphasizes importance of the synergy between compiler and hardware. In this paper, we present our efforts to design and implement a parallel environment, which includes an optimizing, portable parallel compiler OpenUH and SMT architecture EDSMT based on IA-64. the performance is evaluated using the NAS parallel benchmarks.(1)
Analytical models for adaptive routing in multicomputer interconnection networks withthe traditional non-bursty Poisson traffic have been widely reported in the literature. However, traffic loads generated by many re...
详细信息
ISBN:
(纸本)9783540729044
Analytical models for adaptive routing in multicomputer interconnection networks withthe traditional non-bursty Poisson traffic have been widely reported in the literature. However, traffic loads generated by many real-world parallel applications may exhibit bursty and batch arrival properties, which can significantly affect network performance. this paper develops a new and concise analytical model for hypercubic networks in the presence of bursty and batch arrival traffic modelled by the Compound Poisson Process (CPP) with geometrically distributed batch sizes. the computation complexity of the model is independent of network size. the analytical results are validated through comparison to those obtained from the simulation experiments. the model is used to evaluate the effects of the bursty traffic with batch arrivals on the performance of interconnection networks.
Recent advances in embedded processingarchitectures allow for new powerful algorithms, which exploit the intrinsic parallelism present in image processing applications. this paper describes the results of the mapping...
详细信息
ISBN:
(纸本)9783540736226
Recent advances in embedded processingarchitectures allow for new powerful algorithms, which exploit the intrinsic parallelism present in image processing applications. this paper describes the results of the mapping process of stochastic image quantisation on a massively parallel processor. the problem can be modeled in a parallel way. Despite the fact that the implementation is 10 bound, good speedups are achieved (16 x compared to a standard image processing package running on a Pentium processor).
It is very important to organize streams well to make stream programs take advantage of the parallel computing and memory system of the stream processor effectively, especially for scientific stream programs. In this ...
详细信息
ISBN:
(纸本)9783540729044
It is very important to organize streams well to make stream programs take advantage of the parallel computing and memory system of the stream processor effectively, especially for scientific stream programs. In this paper, after analyzing typical scientific programs, we present and characterize two methods to optimize the stream organization: stream reusing and stream transpose. Several representative scientific stream programs with and without our optimization are performed on a stream typical processor simulator. Simulation results show that these methods can improve scientific stream program performance greatly.
暂无评论