In this paper, we proposed a flexible VLSI-based parallelprocessing architecture for an improved three-step search (ITSS) motion estimation algorithm that is superior to the existing three-step search (TSS) algorithm...
详细信息
ISBN:
(纸本)0780370570
In this paper, we proposed a flexible VLSI-based parallelprocessing architecture for an improved three-step search (ITSS) motion estimation algorithm that is superior to the existing three-step search (TSS) algorithm in all cases and also to the recently proposed new three-step search (NTSS) algorithm if used for low bit-rate video coding, as withthe H.261 standard. Based on a VLSI tree processor and an FPGA addressing circuit, the architecture can successfully implement the ITSS algorithm on silicon withthe minimum number of gates. Because of the flexibility of the architecture, it can also be extended to implement other three-step search algorithms.
Image processingalgorithms are widely used in the automotive field for ADAS (Advanced Driver Assistance System) purposes. To embed these algorithms, semiconductor companies offer heterogeneous architectures which are...
详细信息
ISBN:
(纸本)9781467375894
Image processingalgorithms are widely used in the automotive field for ADAS (Advanced Driver Assistance System) purposes. To embed these algorithms, semiconductor companies offer heterogeneous architectures which are composed of different processing units, often with massively parallel computing unit. However, embedding complex algorithms on these SoCs (System on Chip) remains a difficult task due to heterogeneity, it is not easy to decide how to allocate parts of a given algorithm on processing units of a given SoC. In order to help automotive industry in embedding algorithms on heterogeneous architectures, we propose a novel approach to predict performances of image processingalgorithms on different computing units of a given heterogeneous SoC. Our methodology is able to predict a more or less wide interval of execution time with a degree of confidence using only high level description of algorithms to embed, and a few characteristics of computing units.
the amount of Task Level parallelism (TLP) in runtime workload is useful information to determine the efficient us age of multiprocessors. this paper presents mechanisms to dynamically estimate the amount of TLP in ru...
详细信息
ISBN:
(纸本)0769525091
the amount of Task Level parallelism (TLP) in runtime workload is useful information to determine the efficient us age of multiprocessors. this paper presents mechanisms to dynamically estimate the amount of TLP in runtime work loads. Modifications are added to the operating system (OS) to collect information about processor utilization, task activities, from which TLP can be calculated. By effectively utilizing the Time Stamp Counter (TSC) hardware, the task activities can be monitored at fine time resolution, result ing in capability of estimation of TLP at fine granularity. We implement the mechanisms on a recent version of Linux OS. Evaluation results indicate that the mechanisms can estimate TLP accurately for various kinds of workloads with small overheads.
parallelprocessing is a complex topic found in computing education and has become an essential topic in the curricula owing to the recent developments in both software and hardware. Ensuring access to parallel comput...
详细信息
ISBN:
(纸本)9781479909087;9781479909094
parallelprocessing is a complex topic found in computing education and has become an essential topic in the curricula owing to the recent developments in both software and hardware. Ensuring access to parallel computers in order to provide a better education at universities is not guaranteed due to the high cost of these devices. Alternatively, parallelprocessing can be taught using simulators. Accordingly, a multi-core processor, MCSEP, was developed as a tool for teaching parallel computing and architectures. MCSEP consists of 16 SEP (Students' Experimental Processor) cores connected via a 2D mesh. It can be configured to implement the following parallelarchitectures found in Flynn's taxonomy: Single Instruction Single Data (SISD), Single Instruction Multiple Data (SIMD), and Multiple Instructions Multiple Data (MIMD). In addition, Multiple-SIMD and Multiple-MIMD are also implemented. the salient feature of MCSEP is its ability to configure each core using any of the six instruction set architectures (ISAs) available in SEP. MCSEP is designed and modeled using VHDL. therefore, it enables the implementation on FPGAs.
To solve 'dimensional curse' problem, the cell-based filtering scheme has been proposed, but it shows a linear decrease in performance as the dimensionality is increased. In this paper, we propose a parallel h...
详细信息
ISBN:
(纸本)9781424423576
To solve 'dimensional curse' problem, the cell-based filtering scheme has been proposed, but it shows a linear decrease in performance as the dimensionality is increased. In this paper, we propose a parallel high-dimensional index structure for content-based information retrieval so as to cope withthe linear decrease in retrieval performance. In addition, we devise data insertion, range query and k-NN query processingalgorithms which are suitable for a cluster-based parallel architecture. Finally, we show that our parallel index structure achieves good retrieval performance in proportion to the number of servers in the cluster-based architecture and it outperforms a parallel version of the VA-File when the dimensionality is over 10.
In this paper a technique to deal withthe problem of poor locality and false sharing in irregular codes on shared memory multiprocessors (SMPs) is proposed. this technique is based on the locality model for irregular...
详细信息
ISBN:
(纸本)0769525091
In this paper a technique to deal withthe problem of poor locality and false sharing in irregular codes on shared memory multiprocessors (SMPs) is proposed. this technique is based on the locality model for irregular codes previously developed and extensively proven by the authors on mono-processors and multiprocessors. In the model, locality is established in run-time considering parameters that describe the structure of the sparse matrix which characterizes the irregular accesses. As an example of irregular code with false sharing a particular implementation of the sparse matrix-vector product (SpM x V) was selected. the problem of increasing locality and decreasing false sharing for a irregular problem is formulated as a graph. An adequate distribution of the graph among processors followed by a reordering of the nodes inside each processor produces the solution. the results show important improvements in the behavior of the irregular accesses: reductions in execution time and an improved program scalability.
During vector predictive coding of digital signal series, the vector signal series, obtained by grouping adjacent samples of sources signal series, can approximate to a vector autoregressive series with stable covaria...
详细信息
ISBN:
(纸本)1424411351
During vector predictive coding of digital signal series, the vector signal series, obtained by grouping adjacent samples of sources signal series, can approximate to a vector autoregressive series with stable covariance. this paper, applying the orthogonal projection principle of Hilbert space, attempts to formulate a vector predictive coding strategy highly capable of parallelprocessing and to deduce from this strategy an adaptive parallelprocessing, algorithm, which, compared with traditional lattice algorithms, has improved remarkably in calculation complexity and storage space.
In this paper we provide both a qualitative and a quantitative evaluation of a decoupled multithreaded architecture that uses non-blocking threads. Our architecture is based on simple in-order pipelines and complete d...
详细信息
ISBN:
(纸本)9783540695004
In this paper we provide both a qualitative and a quantitative evaluation of a decoupled multithreaded architecture that uses non-blocking threads. Our architecture is based on simple in-order pipelines and complete decoupling of memory accesses from execution pipelines. We extend the architecture to support thread level speculation using snooping cache coherency protocols. We evaluate the performance gains from speculations by varying the number of load/store instructions compared to computational instructions, miss speculation rates and the degree of thread level speculation. Our architecture presents a viable alternative to complex superscalar and super-speculative CPUs.
In this paper we investigate the problem of finding a delay- and degree-bounded maximum sum of nodes application level multicast tree. We then proved the problem is NP-hard, and its relationship withthe well-studied ...
详细信息
this paper concerns mainly withparallel and distributed implementations of molecular dynamics simulations of the Lennard-Jones potential model. the reported research work studies and experiments different algorithms ...
详细信息
暂无评论