Stochastic encoding represents a value using the probability of ones in a random bit stream. Computation based on this encoding has good fault-tolerance and low hardware cost. However, one of its major issues is long ...
详细信息
ISBN:
(纸本)9781479904945
Stochastic encoding represents a value using the probability of ones in a random bit stream. Computation based on this encoding has good fault-tolerance and low hardware cost. However, one of its major issues is long processing time. We have to use a long enough bit stream to represent a value to guarantee that random fluctuations introduce only small errors to final computation results. For example, for most digital image processingalgorithms, we need a 512-bit stream to represent an 8-bit pixel value stochastically to guarantee that the final computation error is less than 5%. To solve this issue, this paper proposes to share bits between adjacent bit streams to represent adjacent deterministic values. For example, in image processing applications, the bit stream which represents the current pixel value can share parts of the bits in the bit stream which represents the previous pixel value. We use an image contrast stretching algorithm to evaluate this method. Our experimental results show that the proposed methods can improve the performance by 90%.
Multiprocessor System on Chip (MPSoC) technology can present an interesting solution to reduce the computational time of complex applications. Execute the H.264/AVC encoder on MPSoC architecture, is becoming an intere...
详细信息
Multiprocessor System on Chip (MPSoC) technology can present an interesting solution to reduce the computational time of complex applications. Execute the H.264/AVC encoder on MPSoC architecture, is becoming an interesting point of research that can mitigate its algorithmic complexity and to resolve the real time constraints. In this paper, we present an efficient MPSoC architecture for the intra prediction process which is an important module of the H.264/AVC video encoder, using Data Level parallelism (DLP) partitioning. this architecture is tested on an open platform for MPSoC architectures virtual designing (SoCLiB), and validated on FPGA technology. Experimental results show a gain of 74% in term of encoding speed when using four processors for coding a High Definition Video sequence (HDV) compared to uni-processor architecture.
this paper concerns mainly withparallel and distributed implementations of molecular dynamics simulations of the Lennard-Jones potential model. the reported research work studies and experiments different algorithms ...
详细信息
In this paper we explore the effectiveness of solution of computationally intensive problems in FPGA (Field-Programmable Gate Array) on an example of Sudoku game. three different Sudoku solvers have been fully impleme...
详细信息
ISBN:
(纸本)9781467313261
In this paper we explore the effectiveness of solution of computationally intensive problems in FPGA (Field-Programmable Gate Array) on an example of Sudoku game. three different Sudoku solvers have been fully implemented and tested on a low-cost FPGA of Xilinx Spartan-3E family. the first solver is only able to deal with simple puzzles with reasoning, i.e. without search. the second solver applies breadth-first search algorithm and therefore has virtually no limitation on the type of puzzles which are solvable. We prove that despite the serial nature of implemented backtracking search algorithms, parallelism can be used efficiently. thus, the suggested third solver explores the possibility of parallelprocessing of search tree branches and boosts the performance of the second solver. the trade-offs of the designed solvers are analyzed, the results are compared to software and to other known implementations, and conclusions are drawn on how to improve the suggested architectures.
We present the design of the algorithm for constructing the suffix array of a string using manycore GPUs. Despite of the wide usage in text processing and extensive research over two decades there was a lack of effici...
详细信息
ISBN:
(纸本)9783642341083;9783642341090
We present the design of the algorithm for constructing the suffix array of a string using manycore GPUs. Despite of the wide usage in text processing and extensive research over two decades there was a lack of efficient algorithmsthat were able to exploit shared memory parallelism (as multicore CPUs as manycore GPUs) in practice. To the best of our knowledge we developed the first approach exposing shared memory parallelism that significantly outperforms the state-of-the-art existing implementations for sufficiently large inputs. We reduced the suffix array construction problem to a number of parallel primitives such as prefix-sum, radix sorting, random gather and scatter from/to the memory. thus, the performance of the algorithm merely depends on the performance of these primitives on the particular shared memory architecture. We demonstrate its performance on manycore GPUs, but the method can also be applied for other parallelarchitectures, such as multicores, CELL or Intel MIC.
Convolution is one of the most important operators used in image processing. Withthe constant need to increase the performance in high-end applications and the rise and popularity of parallelarchitectures, such as G...
详细信息
the solution of large-scale problems in Computational Science and Engineering relies on the availability of accurate, robust and efficient numerical algorithms and software that are able to exploit the power offered b...
详细信息
ISBN:
(纸本)9783642328206
the solution of large-scale problems in Computational Science and Engineering relies on the availability of accurate, robust and efficient numerical algorithms and software that are able to exploit the power offered by modern computer architectures. Such algorithms and software provide building blocks for prototyping and developing novel applications, and for improving existing ones, by relieving the developers from details concerning numerical methods as well as their implementation in new computing environments.
An efficient parallel priority queue is at the core of the effort in parallelizing important non-numeric irregular computations such as discrete event simulation scheduling and branch-and-bound algorithms. GPGPUs can ...
详细信息
ISBN:
(纸本)9781467323703;9781467323727
An efficient parallel priority queue is at the core of the effort in parallelizing important non-numeric irregular computations such as discrete event simulation scheduling and branch-and-bound algorithms. GPGPUs can provide powerful computing platform for such non-numeric computations if an efficient parallel priority queue implementation is available. In this paper, aiming at fine-grained applications, we develop an efficient parallel heap system employing CUDA. To our knowledge, this is the first parallel priority queue implementation on many-core architectures, thus represents a breakthrough. By allowing wide heap nodes to enable thousands of simultaneous deletions of highest priority items and insertions of new items, and taking full advantage of CUDA's data parallel SIMT architecture, we demonstrate up to 30-fold absolute speedup for relatively fine-grained compute loads compared to optimized sequential priority queue implementation on fast multicores. Compared to this, our optimized multicore parallelization of parallel heap yields only 2-3 fold speedup for such fine-grained loads. this parallelization of a tree-based data structure on GPGPUs provides a roadmap for future parallelizations of other such data structures.
As the need of high quality random number generators is constantly increasing especially for cryptographic algorithms, the development of high throughput randomness generators has to be combined withthe development o...
详细信息
Next Generation Sequencing (NGS) platforms typically produce short reads of size 50-150 base pairs (bp). the number of such short reads can be up to 6 billion per run. To align these short reads to a large genome is a...
详细信息
ISBN:
(纸本)9781467323703;9781467323727
Next Generation Sequencing (NGS) platforms typically produce short reads of size 50-150 base pairs (bp). the number of such short reads can be up to 6 billion per run. To align these short reads to a large genome is a computationally challenging problem. In this paper, we address this problem by considering the design and optimization of parallel sequence alignment on GPU based hybrid architectures. Even though the sequence alignment algorithm is inherently data-parallel, issues such as (a) space-time trade-offs in the Indexing schema, (b) need for fast candidate location search (CAL) on GPU, (c) maintaining low divergence along with low space for the dynamic programming based local alignment, make this a very challenging problem. We present the design of our novel parallel algorithm Graphics processor Accelerated BFAST (GrABFAST) for large scale read alignment that overcomes these challenges and demonstrates superior performance compared to Intel multicore architectures. Using 5 large genomes including those of Humans, Maize, Horse, Dog and Bacteria, we demonstrate a speedup of around 6x using Fermi Tesla C2070 GPUs vs the BFAST algorithm on 16 core Intel Xeon 5570 architecture.
暂无评论