In this paper we propose an architecture design methodology to optimize the throughput of MD4-based hash algorithms. the proposed methodology includes an iteration bound analysis of hash algorithms, which is the theor...
详细信息
In this paper we propose an architecture design methodology to optimize the throughput of MD4-based hash algorithms. the proposed methodology includes an iteration bound analysis of hash algorithms, which is the theoretical delay limit, and Data Flow Graph transformations to achieve the iteration bound. We applied the methodology to some MD4-based hash algorithms such as SHA1, MD5 and RIPEMD-160. Since SHA1 is the algorithm which requires all the techniques we show, we also synthesized the transformed SHA1 algorithm in a 0.18 mu m CMOS technology in order to verify its correctness and its achievement of high throughput. To the best of our knowledge, the proposed SHA1 architecture is the first to achieve the theoretical throughput optimum beating all previously published results. though we demonstrate a limited number of examples, this design methodology can be applied to any other MD4-based hash algorithm.
A novel fast scheme for Discrete Wavelet Transform (DWT) was introduced in last years under the name of lifting scheme [4, 7]. this new scheme presents many advantages over the convolution-based approach [3, 7]. For i...
详细信息
ISBN:
(纸本)9783540854500
A novel fast scheme for Discrete Wavelet Transform (DWT) was introduced in last years under the name of lifting scheme [4, 7]. this new scheme presents many advantages over the convolution-based approach [3, 7]. For instance it is very suitable for parallelization. In this paper we present two new parallel FPGA-based implementations of the lifting-based DWT scheme. the first implementation uses pipelining, parallelprocessing and data reuse to increase the speed up of the algorithm. In the second architecture a controller is introduced to dynamically deploy a suitable number of clones accordingly to the available hardware resources on a targeted environment. these two architectures are able of processing large size incoming images or multi-framed images in real-time. the simulations driven on a Xilinx Virtex-5 FPGA environment has proven the practical efficiency of our contribution: the first architecture has given an operating frequency of 289 MHz, and the second demonstrated the controller's capabilities of deploying the maximum number of clones from the available resources, over a targeted FPGA environment and processingthe task in parallel.
We present an efficient algorithm for nonlocal image filtering with applications in electron cryomicroscopy. Our denoising algorithm is a rewriting of the recently proposed nonlocal mean filter. It builds on the separ...
详细信息
ISBN:
(纸本)9781424420025
We present an efficient algorithm for nonlocal image filtering with applications in electron cryomicroscopy. Our denoising algorithm is a rewriting of the recently proposed nonlocal mean filter. It builds on the separable property of neighborhood filtering to offer a fast parallel and vectorized implementation in contemporary shared memory computer architectures while reducing the theoretical computational complexity of the original filter. In practice, our approach is much faster than a serial, non-vectorized implementation and it scales linearly with image size. We demonstrate its efficiency in data sets from Caulobacter crescentus tomograms and a cryoimage containing viruses and provide visual evidences attesting the remarkable quality of the nonlocal means scheme in the context of cryoimaging. With such development we provide biologists with an attractive filtering tool to facilitate their scientific discoveries.
this paper proposes a software based parallel CRC (Cyclic Redundancy Check) algorithm called ' N-byte RCC (Repetition of Computation and Combination )''. this algorithm is the iterative process of message ...
详细信息
this paper proposes a software based parallel CRC (Cyclic Redundancy Check) algorithm called ' N-byte RCC (Repetition of Computation and Combination )''. this algorithm is the iterative process of message computation by the 'slicing-by-4' and combination through the ' zero block lookup tables '. this algorithm can parallelize the CRC calculation with any number of processors. In order to verify the performance of our algorithm, we employ two different communication architectures; the single bus architecture and the 1-star topology NoC (Network on Chip) architecture. With respect to those architectures, we explore our parallel algorithm by using TLM (Transaction Level Model). From the simulation results, we present that the proposed parallel CRC algorithm with BUS and NoC architectures reduces the processing time by 28 percent and 38 percent, respectively, compared to the 'slicing-by-8' which is the fastest algorithms among other software based algorithms. Furthermore, the 1-star NoC architecture of the parallel CRC shows higher performance than the single bus architecture regardless of the number of processors.
Web search engines are facing formidable performance challenges as they need to process thousands of queries per second over billions of documents. To deal withthis heavy workload, current engines use massively paral...
详细信息
ISBN:
(纸本)9781605580852
Web search engines are facing formidable performance challenges as they need to process thousands of queries per second over billions of documents. To deal withthis heavy workload, current engines use massively parallelarchitectures of thousands of machines that require large hardware investments. We investigate new ways to build such high-performance IR systems based on Graphical processing Units (GPUs). GPUs were originally designed to accelerate computer graphics applications through massive on-chip parallelism. Recently a number of researchers have studied how to use GPUs for other problem domains including databases and scientific computing [2, 3, 5], but we are not aware of previous attempts to use GPUs for large-scale web search. Our contribution here is to design a basic system architecture for GPU-based high-performance IR, and to describe how to perform highly efficient query processing within such an architecture. Preliminary experimental results based on a prototype implementation suggest that significant gains in query processing performance might be obtainable with such an approach.
the evolution of voids (damage) in friction stir welding processes was simulated using a void growth model that incorporates viscoplastic flow and strain hardening of incompressible materials during plastic deformatio...
详细信息
the evolution of voids (damage) in friction stir welding processes was simulated using a void growth model that incorporates viscoplastic flow and strain hardening of incompressible materials during plastic deformation. the void growth rate is expressed as a function of the void volume fraction, the effective deformation rate, and the ratio of the mean stress to the strength of the material. A steady-state Eulerian finite element formulation was employed to calculate the flow and thermal fields in three dimensions, and the evolution of the strength and damage was evaluated by integrating the evolution equations along the streamlines obtained in the Eulerian configuration. the distribution of internal voids within the material was qualitatively compared with experimental results, and a good agreement was observed in terms of the spatial location of voids. the effects of pin geometry and operational parameters such as tool rotational and travel speeds on the evolution of damage were also examined.
In this paper we propose an architecture design methodology to optimize the throughput of MD4-based hash algorithms. the proposed methodology includes an iteration bound analysis of hash algorithms, which is the theor...
详细信息
In this paper we propose an architecture design methodology to optimize the throughput of MD4-based hash algorithms. the proposed methodology includes an iteration bound analysis of hash algorithms, which is the theoretical delay limit, and Data Flow Graph transformations to achieve the iteration bound. We applied the methodology to some MD4-based hash algorithms such as SHA1, MD5 and RIPEMD-160. Since SHA1 is the algorithm which requires all the techniques we show, we also synthesized the transformed SHA1 algorithm in a 0.18 mu m CMOS technology in order to verify its correctness and its achievement of high throughput. To the best of our knowledge, the proposed SHA1 architecture is the first to achieve the theoretical throughput optimum beating all previously published results. though we demonstrate a limited number of examples, this design methodology can be applied to any other MD4-based hash algorithm.
the present paper proposes an adaptive hardware implementation for a microarray image acquisition system, which is mandatory for implementing hardware algorithms for processing microarray images. processing techniques...
详细信息
ISBN:
(纸本)9783540855668
the present paper proposes an adaptive hardware implementation for a microarray image acquisition system, which is mandatory for implementing hardware algorithms for processing microarray images. processing techniques for microarray image are also described, together with a hardware implementation of a spot border detection algorithm. the hardware implementation takes advantage of parallel computation capabilities offered by FPGA technology. Results which prove time and cost efficiency are presented for both hardware implementations.
this book constitutes the refereed proceedings of the 14thinternationalconference on parallel Computing, Euro-Par 2008, held in Las Palmas de Gran Canaria, Spain, in August 2008. the 86 revised papers presented were...
详细信息
ISBN:
(数字)9783540854517
ISBN:
(纸本)9783540854500
this book constitutes the refereed proceedings of the 14thinternationalconference on parallel Computing, Euro-Par 2008, held in Las Palmas de Gran Canaria, Spain, in August 2008. the 86 revised papers presented were carefully reviewed and selected from 264 submissions. the papers are organized in topical sections on support tools and environments; performance prediction and evaluation; scheduling and load balancing; high performance architectures and compilers; parallel and distributed databases; grid and cluster computing; peer-to-peer computing; distributed systems and algorithms; parallel and distributed programming; parallel numerical algorithms; distributed and high-performance multimedia; theory and algorithms for parallel computation; and high performance networks.
Event stream applications consist of an acyclic graph of components that are traversed by streams of events. Examples of operations in such components are filtering, aggregation, enrichment, and transformation of even...
详细信息
ISBN:
(纸本)9781605583617
Event stream applications consist of an acyclic graph of components that are traversed by streams of events. Examples of operations in such components are filtering, aggregation, enrichment, and transformation of events and, commonly, applications include a mix of common-use library functions and user-defined functions. When the operation only depends on the current input events, the component can be trivially parallelized by replication. However, if the component keeps state that is used for the computation of the results, the trivial parallelization approach does not work. parallel versions for common components have being designed, but complex or user-defined components are normally limited by single thread performance. In this work, we use optimistic parallelization approaches to harness the potential of multi-core processors to scale the performance of stateful operators in event stream applications. In addition, we investigate indulgent ways to allow the user to provide application knowledge that can improve the amount of useful speculative work. the current prototype shows considerable gain in throughput even though some speculative executions must be disregarded. Copyright 2008 ACM.
暂无评论