Due to its high-level nature, parallel functional languages provide some advantages for the programmer. Unfortunately, the functional programming community has not paid much attention to some important practical probl...
详细信息
ISBN:
(纸本)9783540695004
Due to its high-level nature, parallel functional languages provide some advantages for the programmer. Unfortunately, the functional programming community has not paid much attention to some important practical problems, like debugging parallel programs. In this paper we introduce the first debugger that works with any parallel extension of the functional language Haskell, the de facto standard in the (lazy evaluation) functional programming community. the debugger is implemented as an independent library. thus, it can be used with any Haskell compiler. Moreover, the debugger can be used to analyze how much speculative work has been done in any program.
Graphics processing units (GPUs) are widely used in the area of scientific computing. While GPUs provide much higher peak performance, efficient implementation of real applications on the GPU architectures is still a ...
详细信息
作者:
Huang, HChinese Acad Sci
Supercomp Ctr Comp Network Informat Ctr Beijing 100080 Peoples R China
In this paper, we use a new language-TPL (Tensor product Language) to compute the Fast Fourier Transform. It can provide good performance and portability. We detail the method and application to the FFT of TPL, andext...
详细信息
ISBN:
(纸本)0769515126
In this paper, we use a new language-TPL (Tensor product Language) to compute the Fast Fourier Transform. It can provide good performance and portability. We detail the method and application to the FFT of TPL, andextendto Sande-Tucky FFT algorithm.
Convolutional Neural Network (CNN) is the state-ofthe-art deep learning approach employed in various applications due to its remarkable performance. Convolutions in CNNs generally dominate the overall computation comp...
详细信息
ISBN:
(纸本)9781509028603
Convolutional Neural Network (CNN) is the state-ofthe-art deep learning approach employed in various applications due to its remarkable performance. Convolutions in CNNs generally dominate the overall computation complexity and thus consume major computational power in real implementations. In this paper, efficient hardware architectures incorporating parallel fast finite impulse response (FIR) algorithm (FFA) for CNN convolution implementations are discussed. the theoretical derivation of 3 and 5 parallel FFAs is presented and the corresponding 3 and 5 parallel fast convolution units (FCUs) are proposed for most commonly used 3 x 3 and 5 x 5 convolutional kernels in CNNs, respectively. Compared to conventional CNN convolution architectures, the proposed FCUs reduce the number of multiplications used in convolutions significantly. Additionally, the FCUs minimize the number of reads from the feature map memory. Furthermore, a reconfigurable FCU architecture which suits the convolutions of both 3 x 3 and 5 x 5 kernels is proposed. Based on this, an efficient top-level architecture for processing a complete convolutional layer in a CNN is developed. To quantize the benefits of the proposed FCUs, the design of an FCU is coded with RTL and synthesized with TSMC 90nrn CMOS technology. the implementation results demonstrate that 30% and 36% of the computational energy can be saved compared to conventional solutions with 3 x 3 and 5 x 5 kernels in CNN, respectively.
作者:
El Baz, D.CNRS
LAAS 7Ave Colonel Roche F-31077 Toulouse 4 France
the implementation of parallel asynchronous iterative algorithms on message passing architectures is considered. Several issues related to communication via message passing interfaces or libraries such as MPI-1, MPI-2...
详细信息
ISBN:
(纸本)9780769527840
the implementation of parallel asynchronous iterative algorithms on message passing architectures is considered. Several issues related to communication via message passing interfaces or libraries such as MPI-1, MPI-2, PVM or SHMEM are discussed in this survey paper Practical impleinentations are proposed.
AIAC algorithms (Asynchronous Iterations Asynchronous Communications) are a particular class of parallel iterative algorithms. their asynchronous nature makes them more efficient than their synchronous counterparts in...
详细信息
AIAC algorithms (Asynchronous Iterations Asynchronous Communications) are a particular class of parallel iterative algorithms. their asynchronous nature makes them more efficient than their synchronous counterparts in numerous cases as has already been shown in previous works. the first goal of this article is to compare several parallel programming environments in order to see if there is one of them which is best suited to efficiently implement AIAC algorithms. the main criterion for this comparison consists in the performances achieved in a global context of grid computing for two classical scientific problems. Nevertheless, we also take into account two secondary criteria which are the ease of programming and the ease of deployment. the second goal of this study is to extract from this comparison the important features that a parallel programming environment must have in order to be suited for the implementation of AIAC algorithms.
In this paper, we compare a wide range of accelerator architectures (GPUs from AMD and NVIDIA, the Xeon Phi, and a DSP), by means of a signal-processing pipeline that processes radio-telescope data. We discuss the map...
详细信息
ISBN:
(纸本)9781509028238
In this paper, we compare a wide range of accelerator architectures (GPUs from AMD and NVIDIA, the Xeon Phi, and a DSP), by means of a signal-processing pipeline that processes radio-telescope data. We discuss the mapping of the algorithms from this pipeline to the accelerators, and analyze performance. We also analyze energy efficiency, using custom-built, microcontroller-based power sensors that measure the instantaneous power consumption of the accelerators, at millisecond time scale. We show that the GPUs are the fastest and most energy efficient accelerators, and that the differences in performance and energy efficiency are large.
In this paper we present a novel and complete approach on how to encapsulate parallelism for relational database query execution that strives for maximum resource utilization for both CPU and disk activities. Its simp...
详细信息
ISBN:
(纸本)9783540695004
In this paper we present a novel and complete approach on how to encapsulate parallelism for relational database query execution that strives for maximum resource utilization for both CPU and disk activities. Its simple and robust design is capable of modeling intra- and inter-operator parallelism for one or more parallel queries in a most natural way. In addition, encapsulation guarantees that the bulk of relational operators can remain unmodified, as long as their implementation is thread-safe. We will show, that withthis approach, the problem of scheduling parallel tasks is generalized, so that it can be safely entrusted to the underlying operating system (OS) without suffering any performance penalties. On the contrary, relocation of all scheduling decisions from the DBMS to the OS guarantees a centralized and therefore near-optimal resource allocation (depending on the OS's abilities) for the complete system that is hosting the database server as one of its tasks. Moreover, withthis proposal, query parallelization is fully transparent on the SQL interface of the database system. Configuration of the system for effective parallel query execution can be adjusted by the DB administrator by setting two descriptive tuning parameters. A prototype implementation has been integrated into the Transbase (R) relational DBMS engine.
In this paper we explore the effectiveness of solution of computationally intensive problems in FPGA (Field-Programmable Gate Array) on an example of Sudoku game. three different Sudoku solvers have been fully impleme...
详细信息
ISBN:
(纸本)9781467313261
In this paper we explore the effectiveness of solution of computationally intensive problems in FPGA (Field-Programmable Gate Array) on an example of Sudoku game. three different Sudoku solvers have been fully implemented and tested on a low-cost FPGA of Xilinx Spartan-3E family. the first solver is only able to deal with simple puzzles with reasoning, i.e. without search. the second solver applies breadth-first search algorithm and therefore has virtually no limitation on the type of puzzles which are solvable. We prove that despite the serial nature of implemented backtracking search algorithms, parallelism can be used efficiently. thus, the suggested third solver explores the possibility of parallelprocessing of search tree branches and boosts the performance of the second solver. the trade-offs of the designed solvers are analyzed, the results are compared to software and to other known implementations, and conclusions are drawn on how to improve the suggested architectures.
As to Markov cipher, its transition probability matrix is a doubly stochastic one. the eigenvalue of the matrix with maximum magnitude less than one plays an important role in designing Markov cipher this paper provid...
详细信息
ISBN:
(纸本)0769515126
As to Markov cipher, its transition probability matrix is a doubly stochastic one. the eigenvalue of the matrix with maximum magnitude less than one plays an important role in designing Markov cipher this paper provides a parallel algorithm for computing the eigenvalue of the doubly stochastic matrix A of size 65535x65535, which comes from a Markov cipher shrunken model with both 16 bits plaintext and ciphertext, an analysis on the complexity of the parallel algorithm is also considered.
暂无评论