OpenMP is a widely used parallel programming model on traditional multi-core processors. Generally, OpenMP is used to develop fine-grained parallelism through a multithread model. Stream programming model is a new kin...
详细信息
OpenMP is a widely used parallel programming model on traditional multi-core processors. Generally, OpenMP is used to develop fine-grained parallelism through a multithread model. Stream programming model is a new kind of parallel programming model for stream architectures. OpenMP bears a resemblance to the stream programming model at some level. the transformation between the two models has attracted much attention from the research community, since it is the foundation of porting programs between the two architectures. Most related researches focus on the efficiency of porting existing parallel programs to the new architectures such as GPUs. Very few of these studies, however, focus on the portative problem systematically, namely, what kind of parallel programs can be or should be transplanted into stream programs and mapped to run on the stream processors. In this paper, we study the mapping relationship of parallel mechanism in OpenMP to the stream programming model, and point out those parallel mechanisms in OpenMP that are infeasible or undesirable for stream programs. By analyzing two typical benchmarks, we draw the conclusion that a majority of scientific applications are suitable to be mapped to the stream programming model. Our conclusion effectively validates the idea of accelerating scientific applications withthe stream processors.
In this paper, we introduce an efficient method to accelerate flow simulations for an isothermal multiphase and multicomponent (MPMC) Lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. Our objecti...
详细信息
In this paper, we introduce an efficient method to accelerate flow simulations for an isothermal multiphase and multicomponent (MPMC) Lattice Boltzmann method (LBM) on a single-node multi-GPU architecture. Our objective is to propose an efficient way to improve performance of multiphase and multicomponent Lattice Boltzmann simulations by the use of Nvidia GPUDirect technology and Peer-to-Peer (P2P) data transfers. Optimization of Peer-to-Peer communications is also studied in this work by the use of a clustering algorithm. Several simulations are shown and performance is discussed in order to validate the method.
Any digitization system must be preceded by an anti-aliasing filter. For wideband high frequency applications, parallel multi-rate conversion systems such as time-interleaved or hybrid filter bank analog-to-digital co...
详细信息
Any digitization system must be preceded by an anti-aliasing filter. For wideband high frequency applications, parallel multi-rate conversion systems such as time-interleaved or hybrid filter bank analog-to-digital converters (resp. TI-ADC or HFB) are attractive solutions. this paper compares the robustness of both techniques with respect to non-idealities of the anti-aliasing filter (AAF). theoretical results show that the signal-to-noise ratio (SNR) degradation due to out-of-band signals is lesser for HFBs than for TI-ADCs, provided that the analysis filters of the HFB are selective enough. Simulation results show that this is the case even for low-order analysis filters in the case of a four-channel HFB.
Computational demanding public key cryptographic algorithms, such as Rivest-Shamir-Adlernan (RSA) and Elliptic Curve (EC) cryptosystems, are critically dependent on modular multiplication for their performance. Modula...
详细信息
ISBN:
(纸本)9783642122668
Computational demanding public key cryptographic algorithms, such as Rivest-Shamir-Adlernan (RSA) and Elliptic Curve (EC) cryptosystems, are critically dependent on modular multiplication for their performance. Modular multiplication used in cryptography may be performed in two different algebraic structures, namely GF(N) and GF (2(n)), which normally require distinct hardware solutions for speeding up performance. For both fields, Montgomery multiplication is the most widely adopted solution, as it enables efficient hardware implementations, provided that a slightly modified definition of modular multiplication is adopted. In this paper we propose a novel unified architecture for parallel Montgomery multiplication supporting both GF(N) and GF(2(n)) finite field operations, which are critical for RSA ad ECC public key cryptosystems. the hardware scheme interleaves multiplication and modulo reduction. Furthermore, it relies on a modified Booth recoding scheme for the multiplicand and a radix-4 scheme for the modulus, enabling reduced time delays even for moderately large operand widths. In addition, we present a pipelined architecture based on the parallel blocks previously introduced, enabling very low clock counts and high throughput levels for long operands used in cryptographic applications. Experimental results, based on 0.18 mu m CMOS technology, prove the effectiveness of the proposed techniques, and outperform the best results previously presented in the technical literature.
We study the traffic characteristics of parallel and high performance computing applications in this paper. Applications that utilize multiple cores are more and more common nowadays due to the emergence of multicore ...
详细信息
ISBN:
(纸本)9781467385312
We study the traffic characteristics of parallel and high performance computing applications in this paper. Applications that utilize multiple cores are more and more common nowadays due to the emergence of multicore processors. However the design nature of single-threaded applications and multi-threaded applications can vary significantly. Furthermore the on-chip communication profile of multicore systems should be analysed and modelled for characterization and simulation purposes. We investigate several applications running on a full system simulation environment. the on-chip communication traces are gathered and analysed. We study the detailed low-level profiles of these applications. the applications are categorized into different groups according to various parallel programming paradigms. We discover that the trace data follow different parameters of power-law model. the problem is solved by applying least-squares linear regression. We propose a generic synthetic traffic model based on the analysis results.
As one of the most pervasive problems in computer science, string matching is the kernel algorithm in many applications,which especially within the communities of information retrieval and computational biology. Meanw...
详细信息
As one of the most pervasive problems in computer science, string matching is the kernel algorithm in many applications,which especially within the communities of information retrieval and computational biology. Meanwhile, the CPU+GPU heterogeneous parallel platform becomes more and more popular in solving computing intensive applications. this paper implements the webpage matching system with GPU-based advanced AC algorithm, G-AC, which is almost 28 times peak performance to the original AC algorithm which is referred from Snort.
Withthe social networks getting increasingly larger, fast community detection algorithms like the label propagation algorithm, are attracting more attention. But the label propagation algorithm deals vertices with no...
详细信息
Programming FPGAs has been an arduous task that requires extensive knowledge of hardware design languages (HDLs), such as Verilog or VHDL, and low-level hardware details. With OpenCL support for FPGAs, the design, pro...
详细信息
ISBN:
(纸本)9781509015047
Programming FPGAs has been an arduous task that requires extensive knowledge of hardware design languages (HDLs), such as Verilog or VHDL, and low-level hardware details. With OpenCL support for FPGAs, the design, prototyping and implementation of an FPGA is increasingly moving towards a much higher level of abstraction, when compared to the intrinsically low-level nature of HDLs. On the other hand, in the context of traditional (i.e., CPU) software development, OpenCL is still considered to be low-level and complex because the programmer needs to manually expose parallelism in the code. In this work, we present our approach to enhancing FPGA programmability via GLAF, a visual programming framework, to automatically generate synthesizable OpenCL code with an array of FPGA-specific optimizations. We find that our tool facilitates the development process and produces functionally correct and well-performing code on the FPGA for our molecular modeling, gene sequence search, and filtering algorithms.
It becomes increasingly common to use GPU (Graphics processing Units) as accelerators to speed up compute intensive sections of applications. Since block ciphers are supposed to be used for high speed encryption, it i...
详细信息
It becomes increasingly common to use GPU (Graphics processing Units) as accelerators to speed up compute intensive sections of applications. Since block ciphers are supposed to be used for high speed encryption, it is important to implement them as fast as possible. Block cipher ARIA is a new type of encryption standard with four different S-boxes. this paper proposes three methods of high performance implementations of ARIA encryption algorithm on GPU. In order to reduce the data dependency, the round function of ARIA are merged into lookup tables and XOR operations. Encrypting process is performed in parallel and all the data in different GPU memory spaces are arranged properly. Experimental results demonstrate that these techniques accelerate the speed of ARIA encryption significantly. the quantitative performance comparison demonstrates acceleration up to 18 - 45 times speedup while the size of plaintext varies from 4M to 256M.
parallel simulation has been an active research area for more than a decade, and several parallel simulation algorithms have been proposed. To evaluate parallel simulation environments, there is a need for a common be...
详细信息
parallel simulation has been an active research area for more than a decade, and several parallel simulation algorithms have been proposed. To evaluate parallel simulation environments, there is a need for a common benchmark suite. Such benchmarks should allow the designer of simulation kernels to: (i) evaluate how efficiently the simulation kernel runs on specific architectures; and (ii) evaluate how simulation problems scale on the kernel. A vast majority of benchmarks suggested in the literature focus on the latter problem. the first requirement is however equally important, as such benchmarks are needed to implement efficient simulation systems. In this report, we advocate an incremental benchmark methodology which is primarily intended for performance tuning. the benchmark suite is based on a small set of ping models with which it is possible to effectively isolate and estimate various overheads, contention and latencies encountered in simulation kernels. the use of the benchmark suite is illustrated by its application to performance tuning and evaluation of a time warp kernel.
暂无评论