the performance of the elliptic curve method (ECM) for integer factorization plays an important role in the security assessment of RSA-based protocols as a cofactorization tool inside the number field sieve. the effic...
详细信息
ISBN:
(纸本)9783642349614;9783642349607
the performance of the elliptic curve method (ECM) for integer factorization plays an important role in the security assessment of RSA-based protocols as a cofactorization tool inside the number field sieve. the efficient arithmetic for Edwards curves found an application by speeding up ECM. We propose techniques based on generating and combining addition-subtracting chains to optimize Edwards ECM in terms of both performance and memory requirements. this makes our approach very suitable for memory-constrained devices such as graphics processing units (GPU). For commonly used ECM parameters we are able to lower the required memory up to a factor 55 compared to the state-of-the-art Edwards ECM approach. Our ECM implementation on a GTX 580 GPU sets a new throughput record, outperforming the best GPU, CPU and FPGA results reported in literature.
this paper presents an efficient mapping of geometric biclustering (GBC) algorithm for neural information processing on Graphical processing Unit (GPU). the proposed designs consist of five different versions which ex...
详细信息
Many particle simulation codes use interaction lists to store interacting particles. Depending on the physical parameters of the simulations those interaction lists may occupy a large amount of physical memory, which ...
详细信息
Many particle simulation codes use interaction lists to store interacting particles. Depending on the physical parameters of the simulations those interaction lists may occupy a large amount of physical memory, which may limit the number of particles of the simulation. this article discusses several methods that try to reduce the size of interaction lists while maintaining the number of particle interactions per second or even increase it. Different techniques are discussed for a parallel shared memory algorithm on multicore architectures. On those architectures, the memory bandwidth is shared by multiple cores. Since the interaction list is a large shared data structure, it cannot be stored in CPU caches and has to be streamed into the processor several times. A reduction of the size of the interaction list will therefore reduce the number of elements to be reloaded resulting in more efficient implementations.
In this paper we evaluate two life science algorithms, namely Needleman-Wunsch sequence alignment and Direct Coulomb Summation, for GPUs. Whereas for Needleman-Wunsch it is difficult to get good performance numbers, D...
详细信息
In this paper we evaluate two life science algorithms, namely Needleman-Wunsch sequence alignment and Direct Coulomb Summation, for GPUs. Whereas for Needleman-Wunsch it is difficult to get good performance numbers, Direct Coulomb Summation is particularly suitable for graphics cards. We present several optimization techniques, analyze the theoretical potential of the optimizations with respect to the algorithms, and measure the effect on execution times. We target the recent NVIDIA Fermi architecture to evaluate the performance impacts of novel hardware features like the cache subsystem on optimizing transformations. We compare the execution times of CUDA and OpenCL code versions for Fermi and predecessor models withparallel OpenMP versions executed on the main CPU.
Withthe development of Internet and cloud computing, multimedia data, such as images and videos, has become one of the most common data types being processed. As the scale of multimedia data being still increasing, i...
详细信息
In the literature various two level interconnection networks are proposed using hypercubes or star graphs. In this paper, a new two level interconnection network topology called the Metastar denoted as Mstar(k,m) is i...
详细信息
ISBN:
(纸本)9783642280726;9783642280733
In the literature various two level interconnection networks are proposed using hypercubes or star graphs. In this paper, a new two level interconnection network topology called the Metastar denoted as Mstar(k,m) is introduced. the proposed network takes the star graph as basic building blocks. Here, the network at the lower level is a star but at the higher level the network is a cube. Its various topological parameters such as packing density, degree, diameter, cost, average distance and hamiltonicity are investigated. Message routing and broadcasting algorithms are also proposed. Performance analysis in terms of topological parameters is done and the proposed network is proved to be a suitable candidate for large scale computing.
In this paper we present a Spiking Neural P system (SNP system) simulator based on graphics processing units (GPUs). In particular we implement the simulator using NVIDIA CUDA enabled GPUs. the massively parallel arch...
详细信息
this paper concerns mainly withparallel and distributed implementations of molecular dynamics simulations of the Lennard-Jones potential model. the reported research work studies and experiments different algorithms ...
详细信息
this paper concerns mainly withparallel and distributed implementations of molecular dynamics simulations of the Lennard-Jones potential model. the reported research work studies and experiments different algorithms and parallelization techniques for shared memory and message passing architectures, and the programs are executed on single-core processors, multi-core processors, GPU, and GPU cluster. the solution based on efficient versions of the neighbor list algorithm and space division technique is further discussed. the obtained speedups for multi-core processor, GPU, and GPU cluster, relative to the single-core processor implementation of the program, are analyzed, and the advantages of the algorithms are highlighted.
this paper presents a method for auto-tuning interactive ray tracing on GPUs using a hardware model. Getting full performance from modern GPUs is a challenging task. Workloads which require a guaranteed performance ov...
详细信息
the global scheduler of a current GPU distributes thread blocks to symmetric multiprocessors (SM), which schedule threads for execution withthe granularity of a warp. threads in a warp execute the same code path in l...
详细信息
the global scheduler of a current GPU distributes thread blocks to symmetric multiprocessors (SM), which schedule threads for execution withthe granularity of a warp. threads in a warp execute the same code path in lockstep, which potentially leads to a large amount of wasted cycles for divergent control flow. In order to overcome this general issue of SIMT architectures, we propose techniques to relax divergence on the fly within a computation kernel in order to achieve a much higher total utilization of processing cores. We propose techniques for branch and loop divergence (which may also be combined) switching to suitable tasks during a GPU kernel run every time divergence occurs. Our newly introduced techniques can easily be applied to arbitrary iterative algorithms and we evaluate the performance and effectiveness of our approach exemplarily via synthetic and real world applications.
暂无评论