Importance sampling is a widely-used method for probabilistic inference with Bayesian probabilistic networks. Importance sampling is relatively easy to parallelize and parallel GPU implementations yield significant sp...
详细信息
In a commercial Relational Database Management System (RDBMS), sort and join are the most demanding operations, and it is quite beneficial to improve the performance of external sort and external join algorithmsthat ...
详细信息
ISBN:
(纸本)9783642246494
In a commercial Relational Database Management System (RDBMS), sort and join are the most demanding operations, and it is quite beneficial to improve the performance of external sort and external join algorithmsthat handle large input data sizes. this paper proposes parallel implementations of multithreaded external sort and external hash join algorithms to accelerate IBM DB2, one of leading RDBMSs, using an IBM Power Edge of Network (IBM PowerEN (TM)) Peripheral Component Interconnect Express (PCIe) card as an accelerator. the preliminary results show that the proposed parallel implementation of the algorithms on PowerEN (TM) PCIe card can speed up the DB2 sort and join performance about two times.
this paper makes a comparison of three parallel point-multiplication algorithms on conic curves over ring Zn. We propose one algorithm for paralleling point-multiplication by utilizing Chinese Remainder theorem to div...
详细信息
ISBN:
(数字)9783642246692
ISBN:
(纸本)9783642246685
this paper makes a comparison of three parallel point-multiplication algorithms on conic curves over ring Zn. We propose one algorithm for paralleling point-multiplication by utilizing Chinese Remainder theorem to divide point-multiplication over ring Zn into two different point-multiplications over finite field and to compute them respectively. Time complexity and speedup ratio of this parallel algorithm are computed on the basis of our previous research about the basic parallelalgorithms in conic curves cryptosystem. A quantitative performance analysis is made to compare this algorithm with two other algorithms we designed before. the performance comparison demonstrates that the algorithm presented in this paper can reduce time complexity of point-multiplication on conic curves over ring Zn and it is more efficient than the preceding ones.
An emotional agent software architecture for real-time mobile robotic applications has been developed. In order to allow the agent to undertake more dynamically constrained application problem solving, the processor c...
详细信息
ISBN:
(数字)9783642246692
ISBN:
(纸本)9783642246685
An emotional agent software architecture for real-time mobile robotic applications has been developed. In order to allow the agent to undertake more dynamically constrained application problem solving, the processor computation time should be reduced and the gained time is used for executing more complex processes. In this paper, the response time of the operating processes, in each attention cycle of the agent, is decreased by parallelizing the highly parallel processes of the architecture, namely, emotional contribution processes. the implementation of these processes has been evaluated in Field Programmable Gate Array (FPGA) and multicore processors.
Finding optimal phase durations for a controlled intersection is a computationally intensive task requiring O(N-3) operations. In this paper we introduce cost-optimal parallelization of a dynamic programming algorithm...
详细信息
ISBN:
(纸本)9783642246494
Finding optimal phase durations for a controlled intersection is a computationally intensive task requiring O(N-3) operations. In this paper we introduce cost-optimal parallelization of a dynamic programming algorithm that reduces the complexity to O(N-2). three implementations that span a wide range of parallel hardware are developed. the first is based on shared-memory architecture, using the OpenMP programming model. the second implementation is based on message passing, targeting massively parallel machines including high performance clusters, and supercomputers. the third implementation is based on the data parallel programming model mapped on Graphics processing Units (GPUs). Key optimizations include loop reversal, communication pruning, load-balancing, and efficient thread to processors assignment. Experiments have been conducted on 8-core server, IBM BlueGene/L supercomputer 2-node boards with 128 processors, and GPU GTX470 GeForce Nvidia with 448 cores. Results indicate practical scalability on all platforms, with maximum speed up reaching 76x for the GTX470.
Automation, computational time and cost are open subjects in microarray image processing. the present paper proposes image processing techniques together withtheir implementations in order to eliminate the shortcomin...
详细信息
ISBN:
(纸本)9781457714115
Automation, computational time and cost are open subjects in microarray image processing. the present paper proposes image processing techniques together withtheir implementations in order to eliminate the shortcomings of the existing software platforms for microarray image processing: user intervention, increased computational time and cost. thus, for each step of microarray image processing, application-specific hardware architectures are designed aiming algorithmsparallelization for fast processing. Computational time is estimated and compared with state of the art approaches. the proposed hardware architectures integrated inside microarray scanners deliver microarray image characteristics in an automated manner, excluding the need of an additional software platform. the FPGA technology was chosen for implementation, due to its parallel computation capabilities and ease of reconfiguration.
Static program analysis supporting software development is of ten part of edit-compile-cycles, and precise program analysis is time consuming. Points-to analysis is a data-flow-based static program analysis used to fi...
详细信息
ISBN:
(纸本)9781450302418
Static program analysis supporting software development is of ten part of edit-compile-cycles, and precise program analysis is time consuming. Points-to analysis is a data-flow-based static program analysis used to find object references in programs. Its applications include test case generation, compiler optimizations and program understanding, and more. Recent increases in processing power of desktop computers comes mainly from multiple cores. parallelalgorithms are vital for simultaneous use of multiple cores. An efficient parallel points-to analysis requires sufficient work for each processing unit. the present paper presents a parallelized points-to analysis of object-oriented programs. It exploits that (1) different target methods of polymorphic calls and (2) independent control-flow branches can be analyzed in parallel. Carefully selected thresholds guarantee that each parallelthread has sufficient work to do and that only little work is redundant with other threads. Our experiments show that this approach achieves a maximum speed-up of 4.43 on 8 cores for a benchmark suite of Java programs. Copyright 2011 ACM.
We focus on agent-based simulations where a large number of agents move in the space, obeying to some simple rules. Since such kind of simulations are computational intensive, it is challenging, for such a contest, to...
详细信息
ISBN:
(纸本)9780769543284
We focus on agent-based simulations where a large number of agents move in the space, obeying to some simple rules. Since such kind of simulations are computational intensive, it is challenging, for such a contest, to let the number of agents to grow and to increase the quality of the simulation. A fascinating way to answer to this need is by exploiting parallelarchitectures. In this paper, we present a novel distributed load balancing schema for a parallel implementation of such simulations. the purpose of such schema is to achieve an high scalability. Our approach to load balancing is designed to be lightweight and totally distributed: the calculations for the balancing take place at each computational step, and influences the successive step. To the best of our knowledge, our approach is the first distributed load balancing schema in this context. We present boththe design and the implementation that allowed us to perform a number of experiments, with up-to 1, 000, 000 agents. Tests show that, in spite of the fact that the load balancing algorithm is local, the workload distribution is balanced while the communication overhead is negligible.
In this work, we propose an efficient quasi-cyclic LDPC (QC-LDPC) decoder simulator which runs on graphics processing units (GPUs). We optimize the data structures of the messages used in the decoding process such tha...
详细信息
ISBN:
(纸本)9783642246494
In this work, we propose an efficient quasi-cyclic LDPC (QC-LDPC) decoder simulator which runs on graphics processing units (GPUs). We optimize the data structures of the messages used in the decoding process such that boththe read and write processes can be performed in a highly parallel manner by the GPUs. We also propose a highly efficient algorithm to convert the data structure of the messages from one form to another with very little latency. Finally, withthe use of a large number of cores in the GPU to perform the simple computations simultaneously, our GPU-based LDPC decoder is found to run at around 100 times faster than a CPU-based simulator.
An addition chain for a natural number x of n bits is a sequence of numbers a(0), a(1), ... , a(l), such that a(0) = 1, a(l) = x, and a(k) = a(i) + a(j) with 0 <= i, j < k <= l. the addition chain problem is ...
详细信息
ISBN:
(数字)9783642246692
ISBN:
(纸本)9783642246685
An addition chain for a natural number x of n bits is a sequence of numbers a(0), a(1), ... , a(l), such that a(0) = 1, a(l) = x, and a(k) = a(i) + a(j) with 0 <= i, j < k <= l. the addition chain problem is what is the minimal number of additions needed to compute X starting from 1? In this paper, we present a new parallel algorithm to generate a short addition chain for x. the algorithm has running time O(log(2) n) using polynomial number processors under EREW PRAM (exclusive read exclusive write parallel random access machine). the algorithm is faster than previous algorithms and is based on binary method.
暂无评论