the proceedings contain 4 papers. the topics discussed include: cache size in a cost model for heterogeneous skeletons;an efficient skew-insensitive algorithm for join processing on grid architectures;formally specify...
ISBN:
(纸本)9781450308625
the proceedings contain 4 papers. the topics discussed include: cache size in a cost model for heterogeneous skeletons;an efficient skew-insensitive algorithm for join processing on grid architectures;formally specifying and analyzing a parallel virtual machine for lazy functional languages using Maude;and type system for a safe execution of parallel programs in BSML.
In this work, we propose an efficient quasi-cyclic LDPC (QC-LDPC) decoder simulator which runs on graphics processing units (GPUs). We optimize the data structures of the messages used in the decoding process such tha...
详细信息
ISBN:
(纸本)9783642246494
In this work, we propose an efficient quasi-cyclic LDPC (QC-LDPC) decoder simulator which runs on graphics processing units (GPUs). We optimize the data structures of the messages used in the decoding process such that boththe read and write processes can be performed in a highly parallel manner by the GPUs. We also propose a highly efficient algorithm to convert the data structure of the messages from one form to another with very little latency. Finally, withthe use of a large number of cores in the GPU to perform the simple computations simultaneously, our GPU-based LDPC decoder is found to run at around 100 times faster than a CPU-based simulator.
Successful proof-of-concept laboratory experiments on cortically-controlled brain computer interface motivate continued development for neural prosthetic microsystems (NPMs). One of the research directions is to reali...
详细信息
ISBN:
(纸本)9781424441419
Successful proof-of-concept laboratory experiments on cortically-controlled brain computer interface motivate continued development for neural prosthetic microsystems (NPMs). One of the research directions is to realize realtime spike sorting processors (SSPs) on the NPM. the SSP detects the spikes, extracts the features, and then performs the classification algorithm in realtime in order to differentiate the spikes for the different firing neurons. Several architectures have been designed for the spike detection and feature extraction. However, the classification hardware is missing. To complete the SSP, a density-based hardware-oriented classification algorithm is proposed for hardware implementation. the traditional classification algorithms require a considerable memory space to store all the training features during the processing iteration, which results in a considerable power and area for the hardware. the proposed one is designed based on the density map of the spike features. the density map can be accumulated on-line withthe coming of the spike features. therefore the algorithm can save significant memory space, and is good for efficient hardware implementation.
An addition chain for a natural number x of n bits is a sequence of numbers a(0), a(1), ... , a(l), such that a(0) = 1, a(l) = x, and a(k) = a(i) + a(j) with 0 <= i, j < k <= l. the addition chain problem is ...
详细信息
ISBN:
(数字)9783642246692
ISBN:
(纸本)9783642246685
An addition chain for a natural number x of n bits is a sequence of numbers a(0), a(1), ... , a(l), such that a(0) = 1, a(l) = x, and a(k) = a(i) + a(j) with 0 <= i, j < k <= l. the addition chain problem is what is the minimal number of additions needed to compute X starting from 1? In this paper, we present a new parallel algorithm to generate a short addition chain for x. the algorithm has running time O(log(2) n) using polynomial number processors under EREW PRAM (exclusive read exclusive write parallel random access machine). the algorithm is faster than previous algorithms and is based on binary method.
CUDA is an architecture introduced by NVIDIA Corporation, which allows software developers to take advantage of GPU resources in order to increase the computational power. this paper presents an approach to accelerate...
详细信息
ISBN:
(纸本)9783642246494
CUDA is an architecture introduced by NVIDIA Corporation, which allows software developers to take advantage of GPU resources in order to increase the computational power. this paper presents an approach to accelerate the similarity searching of DNA and protein molecules through parallel alignments of their sequences withthe use of GPU and CUDA. In order to optimally align two biopolymer sequences, such as amino acid or nucleotide sequences, we employ the Smith-Waterman algorithm. We present the optimization steps leading to achieve a very good efficiency of our implementation on GPU and we compare results of efficiency tests with other known implementations. the results show that it is possible to search bioinformatics databases accurately within a reasonable time.
Sorting algorithms have been studied extensively since past three decades. their uses are found in many applications including real-time systems, operating systems, and discrete event simulations. In most cases, the e...
详细信息
High performance architectures are increasingly heterogeneous with shared and distributed memory components. Programming such architectures is complicated and performance portability is a major issue as the architectu...
详细信息
ISBN:
(纸本)9781450308625
High performance architectures are increasingly heterogeneous with shared and distributed memory components. Programming such architectures is complicated and performance portability is a major issue as the architectures evolve. this Paper proposes a new architectural cost model that accounts for cache size and improves on heterogeneous architectures, and demonstrates a skeleton-based programming model that simplifies programming heterogeneous architectures. We further demonstrate that the cost model can be exploited by skeletons to improve load balancing on heterogeneous architectures. the heterogeneous skeleton model facilitates performance portability, using the architectural cost model to automatically balance load across heterogeneous components of the architecture. For both a data parallel benchmark, and realistic image processing program we obtain good performance for the heterogeneous skeleton on homogeneous shared and distributed memory architectures, and on three heterogeneous architectures. We also show that taking cache size into account in the model leads to improved balance and performance.
In this paper, we analyze the jitter and packet loss behavior of voice over Internet protocol (VoIP) traffic by means of networks measurements and simulations results. As result of these analyses, we provide a detaile...
详细信息
ISBN:
(数字)9783642246692
ISBN:
(纸本)9783642246685
In this paper, we analyze the jitter and packet loss behavior of voice over Internet protocol (VoIP) traffic by means of networks measurements and simulations results. As result of these analyses, we provide a detailed characterization and accurate modeling of these Quality of Service (QoS) parameters. Our studies have revealed that VoIP jitter can be modeled by self-similar and multifractal models. We present a methodology for simulating packet loss. Besides, we found relationships between Hurst parameter (H) with packet loss rate (PLR).
Computations in data mining tasks are executed sequentially in a CPU;usually on a single processor desktop computer. Traditionally, parallel computing is the usage of multiple computing resources to execute computatio...
详细信息
ISBN:
(纸本)9780972741286
Computations in data mining tasks are executed sequentially in a CPU;usually on a single processor desktop computer. Traditionally, parallel computing is the usage of multiple computing resources to execute computational problems which can be solved simultaneously. Such computations are possible using multi-core CPUs or computers with multiple CPUs or by using a network of computers in parallel. Today's Graphics processing Units (GPU) can well be thought of as parallel processors and applications have proven it to be so. Simply stating, GPU is capable of simultaneously using multiple internal computing resources such as 'Core-processors' to solve a computational problems within a fraction of the time a CPU would need. In this paper, we depict GPU as an effective desktop multiprocessor machine which handles computationally- intense data-independent tasks in data mining algorithms in parallel, while sequentially executing sections of the problem-program. We explore the parallel architecture of GPU for computing core data-mining problem such as clustering for efficient parallel computing on desktop computer.
Finding optimal phase durations for a controlled intersection is a computationally intensive task requiring O(N-3) operations. In this paper we introduce cost-optimal parallelization of a dynamic programming algorithm...
详细信息
ISBN:
(纸本)9783642246494
Finding optimal phase durations for a controlled intersection is a computationally intensive task requiring O(N-3) operations. In this paper we introduce cost-optimal parallelization of a dynamic programming algorithm that reduces the complexity to O(N-2). three implementations that span a wide range of parallel hardware are developed. the first is based on shared-memory architecture, using the OpenMP programming model. the second implementation is based on message passing, targeting massively parallel machines including high performance clusters, and supercomputers. the third implementation is based on the data parallel programming model mapped on Graphics processing Units (GPUs). Key optimizations include loop reversal, communication pruning, load-balancing, and efficient thread to processors assignment. Experiments have been conducted on 8-core server, IBM BlueGene/L supercomputer 2-node boards with 128 processors, and GPU GTX470 GeForce Nvidia with 448 cores. Results indicate practical scalability on all platforms, with maximum speed up reaching 76x for the GTX470.
暂无评论