In recent years, filter bank multicarrier (FBMC) has recaptured widespread interests for its possible applications in cognitive radio and dynamic spectrum access. A distinctive feature for cognitive radio is its adapt...
详细信息
ISBN:
(纸本)9781467362351
In recent years, filter bank multicarrier (FBMC) has recaptured widespread interests for its possible applications in cognitive radio and dynamic spectrum access. A distinctive feature for cognitive radio is its adaptivity to environment. When environment changes, a cognitive radio will change its parameters to optimize the transmission and receiving. Thus it is desirable to design a unified structure and algorithm for FBMC that needs little change for different parameters. In this paper, we propose a unified structure and parallel algorithms to implement the FBMC. The FBMC system and parallel algorithms are constructed based on the normalized prototype filter. The coefficients of the normalized prototype filter can be pre-computed and stored. The proposed parallel algorithms have the same structure for various choices of time duration, subcarrier spacing and bandwidth. Combined with known parallel algorithms for the fast Fourier transform (FFT), the proposed algorithms fully parallelize the computations for the transmitter and receiver, which can run much faster than conventional serial algorithms as modern processors usually have massive parallel capability.
Finding the connected components of a graph is a basic computational problem. In recent years, there were several exciting results in breaking the log2 n-time barrier to finding connected components on parallel machin...
详细信息
ISBN:
(纸本)0780320182
Finding the connected components of a graph is a basic computational problem. In recent years, there were several exciting results in breaking the log2 n-time barrier to finding connected components on parallel machines using shared memory without concurrent-write capability. This paper further presents two new parallel algorithms both using less than log2 n time. The merit of the first algorithm is that it uses only a sublinear number of processors, yet retains the time complexity of the fastest existing algorithm. The second algorithm is slightly slower but its work (i.e., the time-processor product) is closer to optimal than all previous algorithms using less than log2 n time.
We use exponential start time clustering to design faster parallel graph algorithms involving distances. Previous algorithms usually rely on graph decomposition routines with strict restrictions on the diameters of th...
详细信息
ISBN:
(纸本)9781450335881
We use exponential start time clustering to design faster parallel graph algorithms involving distances. Previous algorithms usually rely on graph decomposition routines with strict restrictions on the diameters of the decomposed pieces. We weaken these bounds in favor of stronger local probabilistic guarantees. This allows more direct analyses of the overall process giving: Linear work parallel algorithms that construct spanners with O(k) stretch and size O(n(1+1)=k) in unweighted graphs and size O(n(1+1/k) log k) in weighted graphs. Hopsets that lead to the first parallel algorithm for approximating shortest paths in undirected graphs with O(m poly log n) work.
Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evalua...
详细信息
ISBN:
(纸本)9781450383356
Sparse times dense matrix multiplication (SpMM) finds its applications in well-established fields such as computational linear algebra as well as emerging fields such as graph neural networks. In this study, we evaluate the performance of various techniques for performing SpMM as a distributed computation across many nodes by focusing on GPU accelerators. We examine how the actual local computational performance of state-of-the-art SpMM implementations affect computational efficiency as dimensions change when we scale to large numbers of nodes, which proves to be an unexpectedly important bottleneck. We consider various distribution strategies, including A-Stationary, B-Stationary, and C-Stationary algorithms, 1.5D and 2D algorithms, and RDMA-based and bulk synchronous methods of data transfer. Our results show that the best choice of algorithm and implementation technique depends not only on the cost of communication for particular matrix sizes and dimensions, but also on the performance of local SpMM operations. Our evaluations reveal that with the involvement of GPU accelerators, the best design choices for SpMM differ from the conventional algorithms that are known to perform well for dense matrix-matrix or sparse matrix-sparse matrix multiplies.
We present parallel algorithms to find cut vertices, bridges, and Hamiltonian Path in bounded interval tolerance graphs. For a graph with n vertices, the algorithms require O (log n) time and use O (n) processors to r...
详细信息
ISBN:
(纸本)0769511538
We present parallel algorithms to find cut vertices, bridges, and Hamiltonian Path in bounded interval tolerance graphs. For a graph with n vertices, the algorithms require O (log n) time and use O (n) processors to run OR. Concurrent Read Exclusive Write parallel RAM (CREW PRAM) model of computation. Our approach transforms the original graph problem to a problem in computational geometry. The total work done by the parallel algorithms is comparable to the work done by the best known sequential algorithms for the more restricted class of graphs, namely, interval graphs and permutation graphs. In this sense our algorithms have optimal complementary.
The mechanisms which lead to high tree species diversity in forests are not yet fully understood. One of the leading theories is that the natural enemies' interaction can give rise to a survival advantage for rare...
详细信息
ISBN:
(纸本)9781618397881
The mechanisms which lead to high tree species diversity in forests are not yet fully understood. One of the leading theories is that the natural enemies' interaction can give rise to a survival advantage for rare tree species over more common species. One way of exploring such observations is through the use of individual based modeling. An individual based model (IBM) is a bottom up simulation where the bulk dynamics emerge from the interaction of individual constituents. Due to their, emergent nature, IBMs are population sensitive where achieving a high degree of accuracy is synonymous with matching system population sizes. Consequently such models may run into the millions of individuals and become computationally intensive. Here the computing power of graphics processing units (GPUs) is used to overcome this computation limitation. The algorithms developed here for GPUs allow this model to be scaled into the millions of individuals and run on standard desktop computers. This effectively puts supercomputing power at the fingertips of researchers, students, and forest management services alike. The parallel implementation developed here was compared against a serial implementation running on the central processing unit. The results show a significant perfomance gain for the parallel implementation while maintaining statistical accuracy. This shows that realistically sized models can be efficiently executed on inexpensive mass-market desktop computer hardware.
Misra [8] recently introduced a regular data structure, called powerlists, using which he showed how many data parallel algorithms, including Batcher's merge sort, bitonic sort, fast Fourier transform, prefix sum,...
详细信息
ISBN:
(纸本)3540600434
Misra [8] recently introduced a regular data structure, called powerlists, using which he showed how many data parallel algorithms, including Batcher's merge sort, bitonic sort, fast Fourier transform, prefix sum, can be described concisely using recursion. The elegance of these recursive descriptions is further reflected in deducing properties of these algorithms. It is shown in this paper how such proofs can be easily automated in a theorem prover RRL (Rewrite Rule Laboratory) based on equational and rewriting techniques. In particular, the cover set method for automating proofs by induction in RRL generates proofs which preserve the clarity and succinctness, to a large extant, of hand proofs given in [8]. This is illustrated using a correctness proof of Batcher's merge sort algorithm. Mechanically generated proofs from specifications of powerlists and parallel algorithms using different approaches are contrasted. It is shown that one gets longer, complex proofs with many cases if powerlists are modeled as a subtype of lists. However, if powerlists are specified using a proposal by Kapur [2] in which the algebraic specification method is extended to associate applicability conditions with functions of a data type thus allowing constructors of a data structure to be partial, then one gets compact and elegant proofs, similar to the ones reported in [8]. Applicability conditions can be used to provide contests for axioms and proofs just like type information. The effectiveness of the proposed axiomatic method becomes all the more evident while reasoning about nested powerlists for modeling n-dimensional arrays, for example, in specifying and reasoning about a transformation for embedding a multi-dimensional array into a hypercube such that adjacency of nodes is preserved by the transformation. A mechanically generated proof of this property of the embedding transformation, which was not proved in Misra's paper, is discussed in detail. This suggests that the proposed
In this paper by means of an abstract model of the SIMD type with vertical data processing (the STAR-machine), we present a simple associative parallel algorithm for finding tree paths in undirected graphs. We study a...
详细信息
ISBN:
(纸本)0769517315
In this paper by means of an abstract model of the SIMD type with vertical data processing (the STAR-machine), we present a simple associative parallel algorithm for finding tree paths in undirected graphs. We study applications of this algorithm to update minimum spanning trees in undirected graphs, to determine maximum flow values in a multiterminal network, and to find a fundamental set of circuits with respect to a given spanning tree. These algorithms are given as the corresponding STAR procedures whose correctness is proved and time complexity is evaluated.
For a given ordered graph (G, <), we consider the smallest (strongly) chordal graph G' containing G with < as a (strongly) perfect elimination ordering. We call (G, <) a compact representation of G'. ...
详细信息
ISBN:
(纸本)3540626166
For a given ordered graph (G, <), we consider the smallest (strongly) chordal graph G' containing G with < as a (strongly) perfect elimination ordering. We call (G, <) a compact representation of G'. We show that the computation of a depth-first search tree and a breadth-first search tree can be done in polylogarithmic time with a linear processor number with respect to the size of the compact representation in parallel. We consider also the problems to find a maximum clique and to develop a data structure extension that allows an adjacency query in polylogarithmic time.
The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of CUDA-enabled GPU architecture. It has multiple streaming multiprocessors with a shared memory, and the globa...
详细信息
ISBN:
(纸本)9781479984909
The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of CUDA-enabled GPU architecture. It has multiple streaming multiprocessors with a shared memory, and the global memory that can be accessed by all threads. The HMM has several parameters: the number d of streaming multiprocessors, the number p of threads per streaming multiprocessor, the number w of memory banks of each shared memory and the global memory, shared memory latency 1, and global memory latency L. The main purpose of this paper is to discuss optimality of fundamental parallel algorithms running on the HMM. We first show that image convolution for an image with n x n pixels using a filter of size (2v +1) x (2v +1) can be done in O(n(2)/w + n(2)L/dp + n(2)v(2/)dw + n(2)v(2)l/dp) time units on the HMM. Further, we show that this parallel implementation is time optimal by proving the lower bound of the running time. We then go on to show that the product of two n x n matrices can be computed in O (n(3)/mw + n(3)L/mdp + n(3/)dw - n(3)l/dp) time units on the HMM if the capacity of the shared memory in each streaming multiprocessor is O(m(2)). This implementation is also proved to be time optimal. We further clarify the conditions for image convolution and matrix multiplication to hide the memory access latency overhead and to maximize the global memory throughput and the parallelism. Finally, we provide experimental results on GeForce GTX Titan to support our theoretical analysis.
暂无评论