Random networks are widely used for modeling and analyzing complex processes. Many mathematical models have been proposed to capture diverse real-world networks. One of the most important aspects of these models is de...
详细信息
Random networks are widely used for modeling and analyzing complex processes. Many mathematical models have been proposed to capture diverse real-world networks. One of the most important aspects of these models is degree distribution. Chung-Lu (CL) model is a random network model, which can produce networks with any given arbitrary degree distribution. The complex systems we deal with nowadays are growing larger and more diverse than ever. Generating random networks with any given degree distribution consisting of billions of nodes and edges or more has become a necessity, which requires efficient and parallel algorithms. We present an MPI-based distributed memory parallel algorithm for generating massive random networks using CL model, which takes time with high probability and O(n) space per processor, where n, m, and P are the number of nodes, edges and processors, respectively. The time efficiency is achieved by using a novel load-balancing algorithm. Our algorithms scale very well to a large number of processors and can generate massive power-law networks with one billion nodes and 250 billion edges in one minute using 1024 processors.
In this work, we present the numerical results (using C++) obtained from seven different versions of the LU decomposition algorithms. Four of the algorithms use Crout-like reduction and three of the algorithms use Doo...
详细信息
In this work, we present the numerical results (using C++) obtained from seven different versions of the LU decomposition algorithms. Four of the algorithms use Crout-like reduction and three of the algorithms use Doolittle-like reduction. (C) 2004 Elsevier Inc. All rights reserved.
A maximum a posteriori (MAP) algorithm is presented for the estimation of spin-density and spin-spin decay distributions from frequency and phase-encoded magnetic resonance imaging data. Linear spatial localization gr...
详细信息
A maximum a posteriori (MAP) algorithm is presented for the estimation of spin-density and spin-spin decay distributions from frequency and phase-encoded magnetic resonance imaging data. Linear spatial localization gradients are assumed: the y-encode gradient applied during the phase preparation time of duration tau before measurement collection, and the x-encode gradient applied during the full data collection time t greater than or equal to 0, The MRT signal model developed in [22] is used in which a signal resulting from M phase encodes (rows) and N frequency encode dimensions (columns) is modeled as a superposition of MN sine-modulated exponentially decaying sinusoids with unknown spin-density and spin-spin decay parameters, The nonlinear least-squares MAP estimate of the spin density and spin-spin decay distributions solves for the 2MN spin-density and decay parameters minimizing the squared-error between the measured data and the sine-modulated exponentially decay signal model using an iterative expectation-maximization algorithm. A covariance diagonalizing transformation is derived which decouples the joint estimation of MN sinusoids into M separate N sinusoid optimizations, yielding an order of magnitude speed up in convergence, The MAP solutions are demonstrated to deliver a decrease in standard deviation of image parameter estimates on brain phantom data of greater than a factor of two over Fourier-based estimators of the spin density and spin-spin decay distributions. A parallel processor implementation is demonstrated which maps the N sinusoid coupled minimization to separate individual simple minimizations, one for each processor.
The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinational element has bound...
详细信息
The circuit value update problem is the problem of updating values in a representation of a combinational circuit when some of the inputs are changed. We assume for simplicity that each combinational element has bounded fan-in and fan-out and can be evaluated in constant time. This problem is easily solved on an ordinary serial computer in O(W + D) time, where W is the number of elements in the altered subcircuit and D is the subcircuit's embedded depth (its depth measured in the original circuit). In this paper we show how to solve the circuit value update problem efficiently on a P-processor parallel computer. We give a straightforward synchronous, parallel algorithm that runs in O(W/P + D1g P) expected time. Our main contribution, however, is an optimistic, asynchronous, parallel algorithm that runs in O(W/P + D + 1g W + 1g P) expected time, where W and D are the size and embedded depth, respectively, of the ''volatile'' subcircuit, the subcircuit of elements that have inputs which either change or glitch as a result of the update. To our knowledge, our analysis provides the first analytical bounds on the running time of an optimistic, asynchronous, parallel algorithm.
This paper presents parallel algorithms for determining the number of partitions of a given integer N, where the partitions may be subject to restrictions, such as being composed of distinct parts, of a given number o...
详细信息
This paper presents parallel algorithms for determining the number of partitions of a given integer N, where the partitions may be subject to restrictions, such as being composed of distinct parts, of a given number of parts, and/or of parts belonging to a specified set. We present a series of adaptive algorithms suitable for varying numbers of processors. The fastest of these algorithms computes the number of partitions of n with largest part equal to k, for 1 less than or equal to k less than or equal to n less than or equal to N, in time O(log(2)(N)) using O(N-2/log N) processors. parallel logarithmic time algorithms that generate partitions uniformly at random, using these quantities, are also presented. (C) 1996 Academic Press, Inc.
Programmable logic devices (PLDs) continue to grow in size and currently contain several millions of gates. At the same time, research effort is going into higher-level hardware synthesis methodologies for reconfigura...
详细信息
Programmable logic devices (PLDs) continue to grow in size and currently contain several millions of gates. At the same time, research effort is going into higher-level hardware synthesis methodologies for reconfigurable computing that can exploit PLD technology. In this paper, we explore the effectiveness and extend one such formal methodology in the design of massively parallel algorithms. We take a step-wise refinement approach to the development of correct reconfigurable hardware circuits from formal specifications. A functional programming notation is used for specifying algorithms and for reasoning about them. The specifications are realised through the use of a combination of function decomposition strategies, data refinement techniques, and off-the-shelf refinements based upon higher-order functions. The off-the-shelf refinements are inspired by the operators of communicating sequential processes (CS.P) and map easily to programs in Handel-C (a hardware description language). The Handel-C descriptions are directly compiled into reconfigurable hardware. The practical realisation of this methodology is evidenced by a case studying the matrix multiplication algorithm as it is relatively simple and well known. In this paper, we obtain several hardware implementations with different performance characteristics by applying different refinements to the algorithm. The developed designs are compiled and tested under Celoxica's RC-1000 reconfigurable computer with its 2 million gates Virtex-E FPGA. Performance analysis and evaluation of these implementations are included. (C) 2006 Elsevier Ltd. All rights reserved.
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), have been witnessing a considerable increase in density. State-of-the-art FPGAs are complex hybrid devices that contain up to several millions of...
详细信息
Reconfigurable devices, such as Field Programmable Gate Arrays (FPGAs), have been witnessing a considerable increase in density. State-of-the-art FPGAs are complex hybrid devices that contain up to several millions of gates. Recently, research effort has been going into higher-level parallelization and hardware synthesis methodologies that can exploit such a programmable technology. In this paper, we explore the effectiveness of one such formal methodology in the design of parallel versions of the Serpent cryptographic algorithm. The suggested methodology adopts a functional programming notation for specifying algorithms and for reasoning about them. The specifications are realized through the use of a combination of function decomposition strategies, data refinement techniques, and off-the-shelf refinements based upon higher-order functions. The refinements are inspired by the operators of Communicating Sequential Processes and map easily to programs in Handel-C (a hardware description language). In the presented research, we obtain several parallel Serpent implementations with different performance characteristics. The developed designs are tested under Celoxica's RC-1000 reconfigurable computer with its two million gates Virtex-E FPGA. Performance analysis and evaluation of these implementations are included.
We give an O (log(4) n)-time O(n(2))-processor CRCW PRAM algorithm to find a hamiltonian cycle in a strong semicomplete bipartite digraph, B, provided that a factor of B (i.e., a collection of vertex disjoint cycles c...
详细信息
We give an O (log(4) n)-time O(n(2))-processor CRCW PRAM algorithm to find a hamiltonian cycle in a strong semicomplete bipartite digraph, B, provided that a factor of B (i.e., a collection of vertex disjoint cycles covering the vertex set of B)is computed in a preprocessing step. The factor is found (if it exists) using a bipartite matching algorithm, hence placing the whole algorithm in the class Random-NC. We show that any parallel algorithm which can check the existence of a hamiltonian cycle in a strong semicomplete bipartite digraph in time O(r(n)) using p(n) processors can be used to check the existence of a perfect matching in a bipartite graph in time O(r(n) + n(2)/p(n)) using p(n) processors. Hence, our problem belongs to the class NC if and only if perfect matching in bipartite graphs belongs to NC. We also consider the problem of finding a hamiltonian path in a semicomplete bipartite digraph.
This paper addresses the problem of developing efficient parallel algorithms for the training procedure of a neural network-based Fingerprint Image Comparison (FIC) system. The target architecture is assumed to be a c...
详细信息
This paper addresses the problem of developing efficient parallel algorithms for the training procedure of a neural network-based Fingerprint Image Comparison (FIC) system. The target architecture is assumed to be a coarse-grain distributed-memory parallel architecture. Two types of parallelism-node parallelism and training set parallelism (TSP)-are investigated. Theoretical analysis and experimental results show that node parallelism has low speedup and poor scalability, while TSP proves to have the best speedup performance. TSP, however, is amenable to a slow convergence rate. To reduce this effect, a modified training set parallel algorithm using weighted contributions of synaptic connections is proposed. Experimental results show that this algorithm provides a fast convergence rate while keeping the best speedup performance obtained. The combination of TSP with node parallelism is also investigated. A good performance is achieved by this approach. This provides better scalability with the trade-off of a slight decrease in speedup. The above algorithms are implemented on a 32-node CM-5.
In this paper, we present some novel algorithms for scheduling hierarchical signal flow graphs in the domain of high-level synthesis. With complex chips that need to be designed in the future, it is expected that the ...
详细信息
In this paper, we present some novel algorithms for scheduling hierarchical signal flow graphs in the domain of high-level synthesis. With complex chips that need to be designed in the future, it is expected that the runtimes of these scheduling algorithms will be quite large. The key contributions of this paper are as follows: First, we develop a novel extension of the sequential force-directed scheduling algorithm which naturally handles loops and conditionals by coming up with a scheme of scheduling hierarchical signal flow graphs. Second, we develop three new parallel algorithms for the scheduling problem. Our parallel algorithms are portable across a wide range of parallel platforms. We report results on a set of high-level synthesis benchmarks on 8-processor SGI Origin and a 64 processor IBM SP-2. While some parallel algorithms for VLSI CAD reported by earlier researchers have reported a loss of qualities of results, our parallel algorithms produce exactly the same results as the sequential algorithms on which they are based.
暂无评论