Experimental techniques such as X-ray crystallography and nuclear magnetic resonance have been useful for the accurate determination of RNA tertiary structures. However, high-throughput structure determination using s...
详细信息
Experimental techniques such as X-ray crystallography and nuclear magnetic resonance have been useful for the accurate determination of RNA tertiary structures. However, high-throughput structure determination using such methods often becomes difficult, due to the need for a large quantity of pure samples. Computational techniques for the prediction of RNA tertiary structures are thus becoming increasingly popular. Most of the existing prediction algorithms are computationally intensive, and there is a clear need for acceleration. In this paper, we propose a parallelization methodology for the fragment assembly of RNA (FARNA) algorithm, one of the most effective methods for computational prediction of RNA tertiary structure. The proposed parallelization scheme exploits multi-core CPUs and GPUs in harmony to maximize their utilization. We tested our approach with a number of RNA sequences and confirmed that it allows the time required for structure prediction to be significantly reduced. With respect to the baseline architecture equipped with a single CPU core, we achieved a speedup of up to approximately 24 x (roughly 4x by multi-core CPUs and 20x by GPUs). Compared with a quad-core CPU setup, the proposed approach delivers an additional 12x speedup by utilizing CPU devices. Given that most PCs these days have a multi-core CPU and a GPU card, our methodology will be very helpful for accelerating algorithms in a cost-effective manner. (C) 2013 Elsevier Ltd. All rights reserved.
Hardware accelerators are getting increasingly important in heterogeneous systems for many applications, including those that employ matrix decompositions. In recent years, a class of tiled matrix decomposition algori...
详细信息
Hardware accelerators are getting increasingly important in heterogeneous systems for many applications, including those that employ matrix decompositions. In recent years, a class of tiled matrix decomposition algorithms has been proposed for out-of-memory computations and multi-core architectures including GPU-based heterogeneous systems. However, on FPGAs these scalable solutions for large matrices are rarely found. In this paper we use the latest tiled decomposition algorithms from high performance linear algebra for off-chip memory access and loop mapping on multiple processing cores for on-chip computation to perform scalable and high performance QR and LU matrix decompositions on FPGAs. (C) 2012 Elsevier B.V. All rights reserved.
Multiple-Input-Multiple-Output communication systems demand fast sphere decoding with high performance. To speed up the computation, we propose a scheme with multiple fixed complexity sphere decoders to construct a pa...
详细信息
Multiple-Input-Multiple-Output communication systems demand fast sphere decoding with high performance. To speed up the computation, we propose a scheme with multiple fixed complexity sphere decoders to construct a parallel soft-output fixed complexity sphere decoder (PFSD). The proposed decoder is highly parallel and has performance comparable to soft-output list fixed complexity sphere decoder (LFSD) and -best sphere decoder. In addition, we propose a parallel QR decomposition algorithm to lower the preprocessing overhead, and a low complexity LLR algorithm to allow parallel update of LLR values. We demonstrate that the PFSD algorithm can increase the throughput and reduce bit error rate of a soft-output solution in a 4 x 4 16-QAM system, and has superior performance compared to other soft decoders with comparable throughput and computation complexity. The PFSD algorithm has been mapped onto Xilinx XC4VLX160 FPGA. The resulting PFSD decoder can achieve up to 75 Mbps throughput for 4 x 4 64-QAM configuration at 100MHz with low control overhead.
This paper studies an inexact perturbed path-following algorithm in the framework of Lagrangian dual decomposition for solving large-scale separable convex programming problems. Unlike the exact versions considered in...
详细信息
This paper studies an inexact perturbed path-following algorithm in the framework of Lagrangian dual decomposition for solving large-scale separable convex programming problems. Unlike the exact versions considered in the literature, we propose solving the primal subproblems inexactly up to a given accuracy. This leads to an inexactness of the gradient vector and the Hessian matrix of the smoothed dual function. Then an inexact perturbed algorithm is applied to minimize the smoothed dual function. The algorithm consists of two phases, and both make use of the inexact derivative information of the smoothed dual problem. The convergence of the algorithm is analyzed, and the worst-case complexity is estimated. As a special case, an exact path-following decomposition algorithm is obtained and its worst-case complexity is given. Implementation details are discussed, and preliminary numerical results are reported.
Based on two-grid discretizations and domain decomposition, a parallel Oseen-linearized finite element algorithm for the stationary Navier-Stokes equations with moderate or large viscosity parameter is proposed and an...
详细信息
Based on two-grid discretizations and domain decomposition, a parallel Oseen-linearized finite element algorithm for the stationary Navier-Stokes equations with moderate or large viscosity parameter is proposed and analyzed. The key idea of the algorithm is to first solve a nonlinear problem by Picard iterative method on a coarse grid, and then to solve an Oseen problem in parallel on a fine grid to correct the coarse grid solution. By using local a priori error estimate for the finite element solution and under the uniqueness condition, error bounds of the corresponding finite element solution are analyzed. Numerical results are also given to demonstrate the high efficiency of the algorithm. (C) 2011 Elsevier B.V. All rights reserved.
We consider a distributed noncooperative control setting in which systems are interconnected via state constraints. Each of these systems is governed by an agent which is responsible for exchanging information with it...
详细信息
We consider a distributed noncooperative control setting in which systems are interconnected via state constraints. Each of these systems is governed by an agent which is responsible for exchanging information with its neighbours and computing a feedback law using a nonlinear model predictive controller to avoid violations of constraints. For this setting we present an algorithm which generates a parallelizable hierarchy among the systems. Moreover, we show both feasibility and stability of the closed loop using only abstract properties of this algorithm. To this end, we utilize a trajectory based stability result which we extend to the distributed setting. (C) 2012 Elsevier B.V. All rights reserved.
A general primal-dual splitting algorithm for solving systems of structured coupled monotone inclusions in Hilbert spaces is introduced and its asymptotic behavior is analyzed. Each inclusion in the primal system feat...
详细信息
A general primal-dual splitting algorithm for solving systems of structured coupled monotone inclusions in Hilbert spaces is introduced and its asymptotic behavior is analyzed. Each inclusion in the primal system features compositions with linear operators, parallel sums, and Lipschitzian operators. All the operators involved in this structured model are used separately in the proposed algorithm, most steps of which can be executed in parallel. This provides a flexible solution method applicable to a variety of problems beyond the reach of the state-of-the-art. Several applications are discussed to illustrate this point.
Given a sequence A of real numbers, we wish to find a list of all nonoverlapping contiguous subsequences of A that are maximal. A maximal subsequence M of A has the property that no proper subsequence of M has a great...
详细信息
Given a sequence A of real numbers, we wish to find a list of all nonoverlapping contiguous subsequences of A that are maximal. A maximal subsequence M of A has the property that no proper subsequence of M has a greater sum of values. Furthermore, M may not be contained properly within any subsequence of A with this property. This problem has several applications in Computational Biology and can be solved sequentially in linear time. We present a BSP/CGM algorithm that solves this problem using p processors in O(vertical bar A vertical bar=p) time and O(vertical bar A vertical bar/p) space per processor. The algorithm uses a constant number of communication rounds of size at most O(vertical bar A vertical bar/p). Thus, the algorithm achieves linear speedup and is highly scalable. To our knowledge, there are no previous known parallel BSP/CGM algorithms to solve this problem.
An urban scale Eulerian non-reactive multilayer air pollution model is proposed describing convection, turbulent diffusion and emission. A mass-consistent wind field model developed by authors is included in the air p...
详细信息
An urban scale Eulerian non-reactive multilayer air pollution model is proposed describing convection, turbulent diffusion and emission. A mass-consistent wind field model developed by authors is included in the air pollution model. An Adaptive Finite Element Method with characteristics in the horizontal directions and Finite Differences in the vertical direction using splitting techniques is proposed to numerically solve the corresponding PDE problem. A parallel version of the algorithm improves the precision of the solution keeping computation time below real time of simulation. A numerical example illustrates the whole problem. (C) 2013 Elsevier Ltd. All rights reserved.
Network coding helps improve communication rate and save bandwidth by performing a special coding at the sending or intermediate nodes. However, encoding/decoding at the nodes creates computation overhead on large inp...
详细信息
Network coding helps improve communication rate and save bandwidth by performing a special coding at the sending or intermediate nodes. However, encoding/decoding at the nodes creates computation overhead on large input data that causes coding delays. Therefore the progressive method which can hide decoding delay in waiting time is proposed in the previous works. However, the network speed has been greatly accelerated and progressive schemes are no longer the most efficient decoding method. Thus, we present non-progressive decoding algorithm that can be more aggressively parallelized than the progressive network coding, which can diminish the advantages of hidden decoding time of progressive methods by utilizing the multi-core processors. Moreover, the block algorithm implemented by non-progressive decoding helps to reduce cache misses. Through experiments, our scheme which relies on matrix inversion and multiplication shows 46.0% improved execution time and 89.2% last level cache miss reduction compared to the progressive method on multi-core systems. (C) 2012 Elsevier Ltd. All rights reserved.
暂无评论