We are concerned with the mapping on high performance hybrid architectures of a parallel software implementing a two level overlapping domain decomposition, that is, along space and time directions, of the four dimens...
详细信息
We are concerned with the mapping on high performance hybrid architectures of a parallel software implementing a two level overlapping domain decomposition, that is, along space and time directions, of the four dimensional variational data assimilation model. The reference architecture belongs to the SCoPE (Sistema Cooperativo Per Elaborazioni scientifiche multidisciplinari) data center, located at University of Naples Federico II. We consider the initial boundary problem of the shallow water equation and analyse both strong and weak scaling. Keeping the efficiency always greater than 60%$$ 60\% $$ and about 90%$$ 90\% $$ in most cases, we experimentally find that the isoefficiency function grows a little more than linearly with respect to the number of processes. Results, obtained by using the parallel computing toolbox of MATLABR2013a, are in agreement with the algorithm's performance prevision based on the scale up factor, confirming the appropriate mapping of the algorithm on the hybrid architecture.
Let U be a given set of nodes of a parallel computer system and assume that each node u in U has a piece of information t(u) called a token. This paper discusses the problem of each u is an element of U broadcasting i...
详细信息
Let U be a given set of nodes of a parallel computer system and assume that each node u in U has a piece of information t(u) called a token. This paper discusses the problem of each u is an element of U broadcasting its token t(u) to all nodes in U. We refer to this problem as the group-gossiping problem, which includes the (conventional) gossiping problem as a special case. In this paper, we consider the group-gossiping problem in n-cubes under a circuit-switching model and propose an optimal group-gossiping algorithm for n-cubes under the model.
Economic and social development has made financial engineering an increasingly important research area, and more and more financial problems cannot be solved directly by analytical formulas. In view of this, algorithm...
详细信息
Economic and social development has made financial engineering an increasingly important research area, and more and more financial problems cannot be solved directly by analytical formulas. In view of this, algorithms that apply computer technology to financial engineering have emerged. In this study, the Backward Stochastic Differential Equation (BSDE) algorithm is used to investigate and analyse the problem of option pricing calculation in finance. In the research process, GBSDE-Theta parallel algorithm composed of BSDE-Theta algorithm and GPU algorithm uses the new algorithm to establish a computing model in the financial engineering field, which applies to the calculation of enterprise option pricing. The research results show that compared with the basic algorithm, the actual option values of the option pricing data obtained by using the GBSDE-Theta parallel algorithm are more closely matched. The computational model can achieve a speedup ratio of about 230 times of the serial version with the number of time steps N=128 and the number of simulated paths 80,000. About the relative error of the GBSDE-Theta algorithm, there are 80 points within 3% and only 16 points over 3.00%, which is a relatively small error. The above results show that the financial computing system obtained in this study is highly feasible and effective, and can provide a new research idea for the progress and development of other computations in the financial field.
The article considers the conforming identification of the fundamental matrix in the image matching problem. The method consists in the division of the initial overdetermined system into lesser dimensional subsystems....
详细信息
The article considers the conforming identification of the fundamental matrix in the image matching problem. The method consists in the division of the initial overdetermined system into lesser dimensional subsystems. On these subsystems, a set of solutions is obtained, from which a subset of the most conforming solutions is defined. Then, on this subset the resulting solution is deduced. Since these subsystems are formed by all possible combinations of rows in the initial system, this method demonstrates high accuracy and stability, although it is computationally complex. A comparison with the methods of least squares, least absolute deviations, and the RANSAC method is drawn.
Background: The secondary structure that maximizes the number of non-crossing matchings between complimentary bases of an RNA sequence of length n can be computed in O(n(3)) time using Nussinov's dynamic programmi...
详细信息
Background: The secondary structure that maximizes the number of non-crossing matchings between complimentary bases of an RNA sequence of length n can be computed in O(n(3)) time using Nussinov's dynamic programming algorithm. The Four-Russians method is a technique that reduces the running time for certain dynamic programming algorithms by a multiplicative factor after a preprocessing step where solutions to all smaller subproblems of a fixed size are exhaustively enumerated and solved. Frid and Gusfield designed an O(n(3)/log n) algorithm for RNA folding using the Four-Russians technique. In their algorithm the preprocessing is interleaved with the algorithm computation. Theoretical results: We simplify the algorithm and the analysis by doing the preprocessing once prior to the algorithm computation. We call this the two-vector method. We also show variants where instead of exhaustive preprocessing, we only solve the subproblems encountered in the main algorithm once and memoize the results. We give a simple proof of correctness and explore the practical advantages over the earlier method. The Nussinov algorithm admits an O(n(2)) time parallel algorithm. We show a parallel algorithm using the two-vector idea that improves the time bound to O(n(2)/log n). Practical results: We have implemented the parallel algorithm on graphics processing units using the CUDA platform. We discuss the organization of the data structures to exploit coalesced memory access for fast running times. The ideas to organize the data structures also help in improving the running time of the serial algorithms. For sequences of length up to 6000 bases the parallel algorithm takes only about 2.5 seconds and the two-vector serial method takes about 57 seconds on a desktop and 15 seconds on a server. Among the serial algorithms, the two-vector and memoized versions are faster than the Frid-Gusfield algorithm by a factor of 3, and are faster than Nussinov by up to a factor of 20. The source-code f
Consider the selection problem of determining the k th smallest element of a sequence of n elements. Under the CGM (Coarse Grained Multicomputer) model with p processors and O(n/p) local memory, we present a determini...
详细信息
ISBN:
(纸本)9780897919845
Consider the selection problem of determining the k th smallest element of a sequence of n elements. Under the CGM (Coarse Grained Multicomputer) model with p processors and O(n/p) local memory, we present a deterministic parallel algorithm for the selection problem that requires O(log p) communication rounds. Besides requiring a low number of communication rounds, the algorithm also attempts to minimize the total amount of data transmitted in each round (only O(p) except in the last round). The basic algorithm is then extended to solve the problem of q simultaneous selections using the same input sequence, also in O(log p) communication rounds and asymptotically same local computing time (if q = O(p) ). The simultaneous selection algorithm gives rise to a communication efficient sorting algorithm, with O(log p) communication rounds and a total of O(p 2) data transmitted in each round except in the last one. In addition to showing theoretical complexities, we present very promising experimental results obtained on two parallel machines that show almost linear speedup, indicating the efficiency and scalability of the proposed algorithms. To our knowledge, this is the best deterministic CGM algorithm in the literature for the selection problem.
Stroke-based rendering is a rendering method that mimics the actual painting technique by drawing a stroke by stroke on a blank canvas image. In this paper, we propose a watercolor image generation method using stroke...
详细信息
Stroke-based rendering is a rendering method that mimics the actual painting technique by drawing a stroke by stroke on a blank canvas image. In this paper, we propose a watercolor image generation method using stroke-based rendering. The proposed method generates an image that is a good approximation of the input image as well as having the characteristics of a watercolor painting by repeatedly painting strokes while referring to the input image. To generate a high-quality image, that is, an image that closely resembles an actual watercolor painting, various techniques are employed: modeling of watercolor paper, detailed physical simulation of the movement of water and pigment, strokes using a brush model, among others. The proposed method generates a large number of strokes and performs computationally intensive watercolor simulations for each stroke. Therefore, this paper also presents its parallel algorithm using a Graphics Processing Unit (GPU). We implemented this parallel algorithm on an NVIDIA A100 GPU. The experimental results show that the CPU implementations with sequential and parallel executions take 34,651 and 867 s to generate a 4K-watercolor image of size 3840x2144$$ 3840\times 2144 $$, respectively. In contrast, the GPU implementation with parallel execution succeeded in reducing the time to 44 s.
The knapsack problem is known to be a typical NP-complete problem, which has 2(n) possible solutions to search over. Thus a task for solving the knapsack problem can be accomplished in 2(n) trials if an exhaustive sea...
详细信息
The knapsack problem is known to be a typical NP-complete problem, which has 2(n) possible solutions to search over. Thus a task for solving the knapsack problem can be accomplished in 2(n) trials if an exhaustive search is applied. In the past decade, much effort has been devoted in order to reduce the computation time of this problem instead of exhaustive search. In 1984, Karnin proposed a brilliant parallel algorithm, which needs O(2(n/6)) processors to solve the knapsack problem in O(2(n/2)) time;that is, the cost of Karnin's parallel algorithm is O(2(2n/3)). In this paper, we propose a fast search technique to improve Karnin's parallel algorithm by reducing the search time complexity of Karnin's parallel algorithm to be O(2(n/3)) under the same O(2(n/6)) processors available. Thus, the cost of the proposed parallel algorithm is O(2(n/2)). Furthermore, we extend this search technique to the case that the number of available processors is P = O(2(x)), where x greater than or equal to 1. From the analytical results, we see that our search technique is indeed superior to the previously proposed methods. We do believe our proposed parallel algorithm is pragmatically feasible at the moment when multiprocessor systems become more and more popular.
The integral image can be used to quickly complete common pixel-level operations in the regular region of the grey-level image. So it has been widely used in the field of computer vision and pattern recognition. In th...
详细信息
The integral image can be used to quickly complete common pixel-level operations in the regular region of the grey-level image. So it has been widely used in the field of computer vision and pattern recognition. In this paper, we firstly present an intuitive parallel method to compute the integral image. Then based on the intuitive method, a two-stage method based on the binary tree is introduced. In each stage of the algorithm, we do a firstly top-down and secondly bottom-up traversal over the tree. Finally, we analyze the case of large-scale grey-level image and optimize the computation based on the CUDA architecture. We have done the experiment in the consumer-level PC hardware which shows that the GPU-based algorithm outperforms the corresponded CPU-based algorithm in terms of speed in case of large-scale images.
A new dynamic data structure has been proposed recently in *** are several algorithms for matrix *** none of them has used r-train data structure for storing and multiplying the *** this paper algorithm for matrix mul...
详细信息
A new dynamic data structure has been proposed recently in *** are several algorithms for matrix *** none of them has used r-train data structure for storing and multiplying the *** this paper algorithm for matrix multiplication using r-train for parallel machine has been proposed.
暂无评论