Microprocessor clock rates-which for three decades doubled about every 18 months-have essentially stopped increasing. Instead, the number of processor cores (identical processing units capable of all usual microproces...
详细信息
Microprocessor clock rates-which for three decades doubled about every 18 months-have essentially stopped increasing. Instead, the number of processor cores (identical processing units capable of all usual microprocessor functions) in a microprocessor is increasing exponentially with time. In order to increase performance as the number of cores increase, a measurement analysis software will have to take advantage of this parallelism. The objectives of this paper are to study one example of a measurement analysis having serial dependencies among the input data and to show that there is a practical parallel algorithm despite the data dependencies within the measured time series. The measurement analysis studied is transition localization in digital signals. A parallel scan-type algorithm is presented. The results of applying the parallel algorithm on both synthetic data and actual measured data are presented, and the speedup obtained on a twenty-four core computer analyzed. The parallel method produces exactly the same measurement results, bit for bit, as the original serial method. It is argued that what is desired for this and many other measurement processing algorithms is scalability in throughput with number of cores. Such scalability is achieved by the proposed algorithm, with throughput up to about a dozen cores.
This study proposes a novel factorization method for the DCT IV algorithm that allows for breaking it into four or eight sections that can be run in parallel. Moreover, the arithmetic complexity has been significantly...
详细信息
This study proposes a novel factorization method for the DCT IV algorithm that allows for breaking it into four or eight sections that can be run in parallel. Moreover, the arithmetic complexity has been significantly reduced. Based on the proposed new algorithm for DCT IV, the speed performance has been improved substantially. The performance of this algorithm was verified using two different GPU systems produced by the NVIDIA company. The experimental results show that the novel proposed DCT algorithm achieves an impressive reduction in the total processing time. The proposed method is very efficient, improving the algorithm speed by more than 4-times-that was expected by segmenting the DCT algorithm into four sections running in parallel. The speed improvements are about five-times higher-at least 5.41 on Jetson AGX Xavier, and 10.11 on Jetson Orin Nano-if we compare with the classical implementation (based on a sequential approach) of DCT IV. Using a parallel formulation with eight sections running in parallel, the improvement in speed performance is even higher, at least 8.08-times on Jetson AGX Xavier and 11.81-times on Jetson Orin Nano.
A nodal-based finite element formulation coupled with absorbing boundary conditions has been developed to solve open boundary microwave problems, Only parallel computation enables to modelize large devices. We show in...
详细信息
A nodal-based finite element formulation coupled with absorbing boundary conditions has been developed to solve open boundary microwave problems, Only parallel computation enables to modelize large devices. We show in this paper how the code has been implemented on a parallel shared memory computer. Each step of the code is analyzed. Two types of storage for the matrix and two preconditioning methods for the conjugate gradient algorithm are particularly compared.
Plasma particle simulations are used extensively for the study of nonlinear phenomena in both space and laboratory plasmas. Here, a well-benchmarked plasma simulation code has been implemented on the 32-node JPL Mark ...
详细信息
Plasma particle simulations are used extensively for the study of nonlinear phenomena in both space and laboratory plasmas. Here, a well-benchmarked plasma simulation code has been implemented on the 32-node JPL Mark III hypercube to study the applicability of parallel architecture to particle simulation models. In the sequential version of the code, about 90% of the computation time is spent updating the particle positions and velocities. When implemented in parallel on the Mark III Hypercube, this part of the code was sped up by a factor of about 27 (83% efficiency). Computation times on the Mark III have also been compared with times on a variety of other computers.
In tree-based adaptive mesh refinement, elements are partitioned between processes using a space-filling curve. The curve establishes an ordering between all elements that derive from the same root element, the tree. ...
详细信息
In tree-based adaptive mesh refinement, elements are partitioned between processes using a space-filling curve. The curve establishes an ordering between all elements that derive from the same root element, the tree. When representing complex geometries by connecting several trees, the roots of these trees form an unstructured coarse mesh. We present an algorithm to partition the elements of the coarse mesh such that (a) the fine mesh can be load-balanced to equal element counts per process regardless of the element-to-tree map, and (b) each process that holds fine mesh elements has access to the meta data of all relevant trees. As an additional feature, the algorithm partitions the meta data of relevant ghost (halo) trees as well. We develop in detail how each process computes the communication pattern for the partition routine without handshaking and with minimal data movement. We demonstrate the scalability of this approach on up to 917e3 MPI ranks and 371e9 coarse mesh elements, measuring run times of one second or less.
This paper formulates an incomplete projection algorithm that is applied to the image recovery problem. The algorithm allows an easy implementation of dynamic load balancing for parallel architectures. Furthermore, th...
详细信息
This paper formulates an incomplete projection algorithm that is applied to the image recovery problem. The algorithm allows an easy implementation of dynamic load balancing for parallel architectures. Furthermore, the local computation - communication load ratio can be adjusted, since each processor performs a finite number of iterations of any projection-type technique, and this number can be provided as a parameter of the algorithm. Numerical results compare favorably with those obtained by the extrapolated method of parallel subgradient projections.
Determining the inner organizational structure of sets of networked elements is of paramount importance to analyze real-world systems such as social, biological, or economic networks. To such an end, it is necessary t...
详细信息
Determining the inner organizational structure of sets of networked elements is of paramount importance to analyze real-world systems such as social, biological, or economic networks. To such an end, it is necessary to identify communities of interrelated nodes within the networks. Recently, a fuzzy community detection approach based on the minimization of a topological error functional has been proposed in the form of a gradient-based algorithm design pattern. However, the intrinsic quadratic algorithmic complexity of the procedure limits the problem size that can be efficiently treated. Here, we extend the ability of this approach to analyze larger networks resorting to parallelism. Thus, we identify the concurrency sources in the gradient-based algorithm design pattern. To determine the parallelization limits, we develop a two-dimensional performance model as a function of the number of processors and network size. The model permits to compute the maximum possible speedup. Another model is presented to find the maximum problem size tractable in a given amount of time. Application of the previous models to a set of benchmark networks shows that parallelization enhances the proposed fuzzy community detection approach in more than an order of magnitude. This allows treatment of networks with several hundred thousand nodes in a time frame of hours.
In this paper we present a parallel implementation of a well-known heuristic optimisation algorithm (the downhill simplex algorithm developed by Nelder and Mead in 1965) which is well suited for unconstrained optimisa...
详细信息
In this paper we present a parallel implementation of a well-known heuristic optimisation algorithm (the downhill simplex algorithm developed by Nelder and Mead in 1965) which is well suited for unconstrained optimisation, We present the sequential algorithm as well as the parallel algorithm which we used to generate numerical results. They include numerical results of experiments on neural networks and a test suite of functions which demonstrate the parallel algorithm's increased robustness and convergence rate for high-dimensional problems compared to the sequential algorithm. (C) 1998 John Wiley & Sons, Ltd.
International Journal of parallel, Emergent and Distributed Systems is celebrating its 25th volume. IJPEDS is the continuation of the journal parallel algorithms and Applications which existed from 1993 to 2004. Paral...
详细信息
International Journal of parallel, Emergent and Distributed Systems is celebrating its 25th volume. IJPEDS is the continuation of the journal parallel algorithms and Applications which existed from 1993 to 2004. parallel algorithms and Applications was founded by the late David J. Evans, who served as Editor-in-Chief until 1996. Graham Megson (his former student) served as Editor-in-Chief of PAA from 1996 to 2004. They deserve credit for founding it and for their excellent stewardship of the journal in their roles. I was pleased to serve as Associate Editor/EIC of PAA from the beginning in 1992 until 2004 and Regional Editor thereafter. From 2005 (starting with volume 20), the journal was renamed International Journal of parallel, Emergent and Distributed Systems, expanding its scope (the new scope includes the areas of emergent and distributed systems, algorithms, architectures and applications), and I took over as Editor-in-Chief. It is my honour and pleasure to serve this journal since it was established and to lead it during the last six years.
Monte -Carlo Tree Search (MCTS) is an adaptive and heuristic tree -search algorithm designed to uncover sub -optimal actions at each decision -making point. This method progressively constructs a search tree by gather...
详细信息
Monte -Carlo Tree Search (MCTS) is an adaptive and heuristic tree -search algorithm designed to uncover sub -optimal actions at each decision -making point. This method progressively constructs a search tree by gathering samples throughout its execution. Predominantly applied within the realm of gaming, MCTS has exhibited exceptional achievements. Additionally, it has displayed promising outcomes when employed to solve NP -hard combinatorial optimization problems. MCTS has been adapted for distributed -memory parallel platforms. The primary challenges associated with distributed -memory parallel MCTS are the substantial communication overhead and the necessity to balance the computational load among various processes. In this work, we introduce a novel distributed -memory parallel MCTS algorithm with partial backpropagations, referred to as parallel Partial-Backpropagation MCTS (PPB-MCTS). Our design approach aims to significantly reduce the communication overhead while maintaining, or even slightly improving, the performance in the context of combinatorial optimization problems. To address the communication overhead challenge, we propose a strategy involving transmitting an additional backpropagation message. This strategy avoids attaching an information table to the communication messages exchanged by the processes, thus reducing the communication overhead. Furthermore, this approach contributes to enhancing the decision -making accuracy during the selection phase. The load balancing issue is also effectively addressed by implementing a shared transposition table among the parallel processes. Furthermore, we introduce two primary methods for managing duplicate states within distributed -memory parallel MCTS, drawing upon techniques utilized in addressing duplicate states within sequential MCTS. Duplicate states can transform the conventional search tree into a Directed Acyclic Graph (DAG). To evaluate the performance of our proposed parallel algorithm, we conduct
暂无评论