We show that the complete binary tree with n > 8 leaves can be embedded in the hypercube with n nodes such that: paths of the tree are mapped onto edge-disjoint paths of the hypercube, at most two tree nodes (one o...
详细信息
We show that the complete binary tree with n > 8 leaves can be embedded in the hypercube with n nodes such that: paths of the tree are mapped onto edge-disjoint paths of the hypercube, at most two tree nodes (one of which is a leaf) are mapped onto each hypercube node, and the maximum distance from a leaf to the root of the tree is log2n + 1 hypercube edges (which is optimally short). This embedding facilitates efficient implementation of many P-RAM algorithms on the hypercube.
With the development of concurrent computing architectures which promise cost-effective means of obtaining supercomputing performance, there is much interest in applying and in evaluating the actual performance on lar...
详细信息
With the development of concurrent computing architectures which promise cost-effective means of obtaining supercomputing performance, there is much interest in applying and in evaluating the actual performance on large, computationally-intensive problems. Of particular interest is the concurrent performance of large scale electromagnetic scattering problems. Two electromagnetic codes with differing underlying algorithms have been converted to run on the Mark III Hypercube. One is a time domain finite difference solution of Maxwell's equations to solve for scattered fields and the other is a frequency domain moment method solution. Important measures for demonstrating the utility of the parallel architecture are the size of the problem that can be solved and the efficiency by which the paralleling can increase the speed of execution.
In industrial numerical simulations, efficiently generating high-quality tetrahedral meshes remains a significant challenge. Advances in high-performance computing have made parallelization a practical approach to imp...
详细信息
In industrial numerical simulations, efficiently generating high-quality tetrahedral meshes remains a significant challenge. Advances in high-performance computing have made parallelization a practical approach to improving the quality of large-scale tetrahedral meshes. This study proposes a fine-grained multithreaded parallel method to accelerate tetrahedral mesh improvement. By utilizing atomic operations, we fundamentally address thread safety concerns. Additionally, through the precise use of atomic operations, task decomposition strategies, and a multithreaded memory model, we minimize the probability of task overlap and data races, thereby enhancing overall parallel mesh improvement efficiency. Experimental results demonstrate that our parallel mesh improver is robust and effective for complex industrial models. On a laptop with 16 threads, we achieved a tenfold increase in tetrahedral mesh improvement speed, with the quality of the improved meshes being comparable to that of the sequential process.
We present BiqBin, an exact solver for linearly constrained binary quadratic problems. Our approach is based on an exact penalty method to first efficiently transform the original problem into an instance of Max-Cut, ...
详细信息
We present BiqBin, an exact solver for linearly constrained binary quadratic problems. Our approach is based on an exact penalty method to first efficiently transform the original problem into an instance of Max-Cut, and then to solve the Max-Cut problem by a branch-and-bound algorithm. All the main ingredients are carefully developed using new semidefinite programming relaxations obtained by strengthening the existing relaxations with a set of hypermetric inequalities, applying the bundle method as the bounding routine and using new strategies for exploring the branch-and-bound tree. Furthermore, an efficient C implementation of a sequential and a parallel branch-and-bound algorithm is presented. The latter is based on a load coordinator-worker scheme using MPI for multi-node parallelization and is evaluated on a high-performance computer. The new solver is benchmarked against BiqCrunch, GUROBI, and SCIP on four families of (linearly constrained) binary quadratic problems. Numerical results demonstrate that BiqBin is a highly competitive solver. The serial version outperforms the other three solvers on the majority of the benchmark instances. We also evaluate the parallel solver and show that it has good scaling properties. The general audience can use it as an on-line service available at http://***.
This paper presents a parallel adaptive clustering (PAC) algorithm to automatically classify data while simultaneously choosing a suitable number of classes. Clustering is an important tool for data analysis and under...
详细信息
This paper presents a parallel adaptive clustering (PAC) algorithm to automatically classify data while simultaneously choosing a suitable number of classes. Clustering is an important tool for data analysis and understanding in a broad set of areas including data reduction, pattern analysis, and classification. However, the requirement to specify the number of clusters in advance and the computational burden associated with clustering large sets of data persist as challenges in clustering. We propose a new parallel adaptive clustering (PAC) algorithm that addresses these challenges by adaptively computing the number of clusters and leveraging the power of parallel computing. The algorithm clusters disjoint subsets of the data on parallel computation threads. We develop regularized set k-means to efficiently cluster the results from the parallel threads. A refinement step further improves the clusters. The PAC algorithm offers the capability to adaptively cluster data sets which change over time by reusing the information from previous time steps to decrease computation. We provide theoretical analysis and numerical experiments to characterize the performance of the method, validate its properties, and demonstrate the computational efficiency of the method.
The problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles appears frequently in the context of agent-based simulation studies. For this reason, the High Level Architecture (HL...
详细信息
The problem of identifying intersections between two sets of d-dimensional axis-parallel rectangles appears frequently in the context of agent-based simulation studies. For this reason, the High Level Architecture (HLA) specification a standard framework for interoperability among simulators includes a Data Distribution Management (DDM) service whose responsibility is to report all intersections between a set of subscription and update regions. The algorithms at the core of the DDM service are CPU-intensive, and could greatly benefit from the large computing power of modern multi-core processors. In this article, we propose two parallel solutions to the DDM problem that can operate effectively on shared-memory multiprocessors. The first solution is based on a data structure (the interval tree) that allows concurrent computation of intersections between subscription and update regions. The second solution is based on a novel parallel extension of the Sort Based Matching algorithm, whose sequential version is considered among the most efficient solutions to the DDM problem. Extensive experimental evaluation of the proposed algorithms confirm their effectiveness on taking advantage of multiple execution units in a shared-memory architecture.
We present a parallel algorithm for static and dynamic partitioning of unstructured FEM-meshes, The method consists of two parts, First a fast but inaccurate sequential clustering is determined which is used, together...
详细信息
We present a parallel algorithm for static and dynamic partitioning of unstructured FEM-meshes, The method consists of two parts, First a fast but inaccurate sequential clustering is determined which is used, together with a simple mapping heuristic, to map the mesh initially onto the processors of a parallel system. The second part of the method uses a massively parallel algorithm to remap and optimize the mesh decomposition, taking several cost functions into account which reflect the characteristics of the underlying hardware and the requirements of the numerical solution method supposed to run after the decomposition, The parallel algorithm first calculates the number of nodes that have to be migrated between pairs of clusters in order to obtain an optimal load balancing, In a second step, nodes to be migrated are chosen according to cost functions optimizing the amount of necessary communication and the shapes of subdomains, The latter criterion is extremely important for the convergence behavior of certain numerical solution methods, especially for preconditioned conjugate gradient methods. The parallel parts of the method are implemented in C under Parix to run on the Parsytec GC systems, Results on up to 64 processors are presented and compared to those of other existing methods. (C) 1998 John Wiley & Sons, Ltd.
In this paper, we present a new state space-based approach for the two-dimensional (2-D) frequency estimation problem which occurs in various areas of signal processing and communication problems. The proposed method ...
详细信息
In this paper, we present a new state space-based approach for the two-dimensional (2-D) frequency estimation problem which occurs in various areas of signal processing and communication problems. The proposed method begins with the construction of a state space model associated with the noiseless data which contains a summation of 2-D harmonics. Two auxiliary Hankel-block-Hankel-like matrices are then introduced and from which the two frequency components can be derived via matrix factorizations along with frequency shifting properties. Although the algorithm can render high resolution frequency estimates, it also calls for lots of computations. To alleviate the high computational overhead required, a highly parallelizable implementation of it via the principle subband component (PSC) of some appropriately chosen transforms have been addressed as well. Such a PSC-based transform domain implementation not only reduces the size of data needed to be processed, but it also suppresses the contaminated noise outside the subband of interest. To reduce the computational complexity induced in the transformation process, we also suggest that either the transform of the discrete Fourier transform (DFT) or the Haar wavelet transform (HWT) be employed. As a consequence, such an approach of implementation can achieve substantial computational savings;meanwhile, as demonstrated by the provided simulation results, it still retains roughly the same performance as that of the original algorithm. A comparison with other existing algorithms has been made as well to justify the proposed approaches.
We have found provably optimal algorithms for full-domain discrete-ordinate transport sweeps on a class of grids in 2D and 3D Cartesian geometry that are regular at a coarse level but arbitrary within the coarse block...
详细信息
We have found provably optimal algorithms for full-domain discrete-ordinate transport sweeps on a class of grids in 2D and 3D Cartesian geometry that are regular at a coarse level but arbitrary within the coarse blocks. We describe these algorithms and show that they always execute the full eight-octant (or four-quadrant if 2D) sweep in the minimum possible number of stages for a given P-x x P-y x P-z, partitioning. Computational results confirm that our optimal scheduling algorithms execute sweeps in the minimum possible stage count. Observed parallel efficiencies agree well with our performance model. Our PDT transport code has achieved approximately 68% parallel efficiency with > 1.5M parallel threads, relative to 8 threads, on a simple weak-scaling problem with only three energy groups, 10 directions per octant, and 4096 cells/thread. Our ARDRA code has achieved 71% efficiency with > 1.5M cores, relative to 16 cores, with 36 directions per octant and 48 energy groups. We demonstrate similar efficiencies with PDT on a realistic set of nuclear-reactor test problems, with unstructured meshes that resolve fine geometric details. These results demonstrate that discrete-ordinates transport sweeps can be executed with high efficiency using more than 10(6) parallel processes. (C) 2020 Published by Elsevier Inc.
A novel mechanism for driving residual stress in tokamak plasmas based on k(parallel to) symmetry breaking by the turbulence intensity gradient is proposed The physics of this mechanism is explained and its connection...
详细信息
A novel mechanism for driving residual stress in tokamak plasmas based on k(parallel to) symmetry breaking by the turbulence intensity gradient is proposed The physics of this mechanism is explained and its connection to the wave kinetic equation and the wave-momentum flux is described Applications to the H-mode pedestal in particular to internal transport barriers, are discussed Also, the effect of heat transport on the momentum flux is discussed (C) 2010 American Institute of Physics [doi 10 1063/1 3503624]
暂无评论