The paper proposes a new approach to the solution of standard and block tridiagonal systems that appear in various areas of technical, scientific and financial practice. Its goal is to elaborate an efficient two-phase...
详细信息
The paper proposes a new approach to the solution of standard and block tridiagonal systems that appear in various areas of technical, scientific and financial practice. Its goal is to elaborate an efficient two-phase tridiagonal solver, the particular case of which is the k-step cyclic reduction. The main idea of the proposed approach to designing a two-phase tridiagonal solver lies in using new model equations for dyadic system reduction. The resulting solver differs from the known two-phase partitioning ones also in the second phase, since it uses a series of simple explicit formulas for calculation of the remaining unknown values. Computational experiments on measuring speedup confirmed the efficiency of the proposed solver. (C) 2020 Elsevier B.V. All rights reserved.
In order to realize integrally analysis and optimization of the large airborne radome-enclosed antenna system, a novel optimization strategy is proposed based on an overlapping domain decomposition method by using hig...
详细信息
In order to realize integrally analysis and optimization of the large airborne radome-enclosed antenna system, a novel optimization strategy is proposed based on an overlapping domain decomposition method by using higher-order MoM and out-of-core solver (HO-OC-DDM), and combining with adaptive mutation particle swarm optimization (AMPSO). The introduction of parallel out-of-core solver and DDM can effectively break the random access memory (RAM) limit. This strategy can decompose difficult-to-solve global optimization problems into multi-domain optimization problems by using domain decomposition method. Finally, take airborne Yagi antenna system as an example, the numerical results show that the design of large airborne radome-enclosed antenna system based on the proposed strategy is convenient and effective.
To generate a mesh in a physical domain, an initial mesh of a polygonal domain that approximates the physical domain is introduced. The initial mesh is formed by using a Body Centered Cubic (BCC) lattice that can give...
详细信息
To generate a mesh in a physical domain, an initial mesh of a polygonal domain that approximates the physical domain is introduced. The initial mesh is formed by using a Body Centered Cubic (BCC) lattice that can give a more efficient node ordering for the matrix vector multiplication. An optimization problem is then considered for the displacement on the initial mesh points, which maintains a good quality of triangles while aiming at fitting the initial mesh to the boundary of the physical domain. In the optimization problem, a mesh quality function is employed. The Fr & eacute;chet derivative of the objective function vanishes at the optimal solution and it gives a resulting nonlinear algebraic system for the optimal solution. The nonlinear algebraic system can be solved by using the Picard or Newton method. To resolve the complexity in the physical domain, a very fine initial mesh is often required but the solution time for the nonlinear algebraic system becomes problematic. To overcome this limitation, adaptively refined grid cells for the initial BCC mesh can be used and iterative solvers combined with a domain decomposition preconditioner can be used for solving the algebraic system in the Picard or Newton method. The use of iterative solvers with a domain decomposition preconditioner gives a parallel meshing algorithm that makes the proposed scheme more efficient for large scale problems. Numerical results for various test models are included. (C) 2021 Elsevier Inc. All rights reserved.
Convolutional Neural Networks (CNNs) have been widely adopted in many kinds of artificial intelligence applications. Most of the computational overhead of CNNs is spent on convolutions. An effective approach to reduci...
详细信息
Convolutional Neural Networks (CNNs) have been widely adopted in many kinds of artificial intelligence applications. Most of the computational overhead of CNNs is spent on convolutions. An effective approach to reducing the overhead is transforming convolutions in the time domain into multiplications in the frequency domain by means of Fast Fourier Transform (FFT) algorithms, known as FFT-based fast algorithms for convolutions. However, current FFT-based fast implementations only work for unit-strided convolutions with stride as 1, and cannot be directly applied to strided convolutions with stride size greater than 1, which are usually used as the first layer of CNNs and as an effective alternative to the pooling layers for downsampling. In this paper, we first introduce rearrangement-and sampling-based methods for applying FFT-based fast algorithms to strided convolutions, and the arithmetic complexities of these two methods and the direct method are compared in detail. Then, the highly optimized parallel implementations of the two methods on ARMv8-based many-core CPU are presented. Lastly, we benchmark these implementations against two GEMM-based implementations on this ARMv8 CPU. Our experimental results with convolutions of different kernels, feature maps, and batch sizes show that the rearrangement-based method generally exceeds the sampling-based one under the same optimizations in most cases, and these two methods can get much better performance than GEMM-based ones when the kernels, feature maps and batch sizes are large. The experimental results on the convolutional layers in popular CNNs further demonstrate the conclusions above. (C) 2021 Elsevier B.V. All rights reserved.
The use of parallelism may overcome some of the constraints imposed by single processor computing systems. Besides offering faster solutions, applications that are parallelized can solve bigger or more complex problem...
详细信息
The use of parallelism may overcome some of the constraints imposed by single processor computing systems. Besides offering faster solutions, applications that are parallelized can solve bigger or more complex problems. For instance, simulations can be run at finer resolutions while physical phenomena can be potentially modeled more realistically. We describe in this paper the development of a bio-inspired parallel algorithm used in the three-dimensional simulation of multicellular tissue growth. We report on the different components of the model where cellular automata is used to model different types of cell populations that execute persistent random walks on the computational grid, collide, and proliferate until they reach confluence. We also discuss the main issues encountered in the parallelization of the model and its implementation on a parallel machine.
Computationally efficient sensitivity analysis of a large-scale air pollution model is an important issue we focus on in this paper. Sensitivity studies play an important role for reliability analysis of the results o...
详细信息
Computationally efficient sensitivity analysis of a large-scale air pollution model is an important issue we focus on in this paper. Sensitivity studies play an important role for reliability analysis of the results of complex nonlinear models as those used in the air pollution modelling. There is a number of uncertainties in the input data sets, as well as in some internal coefficients, which determine the speed of the main chemical reactions in the chemical part of the model. These uncertainties are subject to our quantitative sensitivity study. Monte Carlo and quasi-Monte Carlo algorithms are used in this study. A large number of numerical experiments with some special modifications of the model must be carried out in order to collect the necessary input data for the particular sensitivity study. For this purpose we created an efficient high performance implementation SA-DEM, based on the MPI version of the package UNI-DEM. A large number of numerical experiments were carried out with SA-DEM on the IBM MareNostrum III at BSC - Barcelona, helped us to identify a severe performance problem with an earlier version of the code and to resolve it successfuly. The improved implementation appears to be quite efficient for that challenging computational problem, as our experiments show. Some numerical results with performance and scalability analysis of these results are presented in the paper.
Projecting a vector onto a simplex is a well-studied problem that arises in a wide range of optimization problems. Numerous algorithms have been proposed for determining the projection;however, the primary focus of th...
详细信息
Projecting a vector onto a simplex is a well-studied problem that arises in a wide range of optimization problems. Numerous algorithms have been proposed for determining the projection;however, the primary focus of the literature is on serial algorithms. We present a parallel method that decomposes the input vector and distributes it across multiple processors for local projection. Our method is especially effective when the resulting projection is highly sparse, which is the case, for instance, in large-scale problems with independent and identically distributed (i.i.d.) entries. Moreover, the method can be adapted to parallelize a broad range of serial algorithms from the literature. We fill in theoretical gaps in serial algorithm analysis and develop similar results for our parallel analogues. Numerical experiments conducted on a wide range of large-scale instances, both real world and simulated, demonstrate the practical effectiveness of the method.
In this paper, simple optimal algorithms are presented for solving some problems related to interval graphs. These problems are the connected component problem, the spanning tree problem, the eccentricity problem, and...
详细信息
In this paper, simple optimal algorithms are presented for solving some problems related to interval graphs. These problems are the connected component problem, the spanning tree problem, the eccentricity problem, and the single source all destinations shortest path problem. All of the above four problems can be solved in linear time if the endpoints of the intervals are sorted. Moreover, our algorithms can be parallelized very easily so that the above problems can be solved in O(log n) time with O(n/log n) processors using the EREW PRAM model.
Objective: To study the algorithm of machine learning in large data environment. Methods: using the divide and conquer strategy and parallel algorithm. Process: feature selection, classification, clustering and associ...
详细信息
Objective: To study the algorithm of machine learning in large data environment. Methods: using the divide and conquer strategy and parallel algorithm. Process: feature selection, classification, clustering and association analysis of large data. Results and analysis: from the experimental results, we can see, using the divide and conquer strategy and parallel algorithm, we can extract information which is hidden but valuable from large data, improve analysis and problem solving skills. This result is obtained because of the algorithm can effectively extract, retrieve, store, share, analyze and deal with complex structure and large amount of data. Conclusion: the divide and conquer strategy and parallel algorithm for large data on machine learning algorithm to actively promote around is effective, and maximize the value of the data.
Modern microscopic volumetric imaging processes lack capturing flexibility and are inconvenient to operate. Additionally, the quality of acquired data could not be assessed immediately during imaging due to the lack o...
详细信息
Modern microscopic volumetric imaging processes lack capturing flexibility and are inconvenient to operate. Additionally, the quality of acquired data could not be assessed immediately during imaging due to the lack of a coherent real-time visualization system. Thus, to eliminate the requisition of close user supervision while providing real-time 3D visualization alongside imaging, we propose and describe an innovative approach to integrate imaging and visualization into a single pipeline called an online incrementally accumulated rendering system. This system is composed of an electronic controller for progressive acquisition, a memory allocator for memory isolation, an efficient memory organization scheme, a compositing scheme to render accumulated datasets, and accumulative frame buffers for displaying non-conflicting outputs. We implement this design using a laser scanning confocal endomicroscope, interfaced with an FPGA prototyping board through a custom hardware circuit. Empirical results from practical implementations deployed in a cancer research center are presented in this paper. (C) 2013 Elsevier B.V. All rights reserved.
暂无评论