It is well known that the resolution of Maxwell equations may provide large dense matrices, being thus a computer intensive problem. Even small problems require a huge amount of memory to manipulate matrices during th...
详细信息
It is well known that the resolution of Maxwell equations may provide large dense matrices, being thus a computer intensive problem. Even small problems require a huge amount of memory to manipulate matrices during the O(N-3) involved operations. The fast multipole method enables to compress and approximate matrices. Coupled with an iterative resolution of the linear system the complexity reduces to O(N-iter N log N) operations. In order to use multiprocessors machine and to reduce computation times, we propose here a parallel implementation of the fast multiple method. This article relates our first results, as well as the difficulties encountered. Copyright (C) 2003 John Wiley Sons, Ltd.
A still image encoder implementation is presented for a multi-DSP system called PARNEU, which has previously been developed for neural network and signal processing applications. The core of the implementation is base...
详细信息
A still image encoder implementation is presented for a multi-DSP system called PARNEU, which has previously been developed for neural network and signal processing applications. The core of the implementation is based on experimental mappings of discrete wavelet transform (DWT) on the parallel processor architecture. PARNEU has a flexible interconnection network architecture with message passing, which allows adding more processing units (PUS) to the system whenever more computational power is needed. Program code can be written to adapt to the number of PUs. This is utilized in the presented encoder implementation with emphasis on load balancing among processors as well as on balance between communication and computation. Performance of the implementation is measured with a scaleable number of processors and compared to a sequential reference implementation. Results show that the DWT phase can be efficiently parallelized on PARNEU with 95.6% of its time spent on true parallel computation. The overall speedup with four processors is 2.25, which could be improved by further optimization of an adaptive scanning phase of the encoder. (C) 2004 Elsevier B.V. All rights reserved.
A LUT with Hierarchical Structure (HS-LUT) is proposed in this paper to realize the unique nonlinear component, Substitution Box (S-box), of the block ciphers. Different types of S-boxes are analyzed and four importan...
详细信息
A LUT with Hierarchical Structure (HS-LUT) is proposed in this paper to realize the unique nonlinear component, Substitution Box (S-box), of the block ciphers. Different types of S-boxes are analyzed and four important features of them are summarized. Then, custom 4R/1W memory is proposed as the storage unit of the reconfigurable S-box, and an example set of block ciphers is put forward to describe how to achieve a satisfactory structure of reconfigurable S-box. The proposed HS-LUT is applicable for different sets of ciphers and it is implemented under TSMC 40 nm CMOS technology to compare with similar work. The comparison result shows that the proposed HS-LUT gains 6.88% to 51.76% area efficiency improvement.
In this letter, a synthetic aperture radar (SAR) data reformatting approach named Doppler Keystone transform (DKT) is proposed to correct the range migration of a moving target. By using the DKT, the SAR imaging progr...
详细信息
In this letter, a synthetic aperture radar (SAR) data reformatting approach named Doppler Keystone transform (DKT) is proposed to correct the range migration of a moving target. By using the DKT, the SAR imaging program, i.e., the 2-D matched filtering, can be transformed into separate 1-D operations along range or azimuth direction, and therefore, the DKT is suitable for the parallel implementation of SAR imaging of the moving target. Our simulations show that by combining the DKT and the Doppler phase compensation methods, the moving target can be well imaged in high signal-clutter-ratio case.
Motivated by the recent progress of deep spiking neural networks (SNNs), we propose a structure-time parallel strategy based on layered structure and one-time computation over a time window to speed up the prominent s...
详细信息
Motivated by the recent progress of deep spiking neural networks (SNNs), we propose a structure-time parallel strategy based on layered structure and one-time computation over a time window to speed up the prominent spike-based deep learning algorithm named broadcast alignment. Furthermore, a well-designed deep hierarchical model based on the parallel broadcast alignment is proposed for object recognition. The parallel broadcast alignment achieves a significant 137 x speedup compared to its original implementation on MNIST dataset. The object recognition model achieves higher accuracy than that of the latest spiking deep convolutional neural networks on the ETH-80 dataset. The proposed parallel strategy and the object recognition model will facilitate both the simulation of deep SNNs for studying spiking neural dynamics and also the applications of spike-based deep learning in real-world problems. (C) 2019 Elsevier Ltd. All rights reserved.
An iterative method based on the textured decomposition (TD) is developed in order to solve the systems of linear equations arising in the p-version of the finite element method. The iteration is used to implement the...
详细信息
An iterative method based on the textured decomposition (TD) is developed in order to solve the systems of linear equations arising in the p-version of the finite element method. The iteration is used to implement the p-version in parallel on an MIMD computer NCUBE/six. The objectives are twofold: to achieve high computational efficiency (that is, computational load should be balanced among the processors) and simultaneously to achieve rapid convergence. A supereIement, consisting of four adjacent rectangular finite elements, is constructed for two-dimensional problems. Based on the structural property of the shape functions, each supereIement is partitioned into three blocks in two different ways, and a two-leaf TD is used. Computations for a superelement associated with each leaf are assigned to two processors and are performed in parallel. A new preconditioner is introduced to accelerate convergence in a preconditioned textured decomposition (PTD). A special local communication strategy is used to avoid global assembly and global communication. Two model problems including a Laplace equation on a rectangular domain with a near singular solution and a Poisson equation on an L-shaped domain, are solved. The conjugate gradient (CG) method, the TD method, the recursive textured decomposition (RTD) method, both with and without preconditioning, and the classical iterative methods (Jacobi, Gauss-Seidel (GS), successive overrelaxation (SOR)), are used to solve both model problems. Load balance, speedup ratio, and spectral radii of the various iterations are studied. The test results indicate that recursive PTD with a local communication strategy gives at least a 30% improvement in computational time over the other methods.
We present a new, parallel version of the numerical electromagnetics code (NEC). The parallelization is based on a bidimensional block-cyclic distribution of matrices on a rectangular processor grid, assuring a theore...
详细信息
We present a new, parallel version of the numerical electromagnetics code (NEC). The parallelization is based on a bidimensional block-cyclic distribution of matrices on a rectangular processor grid, assuring a theoretically optimal load balance among the processors. The code is portable to any platform supporting message passing parallel environments such as message passing interface and parallel virtual machine, where it could even be executed on heterogeneous clusters of computers running on different operating systems. The developed parallel NEC was successfully implemented on two parallel supercomputers featuring different architectures to test portability. Large structures containing up to 24000 segments, which exceeds currently available computer resources were successfully executed and timing and memory results are presented. The code is applied to analyze the penetration of electromagnetic fields inside a vehicle. The computed results are validated using other numerical methods and experimental data obtained using a simplified model of a vehicle (consisting essentially of the body shell) illuminated by an electromagnetic pulse (EMP) simulator.
We describe an efficient massively parallel implementation of our variant of the FETI type domain decomposition method called Total FETI with a lumped preconditioner. A special attention is paid to the discussion of s...
详细信息
We describe an efficient massively parallel implementation of our variant of the FETI type domain decomposition method called Total FETI with a lumped preconditioner. A special attention is paid to the discussion of several variants of parallelization of the action of the projections to the natural coarse grid and to the effective regularization of the stiffness matrices of the subdomains. Both numerical and parallel scalability of the proposed TFETI method are demonstrated on a 2D elastostatic benchmark up to 314,505,600 unknowns and 4800 cores. The results are also important for implementation of scalable algorithms for the solution of nonlinear contact problems of elasticity by TFETI based domain decomposition method. (C) 2013 Civil-Comp Ltd and Elsevier Ltd. All rights reserved.
A landscape modeling system called the Across Trophic-Level System Simulation (or ATLSS) has been developed in an effort to project the consequences of proposed water regulation plans for restoration of the South Flor...
详细信息
A landscape modeling system called the Across Trophic-Level System Simulation (or ATLSS) has been developed in an effort to project the consequences of proposed water regulation plans for restoration of the South Florida Everglades. The ATLSS Landscape Fish Model (ALFISH) is a component of the ATLSS package (written in C++), which is used to provide dynamic measures of the spatially-explicit food resources available to wading birds, namely fish. The original (serial) ALFISH model requires as much as 30 h for 31-year simulations of specified scenarios. The model's execution time has been successfully improved (by a factor of 4.5) by partitioning its data input and executing the model simultaneously (in parallel) on those partitions. This paper demonstrates how the model's communications between partitioned data can be blocked to simulate compartmentalization effects on the input data. Minimal effects (below 1%) on the output of the original (serial) version are demonstrated. Regarding portability, both models (serial and parallel) have been successfully executed on two different computing environments: an SMP (Symmetric Multi-Processor) with 14 processors and a 14-processor network cluster. (C) 2004 Elsevier B.V. All rights reserved.
The Agents Kernel Language (AKL) is a general purpose concurrent constraint language. It combines the programming paradigms of search-oriented languages such as Prolog and process-oriented languages such as GHC. The p...
详细信息
The Agents Kernel Language (AKL) is a general purpose concurrent constraint language. It combines the programming paradigms of search-oriented languages such as Prolog and process-oriented languages such as GHC. The paper is focused on three essential issues in the parallel implementation of AKL for shared-memory multiprocessors: how to maintain multiple binding environments, how to represent the execution state and how to distribute work among workers. A simple scheme is used for maintaining multiple binding environments. A worker will immediately see conditional bindings placed on variables, all workers will have a coherent view of the constraint stores. A locking scheme is used that entails little overhead for operations on local variables. The goals in a guard are represented in a way that allows them to be inserted and removed without any locking. Continuations are used to represent sequences of untried goals. The representation keeps the granularity of work more coarse. Available work is distributed among workers in such a way that hot-spots are avoided. And- and or-tasks are distributed and scheduled in a uniform way.
暂无评论