A type of incomplete decomposition preconditioner based on local block factorization is considered, for the matrices derived from discreting 2-D or 3-D elliptic partial differential equations. We prove that the condit...
详细信息
ISBN:
(纸本)0769515126
A type of incomplete decomposition preconditioner based on local block factorization is considered, for the matrices derived from discreting 2-D or 3-D elliptic partial differential equations. We prove that the condition numbers of the preconditioned matrices are small, which means that the constructed preconditioners are effective. Further we consider an efficient parallel version of the preconditioner which depends only on a single integer argument. When its value is small, the iterations needed on multiple processors to converge is much more than on a single processor But withthe increase of this value, the difference decreases step by step. Finally, we have many experiments on a cluster of 6 PCs with main frequencies of 1.8GHz the results show that the local block factorizations constructed are efficient in serial implementation, if compared to some well-known effective preconditioners, and the parallel versions are efficient also.
Embedded computing architectures can be designed to meet a variety of application specific requirements. However, optimized hardware can require compiler support to realize the potential of the hardware. this is espec...
详细信息
ISBN:
(纸本)0769526373
Embedded computing architectures can be designed to meet a variety of application specific requirements. However, optimized hardware can require compiler support to realize the potential of the hardware. this is especially true for embedded image processing systems where significant architectural variation is possible, and targeted software can change drastically based on architectural variation. this paper presents methods to compile a single high-level source given a fundamental variation in data-parallel target architectures processor granularity ranging from a single processor to a massively parallel processor array. the approach uses single PPE virtualization, which supports pixel-level data-parallel expressions that operate on a virtual one pixel per processing element (PPE) network and applies pixel-locating transformations to retarget the code into a given target PPE. Unlike mainstream parallel computing techniques, this technique can be applied to lightweight SIMD targets that do not provide global communication hardware or shared memory.
the ultrafast electron beam X-ray computed tomography (CT) measuring system of the Helmholtz-Zentrum Dresden-Rossendorf (HZDR) is primarily operated for fundamental multiphase flow investigations, e.g. in various tech...
详细信息
the ultrafast electron beam X-ray computed tomography (CT) measuring system of the Helmholtz-Zentrum Dresden-Rossendorf (HZDR) is primarily operated for fundamental multiphase flow investigations, e.g. in various technical devices, and for validation of enhanced flow simulation models, e.g. developed for computational fluid dynamic codes (CFD). the CT scanner delivers cross-sectional material distributions by contactless measurements with a spatial resolution of approximately 1 mm and a temporal resolution of maximal 8 kHz. Currently, two central time-consuming processes have been identified limiting the efficient usage of that worldwide unique CT technique: a) the data transfer from the detector system to central data storages (e.g. computer or data base) and b) the data processing. thus, data pre-processing and data reconstruction algorithms have been adapted for the use at multi-core central processing units (CPUs) and even many-core graphics processing units (GPUs). For optimal data processing results an advanced performance PC with two parallel operated high performance graphics processing units, a six-core processor, a high internal data bus speed and a large memory block has been assembled. the newly developed data processingalgorithms induce a performance improvement of approximately 137 for the entire data processing sequence compared to the previous universally applicable single core CPU based data processing tool. (C) 2016 Elsevier Ltd. All rights reserved.
A new generation of digital signal processors with communication capabilities offers the possibility to build parallelarchitectures, so called multi-DSP systems, to overcome the increasing requirements of high perfor...
详细信息
ISBN:
(纸本)0780317726
A new generation of digital signal processors with communication capabilities offers the possibility to build parallelarchitectures, so called multi-DSP systems, to overcome the increasing requirements of high performance applications. In this paper these multi-DSP systems are used for the experimental evaluation of specialized computer architectures within the domain of digital signal processing. Integration of these evaluated architectures by means of VLSI techniques leads to special purpose processors. Two case studies in the field of automation and telecommunication are presented in order to show the suitability of multi-DSP systems for the evaluation task.
It is presented in this paper that the design and analysis of finite difference domain decomposition algorithms for the two-dimensional heat equation and the numerical results have shown the stability and accuracy of ...
详细信息
ISBN:
(纸本)0769515126
It is presented in this paper that the design and analysis of finite difference domain decomposition algorithms for the two-dimensional heat equation and the numerical results have shown the stability and accuracy of the algorithms. the algorithms in the paper have further extended those developed by Dawson and the others [6].
the paper considers efficient computational load distribution for the exact parallel algorithm for the knapsack problem based on packing tree search. We propose an algorithm that provides for static and dynamic comput...
详细信息
the paper considers efficient computational load distribution for the exact parallel algorithm for the knapsack problem based on packing tree search. We propose an algorithm that provides for static and dynamic computational load balancing for the problem in question.
the proceedings contain 9 papers. the special focus in this conference is on Accelerating Data Analysis and Data Management Systems Using Modern Processor and Storage architectures. the topics include: Efficient range...
ISBN:
(纸本)9783319561103
the proceedings contain 9 papers. the special focus in this conference is on Accelerating Data Analysis and Data Management Systems Using Modern Processor and Storage architectures. the topics include: Efficient range queries on modern CPUs;vectorized time series algorithms on modern commodity CPUs;compression-aware in-memory query processing;overtaking CPU DBMSes with a GPU in whole-query analytic processing withparallelism-friendly execution plan optimization;making in-memory databases fast on modern NICs;an analysis on modern hardware;locality-adaptive parallel hash joins using hardware transactional memory;an embedded in-memory DBMS enabling instant snapshot sharing and runtime fragility in main memory.
In our earlier papers, the parallelization and implementations of Gauss-Seidel (G-S) algorrthms for power flow analysis have been investigated on a Sequent flnlnnce ahared memory (SM) machine. In this paper, we genera...
详细信息
2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. parallel processors keep gettin...
详细信息
ISBN:
(纸本)9781479923410
2D image convolution is ubiquitous in image processing and computer vision problems such as feature extraction. Exploiting parallelism is a common strategy for accelerating convolution. parallel processors keep getting faster, but algorithms such as image convolution remain memory bounded on parallel processors such as GPUs. therefore, reducing memory communication is fundamental to accelerating image convolution. To reduce memory communication, we reorganize the convolution algorithm to prefetch image regions to register, and we do more work per thread with fewer threads. To enable portability to future architectures, we implement a convolution autotuner that sweeps the design space of memory layouts and loop unrolling configurations. We focus on convolution with small filters (2x2-7x7), but our techniques can be extended to larger filter sizes. Depending on filter size, our speedups on two NVIDIA architectures range from 1.2x to 4.5x over state-of-the-art GPU libraries.
We consider the problem of partitioning coarse grain signal flow graphs for execution on a class of hierarchically structured, heterogeneous multiprocessor architectures tailored to match the characteristics of a spec...
详细信息
暂无评论