Power-List, ParList and PList data structures are efficient tools for functional descriptions of parallel programs that are divide & conquer in nature. the goal of this work is to develop three parallel variants f...
详细信息
ISBN:
(纸本)3540440496
Power-List, ParList and PList data structures are efficient tools for functional descriptions of parallel programs that are divide & conquer in nature. the goal of this work is to develop three parallel variants for Fast Fourier Transformation using these theories. the variants are implied by the degree of the polynomial, which can be a power of two, a prime number, or a product of prime factors. the last variant includes the first two, and represents a general and efficient parallel algorithm for Fast Fourier Transformation. this general algorithm has a very good time complexity, and can be mapped on a recursive interconnection network.
Hardwired resource allocators for TRAC-like reconfigurable architectures are described. these allocators facilitate searching for available resources in the system and allocation of a subset of these to a given reques...
详细信息
Hardwired resource allocators for TRAC-like reconfigurable architectures are described. these allocators facilitate searching for available resources in the system and allocation of a subset of these to a given request. Various algorithms can be implemented for the search and the allocation of the resources. Tree-structured allocators look particularly attractive withthe cost-delay product being of the order of M* (log M)**2 for a system with M resources of the same type. It is shown how this scheme can be extended to allocate multiple type of resources in the system.
We study parallel solutions to the problem of weighted multiselection to select r elements on given weighted-ranks from a, set S of n weighted elements, where an element is on weighted rank k if it is the smallest ele...
详细信息
ISBN:
(纸本)0769515126
We study parallel solutions to the problem of weighted multiselection to select r elements on given weighted-ranks from a, set S of n weighted elements, where an element is on weighted rank k if it is the smallest element such that the aggregated weight of all elements not greater than it in S is not smaller than k. We propose efficient algorithms on two of the most popular parallelarchitectures, hypercube and mesh. For a hypercube with p < n processors, we present a parallel algorithm running in O(n(epsilon) min{r, log p}) time for p = n(1-epsilon), 0 < epsilon < 1, which is cost optimal when r greater than or equal to p. Our algorithm on rootp x rootp mesh runs in O(rootp + n/p log(3) p) time P which is the same as multiselection on mesh when r greater than or equal to log p, and thus has the same optimality as multiselection in this case.
An analysis of a parallel solution of N-2-1 Puzzle using clusters, is presented. this problem is interesting due to its complexity and related applications, particularly in the field of robotics. A variation of classi...
详细信息
ISBN:
(纸本)9789537138127
An analysis of a parallel solution of N-2-1 Puzzle using clusters, is presented. this problem is interesting due to its complexity and related applications, particularly in the field of robotics. A variation of classic heuristics for forecasting the work to be done in order to reach a solution is analyzed, and it is shown that its use significantly improves the time of sequential algorithm A*. then, a parallel solution on a distributed architecture is presented and speedup is analyzed based on the number of processors, efficiency, and the possible superlinearity when scaling the problem.
We investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the sy...
详细信息
ISBN:
(纸本)9783642143892
We investigate the performance of the routines in LAPACK and the Successive Band Reduction (SBR) toolbox for the reduction of a dense matrix to tridiagonal form, a crucial preprocessing stage in the solution of the symmetric eigenvalue problem, on general-purpose multicore processors. In response to the advances of hardware accelerators, we also modify the code in SBR. to accelerate the computation by off-loading a significant part of the operations to a graphics processor (GPU). Performance results illustrate the parallelism and scalability of these algorithms on current high-performance multi-core architectures.
Withthe shift of the information processing architecture from sequential processing to parallel and distributed processing has come a great change in the role of the sensor technology, withthe integration of the ele...
详细信息
Withthe shift of the information processing architecture from sequential processing to parallel and distributed processing has come a great change in the role of the sensor technology, withthe integration of the electronic circuit as the background. In other words, the sensor is no longer considered simply as a signal-transforming device but rather as an information processing module. this paper discusses the processing architecture for sensing in terms of the parallelprocessing. Some examples of the parallelprocessingarchitectures for the sensor information is described from such new viewpoints as massively parallelprocessing vision, optical neuro-computing, active sensing, and sensor fusion.
processing-in-memory (PIM) provides massive parallelism with high energy efficiency and becomes a promising solution to the "memory wall" problem. Recently, the emerging metal-oxide resistive random access m...
详细信息
ISBN:
(纸本)9781728116013
processing-in-memory (PIM) provides massive parallelism with high energy efficiency and becomes a promising solution to the "memory wall" problem. Recently, the emerging metal-oxide resistive random access memory (RRAM) has shown its potential to design a PIM architecture. Several stateful logic operations, e.g., NOR and NAND, can be executed in parallel in an RRAM crossbar. Although previous works have designed some algorithms using the stateful logic, it is still under exploration how to fully exploit its potential high parallelism and design an asymptotically fast algorithm for a given function. In this work, we theoretically analyze the parallelism in an RRAM crossbar and design several asymptotically optimal arithmetic algorithms. In detail, we first propose the Single Instruction Multiple Lines (SIML) model to unify the stateful logic families and prove three lower bounds on the time complexity of a parallel RRAM algorithm. then, we design three algorithms for integer addition functions withthe stateful logic, guided by the lower bound analysis. All of them reach the time complexity lower bound. Finally, We make two extensions of the integer addition algorithms, supporting multiplication functions by decomposing them to additions and supporting the flex-point data type by proposing an exponent and mantissa update flow. Experimental evaluation shows that our integer algorithms achieves a speedup up to 13.79x over the previous RRAM algorithms. Our flex-point implementation achieves a 26.60x speedup and saves 73.68% energy compared to an ARM.
the questions of application of various parallel programming technologies for the solution of the problem of modeling of carbon nanostructure synthesis are studied in the article. the description of the developed algo...
详细信息
this paper introduces a number of modifications that allow for significant improvements of parallel LLL reduction. Experiments show that these modifications result in an increase of the speed-up by a factor of more th...
详细信息
ISBN:
(纸本)9783642246494
this paper introduces a number of modifications that allow for significant improvements of parallel LLL reduction. Experiments show that these modifications result in an increase of the speed-up by a factor of more than 1.35 for SVP challenge type lattice bases in comparing the new algorithm withthe state-of-the-art parallel LLL algorithm.
the technology and application trends leading to current day multiprocessor architectures such as chip multiprocessors, embedded architectures, and massively parallelarchitectures, demand faster, mode efficient, and ...
详细信息
ISBN:
(纸本)3540440496
the technology and application trends leading to current day multiprocessor architectures such as chip multiprocessors, embedded architectures, and massively parallelarchitectures, demand faster, mode efficient, and more scalable cache coherence schemes than the existing ones. In this paper we present a new scheme that has a potential to meet such a demand. the software support for our scheme is in the form of program annotations to detect shared accesses as well as release synchronizations that represent data sharing boundaries. A small hardware called Coherence Buffer, (CB) with an associated controller, local to each processor forms the control unit to locally enforce cache coherence actions which are off the critical path. Our simulation study shows that a 8 entry 4-way associative CB helps achieve a speedup of 1.07 - 4.31 over full-map 3-hop directory scheme for five of the SPLASH-2 benchmarks (representative of migratory sharing, producer-consumer and write-many workloads), under Release Consistency model.
暂无评论