Grid or mesh techniques are frequently used to approximate continuous entities that behave in a wave or fluid-like fashion. Partial Differential Equations (PDEs) are usually involved in the description of such entitie...
详细信息
Grid or mesh techniques are frequently used to approximate continuous entities that behave in a wave or fluid-like fashion. Partial Differential Equations (PDEs) are usually involved in the description of such entities or processes. Distributed parallel computation was used in various computer cluster configurations to calculate PDE solutions of electrostatic field. The study of the efficacy of the selected architecture using mesh techniques was intended. The match between the algorithm and the architecture in achieving maximum computational performance was also investigated. The developed architectures, algorithms, and findings are presented in the paper.
Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for applicatio...
详细信息
Reconfigurable systems, and in particular, FPGA-based custom computing machines, offer a unique opportunity to define application-specific architectures. These architectures offer performance advantages for application domains such as image processing, where the use of customized pipelines exploits the inherent coarse-grain parallelism. In this paper we describe a set of program analyses and an implementation that map a sequential and un-annotated C program into a pipelined implementation running on a set of FPGAs, each with multiple external memories. Based on well-known parallel computing analysis techniques, our algorithms perform unrolling for operator parallelization, reuse and data layout for memory parallelization and precise communication analysis. We extend these techniques for FPGA-based systems to automatically partition the application data and computation into custom pipeline stages, taking into account the available FPGA and interconnect resources. We illustrate the analysis components by way of an example, a machine vision program. We present the algorithm results, derived with minimal manual intervention, which demonstrate the potential of this approach for automatically deriving pipelined designs from high-level sequential specifications.
In this paper, the parallelization aspects of the accelerated waveform relaxation algorithms for the transient simulation of semiconductor devices on parallel distributed memory computers are studied. These methods ar...
详细信息
In this paper, the parallelization aspects of the accelerated waveform relaxation algorithms for the transient simulation of semiconductor devices on parallel distributed memory computers are studied. These methods are competitive with standard pointwise methods on serial architectures, but are significantly faster on parallel computers. We make use of an improved parallel version of the conjugate gradient squared method (ICGS) combining elements of numerical stability and parallel algorithm design, for solving the resulting sequence of time-varying sparse linear differential-algebraic initial-value problems arising at each linearization step with waveform Newton. We reorganize the algorithm such that all the inner products, matrix-vector multiplications and vector updates of a single iteration step are independent and communication time required for inner product can be overlapped efficiently with computation time of vector updates. Therefore, the bottleneck of the performance, namely the cost of global communication on parallel distributed memory computers can be significantly reduced. The resulting ICGS algorithm maintains the favorable properties of the original algorithm while not increasing the computational costs.
There are several fundamental problems for which there are optimal randomized algorithms, but whose deterministic complexity remains unresolved. Among such problems are the minimum spanning tree problem, the set maxim...
ISBN:
(纸本)9780898715132
There are several fundamental problems for which there are optimal randomized algorithms, but whose deterministic complexity remains unresolved. Among such problems are the minimum spanning tree problem, the set maxima problem, the problem of computing connected components and (minimum) spanning trees in parallel and the problem of performing sensitivity analysis on shortest path trees and minimum spanning trees. For each of these problems there is an optimal randomized algorithm which uses a linear number of random bits. We propose new algorithms (or adapt old ones) for these problems which preserve optimality while saving an exponential number of random bits. In the case of computing minimum spanning trees and MST/SSSP sensitivity analysis, we reduce the dependence on randomness to log* n random *** also consider the problem of selection, for which we give two algorithms which make an expected 1.5n + o(n) comparisons; one uses O(log n) random bits and is uniform, the other uses O(log log n) random bits and is non-uniform.
A value's degree of use - the number of dynamic uses of that value $provides the most essential information needed to optimize its communication. We present simulation results demonstrating the properties of degre...
详细信息
A value's degree of use - the number of dynamic uses of that value $provides the most essential information needed to optimize its communication. We present simulation results demonstrating the properties of degree of use of values, including their predictability: most static instructions generate values with few degrees of use and these exhibit temporal locality. We use these results to guide the design of a degree of use predictor. The development and detailed characterization of this predictor is the focus of this paper. Our predictor leverages future control flow information (e.g., branch predictions) to select among different possible degrees of use. We study the effects of several optimizations and variations in the predictor's algorithms to tune the predictor for maximum performance. The resulting predictor generates correct degree of use predictions for over 92% of all dynamic values and has a misprediction rate below 2.5%. Such a predictor has a wide range of potential applications in optimizing value communication.
We introduce the XPP-VC high-level compiler, which maps C programs to the coarse-grained XPP architecture. XPP-VC's main feature is the integration of pipeline vectorization and temporal partitioning techniques. T...
详细信息
We introduce the XPP-VC high-level compiler, which maps C programs to the coarse-grained XPP architecture. XPP-VC's main feature is the integration of pipeline vectorization and temporal partitioning techniques. The former provides high-throughput inner loop computations and the later allows the compilation of large programs or the use of fewer XPP processing elements. Although the preliminary results are very encouraging, improvements on the generation of configurations are still required. The evaluation we have conducted shows that only a few seconds are required to generate, from algorithms in C, the binaries to program the XPP. To our knowledge this compilation performance is unmatched by any other compiler targeting reconfigurable architectures. Moreover the compiler still achieves high-performance implementations. Since the XPP is delivered as an IP core or device to be coupled to a host processor, a future version of XPP-VC will consider co-compilation, i.e., compilation to hybrid microprocessor/XPP architectures.
A family of deterministic asynchronous Write-All algorithms were studied to analyze the properties of the set of permutations proposed by Kanellakis and Shvartsman. The efficiency of the algorithms was measured in ter...
详细信息
A family of deterministic asynchronous Write-All algorithms were studied to analyze the properties of the set of permutations proposed by Kanellakis and Shvartsman. The efficiency of the algorithms was measured in terms of work acounted for all machine instructions executed by processors. It was found that the analytical results covered only a subset of the possible adversarial patterns of asynchrony. The analysis suggested that the proposed method yielded a faster construction of the Write-All algorithms compared to other methods.
A new block algorithm for triangularization of regular or singular matrices with dimension m × n, is proposed. Taking benefit of fast block multiplication algorithms, it achieves the best known sequential complex...
详细信息
A new block algorithm for triangularization of regular or singular matrices with dimension m × n, is proposed. Taking benefit of fast block multiplication algorithms, it achieves the best known sequential complexity O(mω-1n) for any sizes and any rank. Moreover, the block strategy enables to improve locality with respect to previous algorithms as exhibited by practical performances.
Automatic parallelization of loop programs based on space-time mapping typically aims at maximal parallelism. In order to obtain reasonable performance, the granularity of the parallelism must be coarsened, e.g., by a...
详细信息
ISBN:
(纸本)9781581134094
Automatic parallelization of loop programs based on space-time mapping typically aims at maximal parallelism. In order to obtain reasonable performance, the granularity of the parallelism must be coarsened, e.g., by applying tiling techniques. In contrast to others, we suggest to apply tiling after the computation of a space-time mapping. This extends the applicability of existing tiling methods and significantly improves efficiency of parallelized loop programs.
A parallel real-time complexity theory was presented to study infinity hierarchy for parallel real-time systems. Set of timed ω-languages was closed under intersection, union, complement, concatenation and Kleene clo...
详细信息
A parallel real-time complexity theory was presented to study infinity hierarchy for parallel real-time systems. Set of timed ω-languages was closed under intersection, union, complement, concatenation and Kleene closure. The proposed real-time algorithm consisted of an input tape containing timed ω-word and an output tape containing symbols from alphabet δ in finite control. The analysis suggested that the hierarchy of real-time algorithms with respect to number of processors was infinite.
暂无评论