Multithreading has been proposed as an architectural strategy for tolerating latency in multiprocessors and, through limited empirical studies, shown to offer promise. This paper develops an analytical model of multit...
详细信息
ISBN:
(纸本)0897913701
Multithreading has been proposed as an architectural strategy for tolerating latency in multiprocessors and, through limited empirical studies, shown to offer promise. This paper develops an analytical model of multithreaded processor behavior based on a small set of architectural and program parameters. The model gives rise to a large Markov chain, which is solved to obtain a formula for processor efficiency in terms of the number of threads per processor, the remote reference rate, the latency, and the cost of switching between threads. It is shown that a multithreaded processor exhibits three operating regimes: linear (efficiency is proportional to the number of threads), transition, and saturation (efficiency depends only on the remote reference rate and switch cost). Formulae for regime boundaries are derived. The model is embellished to reflect cache degradation due to multithreading, using an analytical model of cache behavior, demonstrating that returns diminish as the number threads becomes large. Predictions from the embellished model correlate well with published empirical measurements. Prescriptive use of the model under various scenarios indicates that multithreading is effective, but the number of useful threads per processor is fairly small.
We provide optimal algorithms for sorting, FFT, matrix transposition, standard matrix multiplication, and related problems in terms of the number of input/outputs (I/Os) required between internal memory and secondary ...
详细信息
ISBN:
(纸本)0897913612
We provide optimal algorithms for sorting, FFT, matrix transposition, standard matrix multiplication, and related problems in terms of the number of input/outputs (I/Os) required between internal memory and secondary storage. Our two-level memory model is new and gives a realistic treatment of parallel block transfer, in which during a single I/O each of the P secondary storage devices (disks) can simultaneously transfer a contiguous block of B records. We also introduce parallel variants of hierarchical memory models and give optimal algorithms. In our models there are P hierarchies, which operate in parallel. Communication among the hierarchies takes place at a base memory level. The difficulty in developing optimal algorithms in our two-level and hierarchical models is to cope with the partitioning of memory into P separate physical devices. Our sorting algorithms are randomized, but practical;the probability of using more than an optimal number of I/Os falls off exponentially.
Let P be a simple rectilinear convex polygon of size O(n) inside which lie n pairwise disjoint rectangular rectilinear obstacles. We provide parallel techniques for computing rectilinear shortest paths that avoid the ...
详细信息
ISBN:
(纸本)0897913701
Let P be a simple rectilinear convex polygon of size O(n) inside which lie n pairwise disjoint rectangular rectilinear obstacles. We provide parallel techniques for computing rectilinear shortest paths that avoid the set of obstacles in P. Specifically, we compute descriptions of shortest paths in O(log2 n) time, with O(n2/log2n) processors in the CREWPRAM model if source and destination are on the boundary of P, with O(n2/log n) processors if the source is an obstacle vertex and the destination a vertex of P, and with O(n2) processors if both source and destination are obstacle vertices. The descriptions we compute enable one processor to obtain the path length of any query pair of vertices in constant time, or O(n/log n) processors to retrieve the shortest path itself in logarithmic time. If the two query points are arbitrary rather than vertices, then one processor takes O(log n) time (instead of constant time) for finding the path length. A number of other related shortest paths problems are solved. The techniques we use involve a fast computation of separator staircases and, most importantly, a scheme for partitioning the obstacles' boundaries into families in a way that ensures that the resulting path length matrices have a monotonicity property that is apparently absent before applying our partitioning scheme.
Currently, there is a significant gap between the best sequential and parallel complexities of many fundamental problems related to digraph reachability. This complexity bottleneck essentially reflects a seemingly una...
详细信息
ISBN:
(纸本)0897913612
Currently, there is a significant gap between the best sequential and parallel complexities of many fundamental problems related to digraph reachability. This complexity bottleneck essentially reflects a seemingly unavoidable reliance on transitive closure techiques in parallelalgorithms for digraph reachability. To pinpoint the nature of the bottleneck, we develop a collection of polylog-time reductions among reachability problems. These reductions use only linear processors and work for general graphs. Furthermore, for planar digraphs, we give polylog-time algorithms for the following problems: (1) directed ear decomposition, (2) topological ordering, (3) digraph reachability, (4) descendent counting, and (5) depth-first search. These algorithms use only linear processors and therefore reduce the complexity to within a polylog factor of optimal.
A parallel computing system becomes increasingly prone to failure as the number of processing elements in it increases. In this paper, we describe a completely general strategy that takes an arbitrary step of an ideal...
详细信息
ISBN:
(纸本)0897913612
A parallel computing system becomes increasingly prone to failure as the number of processing elements in it increases. In this paper, we describe a completely general strategy that takes an arbitrary step of an ideal CRCW PRAM and automatically translates it to run efficiently and robustly on a PRAM in which processors are prone to failure. The strategy relies on efficient robust algorithms for solving a core problem, the Certified Write-All Problem. This problem characterizes the core of robustness, because, as we show, its complexity is equal to that of any general strategy for realizing robustness in the model. We analyze the expected parallel time and work of various algorithms for solving this problem. Our results are a non-trivial generalization of R.P. Brent's Lemma. We consider the case where the number of the available processors decreases dynamically over time, whereas Brent's Lemma is only applicable in the case where the processor availability pattern is static.
暂无评论