The performance of two classes of large-scale shared-memory machines that have recently emerged, cache-coherent nonuniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA), is compared. They...
详细信息
ISBN:
(纸本)0897915097
The performance of two classes of large-scale shared-memory machines that have recently emerged, cache-coherent nonuniform-memory-access machines (CC-NUMA) and cache-only memory architectures (COMA), is compared. They both have distributed main memory and use directory-based cache coherence. However, COMA machines automatically migrate and replicate data at the main-memory level in cache-line sized chunks. A qualitative model showing that the relative performance is primarily determined by the relative magnitude of capacity misses versus coherence misses and by the granularity of data partitions in the application is presented. Quantitative results using simulation studies for eight parallel applications are reported. It is shown that COMA's potential for performance improvement is limited to applications in which data accesses by different processors are finely interleaved in memory space and where capacity misses dominate over coherence misses. In other situations, COMA can actually perform worse than CC-NUMA due to increased miss latencies caused by its hierarchical directories. An architecture alternative, called COMA-F, that combines the advantages of both is proposed.
Traditionally, register files have been the primary agent for inter-operation communication in load/store architectures. As processors start issuing multiple instructions per cycle, a centralized register file can eas...
详细信息
ISBN:
(纸本)0818631759
Traditionally, register files have been the primary agent for inter-operation communication in load/store architectures. As processors start issuing multiple instructions per cycle, a centralized register file can easily become a bottleneck. This paper analyzes the register file traffic in a load/store architecture with a view to motivate the development of alternate inter-operation communication mechanisms that reduce the bandwidth demanded of a centralized register file. We first provide metrics to characterize the register traffic. These metrics deal with the degree and locality of use of the register instances created. We then present the results of a simulation study that uses the MIPS R2000 architecture and the SPEC benchmark programs. We have two major results. First, a large number of the register instances are used only once, and the average degree of use of register instances is about 2. Second, most of the register instances are used up soon after they are created (within about 30-40 instructions). This suggests that alternate inter-operation communication mechanisms that exploit the temporal locality of use of register instances are likely to be effective in reducing the traffic burden on the centralized register file. The second result was pivotal in the design of the distributed register file for the multiscalar processing paradigm.
We show that n integers in the range 1. n can be stably sorted on an EREW PRAM using O(log n)3/2(log log n)1/2) time, O(n (log n)1/2(log log n)1/2) operations and O(n) space. In addition, we are able to stably sort n ...
详细信息
ISBN:
(纸本)089791466X
We show that n integers in the range 1. n can be stably sorted on an EREW PRAM using O(log n)3/2(log log n)1/2) time, O(n (log n)1/2(log log n)1/2) operations and O(n) space. In addition, we are able to stably sort n integers in the range 1. n on a deterministic CREW PRAM in O((log n)3/2) time with O(n (log n)1/2) operations and O(n) space and to stably sort n arbitrary integers on a randomized CREW PRAM within the same complexity bounds with high probability. In each case our algorithm is closer to optimality than all previous algorithms for the stated problem in the stated model, and our third result matches the operation count of the best known sequential algorithm. We also show that n integers in the range 1. m can be sorted in O((log n)2) time with O(n) operations on an EREW PRAM using a nonstandard word length of O(lognloglognlogm) bits, thereby greatly improving the upper bound on the word length necessary to sort integers with a linear time-processor product, even sequentially. Our algorithms were inspired by, and in one case directly use, the fusion trees recently introduced by Fredman and Willard.
parallel computing teaching has an important difficulty, there are few tools to directly learn the behavior of the parallelalgorithms and the parallelarchitectures. Normally the student is formed to think in sequent...
ISBN:
(纸本)9780897914680
parallel computing teaching has an important difficulty, there are few tools to directly learn the behavior of the parallelalgorithms and the parallelarchitectures. Normally the student is formed to think in sequential algorithms running in sequential machines. We present PSEE, a tool to reduce the gap between the basic concepts and its utilization. PSEE is an integrated and interactive graphic environment which allows to simulate and evaluate the performance of parallelalgorithms in parallelarchitectures. PSEE permits to manage the main characteristic parameters involved in the system in order to show the tuning grade of the algorithm/architecture couple. PSEE includes a graphic editor for algorithms and architectures in modelled form, an interactive simulator to run (simulate) the algorithm on the architecture and a performance evaluation instrument.
The strong link between matroids and matching is used to extend the ideas that resulted in the design of Random NC algorithms for matching to obtain RNC algorithms for the well-known problems of finding an arboresence...
ISBN:
(纸本)9780897914666
The strong link between matroids and matching is used to extend the ideas that resulted in the design of Random NC algorithms for matching to obtain RNC algorithms for the well-known problems of finding an arboresence and a maximum cardinality set of edge-disjoint spanning trees in a graph. The key tools used are linear algebra and randomization.
For pt.I see Proc. 3rd Ann. acm Symp. parallel Algms. Architecture, p. 180-91 (1991). The authors show that over any field, the solution set to a system of n linear equations in n unknowns can be computed in parallel ...
详细信息
For pt.I see Proc. 3rd Ann. acm Symp. parallel Algms. Architecture, p. 180-91 (1991). The authors show that over any field, the solution set to a system of n linear equations in n unknowns can be computed in parallel with randomization simultaneously in poly-logarithmic time in n and with only as many processors as are utilized to multiply two n * n matrices. A time unit represents an arithmetic operation in the field. For singular systems the parallel timings are asymptotically as fast as those for non-singular systems, due to the avoidance of binary search in the matrix rank problem, except when the field has small positive characteristic; in that case, binary search is avoided at a somewhat higher processor count measure.< >
暂无评论