A description is given of the VMP-MC design, a distributed parallel multicomputer based on the VMP multiprocessor design that is intended to provide a set of building blocks for configuring machines from one to severa...
详细信息
ISBN:
(纸本)9780897913195
A description is given of the VMP-MC design, a distributed parallel multicomputer based on the VMP multiprocessor design that is intended to provide a set of building blocks for configuring machines from one to several thousand processors. VMP-MC uses a memory hierarchy based on shared caches, ranging from on-chip caches to board-level caches connected by buses to a high-speed fiber-optic ring. In addition to describing the building block components of this architecture, the authors identify the key performance issues associated withthe design and provide performance evaluation of these issues using trace-driven simulation and measurements from the VMP.
Write-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as...
详细信息
ISBN:
(纸本)9780897913195
Write-invalidate and write-broadcast coherency protocols have been criticized for being unable to achieve good bus performance across all cache configurations. In particular, write-invalidate performance can suffer as block size increases;and large cache sizes will hurt write-broadcast. Read-broadcast and competitive snooping extensions to the protocols have been proposed to solve each problem. the results reported here indicate that the benefits of the extensions are limited. Read-broadcast reduces the number of invalidation misses but at a high cost in processor lockout from the cache. the net effect can be an increase in total execution cycles. Competitive snooping benefits only those programs withhigh per-processor locality of reference to shared data. For programs characterized by interprocessor contention for shared addresses, competitive snooping can degrade performance by causing a slight increase in bus utilization and total execution time.
Most current single-chip processors utilize an on-chip instruction cache to improve performance. A miss in this instruction cache will cause an external memory reference that must compete with data references for acce...
详细信息
ISBN:
(纸本)9780897913195
Most current single-chip processors utilize an on-chip instruction cache to improve performance. A miss in this instruction cache will cause an external memory reference that must compete with data references for access to the external memory, thus affecting the overall performance of the processor. One common way to reduce the number of off-chip instruction requests is to increase the size of the on-chip cache. An alternative approach is presented in which a combination of an instruction cache, instruction queue, and instruction queue buffer is used to achieve the same effect with a much smaller instruction cache size. Such an approach is significant for emerging technologies where high circuit densities are initially difficult to achieve yet a high level of performance is desired, or for more mature technologies where chip area can be used to provide more functionality. the viability of this approach is demonstrated by its implementation in an existing single-chip processor.
the question of what a von Neumann processor can borrow from dataflow to make it more suitable for a multiprocessor is explored. Starting with a simple, RISC (reduced-instruction-self-computer)-like instruction set, t...
详细信息
the question of what a von Neumann processor can borrow from dataflow to make it more suitable for a multiprocessor is explored. Starting with a simple, RISC (reduced-instruction-self-computer)-like instruction set, the author shows how to change the underlying processor organization to make it multithreaded. then he extends it withthree instructions that give it a fine-grained, dataflow capability, and he calls the result P-RISC, for parallel RISC. Finally, the author discusses memory support for such multiprocessors. He compares his approach to existing MIMD (multiple-instruction multiple-data-stream) machines and to other dataflow machines.
An optimal memory system for realizing a high-performance integrated Prolog processor, the IPP, is discussed. First, the memory access characteristics of Prolog are analyzed by a simulator, which simulates the executi...
详细信息
ISBN:
(纸本)9780897913195
An optimal memory system for realizing a high-performance integrated Prolog processor, the IPP, is discussed. First, the memory access characteristics of Prolog are analyzed by a simulator, which simulates the execution of a Prolog program at a microinstruction level. the main findings from this analysis are that the write-access ratio of Prolog is larger than that of procedural language and that performance improvement requires the memory system to process concentrated, large write accesses effectively. the Prolog acceleration strategies for conventional cache memories are discussed. Comparison is made of cache memories (store-swap, store-through) and a stack buffer, regarding not only performance but also reliability, complexity, and effects on procedural languages. It is concluded that the advanced store-through cache is best suited to the IPP.
the authors propose and analyze a two-level cache organization that provides high memory bandwidth. the first-level cache is accessed directly by virtual addresses. It is small, fast, and, without the burden of addres...
详细信息
ISBN:
(纸本)9780897913195
the authors propose and analyze a two-level cache organization that provides high memory bandwidth. the first-level cache is accessed directly by virtual addresses. It is small, fast, and, without the burden of address translation, can easily be optimized to match the processor speed. the virtually addressed cache is backed up by a large physically addressed cache;this second-level cache provides a high hit ratio and greatly reduces memory traffic. the authors show how the second-level cache can be easily extended to solve the synonym problem resulting from the use of a virtually-addressed cache at the first level. Moreover, the second-level cache can be used to shield the virtually-addressed first-level cache from irrelevant cache coherence interference. Finally, simulation results show that this organization has a performance advantage over a hierarchy of physically-addressed caches in a multiprocessor environment.
SIMP is a novel multiple-instruction-pipeline parallel architecture. It enhances the performance of SISD (single-instruction/single-data-stream) processors by using both temporal and spatial parallelisms, while keepin...
详细信息
ISBN:
(纸本)9780897913195
SIMP is a novel multiple-instruction-pipeline parallel architecture. It enhances the performance of SISD (single-instruction/single-data-stream) processors by using both temporal and spatial parallelisms, while keeping program compatibility. the degree of performance enhancement achieved by SIMP depends on determining how to supply multiple instructions continuously and how to resolve data and control dependencies effectively. Techniques for instruction fetch and dependency resolution have been devised for this purpose. the instruction-fetch mechanism utilizes unique schemes for prefetching multiple instructions withthe help of branch prediction, squashing instructions selectively, and providing multiple conditional modes as a result. the dependency-resolution mechanism permits out-of-order execution of sequential instruction stream. the out-of-order execution model is based on R.M. Tomasulo's algorithm (1967), which has been used in single instruction-pipeline processors. However, it is greatly extended to accommodate multiple-instruction pipelining.
Increasing the execution power requires a high instruction-issue bandwidth, and decreasing instruction encoding and applying some code improving techniques cause code expansion. therefore, the instruction memory hiera...
详细信息
ISBN:
(纸本)9780897913195
Increasing the execution power requires a high instruction-issue bandwidth, and decreasing instruction encoding and applying some code improving techniques cause code expansion. therefore, the instruction memory hierarchy performance has become an important factor in system performance. An instruction-placement algorithm has been implemented in the IMPACT-1 (Illinois Microarchitecture Project using Advanced Compiler Technology - Stage I) C compiler to maximize the sequential and spatial localities and to minimize mapping conflicts. this approach achieves low cache miss ratios and low memory traffic ratios for small, fast instruction caches with little hardware overhead. For ten realistic Unix programs, the authors report low miss ratios (average 0.5%) and low memory traffic ratios (average 8%) for a 2048-byte, direct-mapped instruction cache using 64-byte blocks. this result compares favorably withthe fully associative cache results reported by other researchers. the authors also present the effect of cache size, block size, block sectoring, and partial loading on the cache performance. the code performance with instruction placement optimization is shown to be stable across architectures with different instruction-encoding density.
the authors explore the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency. In particular, they evaluate the performance of a directory-based cache coher...
详细信息
ISBN:
(纸本)9780897913195
the authors explore the extent to which multiple hardware contexts per processor can help to mitigate the negative effects of high latency. In particular, they evaluate the performance of a directory-based cache coherent multiprocessor using memory reference traces obtained from three parallel applications. the authors explore the case where there are a small fixed number (2-4) of hardware contexts per processor and the context switch overhead is low. In contrast to previously proposed approaches, they also use a very simple context switch criterion, namely a cache miss or a write-hit to shared data. the results show that the effectiveness of multiple contexts depends on the nature of the applications, the context switch overhead, and inherent latency of the machine architecture. Given reasonably low overhead hardware context switches, the authors show that two or four contexts can achieve substantial performance gains over a single context. For one application, the processor utilization increased by about 46% with two contexts and by about 80% with four contexts.
A high-degree-of-parallelism (more than a thousand) dataflow machine called EM-4 is under development. the authors assert that it is essential to fabricate the processing element (PE) on a single chip to reduce operat...
详细信息
A high-degree-of-parallelism (more than a thousand) dataflow machine called EM-4 is under development. the authors assert that it is essential to fabricate the processing element (PE) on a single chip to reduce operation speed, system size, design complexity, and cost. In the EM-4, the PE, called EMC-R, has been specially designed using a 50,000-gate gate array chip. the authors focus on the architecture of the EMC-R. Its distinctive features are a strongly connected arc dataflow model, a direct matching scheme, and integration of a packet-based circular pipeline and a register-based advanced control pipeline. these features are examined, and the instruction-set architecture and the configuration architecturethat utilize them are described.
暂无评论