Recent evidence indicates that the exploitation of locality in dataflow programs could have, a dramatic impact on performance. the current trend in the design of dataflow processors suggest a synthesis of traditional ...
详细信息
Recent evidence indicates that the exploitation of locality in dataflow programs could have, a dramatic impact on performance. the current trend in the design of dataflow processors suggest a synthesis of traditional non-strict fine grain instruction execution and a strict coarse grain execution in order to exploit locality. While an increase in instruction granularity will favor the exploitation of locality within a single execution thread, the resulting grain size may increase latency among execution threads. In this paper, the resulting latency incurred through the partitioning of fine grain instructions into coarser grain threads is evaluated. We define the concept of a cluster of fine grain instructions to qualify coarse grain input and output latencies using a set of numeric benchmarks. the results offer compelling evidence that the inner loops of a significant number of numeric codes would benefit front coarse grain execution. Based on cluster execution fines, more than 60% of the measured benchmarks favor a coarse grain execution. In 63% of the cases the input latency to the cluster is the same in coarse or fine grain execution modes. these results suggest that the effects of increased instruction granularity on latency is minimal for a high percentage of the measured codes) and in large part is offset by available intra-thread locality. Furthermore, simulation results indicate that strict or non-strict data structure access does not change the basic 0uster characteristics.
the results of a study for the design of a 250-MHz GaAs microprocessor that uses multichip module (MCM) technology to improve performance are presented. the design study for the resulting two-level split cache starts ...
详细信息
ISBN:
(纸本)0897913949
the results of a study for the design of a 250-MHz GaAs microprocessor that uses multichip module (MCM) technology to improve performance are presented. the design study for the resulting two-level split cache starts with a baseline cache architecture and then examines primary cache size and degree of associativity;primary data-cache write policy;secondary cache size and organization;primary cache fetch size;and concurrency between instruction and data accesses. A trace-driven simulator is used to analyze each design's performance. Memory access time and page-size constraints effectively limit the size of the primary data and instruction caches to 4 kW (16 kB). For such cache sizes, a write-through policy is better than a write-back policy. three cache mechanisms contribute to improved performance. the first is a variant of the write-through policy called write-only. this write policy provides most of the performance benefits of subblock placement without extra valid bits. the second is the use of a split secondary cache. the third mechanism allows loads to pass stores without associative matching.
Recent advances in lightwave transmission systems suggest that optical fiber is going to be the enabling technology that creates high bandwidth communication capabilities required by future application areas such as m...
详细信息
In recent years, RISC manufacturers have focussed attention on the enhanced performance of RISC microprocessors in meeting high speed microprocessor requirements. FDDI has such needs at the network to host computer in...
详细信息
In recent years, RISC manufacturers have focussed attention on the enhanced performance of RISC microprocessors in meeting high speed microprocessor requirements. FDDI has such needs at the network to host computer interface. this paper presents design alternatives for the network interface and a model that can be used to examine the achievable throughput assuming different CISC and RISC processors. the model indicates that the CISC processors may outperform the RISC processors in FDDI and other high speed networks where data movement requirements are significantly larger than the protocol processing overhead.< >
暂无评论