Branch misprediction penalties depend on branch misprediction rates and branch penalties. Dynamic branch schemes take advantage of hardware to record and predict branch behavior at run-time for reducing branch mispred...
详细信息
We have developed a hierarchical performance bounding methodology that attempts to explain the performance of loop-dominated scientific applications on particular systems. The Kendall Square Research KSR1 is used as a...
详细信息
The performance of a superpipeline processor heavily relies on its branch performance. Traditional branch strategies used in pipelined processors are delayed branches and branch with squashing. Delayed branches use sa...
ISBN:
(纸本)9780818655104
The performance of a superpipeline processor heavily relies on its branch performance. Traditional branch strategies used in pipelined processors are delayed branches and branch with squashing. Delayed branches use safe instructions to fill delay slots. However, for a deeply pipelined processor, a compiler may not be able to find sufficient safe instructions to fill the branch delay slots. Branch with squashing takes advantage of using instructions in target basic blocks to fill the branch delay slots. However, the penalty of branch misprediction is large in superpipelined *** this paper, we proposed a novel branch scheme, named branch with masked squashing, which takes advantage of both delayed branch and branch with squashing. The basic idea is to fill delay slots with safe instructions which may come from above or after the branch. For the remaining unfilled delay slots, we fill with instructions from the predicted target path. In the case of misprediction, only unsafe instructions are annulled. The safe instructions in branch delay slots are always executed. Simulation results show that this branch strategy performs better than traditional delayed branch and branch with squashing.
Reducing switching activity would significantly reduce power consumption of a processor chip. The authors present two novel techniques, Gray code addressing and Cold scheduling, for reducing switching activity on high...
详细信息
Reducing switching activity would significantly reduce power consumption of a processor chip. The authors present two novel techniques, Gray code addressing and Cold scheduling, for reducing switching activity on high performance processors. They use Gray code which has only one-bit different in consecutive number for addressing. Due to locality of program execution, Gray code addressing can significantly reduce the number of bit switches. Experimental results show that for typical programs running on a RISC microprocessor, using Gray code addressing reduce the switching activity at the address lines by 30/spl sim/50% compared to using normal binary code addressing. Cold scheduling is a software method which schedules instructions in a way that switching activity is minimized. The authors carried out experiments with cold scheduling on the VLSI-BAM. Preliminary results show that switching activity in the control path is reduced by 20-30%.< >
This paper focuses on the issues of designing an optimal superpipelined ISP (instruction set processor) driven by a set of benchmark programs. Most issues discussed in this paper also apply to VLIW and superscalar pro...
详细信息
This paper focuses on the issues of designing an optimal superpipelined ISP (instruction set processor) driven by a set of benchmark programs. Most issues discussed in this paper also apply to VLIW and superscalar processors.< >
This paper presents a study of cache hashing functions for micro-parallel processors (e.g., superpipeline and super-scalar processors). Several novel cache hashing functions are experimented. Our simulation results sh...
详细信息
This paper presents a study of cache hashing functions for micro-parallel processors (e.g., superpipeline and super-scalar processors). Several novel cache hashing functions are experimented. Our simulation results show that an unconventional cache hashing function applied on a direct-mapped cache results in hit rates as good as a two-way set associative cache with traditional mapping, while the cache hit times are as fast as a direct-mapped cache with traditional mapping.
We have developed a hierarchical performance bounding methodology that attempts to explain the performance of loop-dominated scientific applications on particular systems. The Kendall Square Research KSR1 is used as a...
详细信息
We have developed a hierarchical performance bounding methodology that attempts to explain the performance of loop-dominated scientific applications on particular systems. The Kendall Square Research KSR1 is used as a running example. We model the throughput of key hardware units that arc common bottlenecks in concurrent machines. The four units currently used are: memory port, floating-point, instruction issue, and a loop-carried dependence pseudo-unit. We propose a workload characterization, and derive upper bounds on the performance of specific machine-workload pairs. Comparing delivered performance with bounds focuses attention on areas for improvement and indicates how much improvement might be attainable. We delineate a comprehensive approach to modeling and improving application performance on the KSR1. Application of this approach is being automated for the KSR1 with a series of tools including K-MA and K-MACSTAT (which enable the calculation of the MACS hierarchy of performance bounds), K-Trace (which allows parallel code to be instrumented to produce a memory reference trace), and K-Cache (which simulates inter-cache communications based on a memory reference trace).
We present design for balance testability (DFBT), a systematic signature-based method for enhancing the testability of logic circuits. DFBT employs balance testing and guarantees 100% coverage of single stuck-line fau...
详细信息
ISBN:
(纸本)9780897916530
We present design for balance testability (DFBT), a systematic signature-based method for enhancing the testability of logic circuits. DFBT employs balance testing and guarantees 100% coverage of single stuck-line faults, as well as many multiple stuck-line and bridging faults. The logic overhead of DFBT is modest|only one extra I/O pin and a small number of extra gates|and the original circuit need not be altered. We illustrate DFBT by applying it to representative logic circuits.
Modulo scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a combined approach th...
详细信息
Module scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a combined approach th...
详细信息
Module scheduling is an efficient technique for exploiting instruction level parallelism in a variety of loops, resulting in high performance code but increased register requirements. We present a combined approach that schedules the loop operations for minimum register requirements, given a module reservation table. Our method determines optimal register requirements for machines with finite resources and for general dependence graphs. This method demonstrates the potential of lifetime-sensitive module scheduling and is useful in evaluating the performance of lifetime-sensitive module scheduling heuristics.
暂无评论