Recently, selective encoding of scan slices is proposed to compress test data. This encoding technique, unlike many other compression techniques encoding all the bits, only encodes the target-symbol by specifying sing...
详细信息
This paper presents a methodology for high-level power modeling of cell-based processors. A flexible power model library, which can automatically generate detailed power data for actual circuits of each part of given ...
详细信息
This paper presents a methodology for high-level power modeling of cell-based processors. A flexible power model library, which can automatically generate detailed power data for actual circuits of each part of given processor, is developed and annotated dynamically for architecture-level power simulator. According to this method, the dynamic power, leakage power and even area and cell counts can be accurately estimated, and the preliminary power validation for a MIPS microprocessor proves our methodology to be effective and highly correlated, with only small errors comparing with the gate-level power analysis.
As the scale of parallel machine grows, communication network is playing more important role than ever before. Communication affects not only execution time, but also scalability of parallel applications. Parallel int...
详细信息
ISBN:
(纸本)9781617386404
As the scale of parallel machine grows, communication network is playing more important role than ever before. Communication affects not only execution time, but also scalability of parallel applications. Parallel interconnection network simulator is a suitable tool to study large-scale interconnection networks. However, simulating packet level communication on detailed cycle-to-cycle network models is a really challenge work. We implement a kernel-based parallel simulator HPPNetSim to solve problems. Optimistic PDES mechanism needs huge memory consumption of saving simulation entities' states in large-scale simulations, so we chose conservative synchronization approach. Simulation kernel and network models are all carefully designed. To accelerate process of simulation, optimizations are introduced, such as block/unblock synchronization, load balancing, dynamic look-ahead generation, and etc. Simulation examples and performance results show that both high accuracy and good performance are obtained in HPPNetSim. It achieves speedup of 19.8 for 32 processing nodes when simulating 36-port 3-tree fat-tree network.
For a gigahertz microprocessor with multiple clock domains and a large amount of embedded RAMs (Random Access Memory), generating at-speed testing patterns is becoming very difficult and very time-consuming. This pape...
详细信息
As the scale of parallel machine grows, communication network is playing more important role than ever before. Communication affects not only execution time, but also scalability of parallel applications. Parallel int...
详细信息
As the scale of parallel machine grows, communication network is playing more important role than ever before. Communication affects not only execution time, but also scalability of parallel applications. Parallel interconnection network simulator is a suitable tool to study large-scale in-terconnection networks. However, simulating packet level communication on detailed cycle-to-cycle network models is a really challenge work. We implement a kernel-based parallel simulator HPPNetSim to solve problems. Optimistic PDES mechanism needs huge memory consumption of sav-ing simulation entities' states in large-scale simulations, so we chose conservative synchronization approach. Simula-tion kernel and network models are all carefully designed. To accelerate process of simulation, optimizations are in-troduced, such as block/unblock synchronization, load balancing, dynamic look-ahead generation, and etc. Simula-tion examples and performance results show that both high accuracy and good performance are obtained in HPPNetSim. It achieves speedup of 19.8 for 32 processing nodes when simulating 36-port 3-tree fat-tree network.
Bugs are tending to be unavoidable in the design of complex integrated circuits. It is imperative to identify the bugs as soon as possible by post-silicon debug. The main challenge for post-silicon debug is the observ...
详细信息
ISBN:
(纸本)9781424437696
Bugs are tending to be unavoidable in the design of complex integrated circuits. It is imperative to identify the bugs as soon as possible by post-silicon debug. The main challenge for post-silicon debug is the observability of the internal signals. This paper exploits the fact that it is not necessary to observe the error free states. Then we introduce "suspect window" and present a method for determining its boundary. Based on suspect window, we propose a debug approach to achieve high observability by reusing scan chain. Since scan dumps take place only in suspect window, debug time is greatly reduced. Experimental results demonstrate the effectiveness of the proposed approach.
Multicore architecture is becoming a promise to keep Moore's Law and brings a revolution in both research and industry which results new design space for software and architecture. Fast Fourier transform (FFT), co...
详细信息
Multicore architecture is becoming a promise to keep Moore's Law and brings a revolution in both research and industry which results new design space for software and architecture. Fast Fourier transform (FFT), computing intensive and bandwidth intensive, is one of the most popular and important applications in the world. Compared with the computing resource on multicore architecture, the on-chip memory resource is much more expensive because of the limitation of physical chip size. Efficient implementation of FFT algorithm on multicore with good scalability is a challenge for both software and hardware developers. In this paper, supported by the Godson-T architecture, an optimized implementation of 1-D FFT has been developed with matrix transpose conceal and computation/communication overlapping, which achieve more than 30% performance improvement as well as almost 1/3 L2 cache consumption reduce comparing with the base six-step FFT. The limitation of scalability is also analyzed and the conclusion is that on Godson-T when frequency and simultaneous data access happen, the limited access bandwidth of L2 cache is the bottleneck and result in the longer on-chip network latency.
Binary translation is one of the most important approaches for system migration. However, software binary translation systems often suffer from the inefficiency and traditional hardware-software co-designed virtual ma...
详细信息
Binary translation is one of the most important approaches for system migration. However, software binary translation systems often suffer from the inefficiency and traditional hardware-software co-designed virtual machines require the unavoidable re-design of the processor architecture. This paper presents a novel hardware-software co-designed method to accelerate the binary translation on an existing architecture. The hardware supports for source-architecture-only functions, partial decodes and binary translation system acceleration are proposed. These hardware supports help the binary translation system to achieve high performance and simplify the design of the binary translation software. In the meantime, the hardware cost is well controlled in a certain low level. These supports are implemented in Godson-3 processors to speedup the x86 binary translation to the native MIPS instruction set. Performance evaluations on RTL simulation and FPGA emulation platforms show that the proposed method can speedup most benchmark programs by nearly 10 times compared to pure software-based binary translation and achieves about 70% performance of the native program execution. The chip is fabricated in ST 65nm CMOS technology, and the physical design results show that the chip area cost is less than 5%.
In database systems, disk I/O performance is usually the bottleneck of the whole query processing. Among many techniques, compression is one of the most important ones to reduce disk accesses so to improve system perf...
详细信息
In database systems, disk I/O performance is usually the bottleneck of the whole query processing. Among many techniques, compression is one of the most important ones to reduce disk accesses so to improve system performance. RLE (run-length encoding) is one light-weight compression algorithm which incurs negligible CPU cost. A lot of work show that, although RLE is one of the most effective compression techniques in column-oriented systems, it is very hard to use due to bad value locality in row-oriented systems where values from multiple attributes are stored in the same page. We propose CRLE (Column-based RLE), one compression algorithm to apply RLE to row-oriented data storage. On row-oriented storage page, CRLE can exploit value locality in individual column and encode values from the same column in run-length format. Experiments show that CRLE can lead to very good compression ratio and performance in spite of row-oriented data storage.
Ring is a promising on-chip interconnection for CMP. It is more scalable than bus and much simpler than packet-switched networks. The ordering property of ring can be used to optimize cache coherence protocol design. ...
详细信息
Ring is a promising on-chip interconnection for CMP. It is more scalable than bus and much simpler than packet-switched networks. The ordering property of ring can be used to optimize cache coherence protocol design. Existing ring protocols, such as the snooping ring protocol and the ring-order protocol need a retry and acknowledgement scheme or use the ordering property of the ring respectively to resolve conflict memory requests. A cache coherence protocol named SOR (Snooping and Ordering Ring) is developed for ring connected CMP in this paper. This protocol is based on the snooping ring protocol. But instead of using the acknowledgement and retry scheme, it uses the ordering property of the ring to resolve conflicts, thus can avoid unnecessary retries to improve performance and power efficiency. The L1 snooping results are sent with the requests instead of being delayed, so that many useless snoops can be avoided. Simulation result shows that the average probe slot transports and snoop operations can be reduced by SOR are 47% and 48.9%. The average and maximum performance improvements by SOR are 3.33% and 6%.
暂无评论