As technology advances both in increasing bandwidth and in reducing latency for I/O buses and devices, moving I/O data in/out memory has become critical. In this paper, we have observed the different characteristics o...
详细信息
ISBN:
(纸本)9781424456581
As technology advances both in increasing bandwidth and in reducing latency for I/O buses and devices, moving I/O data in/out memory has become critical. In this paper, we have observed the different characteristics of I/O and CPU memory reference behavior, and found the potential benefits of separating I/O data from CPU data. We propose a DMA cache technique to store I/O data in dedicated on-chip storage and present two DMA cache designs. The first design, Decoupled DMA Cache (DDC), adopts additional on-chip storage as the DMA cache to buffer I/O data. The second design, Partition-Based DMA Cache (PBDC), does not require additional on-chip storage, but can dynamically use some ways of the processor's last level cache (LLC) as the DMA cache. We have implemented and evaluated the two DMA cache designs by using an FPGA-based emulation platform and the memory reference traces of real-world applications. Experimental results show that, compared with the existing snooping-cache scheme, DDC can reduce memory access latency (in bus cycles) by 34.8% on average (up to 58.4%), while PBDC can achieve about 80% of DDC's performance improvements despite no additional on-chip storage.
Coverage model is the main technique to evaluate the thoroughness of dynamic verification of a Design-under-Verification (DUV). However, rather than achieving a high coverage, the essential purpose of verification is ...
详细信息
In scan-based tests, power consumptions in both shift and capture phases may be significantly higher than that in normal mode, which threatens circuits' reliability during manufacturing test. In this paper, by ana...
详细信息
ISBN:
(纸本)9783981080131
In scan-based tests, power consumptions in both shift and capture phases may be significantly higher than that in normal mode, which threatens circuits' reliability during manufacturing test. In this paper, by analyzing the impact of X-bits on circuit switching activities, we present an X-filling technique that can decrease both shift- and capture-power to guarantees the reliability of scan tests, called iFill. Moreover, different from prior work on X-filling for shift-power reduction which can only reduce shift-in power, iFill is able to decrease power consumptions during both shift-in and shift-out. Experimental results on ISCAS'89 benchmark circuits show the effectiveness of the proposed technique.
MPI Alltoall communication is widely used in many high performance computing (HPC) applications. In Alltoall communication, each process sends a distinct message to all other participating processes. In multicore clus...
详细信息
CAM is widely used in microprocessors and SOC TLB modules. It gives great advantage for software development. And TLB operations become bottleneck of the microprocessor performance. The test cost of normal BIST approa...
详细信息
Thread migration is an effective technique for fault resilience and load balancing in high performance computing. However, flexible thread migration is not easy to achieve. In this paper, we present an approach to cre...
详细信息
Physical Unclonable Functions (PUFs) have emerged as a promising primitive to provide a hardware keyless security mechanism for integrated circuit applications. Public PUFs (PPUFs) address the crucial PUF vulnerabilit...
详细信息
A 64 bit low power, high speed floating-point adder design is presented in this paper. The proposed floating-point adder is based on dual path architecture, and both dynamic and leakage power are reduced by exploiting...
详细信息
A 64 bit low power, high speed floating-point adder design is presented in this paper. The proposed floating-point adder is based on dual path architecture, and both dynamic and leakage power are reduced by exploiting architecture opportunities to minimize switching activity and maximize the stack effect of the circuits concurrently. Experimental result based on 130 nm CMOS standard cell design shows that average power consumptions of the FP adder can be reduced by 61.4% with proposed low power techniques.
As the number of cores in chip multiprocessors (CMPs) increases, network-on-chip (NoC) has become a major role in ensuring performance and power scalability. In this paper, we propose multiple-combinational-channel (M...
详细信息
As the number of cores in chip multiprocessors (CMPs) increases, network-on-chip (NoC) has become a major role in ensuring performance and power scalability. In this paper, we propose multiple-combinational-channel (MCC), a load balancing and deadlock free interconnect network for cache-coherent non-uniform memory accessing (CC-NUMA). In order to make load more balancing and reduce power dissipation, we combine low usage channels and make high usage channels independent and wide enough, since messages transmitted over NoC have different widths and injection rates. Furthermore, based on the in-depth analysis of network traffic, we summarize four traffic patterns and establish several rules to avoid protocol-level deadlock. We implement MCC on a 16-core CMPs, and evaluate the power and performance using universal workloads. The experimental results show that MCC reduces nearly 21% power than multiple-physical-channel with similar throughput. Moreover, MCC improves 10% performance with similar area and power, compared to packet-switching architecture with virtual channels.
Multicore architecture is becoming a promise to keep Moore's Law and brings a revolution in both research and industry which results new design space for software and architecture. Fast Fourier transform (FFT), co...
详细信息
Multicore architecture is becoming a promise to keep Moore's Law and brings a revolution in both research and industry which results new design space for software and architecture. Fast Fourier transform (FFT), computing intensive and bandwidth intensive, is one of the most popular and important applications in the world. Compared with the computing resource on multicore architecture, the on-chip memory resource is much more expensive because of the limitation of physical chip size. Efficient implementation of FFT algorithm on multicore with good scalability is a challenge for both software and hardware developers. In this paper, supported by the Godson-T architecture, an optimized implementation of 1-D FFT has been developed with matrix transpose conceal and computation/communication overlapping, which achieve more than 30% performance improvement as well as almost 1/3 L2 cache consumption reduce comparing with the base six-step FFT. The limitation of scalability is also analyzed and the conclusion is that on Godson-T when frequency and simultaneous data access happen, the limited access bandwidth of L2 cache is the bottleneck and result in the longer on-chip network latency.
暂无评论