This paper describes the design for testability (DFT) challenges and techniques of Godson-3 microprocessor, which is a scalable multicore processor based on the scalable mesh of crossbar (SMOC) on-chip network and...
详细信息
This paper describes the design for testability (DFT) challenges and techniques of Godson-3 microprocessor, which is a scalable multicore processor based on the scalable mesh of crossbar (SMOC) on-chip network and targets high-end applications. Advanced techniques are adopted to make the DFT design scalable and achieve low-power and low-cost test with limited IO resources. To achieve a scalable and flexible test access, a highly elaborate test access mechanism (TAM) is implemented to support multiple test instructions and test modes. Taking advantage of multiple identical cores embedding in the processor, scan partition and on-chip comparisons are employed to reduce test power and test time. Test compression technique is also utilized to decrease test time. To further reduce test power, clock controlling logics are designed with ability to turn off clocks of non-testing partitions. In addition, scan collars of CACHEs are designed to perform functional test with low-speed ATE for speed-binning purposes, which poses low complexity and has good correlation results.
Dawning Nebulae is a heterogeneous system composed of 9280 multi-core x86 CPUs and 4640 NVIDIA Fermi GPUs. With a Linpack performance of 1.271 petaFLOPS, it was ranked the second in the TOP500 List released in June 20...
详细信息
Dawning Nebulae is a heterogeneous system composed of 9280 multi-core x86 CPUs and 4640 NVIDIA Fermi GPUs. With a Linpack performance of 1.271 petaFLOPS, it was ranked the second in the TOP500 List released in June 2010. In this paper, key issues in the system design of Dawning Nebulae are introduced. system tuning methodologies aiming at petaFLOPS Linpack result are presented, including algorithmic optimization and communication improvement. The design of its file I/O subsystem, including HVFS and the underlying DCFS3, is also described. Performance evaluations show that the Linpack efficiency of each node reaches 69.89%, and 1024-node aggregate read and write bandwidths exceed 100 GB/s and 70 GB/s respectively. The success of Dawning Nebulae has demonstrated the viability of CPU/GPU heterogeneous structure for future designs of supercomputers.
Detecting traffic signs effectively under low-light conditions remains a significant challenge. To address this issue, we propose YOLO-LLTS, an end-to-end real-time traffic sign detection algorithm specifically design...
详细信息
The Godson-3A microprocessor is a quad-core version of the scalable Godson-3 multi-core series. It is physically implemented based on the 65 nm CMOS process. This 174 mm2 chip consists of 425 million transistors. The ...
详细信息
The Godson-3A microprocessor is a quad-core version of the scalable Godson-3 multi-core series. It is physically implemented based on the 65 nm CMOS process. This 174 mm2 chip consists of 425 million transistors. The maximum frequency is 1GHz with a maximum power consumption of 15 W. The main challenges of Godson-3A physical implementation include very large scale, high frequency requirement, sub-micron technology effects and aggressive time schedule. This paper describes the design methodology of the physical implementation of Godson-3A, with particular emphasis on design methods for high frequency, clock tree design, power management, and on-chip variation (OCV) issue.
Scan design is a widely used design-for-testability technique to improve test quality and efficiency. For the scan-designed circuit, test and diagnosis of the scan chain and the circuit is an important process for sil...
详细信息
Scan design is a widely used design-for-testability technique to improve test quality and efficiency. For the scan-designed circuit, test and diagnosis of the scan chain and the circuit is an important process for silicon debug and yield learning. However, conventional scan designs and diagnosis methods abort the subsequent diagnosis process after diagnosing the scan chain if the scan chain is faulty. In this work, we propose a design-for-diagnosis scan strategy called helix scan and a diagnosis algorithm to address this issue. Unlike previous proposed methods, helix scan has the capability to carry on the diagnosis process without losing information when the scan chain is faulty. What is more, it simplifies scan chain diagnosis and achieves high diagnostic resolution as well as accuracy. Experimental results demonstrate the effectiveness of our design.
An increasing number of supercomputers adopt a heterogeneous architecture, consisting of both general purpose CPUs and specialized accelerators. Such design is beneficial for scalability and power, but on the other ha...
详细信息
An increasing number of supercomputers adopt a heterogeneous architecture, consisting of both general purpose CPUs and specialized accelerators. Such design is beneficial for scalability and power, but on the other hand, heterogeneity brings new challenges in communication systems to connect heterogeneous components and provide support for programming. The communication system of the Dawning 6000 connectstwo kinds of heterogeneous processors, Loongson and AMD, and adopts a three layer architecture with an intranode layer between heterogeneous components. To efficiently connect heterogeneous components, the system forms a global address space and provides a mechanism for message transmission via an in-node global store; and employing Infiniband network, provides an OS-bypassing virtualization method to share an Infiniband card between nodes. To facilitate programming on heterogeneous processors, it supports unified parallel C (UPC), with a modified complier based on global address space. Also, aspecial collective network is implemented for collective operations. Results obtained from a prototype system prove these features to be both feasible and efficient.
Low-power design is one of the most important issues in wireless sensor networks (WSNs) , while reliable information transmitting should be ensured as well. Transmitting power (TP) control is a simple method to make t...
详细信息
Low-power design is one of the most important issues in wireless sensor networks (WSNs) , while reliable information transmitting should be ensured as well. Transmitting power (TP) control is a simple method to make the power consumption down, but excessive interferences from potential adjacent operating links and communication reliability between nodes should be considered. In this paper, a reliable and energy efficient protocol is presented, which adopts adaptive rate control based on an optimal TP. A mathematical model considering average interference and network connectivity was used to predict the optimal TP. Then for the optimal TP, active nodes adaptively chose the data rate with the change of bit-error–rate(BER) performance. The efficiency of the new strategy was validated by mathematical analysis and simulations. Compared with 802.11 DCF which uses maximum unified TP and BASIC protocol, it is shown that the higher average throughput can achieve while the energy consumption per useful bit can be reduced according to the results.
Floor plans can provide valuable prior information that helps enhance the accuracy of indoor positioning systems. However, existing research typically faces challenges in efficiently leveraging floor plan information ...
详细信息
MapReduce is a programming framework introduced by Google for large-scale data processing. It is usually used in a scan-centric fashion where all the data are split into blocks and Maps are generated for each block to...
详细信息
DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new pageallocation-based optimization that works seamlessly together with some existing hardware and software optimizat...
详细信息
DRAM row buffer conflicts can increase memory access latency significantly. This paper presents a new pageallocation-based optimization that works seamlessly together with some existing hardware and software optimizations to eliminate significantly more row buffer conflicts. Validation in simulation using a set of selected scientific and engineering benchmarks against a few representative memory controller optimizations shows that our method can reduce row buffer miss rates by up to 76% (with an average of 37.4%). This reduction in row buffer miss rates will be translated into performance speedups by up to 15% (with an average of 5%).
暂无评论