The Godson project with an R&D history of 10 years is an independent national program of China that aims at developing advanced microprocessor technologies based on fundamental research and commercialization of the c...
详细信息
The Godson project with an R&D history of 10 years is an independent national program of China that aims at developing advanced microprocessor technologies based on fundamental research and commercialization of the chip technology. We will give a comprehensive presentation of the Godson project, including its history, technical roadmaps, and several unique technical merits.
It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to sim...
详细信息
It can be observed from looking backward that processor architecture is improved through spirally shifting from simple to complex and from complex to simple. Nowadays we are facing another shifting from complex to simple, and new innovative architecture will emerge to utilize the continuously increasing transistor budgets. The growing importance of wire delays, changing workloads, power consumption, and design/verification complexity will drive the forthcoming era of Chip Multiprocessors (CMPs). Furthermore, typical CMP projects both from industries and from academics are investigated. Through going into depths for some primary theoretical and implementation problems of CMPs, the great challenges and opportunities to future CMPs are presented and discussed. Finally, the Godson series microprocessors designed in China are introduced.
It is a well-known fact that test power consumption may exceed that during functional operation. Leakage power dissipation caused by leakage current in Complementary Metal-Oxide-Semiconductor (CMOS) circuits during ...
详细信息
It is a well-known fact that test power consumption may exceed that during functional operation. Leakage power dissipation caused by leakage current in Complementary Metal-Oxide-Semiconductor (CMOS) circuits during test has become a significant part of the total power dissipation. Hence, it is important to reduce leakage power to prolong battery life in portable systems which employ periodic self-test, to increase test reliability and to reduce test cost. This paper analyzes leakage current and presents a kind of leakage current simulator based on the transistor stacking effect. Using it, we propose techniques based on don't care bits (denoted by Xs) in test vectors to optimize leakage current in integrated circuit (IC) test by genetic algorithm. The techniques identify a set of don't care inputs in given test vectors and reassign specified logic values to the X inputs by the genetic algorithm to get minimum leakage vector (MLV). Experimental results indicate that the techniques can effectually optimize leakage current of combinational circuits and sequential circuits during test while maintaining high fault coverage,
Disturbances occur inevitably during daily operations of the metro system, leading to train delays and low service quality. Different from common deterministic reactive train rescheduling frameworks, taking the inhere...
详细信息
This paper describes the design for testability (DFT) challenges and techniques of Godson-3 microprocessor, which is a scalable multicore processor based on the scalable mesh of crossbar (SMOC) on-chip network and...
详细信息
This paper describes the design for testability (DFT) challenges and techniques of Godson-3 microprocessor, which is a scalable multicore processor based on the scalable mesh of crossbar (SMOC) on-chip network and targets high-end applications. Advanced techniques are adopted to make the DFT design scalable and achieve low-power and low-cost test with limited IO resources. To achieve a scalable and flexible test access, a highly elaborate test access mechanism (TAM) is implemented to support multiple test instructions and test modes. Taking advantage of multiple identical cores embedding in the processor, scan partition and on-chip comparisons are employed to reduce test power and test time. Test compression technique is also utilized to decrease test time. To further reduce test power, clock controlling logics are designed with ability to turn off clocks of non-testing partitions. In addition, scan collars of CACHEs are designed to perform functional test with low-speed ATE for speed-binning purposes, which poses low complexity and has good correlation results.
Dawning Nebulae is a heterogeneous system composed of 9280 multi-core x86 CPUs and 4640 NVIDIA Fermi GPUs. With a Linpack performance of 1.271 petaFLOPS, it was ranked the second in the TOP500 List released in June 20...
详细信息
Dawning Nebulae is a heterogeneous system composed of 9280 multi-core x86 CPUs and 4640 NVIDIA Fermi GPUs. With a Linpack performance of 1.271 petaFLOPS, it was ranked the second in the TOP500 List released in June 2010. In this paper, key issues in the system design of Dawning Nebulae are introduced. system tuning methodologies aiming at petaFLOPS Linpack result are presented, including algorithmic optimization and communication improvement. The design of its file I/O subsystem, including HVFS and the underlying DCFS3, is also described. Performance evaluations show that the Linpack efficiency of each node reaches 69.89%, and 1024-node aggregate read and write bandwidths exceed 100 GB/s and 70 GB/s respectively. The success of Dawning Nebulae has demonstrated the viability of CPU/GPU heterogeneous structure for future designs of supercomputers.
The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed...
详细信息
The combination of growing transistor counts and limited power budget within a silicon die leads to the utilization wall problem (a.k.a. "Dark Silicon"), that is only a small fraction of chip can run at full speed during a period of time. Designing accelerators for specific applications or algorithms is considered to be one of the most promising approaches to improving energy-efficiency. However, most current design methods for accelerators are dedicated for certain applications or algorithms, which greatly constrains their applicability. In this paper, we propose a novel general-purpose many-accelerator architecture. Our contributions are two-fold. Firstly, we propose to cluster dataflow graphs (DFGs) of hotspot basic blocks (BBs) in applications. The DFG clusters are then used for accelerators design. This is because a DFC is the largest program unit which is not specific to a certain application. We analyze 17 benchmarks in SPEC CPU 2006, acquire over 300 DFGs hotspots by using LLVM compiler tool, and divide them into 15 clusters based on graph similarity. Secondly, we introduce a function instruction set architecture (FISC) and illustrate how DFG accelerators can be integrated with a processor core and how they can be used by applications. Our results show that the proposed DFG clustering and FISC design can speed up SPEC benchmarks 6.2X on average.
If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software,can be...
详细信息
If your computer crashes, you can revive it by a reboot, an empirical solution that usually turns out to be effective. The rationale behind this solution is that transient faults, either in hardware or software,can be fixed by refreshing the machine state. Such a "silver bullet", however, could be futile in the future because the faults, especially those existing in the hardware such as Integrated Circuit(IC) chips, cannot be eliminated by refreshing. What we need is a more sophisticated mechanism to steer the system back to the right track. The "magic cure" is the Fault Tolerance On-Chip(FTOC) mechanism, which relies on a suite of built-in design-for-reliability logic, including fault detection, fault diagnosis, and error recovery, working in a self-supportive manner. We have exploited the FTOC to build a holistic solution ranging from on-chip fault detection to error recovery mechanisms to address faults caused by chips progressively aging. Besides fault detection, the FTOC paradigm provides attractive benefits, such as facilitating graceful performance degradation, mitigating the impact of verification blind spots, and improving the chip yield.
The many-accelerator architecture, mostly composed of general-purpose cores and accelerator-like function units (FUs), becomes a great alternative to homogeneous chip multiprocessors (CMPs) for its superior power-...
详细信息
The many-accelerator architecture, mostly composed of general-purpose cores and accelerator-like function units (FUs), becomes a great alternative to homogeneous chip multiprocessors (CMPs) for its superior power-efficiency. However, the emerging many-accelerator processor shows a much more complicated memory accessing pattern than general purpose processors (GPPs) because the abundant on-chip FUs tend to generate highly-concurrent memory streams with distinct locality and bandwidth demand. The disordered memory streams issued by diverse accelerators exhibit a mutual- interference behavior and cannot be efficiently handled by the orthodox main memory interface that provides an inflexible data fetching mode. Unlike the traditional DRAM memory, our proposed Aggregation Memory system (AMS) can function adaptively to the characterized memory streams from different FUs, because it provides the FUs with different data fetching sizes and protects their locality in memory access by intelligently interleaving their data to memory devices through sub-rank binding. Moreover, AMS can batch the requests without sub-rank conflict into a read burst with our optimized memory scheduling policy. Experimental results from trace-based simulation show both conspicuous performance boost and energy saving brought by AMS.
The decades-old synchronous memory bus interface has restricted many innovations in the memory system, which is facing various challenges (or walls) in the era of multi-core and big data. In this paper, we argue tha...
详细信息
The decades-old synchronous memory bus interface has restricted many innovations in the memory system, which is facing various challenges (or walls) in the era of multi-core and big data. In this paper, we argue that a message- based interface should be adopted to replace the traditional bus-based interface in the memory system. A novel message interface based memory system called MIMS is proposed. The key innovation of MIMS is that processors communicate with the memory system through a universal and flexible message packet interface. Each message packet is allowed to encapsulate multiple memory requests (or commands) and additional semantic information. The memory system is more intelligent and active by equipping with a local buffer scheduler, which is responsible for processing packets, scheduling memory requests, preparing responses, and executing specific commands with the help of semantic information. Under the MIMS framework, many previous innovations on memory architecture as well as new optimization opportunities such as address compression and continuous requests combination can be naturally incorporated. The experimental results on a 16-core cycle-detailed simulation system show that: with accurate granularity message, MIMS can improve system performance by 53.21% and reduce energy delay product (EDP) by 55.90%. Furthermore, it can improve effective bandwidth utilization by 62.42% and reduce memory access latency by 51% on average.
暂无评论