In this paper, we propose a binary-tree waveguide connected Optical-Network-on-Chip (ONoC) to accelerate the establishment of the lightpath. By broadcasting the control data in the proposed power-efficient binary-tree...
详细信息
ISBN:
(纸本)9783981080162
In this paper, we propose a binary-tree waveguide connected Optical-Network-on-Chip (ONoC) to accelerate the establishment of the lightpath. By broadcasting the control data in the proposed power-efficient binary-tree waveguide, the maximal hops for establishing lightpath is reduced to two. With extensive simulations and analysis, we demonstrate that the proposed ONoC significantly reduces the setup time, and then the packet latency.
In order to combine the power of simulation-based and formal techniques, semi-formal methods have been widely explored. Among these methods, abstraction-guided simulation is a quite promising one. In this paper, we pr...
详细信息
ISBN:
(纸本)9783981080162
In order to combine the power of simulation-based and formal techniques, semi-formal methods have been widely explored. Among these methods, abstraction-guided simulation is a quite promising one. In this paper, we propose an abstraction-guided simulation approach aiming to cover hard-to-reach states in functional verification of microprocessors. A Markov model is constructed utilizing the high level functional specification, i.e. ISA. Such model integrates vector correlations. Furthermore, several strategies utilizing abstraction information are proposed as an effective guidance to the test generation. Experimental results on two complex microprocessors show that our approach is more efficient in covering hard-to-reach states than similar methods. Comparing with some work with other intelligent engines, our approach could guarantee higher hit ratio of target states without efficiency loss.
Nondeterminism of multi-clock systems often complicates various system validation processes such as post silicon debugging and at-speed testing, which has brought many difficulties to system designers and testers. The...
详细信息
ISBN:
(纸本)9783981080162
Nondeterminism of multi-clock systems often complicates various system validation processes such as post silicon debugging and at-speed testing, which has brought many difficulties to system designers and testers. The major source of nondeterministic behaviors is clock domain crossing, because the clocks that determine the timing of events are sensitive to variations. In this paper, we propose a general method to eliminate the nondeterminism resulted from clock domain crossing. This method does not assume any specific relationship among the clocks. Instead, to adapt to various clock conditions, an automatic configuration procedure and a periodic error canceling mechanism, which only require trivial hardware support, are proposed by analyzing the deterministic boundaries theoretically. To demonstrate the applicability of our method in practice, we implement it on a FPGA platform. Experiment results validate that the performance loss brought by our method over conventional multi-clock FIFO is less than 2%.
Faster-than-at-speed testing provides an effective way for detecting and debugging small delay defects in modern fabricated chips. However, the use of external automatic test equipment for faster-than-at-speed delay t...
详细信息
ISBN:
(纸本)9783981080162
Faster-than-at-speed testing provides an effective way for detecting and debugging small delay defects in modern fabricated chips. However, the use of external automatic test equipment for faster-than-at-speed delay testing could be costly. In this paper, we present an on-chip clock generation scheme which facilitates faster-than-at-speed delay testing for both launch on capture and launch on shift test frameworks. The required test clock frequency with a high resolution can be obtained by specifying the information in the test patterns, which is then shifted into the delay control stages to configure the launch and capture clock generation circuit (LCCG) embedded on-chip. Similarly, the control information for selecting various test frameworks and clock signals can also be embedded in the test patterns. Experimental results are presented to validate the proposed scheme.
We propose a multiple-fault diagnosis method with high diagnosability, resolution, first-hit and short run time. The method has no assumption on fault models, thus can diagnose arbitrary faults. To cope with the multi...
详细信息
ISBN:
(纸本)9783981080162
We propose a multiple-fault diagnosis method with high diagnosability, resolution, first-hit and short run time. The method has no assumption on fault models, thus can diagnose arbitrary faults. To cope with the multiple-fault mask and reinforcement effect, two key techniques of construction and scoring of fault-tuple equivalence trees are introduced to choose and rank the final candidate locations. Experimental results show that, when the circuits have 2 arbitrary faults, the average diagnosability and resolution are 98% and 0.95, respectively, with the best case 100% and 1.00. Moreover, in average, even when 21 arbitrary faults exist, our method can still identify 93% of them with the resolution 0.78, increased by 41% and 39% in comparison with the latest work where the diagnosability and resolution are 66% and 0.56. Finally, 96% of our top-ranked candidate locations are actual fault locations.
Large scale Chip-Multiprocessors (CMPs) generally employ Network-on-Chip (NoC) to connect the last level cache (LLC), which is generally organized as distributed NUCA (non-uniform cache access) arrays for scalability ...
详细信息
Large scale Chip-Multiprocessors (CMPs) generally employ Network-on-Chip (NoC) to connect the last level cache (LLC), which is generally organized as distributed NUCA (non-uniform cache access) arrays for scalability and efficiency. On the other hand, aggressive technology scaling induces severe reliability problems, causing on-chip components (e.g., cores, cache banks, routers) failure due to manufacture defects or on-line hardware faults. Typical degradable CMPs should possess the ability to work around defects by disabling faulty components. For static NUCA architecture, when cache banks attached to a computing node are disabled, however, certain physical address sections will no longer be accessible. Prior approaches such as sets reduction introduced in Intel Xeon processor 7100 series enable turning off cache banks by masking certain sets bits in physical address1, which greatly wastes cache capacity. In this paper, we propose to tackle the above problem in a finer granularity to restrict the capacity loss in NUCA cache. Cache accesses to isolated nodes are redirected based on the utility-driven address remapping scheme that reduces data blocks conflicts in fault-tolerant shared-LLC. We evaluate our technique using GEMS simulator. Experimental results show that address remapping achieves significant improvement over the conventional cache sizing scheme.
We present nGFSIM, a GPU-based fault simulator for stuck-at faults which can report the fault coverage of one-to n-detection for any specified integer n using only a single run of fault simulation. nGFSIM, which explo...
详细信息
We present nGFSIM, a GPU-based fault simulator for stuck-at faults which can report the fault coverage of one-to n-detection for any specified integer n using only a single run of fault simulation. nGFSIM, which explores the massive parallelism in the GPU architecture and optimizes the memory access and usage, enables accelerated fault simulation without the need of fault dropping. We show that nGFSIM offers a 25X speedup in comparison with a commercial tool and enables new applications in test selection.
Critical path selection plays an important role in testing of small delay defects (SDD). For some timing-balanced circuits, the numbers of candidate critical paths may be very large, and this will make Monte Carlo sim...
详细信息
ISBN:
(纸本)9781605588377
Critical path selection plays an important role in testing of small delay defects (SDD). For some timing-balanced circuits, the numbers of candidate critical paths may be very large, and this will make Monte Carlo simulation based statistical timing analysis very inefficient. A fast path selection approach based on graph partition is proposed in this paper. First, a critical path graph (CPG) is generated to implicitly enumerate almost all candidate critical paths, and then the CPG is partitioned into several sub graphs which contain limited numbers of paths using two graph partition approaches. After that, Monte Carlo simulation is applied on each sub graph for path selection. At last, according to the partition topology of the CPG and path sets selected from each sub graph, a path set for the original CPG is generated using Union and Cartesian product operations for testing SDDs. Experimental results show that for circuits containing large numbers of candidate critical paths, the proposed path selection approach can reduce the CPU time significantly and maintain a higher probability of capturing delay failures compared to path selection methods based on general Monte Carlo simulation.
Sorting is a kernel algorithm for a wide range of applications. In this paper, we present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consist...
详细信息
Sorting is a kernel algorithm for a wide range of applications. In this paper, we present a new algorithm, GPU-Warpsort, to perform comparison-based parallel sort on Graphics Processing Units (GPUs). It mainly consists of a bitonic sort followed by a merge sort. Our algorithm achieves high performance by efficiently mapping the sorting tasks to GPU architectures. Firstly, we take advantage of the synchronous execution of threads in a warp to eliminate the barriers in bitonic sorting network. We also provide sufficient homogeneous parallel operations for all the threads within a warp to avoid branch divergence. Furthermore, we implement the merge sort efficiently by assigning each warp independent pairs of sequences to be merged and by exploiting totally coalesced global memory accesses to eliminate the bandwidth bottleneck. Our experimental results indicate that GPU-Warpsort works well on different kinds of input distributions, and it achieves up to 30% higher performance than previous optimized comparison-based GPU sorting algorithm on input sequences with millions of elements.
In this paper, we present a Godson-T Verification Engine (GVE) to rapidly prototype and debug our Godson-T many-core processor design. GVE adopts the state-of-the-art hardware platform which contains 6 Xilinx Virtex-5...
详细信息
In this paper, we present a Godson-T Verification Engine (GVE) to rapidly prototype and debug our Godson-T many-core processor design. GVE adopts the state-of-the-art hardware platform which contains 6 Xilinx Virtex-5 LX330 FPGAs, thus permitting us to map our many-core processor and peripheral devices into it. Besides the hardware, our toolkit Godson-T Studio provides the compiler, program loader, debugger and monitor to fulfil the purpose of developing, profiling and debugging, while the accuracy loss problem is settled by our novel techniques: Check-point and ILA-Check, presented in this paper. To our experience, GVE greatly reduces the verification cycle due to its high execution speed, for example, it finishes thousands of testcases in an hour, where the software-based approach takes few days to run. And by the help of the checkpoint framework, we can easily locate the faults. Because of these features, GVE makes a great contribution to the 16-tile Godson-T tape-out Project.
暂无评论