With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wai...
详细信息
With the growing scale of high-performance computing (HPC) systems, today and more so tomorrow, faults are a norm rather than an exception. HPC applications typically tolerate fail-stop failures under the stop-and-wait scheme, where even if only one processor fails, the whole system has to stop and wait for the recovery of the corrupted data. It is now a more-or-less accepted fact that the stop-and-wait scheme will not scale to the next generation of HPC systems. Inspired by the previous stop-and-wait algorithm-based fault tolerance (ABFT) recovery technique, we propose in this paper a nonstop fault tolerance scheme at the application level and describe its implementation. When failure occurs during the execution of applications, we do not stop to wait for the recovery of the corrupted node; instead, we replace it with the corresponding redundant node and continue the execution. At the end of execution, the correct solution can be recovered algorithmically at a very low cost. In order to implement the scheme, some new fault-tolerant features of the Message Passing Interface (MPI) have been investigated and utilized in the MPICH implementation of MPI. We also describe a case study using High Performance Linpack (HPL) with these new features and evaluate the performance of both our new scheme and ABFT recovery. Experimental results show the advantage of our new scheme over ABFT recovery even in a small scale.
As the feature size of FPGA shrinks to nanometers, SRAM-based FPGAs are more vulnerable to soft errors. During logic synthesis, reliability of the design can be improved by introducing logic masking effect. In this wo...
详细信息
As the feature size of FPGA shrinks to nanometers, SRAM-based FPGAs are more vulnerable to soft errors. During logic synthesis, reliability of the design can be improved by introducing logic masking effect. In this work, we observe that there are a lot of not-fully occupied look-up tables (LUTs) after logic synthesis. Hence, we propose a functional equivalent class based soft error mitigation scheme to exploit free LUT entries in the circuit. The proposed technique replaces not fully-occupied LUTs with corresponding functional equivalent classes, which can improve the reliability while preserve the functionality of the design. Experimental results show that, compared with the baseline ABC mapper, the proposed technique can reduce the soft error rate by 21%, and the critical-path delay increase is only 4.25%.
Web workloads are known to vary dynamically with time which poses a challenge to resource allocation among the applications. In this paper, we argue that the existing dynamic resource allocation based on resource util...
详细信息
Web workloads are known to vary dynamically with time which poses a challenge to resource allocation among the applications. In this paper, we argue that the existing dynamic resource allocation based on resource utilization has some drawbacks in virtualized servers. Dynamic resource allocation directly based on real-time user experience is more reasonable and also has practical significance. To address the problem, we propose a systemarchitecture that combines real time measurements and analysis of user experience for resource allocation. We evaluate our proposal using Webbench. The experiment results show that these techniques can judiciously allocate system resources.
ISS (Instruction Set Simulator) plays an important role in pre-silicon software development for ASIP. However, the speed of traditional simulation is too slow to effectively support full-scale software development. In...
详细信息
ISS (Instruction Set Simulator) plays an important role in pre-silicon software development for ASIP. However, the speed of traditional simulation is too slow to effectively support full-scale software development. In this paper, we propose a hybrid simulation framework which further improves the previous simulation methods by aggressively utilizing the host machine resources. The utilization is achieved by categorizing instructions of ASIP application into two types, namely custom and basic instructions, via binary instrumentation. Then in a way of hybrid simulation, only custom instructions are simulated on the ISS and basic instructions are executed fast and natively on the host machine. We implement this framework for an industrial ASIP to validate our approach. Experimental results show that when the implemented ISS, namely GS-Sim, is applied to practical multimedia decoders, an average simulation speed up to 1058.5MIPS can be achieved, which is 34.7 times of the state-of-art dynamic binary translation simulator and is the fastest to the best of our knowledge.
三维集成电路是通过硅通孔将多个相同或不同工艺的晶片上下堆叠并进行垂直集成的新兴芯片集成技术。通过这种集成,芯片可获得更小的外形尺寸、更高的片上晶体管集成密度、单片上能集成更多的功能模块以及更高的互连性能等显著优点。然而,三维集成电路也带来了诸如TSV电迁移效应等新挑战。本文提出了一种抑制TSV电迁移效应的可靠性设计方法。首先,针对镀铜气泡、绑定非对齐和绑定界面尘埃沾染等TSV缺陷,分析了制造缺陷和电迁移效应之间的关系。通过观察发现,制造缺陷在加剧电迁移效应的同时还会影响TSV的阻值。然后,本文提出了TSV-SAFE(TSV Self-healing architecture For Electro-migration)可靠性设计框架抑制电迁移效应。实验中,本文构建了一个由两层电路组成的3D芯片仿真平台。实验结果表明,采用本文所提出的技术,TSV的平均无故障时间(MTTF)平均增加了70倍,而由此带来的硬件面积开销不超过全芯片面积的1%。
Some wafer fabrication processes are repeated processes, e.g. atomic layer deposition (ALD) process. For such processes, the wafers need to visit some processing modules for a number of times, which complicates the cy...
详细信息
暂无评论