Single event upsets have made modern integrated circuits more susceptible to soft errors, making their operation less reliable. In order to detect induced computational faults, we propose a fault-tolerant, low-overhea...
Single event upsets have made modern integrated circuits more susceptible to soft errors, making their operation less reliable. In order to detect induced computational faults, we propose a fault-tolerant, low-overhead microarchitecture that detects soft errors using temporal redundancy. The idea of introducing an Early Execution Unit (EXU) at the processor frontend has been explored by previous proposals in order to execute dynamic instructions with short dependency chains expeditiously for performance improvement. However, we observe that the functional units in the EXU are idle for a significant portion of the program execution duration. Our proposed microarchitecture utilises these inactive frontend functional units in order to re-execute dynamic instructions to ascertain computational correctness. A low-overhead backend verifier leverages idle resources for redundant execution using the proposed priority-based scheduling algorithm that interleaves program execution with re-execution for error detection. Our proposal provides exhaustive transient fault coverage while improving performance by 7.5% over an existing restricted OoO microarchitecture, Freeflow Core, delivering IPC close to that of an out-of-order baseline, while being $1.78 \times $ more energy-efficient than the involuted, power-hungry out-of-order design.
Contemporary integrated circuits are becoming increasingly susceptible to soft errors due to single-event upsets, effectively decreasing the reliability of operation. In this paper, we propose the ERrOR microarchitect...
Contemporary integrated circuits are becoming increasingly susceptible to soft errors due to single-event upsets, effectively decreasing the reliability of operation. In this paper, we propose the ERrOR microarchitecture, that detects soft errors in processor operation using temporal redundancy with minimal hardware overhead. Previous proposals have explored the idea of introducing an Early Execution Unit (EXU) at the processor frontend in order to expeditiously execute dynamic instructions with short dependency chains for performance improvement. However, we observe that the functional units in the EXU are idle for a significant fraction of the program execution duration. ERrOR leverages these inactive frontend functional units to re-execute dynamic instructions for the purpose of error detection. A lightweight verifier introduced at the backend makes use of idle resources for redundant execution by interleaving program execution with re-execution for error detection. ERrOR provides exhaustive transient fault coverage while improving performance by 7.5% over an existing restricted OoO microarchitecture, Freeflow Core.
Massively parallel processors such as graphics processing units (GPUs) often face the challenge of resource underutilization due to varying resource proclivity of workloads. Running multiple applications on a GPU has ...
详细信息
ISBN:
(纸本)9781728192017;9781728192024
Massively parallel processors such as graphics processing units (GPUs) often face the challenge of resource underutilization due to varying resource proclivity of workloads. Running multiple applications on a GPU has been an efficient and known alternative to mitigate underutilization. This paper proposes a multi-application oriented framework that carries out dynamic optimizations based on the operational intensities of various applications. Our framework analyzes applications based on operational intensities to identify their bottleneck resources using Roofline model. We demonstrate that the proposed optimizations improve the utilization and system-wide throughput of the GPU co-running applications with irregular resource demands. The dynamic optimizations improve the performance by 14.8% on average and up to 72.4% over a state-of-the-art spatial multitasking technique.
Reliable multicast protocols are an important class of protocols for reliably disseminating information from a sender to multiple receivers in the face of node and link failures. A tree-based reliable multicast protoc...
详细信息
Reliable multicast protocols are an important class of protocols for reliably disseminating information from a sender to multiple receivers in the face of node and link failures. A tree-based reliable multicast protocol (TRAM) provides scalable reliable multicast by grouping receivers in hierarchical repair groups and using a selective acknowledgment mechanism. We present an improvement to TRAM to minimize the resource utilization at intermediate hosts and to localize the effect of slow or malicious receivers on normal receivers. We present an evaluation of TRAM and TRAM++ on a campus-wide WAN without errors and with message errors. The evaluation brings out that, given a constraint on the buffer availability at intermediate hosts, TRAM++ can tolerate the constraint at the expense of increasing the end-to-end latency for the normal receivers by only 3.2% compared to TRAM in error-free cases. When slow or faulty receivers are present, TRAM++ is able to provide the same uninterrupted quality of service to the normal nodes while localizing the effect of the faulty ones without incurring any additional memory overhead.
Digital twins have a major potential to form a significant part of urban management in emergency planning, as they allow more efficient designing of the escape routes, better orientation in exceptional situations, and...
详细信息
暂无评论