Mainstream multi-core processors employ large multilevel on-chip caches making them highly susceptible to soft errors. We demonstrate that designing a reliable cache hierarchy requires understanding the vulnerability ...
详细信息
ISBN:
(纸本)9781509058266
Mainstream multi-core processors employ large multilevel on-chip caches making them highly susceptible to soft errors. We demonstrate that designing a reliable cache hierarchy requires understanding the vulnerability interdependencies across different cache levels. This involves vulnerability analyses depending upon the parameters of different cache levels (partition size, line size, etc.) and the corresponding cache access patterns for different applications. This paper presents a novel soft error-aware cache architectural space exploration methodology and vulnerability analysis of multi-level caches considering their vulnerability interdependencies. Our technique significantly reduces exploration time while providing reliability-efficient cache configurations. We also show applicability/benefits for ECC-protected caches under multi-bit fault scenarios.
A widely-used quantum programming paradigm comprises of both the data flow and control flow. Existing quantum hardware cannot well support the control flow, significantly limiting the range of quantum software executa...
详细信息
Continuous shrinking of transistor size to provide high computation capability along with low power consumption has been accompanied by reliability degradations due to e.g., aging phenomenon. In this regard, with huge...
详细信息
Continuous shrinking of transistor size to provide high computation capability along with low power consumption has been accompanied by reliability degradations due to e.g., aging phenomenon. In this regard, with huge number of configuration bits, Field-Programmable Gate Arrays (FPGAs) are more susceptible to aging since aging not only degrades the performance, it may additionally result in corrupting the configuration cells and thus causing permanent circuit malfunctioning. While several works have investigated the aging effects in Look-Up Tables (LUTs), the routing fabric of these devices is seldom studied - even though it contributes to the majority of FPGAs' resources and configuration bits. Furthermore, there is a high prospect that errors in its state to propagate to the device outputs. In this paper, we first investigate aging effects in the routing fabric of FPGAs with respect to performance and reliability degradations. Based on this investigation, we enhance the conventional routing algorithm to mitigate the impact of aging by increasing the recovery time (i.e., the mechanism used to heal aging-induced defects) of transistors used in the routing resources. We examine our proposed method as reduction in stress time and required guardband to protect against aging in the routing fabric, as well as in improving the FPGA's lifetime. Our experiments show that the proposed method reduces the average stress time and aging-induced delay of routing resources by 41% and 18.3%, respectively. This, in turn, leads to improving the device lifetime by 130% compared to baseline routing. The proposed method can be applied by simple amending of conventional routing algorithms. Thus, it incurs negligible delay overhead.
Modern embedded technology enables a high level of compute performance at the cost of little energy. Hence, miniaturized satellite development has begun to rely upon conventional application processor architectures an...
详细信息
Modern embedded technology enables a high level of compute performance at the cost of little energy. Hence, miniaturized satellite development has begun to rely upon conventional application processor architectures and FPGAs, and can nowadays offer an abundance of performance and storage capacity.
Aging effects in nano-scale CMOS circuits impair the reliability and Mean Time to Failure (MTTF) of embeddedsystems. Especially for FPGAs that are manufactured in the latest technology node, aging is amajor concern. ...
详细信息
ISBN:
(纸本)9781467383899
Aging effects in nano-scale CMOS circuits impair the reliability and Mean Time to Failure (MTTF) of embeddedsystems. Especially for FPGAs that are manufactured in the latest technology node, aging is amajor concern. We introduce the first cross-layer aging-aware placement method for accelerators in FPGA-based runtime reconfigurable architectures. It optimizes stress distribution by accelerator placement at runtime, i.e. to which reconfigurable region an accelerator shall be reconfigured. Additionally, it optimizes logic placement at synthesis time to diversify the resource usage of individual accelerators, i.e. which CLBs of a reconfigurable region shall be used by an accelerator. Both layers together balance the intra- and inter-region stress induced by the application workload at negligible performance cost. Experimental results show significant reduction of maximum stress of up to 64% and 35%, which leads to up to 177% and 14% MTTF improvement relative to state-of-the-art methods w.r.t. HCI and BTI aging, respectively.
Dark Silicon refers to the constraint that only a fraction of on-chip resources (cores) can be simultaneously powered-on (running at full performance) in order to stay within the allowable power budget and safe temper...
详细信息
ISBN:
(纸本)9783981537048
Dark Silicon refers to the constraint that only a fraction of on-chip resources (cores) can be simultaneously powered-on (running at full performance) in order to stay within the allowable power budget and safe temperature limits, while others remain `dark'. In this paper, we demonstrate how these `dark cores' can be leveraged to improve the temperature profile at run-time, thus providing opportunities to power-on more cores at the nominal voltage than the number allowed when strictly obeying the conventional Thermal Design Power (TDP) constraint. In this paper, we propose a computationally efficient dark silicon management technique that determines the best set of cores to keep dark and the mapping of threads to cores at run-time, while also accounting for the impact of process variations. We have developed a lightweight temperature prediction mechanism that determines the impact of different candidate solutions on the chip thermal profile. Experimental evaluation of the proposed techniques on a simulated 8×8 many-core processor, and across a range of chips to account for process variations, show that the total instruction throughput is increased by 1.8× on average while keeping the temperature within the safe limits, when compared with state-of-the-art approaches.
Coarse-Grained Reconfigurable architectures (CGRA) promise both low power and high performance coupled with flexibility, however automatic mapping of applications to such platforms remains a great research challenge. ...
详细信息
Coarse-Grained Reconfigurable architectures (CGRA) promise both low power and high performance coupled with flexibility, however automatic mapping of applications to such platforms remains a great research challenge. Efficient manual mapping of the data-centric kernels of applications yields great results, however these contain internally control-flow specific tasks, which introduce mapping irregularities and execution inefficiencies on CGRAs. In this paper, we explore analysis, design and synthesis of reconfigurable structures for efficient application-specific control-flow processing, aiming to develop a methodology to design reconfigurable control-flow acceleration modules. Such modules can be coupled with generic CGRAs, off-loading execution of irregular and ill-suited sequential control-flow subroutines, enabling the CGRA to exploit a clean, regular data-flow centric mapping. Considering different architectural paradigms, we design and compare a functional array-based design, a VLIW-style design and an automatically generated design based on graph theoretic concepts against the ASIC implementation of the control flow operations for several kernels of the linear algebra domain. Such reconfigurable control-flow specific accelerators are a first step towards automating CGRA-based accelerator design and application mapping from high-level descriptions.
Future on-chip manycore systems are expected to have hundreds of cores, and to be used for a number of applications to amortize their fabrication costs. In this paper, we examine how software pipelines, which are usef...
详细信息
ISBN:
(纸本)9781479977932
Future on-chip manycore systems are expected to have hundreds of cores, and to be used for a number of applications to amortize their fabrication costs. In this paper, we examine how software pipelines, which are useful for streaming/multimedia applications, can be efficiently executed on a manycore system with shared memory. The goal is to balance the stages of the pipeline under workload and resource variations. This paper presents ADAPT, a method to quickly detect bottleneck stages and add cores (workers) to those bottleneck stages at run-time. Further, if there are no idle workers, then a shuffling of workers across stages is performed to improve/maintain throughput. ADAPT is implemented in a 48-core system which is built using a commercial core and tool suite. For a variety of applications, ADAPT takes less than 2 μs for one run-time adaptation, and achieves up to 2.1× the throughput of a state-of-the-art method (which is modified and implemented in the same system for a fair comparison). These results illustrate the applicability of ADAPT for fine-grained run-time management of manycore systems to achieve high throughput for software pipelines.
On-chip many-core systems are expected to be in common use in the future. A set of homogeneous processors in a many-core system can be used to implement multiple pipelines which execute simultaneously. Pipelines of pr...
详细信息
ISBN:
(纸本)9783981537048
On-chip many-core systems are expected to be in common use in the future. A set of homogeneous processors in a many-core system can be used to implement multiple pipelines which execute simultaneously. Pipelines of processors use varying numbers of cores when their workloads vary at run time. In this paper, we show how such a system executing multiple pipelines with varying workloads can be implemented. We further show how the system can switch cores within a pipeline (intra-elasticity) and between pipelines (inter-elasticity). The method is named E-pipeline, and is implemented and evaluated in a commercial tool suite. Compared to reference design methods with clock gating, E-pipeline achieves the same power savings, maintains the throughput to meet throughput constraints and reduces core usage by an average of 37.7%. The adaptation overhead for switching cores is approximately 2μs.
暂无评论