Soft Errors have emerged as a key challenge to microprocessor design. Traditional soft error tolerance techniques (such as redundant multithreading and instruction duplication) can achieve high fault coverage but at t...
详细信息
Soft Errors have emerged as a key challenge to microprocessor design. Traditional soft error tolerance techniques (such as redundant multithreading and instruction duplication) can achieve high fault coverage but at the cost of significant performance degradation. Prior research reports that soft errors can be masked at the architecture level, and the degree of such masking, named as architecture vulnerability factor (AVF), can vary significantly across workloads and individual structures, hence strict redundant execution may not be necessary for soft error tolerance. In this work, we exploit the AVF varying feature to adaptively tune reliability and performance. We present an infrastructure to online compute and predict AVF for three microprocessor structures (IQ, ROB, and LSQ), guiding when the protection scheme should be activated to improve reliability. Experimental results show that our method can efficiently compute the AVF for different structures independent of hardware configurations. The average differences between our method and a prior offline AVF computing method are 0.10, 0.01, and 0.039 for IQ, ROB, and LSQ, respectively.
Server consolidation based on virtualization technology will simplify system administration, reduce the cost of power and physical infrastructure, and improve utilization in today's Internet-service-oriented enter...
详细信息
Server consolidation based on virtualization technology will simplify system administration, reduce the cost of power and physical infrastructure, and improve utilization in today's Internet-service-oriented enterprise data centers. How much power and how many servers for the underlying physical infrastructure are saved via server consolidation in VM-based data centers is of great interest to administrators and designers of those data centers. Various workload consolidations differ in saving power and physical servers for the infrastructure. The impacts caused by virtualization to those concurrent services are fluctuating considerably which may have a great effect on server consolidation. This paper proposes a utility analytic model for Internet-oriented server consolidation in VM-based data centers, modelling the interaction between server arrival requests with several QoS requirements, and capability flowing amongst concurrent services, based on the queuing theory. According to features of those services' workloads, this model can provide the upper bound of consolidated physical servers needed to guarantee QoS with the same loss probability of requests as in dedicated servers. At the same time, it can also evaluate the server consolidation in terms of power and utility of physical servers. Finally, we verify the model via a case study comprised of one e-book database service and one e-commerce Web service, simulated respectively by TPC-W and SPECweb2005 benchmarks. Our experiments show that the model is simple but accurate enough. The VM-based server consolidation saves up to 50% physical infrastructure, up to 53% power, and improves 1.7 times in CPU resource utilization, without any degradation of concurrent services' performance, running on Rainbow - our virtual computing platform.
Thread-level redundancy in Chip Multiprocessors(TLR-CMP) is efficient for soft error tolerance. Process variation causes core-to-core (C2C) performance asymmetry across a chip, which should be taken into consideration...
详细信息
Thread-level redundancy in Chip Multiprocessors(TLR-CMP) is efficient for soft error tolerance. Process variation causes core-to-core (C2C) performance asymmetry across a chip, which should be taken into consideration for application scheduling. In this paper, two types of variations beyond C2C are introduced, i.e., inter-pair and intra-pair variation in TLR-CMP. Intra-pair performance asymmetry can affect the performance of applications differently. Based on the above observation, we firstly formalize the variation aware scheduling in TLR-CMP as a 0-1 programming problem,to maximize the system weighted throughput. An efficient scheduling algorithm, named IntraVarF&AppSen, is then proposed to tackle this problem, which can be proved to be optimal when the number of applications to be scheduled is equal to the number of core pairs. Simulation on a 64-core CMP shows 2.8%-4% improvement in weighted throughput when compared to prior VarF&AppIPC algorithm.
Enhanced scan delay testing approach can achieve high transition delay fault coverage by a small size of test pattern set but with significant hardware overhead. Although the implementation cost of launch on capture (...
详细信息
Enhanced scan delay testing approach can achieve high transition delay fault coverage by a small size of test pattern set but with significant hardware overhead. Although the implementation cost of launch on capture (LOC) approach is relatively low, the generated pattern set for testing delay faults is typically very large. In this paper, we present a novel flip-flop selection method to combine the respective advantages of the two approaches, by replacing a small number of selected regular scan cells with enhanced scan cells, thus to reduce the overall volume of transition delay test patterns effectively. Moreover, higher fault coverage can also be obtained by this approach compared to the standard LOC approach. Experimental results on larger ISCAS-89 and ITC-99 benchmark circuits using a commercial test generation tool show that the volume of test patterns can be reduced by over 70% and the transition delay fault coverage can be improved by up to 8.7%.
Conflict can decrease performance of computer severely, such as bank conflicts reduce bandwidth of interleave multibank memory systems and conflict misses reduce effective on-chip capacity, and this incurs much confli...
详细信息
Conflict can decrease performance of computer severely, such as bank conflicts reduce bandwidth of interleave multibank memory systems and conflict misses reduce effective on-chip capacity, and this incurs much conflict miss further. Conflicts can be avoided by a suitable address mapping scheme which maps the most frequently occurring patterns conflict-free. In this paper, we present a new XOR-based mapping scheme, called XORM, which focuses on multi-bank shared cache of on-chip many-core architecture. The XORM mapping scheme can map arbitrary bits of address to set indices and compute each set index bit as XOR of a transpositional subset of the bits in the address. Then, we analyze necessary characteristics of an address mapping scheme to avoid conflict in many-core architecture. Next, we give a case study to design optimal hash functions based on XORM scheme for skewed-associative cache. Finally, we introduce another case study, in which we illustrate how to design an XORM mapping scheme with lower implementation cost, complexity and computing latency in shared cache of chipped many-core architecture. The evaluated results show the effectiveness of XORM mapping scheme.
On-chip many core architecture is an emerging and promising computation platform. High speed on-chip communication and abundant chipped resources are two outstanding advantages of this architecture, which provide an o...
详细信息
On-chip many core architecture is an emerging and promising computation platform. High speed on-chip communication and abundant chipped resources are two outstanding advantages of this architecture, which provide an opportunity to implement efficient synchronization scheme. The practical execution efficiency of synchronization scheme is critical to this platform. However, there are few researches on systematic evaluation method of choice synchronization schemes for on-chip many core processors, and effect of dedicated hardware support in this context. So we focus on the evaluation method and criterion of synchronization scheme on the platform. Firstly, we present several criterions proper to on-chip many core architecture, that is, absolute overhead of synchronization operation, the transferring time between different synchronization operations, overhead caused by load imbalance, and the network congestion caused by synchronization operation. Secondly, we illustrate how to design microbenchmarks which one dedicated to evaluate a performance criterion respectively. Finally, we implement these microbenchmarks and synchronization schemes on an on-chip many core processor with shared level-two cache and AMD Opteron commercial chip multi-processor, respectively. And we analyze effect of dedicated hardware support. Results show that the most overhead of synchronization is caused by load imbalance and serialization on synchronization point. It also shows that synchronization scheme supported with dedicated hardware can improve its performance obviously for chipped many-core processor.
The synchronization between threads has serious impact on the performance of many-core architecture. When communication is frequent, coarse-grained synchronization brings significant overhead. Thus, coarse-grained syn...
详细信息
The synchronization between threads has serious impact on the performance of many-core architecture. When communication is frequent, coarse-grained synchronization brings significant overhead. Thus, coarse-grained synchronization is not suitable for this situation. However, the overhead of fine-grained synchronization is still small when the communication is frequent. For the many-core architecture which supports fine-grained synchronization with on-chip storage, we propose fine-grained synchronization algorithms for scientific computation application 2-D wavefront and LU decomposition. At first, according to the memory access mode, an efficient method of data allocation is proposed. Then, way of thread partition and synchronization are discussed. Finally, we estimate the two algorithms based on Godson-T many-core architecture. The results of experiments show that the relative speedup is almost linear and the execution time is only 53.2 % of the coarse-grained synchronization. After the global barriers are eliminated, LU decomposition achieved 13.1% performance improvement. Moreover, the experiments prove that the fine-grained mechanism is able to improve the performance of processor and it has a good scalability.
To date, most of many-core prototypes employ tiled topologies connected through on-chip networks. The throughput and latency of the on-chip networks usually become to the bottleneck to achieve peak performance especia...
详细信息
To date, most of many-core prototypes employ tiled topologies connected through on-chip networks. The throughput and latency of the on-chip networks usually become to the bottleneck to achieve peak performance especially for communication intensive applications. Most of studies are focus on on-chip networks only, such as routing algorithms or router micro-architecture, to improve the above metrics. The salient aspect of our approach is that we provide a data management framework to implement high efficient on-chip traffic based on overall many-core system. The major contributions of this paper include that: (1) providing a novel tiled many-core architecture which supports software controlled on-chip data storage and movement management; (2) identifying that the asynchronous bulk data transfer mechanism is an effective method to tolerant the latency of 2-D mesh on-chip networks. At last, we evaluate the 1-D FFT algorithm on the framework and the performance achieves 47.6 Gflops with 24.8% computation efficiency.
For a gigahertz microprocessor with multiple clock domains and a large amount of embedded RAMs (random access memory), generating at-speed testing patterns is becoming very difficult and very time-consuming. This pape...
详细信息
ISBN:
(纸本)9781424450909;9781424450916
For a gigahertz microprocessor with multiple clock domains and a large amount of embedded RAMs (random access memory), generating at-speed testing patterns is becoming very difficult and very time-consuming. This paper presents some novel techniques to improve at-speed testing coverage with low cost. These methods are major concern about preventing X states propagation, which include avoiding capturing X states for registers, sequential bypass of macros, clock control scheme for inter-clock domains and accurate analysis of exception paths in intra-clock domains. Functional patterns are utilized to further improve the efficiency of the at-speed testing. A novel optimal flow is presented by carefully selecting these techniques. By using the flow, 90% transition fault coverage is achieved. In addition, both the number of patterns and the test time of the transition test are decreased by 15%. The total area overhead is about a few hundreds of AND cells and has little timing impact on the critical paths.
The fast evolutionary programming (FEP) introduced the Cauchy distribution into its mutation operator, thus the performances of EP were promoted significantly on a number of benchmark problems. However, the scaling pa...
详细信息
The fast evolutionary programming (FEP) introduced the Cauchy distribution into its mutation operator, thus the performances of EP were promoted significantly on a number of benchmark problems. However, the scaling parameter of the Cauchy mutation is invariable, which has become an obstacle for FEP to reach better performance. This paper proposes and analyzes a new stochastic method for controlling the variable scaling parameters of Cauchy mutation. This stochastic method collects information from a group of individuals randomly selected from the population. Empirical evidence validates our method to be very helpful in promoting the performance of FEP.
暂无评论