Due to complex abstractions implemented over shared data structures protected by locks, conventional symmetric multithreaded operating system kernel such as Linux is hard to achieve high scalability on the emerging mu...
详细信息
ISBN:
(纸本)9781424464425
Due to complex abstractions implemented over shared data structures protected by locks, conventional symmetric multithreaded operating system kernel such as Linux is hard to achieve high scalability on the emerging multi-core architectures, which integrate more and more cores on a single die. This paper presents GenerOS - a general asymmetric operating system kernel for multi-core systems. In principal, GenerOS partitions processing cores into application core, kernel core and interrupt core, each of which is dedicated to a specified function. In implementation, we conduct a delicate modification to Linux kernel and provide the same interface as Linux kernel so that GenerOS is compatible with legacy applications. The better performance of GenerOS mainly benefits from: (1) Applications run on their own cores with minimal interrupt and kernel support; (2) Every kernel service is encapsulated in to a serial process so that there will be fewer contentions than conventional symmetric kernel; (3) A slim schedule policy is used in the kernel core to support schedule between system calls with low overhead. Experiments with two typical workloads on 16-core AMD machine show that GenerOS behaves better than original Linux kernel when there are more processing cores (19.6% for TPC-H using oracle database management system and 42.8% for httperf using apache web server).
Binary translation is one of the most important approaches for system migration. However, software binary translation systems often suffer from the inefficiency and traditional hardware-software co-designed virtual ma...
详细信息
Binary translation is one of the most important approaches for system migration. However, software binary translation systems often suffer from the inefficiency and traditional hardware-software co-designed virtual machines require the unavoidable re-design of the processor architecture. This paper presents a novel hardware-software co-designed method to accelerate the binary translation on an existing architecture. The hardware supports for source-architecture-only functions, partial decodes and binary translation system acceleration are proposed. These hardware supports help the binary translation system to achieve high performance and simplify the design of the binary translation software. In the meantime, the hardware cost is well controlled in a certain low level. These supports are implemented in Godson-3 processors to speedup the x86 binary translation to the native MIPS instruction set. Performance evaluations on RTL simulation and FPGA emulation platforms show that the proposed method can speedup most benchmark programs by nearly 10 times compared to pure software-based binary translation and achieves about 70% performance of the native program execution. The chip is fabricated in ST 65nm CMOS technology, and the physical design results show that the chip area cost is less than 5%.
Grid system software is inherently complex, hard to build and maintain. In this paper, we propose a self-managing building block: grid unit, which facilitates constructing grid system with higher availability and lowe...
详细信息
Grid system software is inherently complex, hard to build and maintain. In this paper, we propose a self-managing building block: grid unit, which facilitates constructing grid system with higher availability and lower management overhead. We present an agent organization as autonomic management framework, and propose a self-recovering protocol to eliminate most of tough jobs from system administrator's routines. The system has been deployed on Dawning 4000A since 2004, the biggest node for China grid system. We have done extensive experiments to evaluate grid unit, and the collected log data shows the availability of a grid parallel process management service, built on the basis of grid unit, reaches 99.997%.
Test power consumption is becoming a major concern in low power integrated circuits (ICs). This paper presents a revised low power compression architecture for scan test. In this paper, the variance in power consumpti...
详细信息
Test power consumption is becoming a major concern in low power integrated circuits (ICs). This paper presents a revised low power compression architecture for scan test. In this paper, the variance in power consumption is used to select test pattern during scan test, and a low power feedback MUX is added to the scan chains. Simulation results by mathematical methods show that the proposed test architecture is promising in reduction of power consumption.
In database systems, disk I/O performance is usually the bottleneck of the whole query processing. Among many techniques, compression is one of the most important ones to reduce disk accesses so to improve system perf...
详细信息
In database systems, disk I/O performance is usually the bottleneck of the whole query processing. Among many techniques, compression is one of the most important ones to reduce disk accesses so to improve system performance. RLE (run-length encoding) is one light-weight compression algorithm which incurs negligible CPU cost. A lot of work show that, although RLE is one of the most effective compression techniques in column-oriented systems, it is very hard to use due to bad value locality in row-oriented systems where values from multiple attributes are stored in the same page. We propose CRLE (Column-based RLE), one compression algorithm to apply RLE to row-oriented data storage. On row-oriented storage page, CRLE can exploit value locality in individual column and encode values from the same column in run-length format. Experiments show that CRLE can lead to very good compression ratio and performance in spite of row-oriented data storage.
Heterogeneous Chip Multi-Processors (heter-CMP) provide suitable resources to various applications and could get more benefits on performance than homogeneous CMP. To fully develop the performance of the heter-CMP sys...
详细信息
Very large scale integrated circuits typically employ Network-on-Chip (NoC) as the backbone for on-chip communication. As technology advances into the nanometer regime, NoCs become more and more susceptible to permane...
详细信息
ISBN:
(纸本)9781424475162
Very large scale integrated circuits typically employ Network-on-Chip (NoC) as the backbone for on-chip communication. As technology advances into the nanometer regime, NoCs become more and more susceptible to permanent faults such as manufacturing defects, device wear-out, which hinder the correct operations of the entire system. Therefore, effective fault-tolerant techniques are essential to improve the reliability of NoCs. Prior work mainly focuses on introducing redundancies, which can't achieve satisfactory reliability and also involve large hardware overhead, especially for data path components. In this paper, we propose fine-grained data path salvaging techniques by splitting data path components, i.e., links, input buffers and crossbar into slices, instead of introducing redundancies. As long as there is one fault-free slice for each component, the router can be functional. Experimental results show that the proposed solution achieves quite high reliability with graceful performance degradation even under high fault rate.
Three-dimensional (3D) integration and Network-on-Chip (NoC) are both proposed to tackle the on-chip interconnect scaling problems, and extensive research efforts have been devoted to the design challenges of combinin...
详细信息
ISBN:
(纸本)9781424475162
Three-dimensional (3D) integration and Network-on-Chip (NoC) are both proposed to tackle the on-chip interconnect scaling problems, and extensive research efforts have been devoted to the design challenges of combining both. Through-silicon via (TSV) is considered to be the most promising technology for 3D integration, however, TSV pads distributed across planar layers occupy significant chip area and result in routing congestions. In addition, the yield of 3D integrated circuits decreased dramatically as the number of TSVs increases. For symmetric 3D mesh NoC, we observe that the TSVs' utilization is pretty low and adjacent routers rarely transmit packets via their vertical channels (i.e. TSVs) at the same time. Based on this observation, we propose a novel TSV squeezing scheme to share TSVs among neighboring router in a time division multiplex mode, which greatly improves the utilization of TSVs. Experimental results show that the proposed method can save significant TSV footprint with negligible performance overhead.
Ring is a promising on-chip interconnection for CMP. It is more scalable than bus and much simpler than packet-switched networks. The ordering property of ring can be used to optimize cache coherence protocol design. ...
详细信息
Ring is a promising on-chip interconnection for CMP. It is more scalable than bus and much simpler than packet-switched networks. The ordering property of ring can be used to optimize cache coherence protocol design. Existing ring protocols, such as the snooping ring protocol and the ring-order protocol need a retry and acknowledgement scheme or use the ordering property of the ring respectively to resolve conflict memory requests. A cache coherence protocol named SOR (Snooping and Ordering Ring) is developed for ring connected CMP in this paper. This protocol is based on the snooping ring protocol. But instead of using the acknowledgement and retry scheme, it uses the ordering property of the ring to resolve conflicts, thus can avoid unnecessary retries to improve performance and power efficiency. The L1 snooping results are sent with the requests instead of being delayed, so that many useless snoops can be avoided. Simulation result shows that the average probe slot transports and snoop operations can be reduced by SOR are 47% and 48.9%. The average and maximum performance improvements by SOR are 3.33% and 6%.
This paper describes a multi-FPGA based platform for emulating the Loongson-2G micro-processor on different mother boards. This platform is developed targeting at verification and evaluation of the Loongson-2G micro-p...
详细信息
ISBN:
(纸本)9781605589114
This paper describes a multi-FPGA based platform for emulating the Loongson-2G micro-processor on different mother boards. This platform is developed targeting at verification and evaluation of the Loongson-2G micro-processor, which is the next generation of Loongson-2 family, composed by one four-issue, out-of-order execution way 64-bit MIPS-compatible processor core named GS464, one 1M byte secondary Cache, one HyperTransport IO interface, one DDR2/3 memory interface and some other low speed IO interfaces. Most parts of this micro-process are mapped into the multi-FPGA based platform which consists two Vertex-5 330 FPGA chips. Semi-custom partitioning tactics within the entire design flow are developed to synthesize the whole designed into the multi-FPGA based platform. Modifications in architectural level are applied to the original architecture of the chip, in order to make it easy to be partitioned into two parts. High speed SEDES of HyperTransport IO link and DDR2/3 memory interface are emulated by using several clocks with different clock phases. To resolve the problem that hard to debug in FPGA system, a method by software probe with help of injected hardware modules in FPGA is developed and used to debug the problem causing by behavior mismatching between the ASIC ram block and the FPGA ram block. Some evaluation work on performance of Loongson-2G is done on this multi-FPGA based platform as pre-silicon test. To the authors' knowledge, there has been no previous work on such a big design used for verification and evaluation.
暂无评论