As one of the most important enabling technologies of cloud computing, virtualization brings to HPC good manageability, online system maintenance, performance isolation and fault isolation. Furthermore, previous study...
详细信息
Thread migration is an effective technique for fault resilience and load balancing in high performance computing. However, flexible thread migration is not easy to achieve. In this paper, we present an approach to cre...
详细信息
Fast facial points fitting plays an important role in applications such as Human-computer Interaction, entertainment, surveillance, and is highly relevant to the techniques of facial expression analysis, face recognit...
详细信息
As technology advances both in increasing bandwidth and in reducing latency for I/O buses and devices, moving I/O data in/out memory has become critical. In this paper, we have observed the different characteristics o...
详细信息
ISBN:
(纸本)9781424456581
As technology advances both in increasing bandwidth and in reducing latency for I/O buses and devices, moving I/O data in/out memory has become critical. In this paper, we have observed the different characteristics of I/O and CPU memory reference behavior, and found the potential benefits of separating I/O data from CPU data. We propose a DMA cache technique to store I/O data in dedicated on-chip storage and present two DMA cache designs. The first design, Decoupled DMA Cache (DDC), adopts additional on-chip storage as the DMA cache to buffer I/O data. The second design, Partition-Based DMA Cache (PBDC), does not require additional on-chip storage, but can dynamically use some ways of the processor's last level cache (LLC) as the DMA cache. We have implemented and evaluated the two DMA cache designs by using an FPGA-based emulation platform and the memory reference traces of real-world applications. Experimental results show that, compared with the existing snooping-cache scheme, DDC can reduce memory access latency (in bus cycles) by 34.8% on average (up to 58.4%), while PBDC can achieve about 80% of DDC's performance improvements despite no additional on-chip storage.
Virtual machine technology can provide high server utilization and service consolidation on an individual physical machine, and gains acceptance in diverse fields. In a growing number of contexts, many situations requ...
详细信息
We propose an efficient algorithm on generation of a universal path candidate set U that contains testable long paths for delay testing. Some strategies are presented to speed up the depth-first search procedure of U ...
详细信息
We propose an efficient algorithm on generation of a universal path candidate set U that contains testable long paths for delay testing. Some strategies are presented to speed up the depth-first search procedure of U generation, targeting the reduction of checking times of sensitization criteria. Experimental results illustrate that our approach achieves an 8X speedup on average in comparison with the traditional depth-first search approach.
This paper describes a multi-FPGA based platform for emulating the Loongson-2G micro-processor on different mother boards. This platform is developed targeting at verification and evaluation of the Loongson-2G micro-p...
详细信息
ISBN:
(纸本)9781605589114
This paper describes a multi-FPGA based platform for emulating the Loongson-2G micro-processor on different mother boards. This platform is developed targeting at verification and evaluation of the Loongson-2G micro-processor, which is the next generation of Loongson-2 family, composed by one four-issue, out-of-order execution way 64-bit MIPS-compatible processor core named GS464, one 1M byte secondary Cache, one HyperTransport IO interface, one DDR2/3 memory interface and some other low speed IO interfaces. Most parts of this micro-process are mapped into the multi-FPGA based platform which consists two Vertex-5 330 FPGA chips. Semi-custom partitioning tactics within the entire design flow are developed to synthesize the whole designed into the multi-FPGA based platform. Modifications in architectural level are applied to the original architecture of the chip, in order to make it easy to be partitioned into two parts. High speed SEDES of HyperTransport IO link and DDR2/3 memory interface are emulated by using several clocks with different clock phases. To resolve the problem that hard to debug in FPGA system, a method by software probe with help of injected hardware modules in FPGA is developed and used to debug the problem causing by behavior mismatching between the ASIC ram block and the FPGA ram block. Some evaluation work on performance of Loongson-2G is done on this multi-FPGA based platform as pre-silicon test. To the authors' knowledge, there has been no previous work on such a big design used for verification and evaluation.
MapReduce is a programming framework introduced by Google for large-scale data processing. It is usually used in a scan-centric fashion where all the data are split into blocks and Maps are generated for each block to...
详细信息
MapReduce is a programming framework introduced by Google for large-scale data processing. It is usually used in a scan-centric fashion where all the data are split into blocks and Maps are generated for each block to scan and process the data in the block, then Reduces merge outputs from all the Maps. When a query intends to process only a subset of the data selected by a predicate, this brute-force method may cause extra I/O overhead spent on irrelevant data, and the overhead for initiating so many Maps may be non-trivial given that the actually interesting data for the query is comparatively small in volume. We propose an approach to integrate the index into the MapReduce execution in which only an appropriate number of Maps are generated, each of which accesses the data using an index. This approach incurs random I/O and remote access to data, so the overall performance depends on both system parameters and the query characteristics. We build a cost model for both this index access execution and the traditional full scan execution. This cost model can be used to choose between the two execution modes before executing a query. Experiments show that the index access execution can greatly outperform full scan execution when the selectivity of the predicate is low, and the cost model predicts the actual execution cost very well so can be used to determine the execution plan for a query.
Page switching is a technique that increases the memory in microcontrollers without extending the address buses. This technique is widely used in the design of 8-bit MCUs. In this paper, we present an algorithm to red...
详细信息
Topology virtualization techniques are proposed for NoC-based many-core processors with core-level redundancy to isolate hardware changes caused by on-chip defective cores. Prior work focuses on homogeneous cores with...
详细信息
ISBN:
(纸本)9783981080162
Topology virtualization techniques are proposed for NoC-based many-core processors with core-level redundancy to isolate hardware changes caused by on-chip defective cores. Prior work focuses on homogeneous cores with symmetric performance and optimizes on-chip communication only. However, core-to-core performance asymmetry due to manufacturing process variations poses new challenges for constructing virtual topologies. Lower performance cores may scatter over a virtual topology, while operating systems typically allocate tasks to continuous cores. As a result, parallel applications are probably assigned to a region containing many slower cores that become bottlenecks. To tackle the above problem, in this paper we present a novel performance-asymmetry-aware reconfiguration algorithm Bubble-Up based on a new metric called core fragmentation factor (CFF). Bubble-Up can arrange cores with similar performance closer, yet maintaining reasonable hop distances between virtual neighbors, thus accelerating applications with higher degree of parallelism, without changing existing allocation strategies for OS. Experimental results show its effectiveness.
暂无评论