检索结果-内蒙古大学图书馆

Journal of computer Science & technology 2010年第4期25卷 886-894页

作者：崔慧敏王蕾范东睿冯晓兵 Key Laboratory of Computer System and Architecture Institute of Computing TechnologyChinese Academy of Sciences Graduate University of Chinese Academy of Sciences

The advent of multi-core/many-core chip technology offers both an extraordinary opportunity and a profound challenge. In particular, computer architects and system software designers are faced with a unique opportunity to introducing new architecture features as well as adequate compiler technology -- together they may have profound impact. This paper presents a case study （using the 1-D Jacobi computation） of compiler-amendable performance optimization techniques on a many-core architecture Godson-T. Godson-T architecture has several unique features that are chosen for this study： 1） chip-level global addressable memory in particular the scratchpad memories （SPM） local to the processing cores; 2） fine-grain memory based synchronization （e.g., full-empty bit for fine-grain synchronization）. Leveraging state-of-the-art performance optimization methods for 1-D stencil parallelization （e.g., timed tiling and variants）, we developed and implement a number of many-core-based optimization for Godson-T. Our experimental study shows good performance in both execution time speedup and scalability, validate the value of globally accessed SPM and fine-grain synchronization mechanism （full-empty bits） under the Godson-T, and provides some useful guidelines for future compiler technology of many-core chip architectures.

关键词： many-core, stencil, Jacobi, compiler SPM, fine-grain synchronization

来源：评论

学校读者我要写书评

暂无评论

Physical Implementation of the 1GHz Godson-3 Quad-Core Microprocessor

引用

Journal of computer Science & technology 2010年第2期25卷 192-199页

作者：范宝峡杨梁王江嵋王茹肖斌徐英刘动赵继业 Key Laboratory of Computer System and Architecture Institute of Computing TechnologyChinese Academy of Sciences Graduate University of Chinese Academy of Sciences Loongson Technology Corporation Limited

The Godson-3A microprocessor is a quad-core version of the scalable Godson-3 multi-core series. It is physically implemented based on the 65 nm CMOS process. This 174 mm2 chip consists of 425 million transistors. The maximum frequency is 1GHz with a maximum power consumption of 15 W. The main challenges of Godson-3A physical implementation include very large scale, high frequency requirement, sub-micron technology effects and aggressive time schedule. This paper describes the design methodology of the physical implementation of Godson-3A, with particular emphasis on design methods for high frequency, clock tree design, power management, and on-chip variation （OCV） issue.

关键词： physical implementation design methodology on-chip variation （OCV） low power clock tree

来源：评论

学校读者我要写书评

暂无评论

Study on blocked LU decomposition on many-core architecture

引用

Gaojishu Tongxin/Chinese High technology Letters 2011年第3期21卷 248-253页

作者： Yu, Lei Liu, Zhiyong Ma, Yike Song, Fenglong Xu, Weizhi Ye, Xiaochun Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing 100190 China Graduate University of Chinese Academy of Sciences Beijing 100039 China

The authors studied the scientific application LU decomposition deeply. A speedup model for LU decomposition was proposed, and an algorithm for LU decomposition based on bit reverse xor (BRX) was implemented. Then a dynamic absolute balance policy (DABP) algorithm was presented. In order to estimate the algorithms of 2 dimensional (2D) scatter, BRX and DABP, two different estimation functions were given and they were used to estimate the load balance problem of the algorithms. These two functions verify that the DABP algorithm has the best load balance. The simulations of the three algorithms were performed on the many-core architecture Godson-T. The experiments prove that the speedup of the DABP algorithm is 46 and it is the best performance of the three algorithms.

关键词： computer architecture

来源：评论

学校读者我要写书评

暂无评论

Using index in the MapReduce framework

Using index in the MapReduce framework

引用

12th International Asia Pacific Web Conference, APWeb 2010

作者： An, Mingyuan Wang, Yang Wang, Weiping Key Laboratory of Computer System and Architecture Graduate University of Chinese Academy of Sciences Beijing China Key Laboratory of Computer System and Architecture Chinese Academy of Sciences Institute of Computing Technology Beijing China

ISBN: (纸本)9780769540122

MapReduce is a programming framework introduced by Google for large-scale data processing. It is usually used in a scan-centric fashion where all the data are split into blocks and Maps are generated for each block to scan and process the data in the block, then Reduces merge outputs from all the Maps. When a query intends to process only a subset of the data selected by a predicate, this brute-force method may cause extra I/O overhead spent on irrelevant data, and the overhead for initiating so many Maps may be nontrivial given that the actually interesting data for the query is comparatively small in volume. We propose an approach to integrate the index into the MapReduce execution in which only an appropriate number of Maps are generated, each of which accesses the data using an index. This approach incurs random I/O and remote access to data, so the overall performance depends on both system parameters and the query characteristics. We build a cost model for both this index access execution and the traditional full scan execution. This cost model can be used to choose between the two execution modes before executing a query. Experiments show that the index access execution can greatly outperform full scan execution when the selectivity of the predicate is low, and the cost model predicts the actual execution cost very well so can be used to determine the execution plan for a query. © 2010 IEEE.

关键词： MapReduce

来源：评论

学校读者我要写书评

暂无评论

Integrating DBMSs as a read-only execution layer into Hadoop

Integrating DBMSs as a read-only execution layer into Hadoop

引用

11th International Conference on Parallel and Distributed Computing, Applications and Technologies, PDCAT 2010

作者： An, Mingyuan Wang, Yang Wang, Weiping Sun, Ninghui Key Laboratory of Computer System and Architecture Graduate University of Chinese Academy of Sciences Chinese Academy of Sciences Beijing China Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing China

ISBN: (纸本)9780769542874

To obtain the efficiency of DBMS, HadoopDB combines Hadoop and DBMS, and claims the superiority over Hadoop in terms of performance. However, the approach of HadoopDB is simply putting MapReduce onto unmodified single-machined DBMSs which has several obvious weaknesses. In essence, HadoopDB is a parallel DBMS with fault tolerance, which incurs unnecessary overhead due to the DBMS legacy. Instead of augmenting DBMS with Hadoop techniques, we propose a new system architecture integrating modified DBMS engines as a read-only execution layer into Hadoop, where DBMS plays a role of providing efficient readonly operators rather than managing the data. Besides the obtained efficiency from DBMS engine, there are other advantages. The modified DBMS engine is able to directly process data from the HDFS (Hadoop Distributed File system) files at the block level, which means that the data replication can be handled by HDFS naturally, and the blocklevel parallelism is easily achieved. The global index access mechanism is added according to the MapReduce paradigm. The data loading speed is also guaranteed by directly writing the data into HDFS with simplified logic. Experiments show that our system outperforms both original Hadoop and HadoopDB styled system. © 2010 IEEE.

关键词： Database systems

来源：评论

学校读者我要写书评

暂无评论

Extended selective encoding of scan slices for reducing test data and test power

Extended selective encoding of scan slices for reducing test...

引用

作者： Liu, Jun Han, Yinhe Li, Xiaowei Key Laboratory of Computer System and Architecture Institute of Computing Technology Graduate University of Chinese Academy of Sciences China School of Computer and Information Hefei University of Technology China

Test data volume and test power are two major concerns when testing modern large circuits. Recently, selective encoding of scan slices is proposed to compress test data. This encoding technique, unlike many other compression techniques encoding all the bits, only encodes the target-symbol by specifying a single bit index and copying group data. In this paper, we propose an extended selective encoding which presents two new techniques to optimize this method: a flexible grouping strategy, X bits exploitation and filling strategy. Flexible grouping strategy can decrease the number of groups which need to be encoded and improve test data compression ratio. X bits exploitation and filling strategy can exploit a large number of don't care bits to reduce testing power with no compression ratio loss. Experimental results show that the proposed technique needs less test data storage volume and reduces average weighted switching activity by 25.6% and peak weighted switching activity by 9.68% during scan shift compared to selective encoding. Copyright © 2010 The institute of Electronics, Information and Communication Engineers.

关键词： Filling

来源：评论

学校读者我要写书评

暂无评论

GenerOS: An asymmetric operating system kernel for multi-core systems

GenerOS: An asymmetric operating system kernel for multi-cor...

引用

International Symposium on Parallel and Distributed Processing (IPDPS)

作者： Qingbo Yuan Jianbo Zhao Mingyu Chen Ninghui Sun Key Laboratory of Computer System and Architecture Institute of Compute Technology Chinese Academy and Sciences Beijing China

ISBN: (纸本)9781424464425

Due to complex abstractions implemented over shared data structures protected by locks, conventional symmetric multithreaded operating system kernel such as Linux is hard to achieve high scalability on the emerging multi-core architectures, which integrate more and more cores on a single die. This paper presents GenerOS - a general asymmetric operating system kernel for multi-core systems. In principal, GenerOS partitions processing cores into application core, kernel core and interrupt core, each of which is dedicated to a specified function. In implementation, we conduct a delicate modification to Linux kernel and provide the same interface as Linux kernel so that GenerOS is compatible with legacy applications. The better performance of GenerOS mainly benefits from: (1) Applications run on their own cores with minimal interrupt and kernel support; (2) Every kernel service is encapsulated in to a serial process so that there will be fewer contentions than conventional symmetric kernel; (3) A slim schedule policy is used in the kernel core to support schedule between system calls with low overhead. Experiments with two typical workloads on 16-core AMD machine show that GenerOS behaves better than original Linux kernel when there are more processing cores (19.6% for TPC-H using oracle database management system and 42.8% for httperf using apache web server).

关键词： Operating systems Kernel Linux Scalability Data structures computer architecture Database systems Web server Multicore processing

来源：评论

学校读者我要写书评

暂无评论

P-GAS: Parallelizing a cycle-accurate event-driven many-core processor simulator using parallel discrete event simulation 10

P-GAS: Parallelizing a cycle-accurate event-driven many-core...

引用

24th Annual Workshop on Principles of Advanced and Distributed Simulation, PADS 2010

作者： Lv, Huiwei Cheng, Yuan Bai, Lu Chen, Mingyu Fan, Dongrui Sun, Ninghui Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences China Graduate School Chinese Academy of Sciences China

ISBN: (纸本)9781424472918

Multi-core processors are commonly available now, but most traditional computer architectural simulators still use single-thread execution. In this paper we use parallel discrete event simulation (PDES) to speedup a cycle-accurate event-driven many-core processor simulator. Evaluation against the sequential version shows that the parallelized one achieves an average speedup of 10.9× (up to 13.6×) running SPLASH-2 kernel on a 16-core host machine, with cycle counter differences of less than 0.1%. Moreover, super-linear speedups are achieved between running 1 thread and 8 threads due to reduced overhead of insert-event-to-queue time and increased cache size in parallel processing. We conclude that PDES could be an attractive option for achieving fast cycle-accurate many-core processor simulations. © 2010 IEEE.

关键词： Simulators

来源：评论

学校读者我要写书评

暂无评论

A novel post-silicon debug mechanism based on Suspect Window

A novel post-silicon debug mechanism based on Suspect Window

引用

作者： Gao, Jianliang Han, Yinhe Li, Xiaowei Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences Graduate University of Chinese Academy of Sciences Beijing China

Bugs are becoming unavoidable in complex integrated circuit design. It is imperative to identify the bugs as soon as possible through post-silicon debug. For post-silicon debug, observability is one of the biggest challenges. Scan-based debug mechanism provides high observability by reusing scan chains. However, it is not feasible to scan dump cycle-by-cycle during program execution due to the excessive time required. In fact, it is not necessary to scan out the error-free states. In this paper, we introduce Suspect Window to cover the clock cycle in which the bug is triggered. Then, we present an efficient approach to determine the suspect window. Based on Suspect Window, we propose a novel debug mechanism to locate the bug both temporally and spatially. Since scan dumps are only taken in the suspect window with the proposed mechanism, the time required for locating the bug is greatly reduced. The approaches are evaluated using ISCAS'89 and ITC'99 benchmark circuits. The experimental results show that the proposed mechanism can significantly reduce the overall debug time compared to scan-based debug mechanism while keeping high observability. Copyright © 2010 The institute of Electronics, Information and Communication Engineers.

关键词： Observability

来源：评论

学校读者我要写书评

暂无评论

TMemCanal: A VM-oblivious dynamic memory optimization scheme for virtual machines in cloud computing

TMemCanal: A VM-oblivious dynamic memory optimization scheme...

引用

10th IEEE International Conference on computer and Information technology, CIT-2010, 7th IEEE International Conference on Embedded Software and systems, ICESS-2010, 10th IEEE Int. Conf. Scalable Computing and Communications, ScalCom-2010

作者： Li, Yaqiong Huang, Yongbing Key Laboratory of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences China Graduate University Chinese of Academy of Sciences China

ISBN: (纸本)9780769541082

In current virtualized cloud platforms, resource provisioning strategy is still a big challenge. Provisioning will gain low resource utilization based on peak workload, and provisioning based on average work loads will sacrifice the potential revenue of cloud customers because of bad user experiences. VM-based performance isolation also restrains resource flowing on demand. As to memory, this eventually results in under-loaded memory and over-loaded memory in the same data center. This paper proposes a VM-oblivious dynamic memory optimization scheme, TMemCanal, which leverages under-loaded memory in a data center to accommodate the needs of loaded memory dynamically in a transparent fashion. TMemCanal is able to identify the under-loaded memory located in different VMs and reuse it in a way of memory flowing without any modification to their Guest OSs. We implemented TMemCanal through extending Xen hypervisor and evaluated using SpecWeb 2005 and LinkPack Benchmarks. Our evaluation shows that TMemCanal can efficiently save memory up to 50% with an overhead less than 7%. Our case study of server consolidation also shows TMemCanal can promote the performance of memory-intensive services up to 400%. © 2010 IEEE.

关键词： Virtual machine

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：