检索结果-内蒙古大学图书馆

Lecture Notes in computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 2013年 8299 LNCS卷 VI页

作者： Cohen, Albert Wu, Chenggang INRIA École Normale Supérieure Département d'Informatique 45 rue d'Ulm 75005 Paris France Chinese Academy of Sciences Institute of Computing Technology State Key Laboratory of Computer Architecture No. 6 Kexueyuan South Road Haidian District 100190 Beijing China

来源：评论

学校读者我要写书评

暂无评论

Extendable pattern-oriented optimization directives

Extendable pattern-oriented optimization directives

引用

作者： Cui, Huimin Xue, Jingling Wang, Lei Yang, Yang Feng, Xiaobing Fan, Dongrui Key Laboratory of Computer System and Architecture Chinese Academy of Sciences Institute of Computing Technology Beijing China School of Computer Science and Engineering University of New South Wales Sydney NSW Australia

Algorithm-specific, that is, semantic-specific optimizations have been observed to bring significant performance gains, especially for a diverse set of multi/many-core architectures. However, current programming models and compiler technologies for the state-of-the-art architectures do not exploit well these performance opportunities. In this article, we propose a pattern-making methodology that enables algorithm-specific optimizations to be encapsulated into "optimization patterns". Such optimization patterns are expressed in terms of preprocessor directives so that simple annotations can result in significant performance improvements. To validate this new methodology, a framework, named EPOD, is developed to map these directives into the underlying optimization schemes for a particular architecture. It is difficult to create an exact performance model to determine an optimal or near-optimal optimization scheme (including which optimizations to apply and in which order) for a specific application, due to the complexity of applications and architectures. However, it is trackable to build individual optimization components and let compiler developers synthesize an optimization scheme from these components. Therefore, our EPOD framework provides an Optimization Programming Interface (OPI) for compiler developers to define new optimization schemes. Thus, new patterns can be integrated into EPOD in a flexible manner. We have identified and implemented a number of optimization patterns for three representative computer platforms. Our experimental results show that a pattern-guided compiler can outperform the state-of-theart compilers and even achieve performance as competitive as hand-tuned code. Therefore, such a patternmaking methodology represents an encouraging direction for domain experts' experience and knowledge to be integrated into general-purpose compilers. © 2012 ACM.

关键词： Semantics

来源：评论

学校读者我要写书评

暂无评论

A Hybrid Circular Queue Method for Iterative Stencil Computations on GPUs

引用

Journal of computer Science & technology 2012年第1期27卷 57-74页

作者： Yang Yang Hui-Min Cui Xiao-Bing Feng Jing-Ling Xue State Key Laboratory of Computer Architecture Institute of Computing TechnologyChinese Academy of Sciences Beijing 100190China Graduate University of Chinese Academy of Sciences Beijing 100190China Programming Languages and Compilers Group School of Computer Science and Engineering University of New South WalesSydneyNSW 2052Australia

In this paper, we present a hybrid circular queue method that can significantly boost the performance of stencil computations on GPU by carefully balancing usage of registers and shared-memory. Unlike earlier methods that rely on circular queues predominantly implemented using indirectly addressable shared memory, our hybrid method exploits a new reuse pattern spanning across the multiple time steps in stencil computations so that circular queues can be implemented by both shared memory and registers effectively in a balanced manner. We describe a framework that automatically finds the best placement of data in registers and shared memory in order to maximize the performance of stencil computations. Validation using four different types of stencils on three different GPU platforms shows that our hybrid method achieves speedups up to 2.93X over methods that use circular queues implemented with shared-memory only.

关键词： stencil computation circular queue GPU occupancy register

来源：评论

学校读者我要写书评

暂无评论

A testability-aware low power architecture

A testability-aware low power architecture

引用

25th IEEE International system-on-Chip Conference, SOCC 2012

作者： Wang, Gang Wang, Jian Qi, Zi-Chu Key Laboratory of Computer System and Architecture Chinese Academy of Sciences Beijing 100190 China Institute of Computing Technology Chinese Academy of Sciences Beijing 100190 China Loongson Technology Corporation Limited Beijing 100190 China

ISBN: (纸本)9781467312950

Test power consumption is becoming a major concern in low power integrated circuits(ICs). This paper presents a revised low power compression architecture for scan test. In this paper, the variance in power consumption is used to select test pattern during scan test, and a low power feedback MUX is added to the scan chains. Simulation results by mathematical methods show that the proposed test architecture is promising in reduction of power consumption. © 2012 IEEE.

关键词： Electric power utilization

来源：评论

学校读者我要写书评

暂无评论

A lightweight hybrid hardware/software approach for object-relative memory profiling

A lightweight hybrid hardware/software approach for object-r...

引用

2012 IEEE International Symposium on Performance Analysis of systems and Software, ISPASS 2012

作者： Chen, Licheng Cui, Zehan Bao, Yungang Chen, Mingyu Huang, Y. Tan, Guangming State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences China Graduate School of Chinese Academy of Sciences China

ISBN: (纸本)9781467311441

Memory profiling is the process of collecting memory address traces during the execution of a program, then analyzing and characterizing the memory behavior of the program offline. With the trend that there will be more and more cores integrated in a processor chip, the Memory Wall problem will become more serious in the chip multiprocessor (CMP) system. Thus accurate and effective memory profiling is becoming one of the keys to identify the source of memory system bottlenecks. A large body of work has been contributed to memory profiling, however, most adopts instrumentation, simulator which suffers heavy overhead, or hardware performance counter which is lack of detail trace information. Furthermore, correlating the raw memory address traces with object-relative information allows us to separate regular pattern for certain object from the irregular mixed, thus helps the optimization. In this paper, we propose a lightweight hybrid hardware/software approach for object-relative memory profiling. We monitor physical memory addresses through hardware snooping with negligible overhead;meanwhile we dump Linux kernel page tables of processes, as well as object-relative memory allocation information. Our approach supports not only to collect applications' full memory traces with detail object relative information, but also to identify hardware-generated memory accesses such as page memory walks due to TLB miss at object level. The experimental results on real system show that our approach is highly accurate (the largest error is 2.04%) and low overhead (the average overhead is 1.60%). Furthermore, we profile two multi-thread applications in detail, and successfully identity hot TLB-miss objects. With object-targeted optimization, we can improve applications' performance by nearly 6.86%. © 2012 IEEE.

关键词： computer operating systems

来源：评论

学校读者我要写书评

暂无评论

Cache locking for network processing acceleration

Cache locking for network processing acceleration

引用

2012 10th IEEE International Symposium on Parallel and Distributed Processing with Applications, ISPA 2012

作者： Su, Wen Gao, Xiang Wang, Jing You, Ruibang Key Laboratory of Computer System and Architecture Chinese Academy of Sciences China Institute of Computing Technology Chinese Academy of Sciences China Graduate University Chinese Academy of Sciences China Loongson Technology Corporation Limited China

ISBN: (纸本)9780769547015

With the dramatic increase in network speed during the past ten years, network processing efficiency has been significantly decreased. In this paper, we propose a network accelerating scheme, which employs cache locking method to reduce data and instruction accessing latency. Interrupts handling and buffer maintenance overheads are obviously decreased. Experimental results show that our solution increases about 22% network bandwidth and reduces 10% latency. © 2012 IEEE.

关键词： Locks (fasteners)

来源：评论

学校读者我要写书评

暂无评论

Optimizing sparse matrix vector multiplication using cache blocking method on Fermi GPU

Optimizing sparse matrix vector multiplication using cache b...

引用

13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking, and Parallel/Distributed computing, SNPD 2012

作者： Xu, Weizhi Zhang, Hao Jiao, Shuai Wang, Da Song, Fenglong Liu, Zhiyong Key Lab. of Computer System and Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing China Graduate University Chinese Academy of Sciences Beijing China

ISBN: (纸本)9780769547619

It is an important task to tune performance for sparse matrix vector multiplication (SpMV), but it is also a difficult task because of its irregularity. In this paper, we propose a cache blocking method to improve the performance of SpMV on the emerging GPU architecture. The sparse matrix is partitioned into many sub-blocks, which are stored in CSR format. With the blocking method, the corresponding part of vector x can be reused in the GPU cache, so the time spent on accessing the global memory for vector x is reduced heavily. Experimental results on GeForce GTX 480 show that SpMV kernel with the cache blocking method is 5x faster than the unblocked CSR kernel in the best case. © 2012 IEEE.

关键词： Graphics processing unit

来源：评论

学校读者我要写书评

暂无评论

Micro-architectural characterization of desktop cloud workloads

Micro-architectural characterization of desktop cloud worklo...

引用

2012 IEEE International Symposium on Workload Characterization, IISWC 2012

作者： Jiang, Tao Hou, Rui Zhang, Lixin Zhang, Ke Chen, Licheng Chen, Mingyu Sun, Ninghui State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences China Graduate University of Chinese Academy of Sciences Beijing China

ISBN: (纸本)9781457720642

Desktop cloud replaces traditional desktop computers with completely virtualized systems from the cloud. It is becoming one of the fastest growing segments in the cloud computing market. However, as far as we know, there is little work done to understand the behavior of desktop cloud. On one hand, desktop cloud workloads are different from conventional data center workloads in that they are rich with interactive operations. Desktop cloud workloads are different from traditional non-virtualized desktop workloads in that they have an extra layer of software stack - hypervisor. On the other hand, desktop cloud servers are mostly built with conventional commodity processors. While such processors are well optimized for traditional desktops and high performance computing workloads, their effectiveness for desktop cloud workloads remains to be studied. As an attempt to shed some lights on the effectiveness of conventional general-purpose processors on desktop cloud workloads, we have studied the behavior of desktop cloud workloads and compared it with that of SPEC CPU2006, TPC-C, PARSEC, and CloudSuite. We evaluate a Xen-based virtualization platform. The performance results reveal that desktop cloud workloads have significantly different characteristics with SPEC CPU2006, TPC-C and PARSEC, but they perform similarly with data center scale-out benchmarks from CloudSuite. In particular, desktop cloud workloads have high instruction cache miss rate (12.7% on average), high percentage of kernel instructions (23%, on average), and low IPC (0.36 on average). And they have much higher TLB miss rates and lower utilization of off-chip memory bandwidth than traditional benchmarks. Our experimental numbers indicate that the effectiveness of existing commodity processors is quite low for desktop cloud workloads. In this paper, we provide some preliminary discussions on some potential architectural and micro-architectural enhancements. We hope that the performance numbers presented i

关键词： Personal computers

来源：评论

学校读者我要写书评

暂无评论

Feedback-controlled security-aware and energy-efficient scheduling for real-time embedded systems

Feedback-controlled security-aware and energy-efficient sche...

引用

7th International Conference on Embedded and Multimedia computing, EMC 2012

作者： Ma, Yue Sang, Nan Jiang, Wei Zhang, Lei School of Computer Science and Engineering University of Electronic Science and Technology of China Chengdu China State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing China

ISBN: (纸本)9789400750753

Security has become an important characteristic for many real-time systems. Due to the lack of enough and stable energy supply in battery-powered embedded systems, one of the foremost challenges is the mismatch between energy and performance requirements of security processing. In this paper, we propose a security-aware and energy-efficient scheduling algorithm that aims at reducing energy consumption while ensuring the real-time and security requirements. Based on the feedback control theory, we employ a feedback unit to keep track of the CPU utilization and manage the security level dynamically. Simulation results show the effectiveness and efficiency of the proposed algorithm. © 2012 Springer Science+Business Media.

关键词： Feedback control

来源：评论

学校读者我要写书评

暂无评论

SoftPCM: Enhancing energy efficiency and lifetime of phase change memory in video applications via approximate write

SoftPCM: Enhancing energy efficiency and lifetime of phase c...

引用

2012 IEEE 21st Asian Test Symposium, ATS 2012

作者： Fang, Yuntan Li, Huawei Li, Xiaowei State Key Laboratory of Computer Architecture Institute of Computing Technology Chinese Academy of Sciences Beijing China Graduate University of Chinese Academy of Sciences Beijing China

ISBN: (纸本)9780769548760

Modern video applications such as video codecs are memory-intensive. As an emerging non-volatile memory technology, phase change memory (PCM) will benefit video applications due to its high density, low leakage power and superior scalability. However, PCM consumes high write energy and can only sustain a limited write number. Hence it is necessary to reduce the write number of PCM for video applications. In this paper, we propose Soft PCM to enhance both energy efficiency and lifetime of PCM. Soft PCM utilizes the error tolerance characteristic of video applications to relax the accuracy of write operations. Experimental results show that Soft PCM can reduce 22% writes and thus improve energy efficiency and lifetime of PCM with slight video quality degradation. © 2012 IEEE.

关键词： Phase change memory

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：