检索结果-内蒙古大学图书馆

THE INTERACTION AND RELATIVE EFFECTIVENESS OF HARDWARE AND software DATA prefetch

JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS 2012年第2期21卷

作者： Verma, Santhosh Koppelman, David M. Louisiana State Univ Dept Elect & Comp Engn Baton Rouge LA 70803 USA

A major performance limiter in modern processors is the long latencies caused by data cache misses. Both compiler- and hardware-based prefetching schemes help hide these latencies and so improve performance. Compiler techniques infer memory access patterns through code analysis, and insert appropriate prefetch instructions. Hardware prefetching techniques work independently from the compiler by monitoring an access stream, detecting patterns in this stream and issuing prefetches based on these patterns. This paper looks at the interplay between compiler and hardware architecture-based prefetching techniques. Does either technique make the other one unnecessary? First, compilers' ability to achieve good results without extreme expertise is evaluated by preparing binaries with no prefetch, one-flag prefetch (no tuning), and expertly tuned prefetch. From runs of SPECcpu2006 binaries, we find that expertise avoids minor slowdown in a few benchmarks and provides substantial speedup in others. We compare software schemes to hardware prefetching schemes and our simulations show software alone substantially outperforms hardware alone on about half of a selection of benchmarks. While hardware matches or exceeds software in a few cases, software is better on average. Analysis reveals that in many cases hardware is not prefetching access patterns that it is capable of recognizing, due to some irregularities in the observed miss sequence. Hardware outperforms software on address sequences that the compiler would not guess. In general, while software is better at prefetching individual loads, hardware partly compensates for this by identifying more loads to prefetch. Using the two schemes together provides further benefits, but less than the sum of the contributions of each alone.

关键词： Cache software prefetch hardware prefetch

来源：评论

学校读者我要写书评

暂无评论

Low power system design by combining software prefetching and dynamic voltage scaling

引用

JOURNAL OF CIRCUITS SYSTEMS AND COMPUTERS 2007年第5期16卷 745-767页

作者： Pamnani, Sumitkumar N. Agarwal, Deepak N. Qu, Gang Yeung, Donald Univ Maryland Dept Elect & Comp Engn College Pk MD 20742 USA Univ Maryland Inst Adv Comp Study College Pk MD 20742 USA AMD Microprocessor Verificat Austin TX 78741 USA

Performance-enhancement techniques improve CPU speed at the cost of other valuable system resources such as power and energy. software prefetching is one such technique, tolerating memory latency for high performance. In this article, we quantitatively study this technique's impact on system performance and power/energy consumption. First, we demonstrate that software prefetching achieves an average of 36% performance improvement with 8% additional energy consumption and 69% higher power consumption on six memory-intensive benchmarks. Then we combine software prefetching with a (unrealistic) static voltage scaling technique to show that this performance gain can be converted to an average of 48% energy saving. This suggests that it is promising to build low power systems with techniques traditionally known for performance enhancement. We thus propose a practical online profiling based dynamic voltage scaling (DVS) algorithm. The algorithm monitors system's performance and adapts the voltage level accordingly to save energy while maintaining the observed system performance. Our proposed online profiling DVS algorithm achieves 38% energy saving without any significant performance loss.

关键词： dynamic voltage scaling software prefetch energy efficiency performance

来源：评论

学校读者我要写书评

暂无评论

Three-level performance optimization for heterogeneous systems based on software prefetching under power constraints

引用

FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE 2018年 86卷 51-58页

作者： Wang, Zhuowei Zhao, Wuqing Wang, Hao Cheng, Lianglun Guangdong Univ Technol Sch Comp Guangzhou 510006 Guangdong Peoples R China Dingxin Informat Technol Co Ltd Guangzhou 510006 Guangdong Peoples R China Norwegian Univ Sci & Technol Dept ICT & Nat Sci Trondheim Norway

High power consumption has become one of the critical problems restricting the development of high-performance computers. Recently, there are numerous studies on optimizing the execution performance while satisfying the power constraint in recent years. However, these methods mainly focus on homogeneous systems without considering the power or speed difference of heterogeneous processors, so it is difficult to apply these methods in the heterogeneous systems with an accelerator. In this paper, by abstracting the current execution model of a heterogeneous system, we propose a new framework for managing the system power consumption with a three-level power control mechanism. The three levels from top to bottom are: system-level power controller (SPC), group-level power controller (GPC) and unit-level power controller (UPC). The study establishes a power management method for software prefetch in UPC to scale frequency and voltage of programs, select the optimal prefetch distance and guide optimization process to satisfy the constraint boundary according to power constraints. The strategy for dividing power based on key threads is put forward in GPC to preferentially allocate power to threads in key paths. In SPC, a method for evaluating the performance of heterogeneous processing engines is designed for dividing power in order to improve the overall execution performance of the system while sustaining the fairness between concurrent applications. Finally, the proposed framework is verified on a central processing unit (CPU)-graphics processing unit (GPU) heterogeneous system. (C) 2018 Elsevier B.V. All rights reserved.

关键词： High-performance computing systems Heterogeneous system Performance optimization software prefetch Energy constraints

来源：评论

学校读者我要写书评

暂无评论

Reducing memory penalty by a programmable prefetch engine for on-chip caches

引用

MICROPROCESSORS AND MICROSYSTEMS 1997年第2期21卷 121-130页

作者： Chen, TF NATL CHUNG CHENG UNIV DEPT COMP SCI CHIAYI TAIWAN

prefetching has been shown to be one of several effective approaches that can tolerate large memory latencies. Hardware-based prefetching schemes handles prefetching at run-time without compiler intervention, whereas software-directed prefetching inserts prefetch instructions in the code by performing static data analysis. In this paper, we consider a prefetch engine called Hare, which handles prefetches at run time and is built in addition to the data pipelining in the on-chip data cache for high-performance processors. The key design is that it is programmable by the user code so that techniques of software prefetching can be also employed in exploiting the benefits of prefetching. The engine launches prefetches ahead of current execution, which is controlled by the program counter. We evaluate the proposed scheme by trace-driven simulation and consider area and cycle time factors for the evaluation of cost-effectiveness. Our performance results show that the prefetch engine can significantly reduce data access penalty with only little prefetching overhead. (C) 1997 Elsevier Science B.V.

关键词： data prefetch programmable engine software prefetch compiler optimization

来源：评论

学校读者我要写书评

暂无评论

Optimizing large-scale graph analysis on multithreaded, multicore platforms

Optimizing large-scale graph analysis on multithreaded, mult...

引用

26th IEEE International Parallel and Distributed Processing Symposium (IPDPS) / Workshop on High Performance Data Intensive Computing

作者： Cong, Guojing Makarychev, Konstantin IBM Corp TJ Watson Res Ctr Yorktown Hts NY 10598 USA

ISBN: (纸本)9780769546759

The erratic memory access pattern of graph algorithms makes it hard to optimize on cache-based architectures. While multithreading hides memory latency, it is unclear how hardware threads combined with caches impact the performance of typical graph workload. As modern architectures strike different balances between caching and multithreading, it remains an open question whether the benefit of optimizing locality behavior outweighs the cost. We study parallel graph algorithms on two different multi-threaded, multi-core platforms, that is, IBM Power7 and Sun Niagara2. Our experiments first demonstrate their performance advantage over prior architectures. We find nonetheless the number of hardware threads in either platform is not sufficient to fully mask memory latency. Our cache-friendly scheduling of memory accesses improves performance by up to 2.6 times on Power7 and prior cache-based architectures, yet the same technique significantly degrades performance on Niagara2. software prefetching and manipulating the storage of the input to improve spatial locality improve performance by up to 2.1 times and 1.3 times on both platforms. Our study reveals interesting interplay between architecture and algorithm.

关键词： Multi-threading Parallel Graph Algorithms software prefetch Traversal

来源：评论

学校读者我要写书评

暂无评论

Effective prefetch for Mark-Sweep Garbage Collection 07

Effective Prefetch for Mark-Sweep Garbage Collection

引用

International Symposium on Memory Management

作者： Garner, Robin Blackburn, Stephen M. Frampton, Daniel Australian Natl Univ Dept Comp Sci Canberra ACT 0200 Australia

ISBN: (纸本)9781595938930

Garbage collection is a performance-critical feature of most modem object oriented languages, and is characterized by poor locality since it must traverse the heap. In this paper we show that by combining two very simple ideas we can significantly improve the performance of the canonical mark-sweep collector, resulting in improvements in application performance. We make three main contributions: 1) we develop a methodology and framework for accurately and deterministically analyzing the tracing loop at the heart of the collector, 2) we offer a number of insights and improvements over conventional design choices for mark-sweep collectors, and 3) we find that two simple ideas - edge order traversal and software prefetch - combine to greatly improve garbage collection performance although each is unproductive in isolation. We perform a thorough analysis in the context of MMTk and Jikes RVM on a wide range of benchmarks and four different architectures. Our baseline system (which includes a number of our improvements) is very competitive with highly tuned alternatives. We show a simple marking mechanism which offers modest but consistent improvements over conventional choices. Finally, we show that enqueuing the edges (pointers) of the object graph rather than the nodes (objects) significantly increases opportunities for software prefetch, despite increasing the total number of queue operations. Combining edge ordered enqueuing with software prefetching yields average performance improvements over a large suite of benchmarks of 20-30% in garbage collection time and 4-6% of total application performance in moderate heaps, across four architectures.

关键词： Java Mark-Sweep software prefetch

来源：评论

学校读者我要写书评

暂无评论

Scalable latency tolerant architecture (SCALT) and its evaluation 1

Scalable latency tolerant architecture (SCALT) and its evalu...

引用

1st IEEE Asia Pacific Conference on ASICs, AP-ASIC 1999

作者： Shimizu, N. Mitake, D. Faculty of Eng. Toukai Univ 1117 Kitakaname Hiratuka-shi Kanagawa259-1292 Japan Graduate School of Eng. Toukai Univ. 1117 Kitakaname Hiratuka-shi Kanagawa259-1292 Japan

ISBN: (纸本)0780357051

The deviation of the memory latency is hard to be predicted for in software, especially on the SMP or NUMA systems. As a hardware correspondent method, the multi-thread processor has been devised. However, it is difficult to improve the processor performance with a single program. We have proposed SCALT that uses a buffer in a software context. For the deviation of a latency problem, we have proposed a instruction to check the data arrival existence in a buffer. This paper describes the SCALT, which uses a buffer check instruction, and its performance evaluation results, obtained analyzing the SMP system through event-driven simulation. © 1999 IEEE.

关键词： Decouppled Architecture Latency Tolerant Out of Order Execution software prefetch

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：