A major performance limiter in modern processors is the long latencies caused by data cache misses. Both compiler- and hardware-based prefetching schemes help hide these latencies and so improve performance. Compiler ...
详细信息
A major performance limiter in modern processors is the long latencies caused by data cache misses. Both compiler- and hardware-based prefetching schemes help hide these latencies and so improve performance. Compiler techniques infer memory access patterns through code analysis, and insert appropriate prefetch instructions. Hardware prefetching techniques work independently from the compiler by monitoring an access stream, detecting patterns in this stream and issuing prefetches based on these patterns. This paper looks at the interplay between compiler and hardware architecture-based prefetching techniques. Does either technique make the other one unnecessary? First, compilers' ability to achieve good results without extreme expertise is evaluated by preparing binaries with no prefetch, one-flag prefetch (no tuning), and expertly tuned prefetch. From runs of SPECcpu2006 binaries, we find that expertise avoids minor slowdown in a few benchmarks and provides substantial speedup in others. We compare software schemes to hardware prefetching schemes and our simulations show software alone substantially outperforms hardware alone on about half of a selection of benchmarks. While hardware matches or exceeds software in a few cases, software is better on average. Analysis reveals that in many cases hardware is not prefetching access patterns that it is capable of recognizing, due to some irregularities in the observed miss sequence. Hardware outperforms software on address sequences that the compiler would not guess. In general, while software is better at prefetching individual loads, hardware partly compensates for this by identifying more loads to prefetch. Using the two schemes together provides further benefits, but less than the sum of the contributions of each alone.
Performance-enhancement techniques improve CPU speed at the cost of other valuable system resources such as power and energy. software prefetching is one such technique, tolerating memory latency for high performance....
详细信息
Performance-enhancement techniques improve CPU speed at the cost of other valuable system resources such as power and energy. software prefetching is one such technique, tolerating memory latency for high performance. In this article, we quantitatively study this technique's impact on system performance and power/energy consumption. First, we demonstrate that software prefetching achieves an average of 36% performance improvement with 8% additional energy consumption and 69% higher power consumption on six memory-intensive benchmarks. Then we combine software prefetching with a (unrealistic) static voltage scaling technique to show that this performance gain can be converted to an average of 48% energy saving. This suggests that it is promising to build low power systems with techniques traditionally known for performance enhancement. We thus propose a practical online profiling based dynamic voltage scaling (DVS) algorithm. The algorithm monitors system's performance and adapts the voltage level accordingly to save energy while maintaining the observed system performance. Our proposed online profiling DVS algorithm achieves 38% energy saving without any significant performance loss.
High power consumption has become one of the critical problems restricting the development of high-performance computers. Recently, there are numerous studies on optimizing the execution performance while satisfying t...
详细信息
High power consumption has become one of the critical problems restricting the development of high-performance computers. Recently, there are numerous studies on optimizing the execution performance while satisfying the power constraint in recent years. However, these methods mainly focus on homogeneous systems without considering the power or speed difference of heterogeneous processors, so it is difficult to apply these methods in the heterogeneous systems with an accelerator. In this paper, by abstracting the current execution model of a heterogeneous system, we propose a new framework for managing the system power consumption with a three-level power control mechanism. The three levels from top to bottom are: system-level power controller (SPC), group-level power controller (GPC) and unit-level power controller (UPC). The study establishes a power management method for software prefetch in UPC to scale frequency and voltage of programs, select the optimal prefetch distance and guide optimization process to satisfy the constraint boundary according to power constraints. The strategy for dividing power based on key threads is put forward in GPC to preferentially allocate power to threads in key paths. In SPC, a method for evaluating the performance of heterogeneous processing engines is designed for dividing power in order to improve the overall execution performance of the system while sustaining the fairness between concurrent applications. Finally, the proposed framework is verified on a central processing unit (CPU)-graphics processing unit (GPU) heterogeneous system. (C) 2018 Elsevier B.V. All rights reserved.
prefetching has been shown to be one of several effective approaches that can tolerate large memory latencies. Hardware-based prefetching schemes handles prefetching at run-time without compiler intervention, whereas ...
详细信息
prefetching has been shown to be one of several effective approaches that can tolerate large memory latencies. Hardware-based prefetching schemes handles prefetching at run-time without compiler intervention, whereas software-directed prefetching inserts prefetch instructions in the code by performing static data analysis. In this paper, we consider a prefetch engine called Hare, which handles prefetches at run time and is built in addition to the data pipelining in the on-chip data cache for high-performance processors. The key design is that it is programmable by the user code so that techniques of software prefetching can be also employed in exploiting the benefits of prefetching. The engine launches prefetches ahead of current execution, which is controlled by the program counter. We evaluate the proposed scheme by trace-driven simulation and consider area and cycle time factors for the evaluation of cost-effectiveness. Our performance results show that the prefetch engine can significantly reduce data access penalty with only little prefetching overhead. (C) 1997 Elsevier Science B.V.
The erratic memory access pattern of graph algorithms makes it hard to optimize on cache-based architectures. While multithreading hides memory latency, it is unclear how hardware threads combined with caches impact t...
详细信息
ISBN:
(纸本)9780769546759
The erratic memory access pattern of graph algorithms makes it hard to optimize on cache-based architectures. While multithreading hides memory latency, it is unclear how hardware threads combined with caches impact the performance of typical graph workload. As modern architectures strike different balances between caching and multithreading, it remains an open question whether the benefit of optimizing locality behavior outweighs the cost. We study parallel graph algorithms on two different multi-threaded, multi-core platforms, that is, IBM Power7 and Sun Niagara2. Our experiments first demonstrate their performance advantage over prior architectures. We find nonetheless the number of hardware threads in either platform is not sufficient to fully mask memory latency. Our cache-friendly scheduling of memory accesses improves performance by up to 2.6 times on Power7 and prior cache-based architectures, yet the same technique significantly degrades performance on Niagara2. software prefetching and manipulating the storage of the input to improve spatial locality improve performance by up to 2.1 times and 1.3 times on both platforms. Our study reveals interesting interplay between architecture and algorithm.
Garbage collection is a performance-critical feature of most modem object oriented languages, and is characterized by poor locality since it must traverse the heap. In this paper we show that by combining two very sim...
详细信息
ISBN:
(纸本)9781595938930
Garbage collection is a performance-critical feature of most modem object oriented languages, and is characterized by poor locality since it must traverse the heap. In this paper we show that by combining two very simple ideas we can significantly improve the performance of the canonical mark-sweep collector, resulting in improvements in application performance. We make three main contributions: 1) we develop a methodology and framework for accurately and deterministically analyzing the tracing loop at the heart of the collector, 2) we offer a number of insights and improvements over conventional design choices for mark-sweep collectors, and 3) we find that two simple ideas - edge order traversal and software prefetch - combine to greatly improve garbage collection performance although each is unproductive in isolation. We perform a thorough analysis in the context of MMTk and Jikes RVM on a wide range of benchmarks and four different architectures. Our baseline system (which includes a number of our improvements) is very competitive with highly tuned alternatives. We show a simple marking mechanism which offers modest but consistent improvements over conventional choices. Finally, we show that enqueuing the edges (pointers) of the object graph rather than the nodes (objects) significantly increases opportunities for software prefetch, despite increasing the total number of queue operations. Combining edge ordered enqueuing with software prefetching yields average performance improvements over a large suite of benchmarks of 20-30% in garbage collection time and 4-6% of total application performance in moderate heaps, across four architectures.
The deviation of the memory latency is hard to be predicted for in software, especially on the SMP or NUMA systems. As a hardware correspondent method, the multi-thread processor has been devised. However, it is diffi...
详细信息
暂无评论