检索结果-内蒙古大学图书馆

Efficient Handling of Lock Hand-off in DSM multiprocessors with Buffering Coherence Controllers

Journal of Computer Science & Technology 2012年第1期27卷 75-91页

作者： Benjamín Sahelices Agustín de Dios Pablo Ibáez Víctor Vials-Yúfera José María Llabería Computer Science Department and HiPEAC European Network of Excellence University of Valladolid Valladolid Spain Computer Science and Systems Engineering Department I3A Research Institute and HiPEAC European Network of Excellence University of Zaragoza Zaragoza Spain Computer Architecture Department and HiPEAC European Network of Excellence Polytechnic University of Catalua Barcelona Spain

Synchronization in parallel programs is a major performance bottleneck in multiprocessor systems. shared data is protected by locks and a lot of time is spent on the competition arising at the lock hand-off. In order to be serialized, requests to the same cache line can either be bounced （NACKed） or buffered in the coherence controller. In this paper, we focus mainly on systems whose coherence controllers buffer requests. In a lock hand-off, a burst of requests to the same line arrive at the coherence controller. During lock hand-off only the requests from the winning processor contribute to progress of the computation, since the winning processor is the only one that will advance the work. This key observation leads us to propose a hardware mechanism we call request bypassing, which allows requests from the winning processor to bypass the requests buffered in the coherence controller keeping the lock line. We present an inexpensive implementation of request bypassing that reduces the time spent on all the execution phases of a critical section （acquiring the lock, accessing shared data, and releasing the lock） and which, as a consequence, speeds up the whole parallel computation. This mechanism requires neither compiler or programmer support nor ISA or coherence protocol changes. By simulating a 32-processor system, we show that using request bypassing does not degrade but rather improves performance in three applications with low synchronization rates, while in those having a large amount of synchronization activity （the remaining four）, we see reductions in execution time and in lock stall time ranging from 14% to 39% and from 52% to 7170, respectively. We compare request bypassing with a previously proposed technique called read combining and with a system that bounces requests, observing a significantly lower execution time with the bypassing scheme. Finally, we analyze the sensitivity of our results to some key hardware and software parameters.

关键词： distributed shared memory multiprocessors synchronization buffer coherence controller request bypass

来源：评论

学校读者我要写书评

暂无评论

Latency, occupancy, and bandwidth in DSM multiprocessors: A performance evaluation

引用

IEEE TRANSACTIONS ON COMPUTERS 2003年第7期52卷 862-880页

作者： Chaudhuri, M Heinrich, M Holt, C Singh, JP Rothberg, E Hennessy, J Cornell Univ Comp Syst Lab Ithaca NY 14853 USA Transmeta Inc Santa Clara CA 95054 USA Princeton Univ Dept Comp Sci Princeton NJ 08544 USA ILOG Inc Mountain View CA 94043 USA Stanford Univ Comp Syst Lab Stanford CA 94305 USA

While the desire to use commodity parts in the communication architecture of a DSM multiprocessor offers advantages in cost and design time, the impact on application performance is unclear. We study this performance impact through detailed simulation, analytical modeling, and experiments on a flexible DSM prototype, using a range of parallel applications. We adapt the logP model to characterize the communication architectures of DSM machines. The l (network latency) and o (controller occupancy) parameters are the keys to performance in these machines, with the g (node-to-network bandwidth) parameter becoming important only for the fastest controllers. We show that, of all the logP parameters, controller occupancy has the greatest impact on application performance. Of the two contributions of occupancy to performance degradation-the latency it adds and the contention it induces-it is the contention component that governs performance regardless of network latency, showing a quadratic dependence on o. As expected, techniques to reduce the impact of latency make controller occupancy a greater bottleneck. Surprisingly, the performance impact of occupancy is substantial, even for highly-tuned applications and even in the absence of latency hiding techniques. Scaling the problem size is often used as a technique to overcome limitations in communication latency and bandwidth. Through experiments on a DSM prototype, we show that there are important classes of applications for which the performance lost by using higher occupancy controllers cannot be regained easily, if at all, by scaling the problem size.

关键词： occupancy distributed shared memory multiprocessors communication controller latency bandwidth queuing model flexible node controller

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：