检索结果-内蒙古大学图书馆

JOURNAL OF SUPERCOMPUTING 2012年第1期62卷 510-549页

作者： Fang, Zhen Zhang, Lixin Carter, John B. McKee, Sally A. Ibrahim, Ali Parker, Michael A. Jiang, Xiaowei NVidia Corp Santa Clara CA USA Chinese Acad Sci Inst Comp Technol Beijing Peoples R China IBM Austin Res Lab Austin TX USA Chalmers Univ Technol S-41296 Gothenburg Sweden AMD Sunnyvale CA USA Intel Corp Intel Labs Santa Clara CA USA

Inability to hide main memory latency has been increasingly limiting the performance of modern processors. The problem is worse in large-scale shared memory systems, where remote memory latencies are hundreds, and soon thousands, of processor cycles. To mitigate this problem, we propose an intelligent memory and cache coherence controller (AMC) that can execute Active Memory Operations (AMOs). AMOs are select operations sent to and executed on the home memory controller of data. AMOs can eliminate a significant number of coherence messages, minimize intranode and internode memory traffic, and create opportunities for parallelism. Our implementation of AMOs is cache-coherent and requires no changes to the processor core or DRAM chips. In this paper, we present the microarchitecture design of AMC, and the programming model of AMOs. We compare AMOs' performance to that of several other memory architectures on a variety of scientific and commercial benchmarks. Through simulation, we show that AMOs offer dramatic performance improvements for an important set of data-intensive operations, e.g., up to 50x faster barriers, 12x faster spinlocks, 8.5x-15x faster stream/array operations, and 3x faster database queries. We also present an analytical model that can predict the performance benefits of using AMOs with decent accuracy. The silicon cost required to support AMOs is less than 1% of the die area of a typical high performance processor, based on a standard cell implementation.

关键词： Distributed shared memory Cache coherence Memory architecture interprocessor synchronization DRAM organization

来源：评论

学校读者我要写书评

暂无评论

Scheduler implementation in MP SoC design 05

Scheduler implementation in MP SoC design

引用

10th Asia and South Pacific Design Automation Conference

作者： Cho, Youngchul Yoo, Sungjoo Choi, Kiyoung Zergainoh, Nacer-Eddine Jerraya, Ahmed Amine Seoul Natl Univ Seoul South Korea

ISBN: (纸本)0780387368

In the design of a heterogeneous multiprocessor system on chip, we face a new design problem;scheduler implementation. In this paper, we present an approach to implementing a static scheduler, which controls all the task executions and communication transactions of a system according to a pre-determined schedule. For the scheduler implementation, we consider both intra-processor and inter-processor synchronization. We also consider scheduler overhead, which is often neglected. In particular, we address the issue of centralized implementation versus distributed implementation. We investigate the pros and cons of the two different scheduler implementations. Through experiments with synthetic examples and a real world multimedia application, we show the effectiveness of our approach.

关键词： system-on-chip processor scheduling integrated circuit design synchronisation scheduler implementation MP SoC design heterogeneous multiprocessor system on chip static scheduler task executions communication transactions intraprocessor synchronization interprocessor synchronization scheduler overhead centralized implementation distributed implementation real world multimedia application

来源：评论

学校读者我要写书评

暂无评论

Scan conversion for a multiprocessor-based ultrasound processing system

Scan conversion for a multiprocessor-based ultrasound proces...

引用

Medical Imaging 2002 Conference

作者： Sikdar, S Managuli, R Kim, YM Univ Washington Dept Elect Engn Image Comp Syst Lab Seattle WA 98195 USA

ISBN: (纸本)081944426X

To meet the computational requirements of mid-range and high-end programmable ultrasound systems, multiple processors are currently required. Algorithms optimized specifically for a single processor-based system may not perform well in a multiprocessor environment. They need to be efficiently remapped on multiple processors to take advantage of the increased computing power while minimizing the interprocessor data transfer and the latency between data acquisition and display. In this, paper, we describe a multiprocessor-based implementation of scan conversion, a key processing task in an ultrasound system that geometrically transforms the acquired polar ultrasound data to Cartesian coordinates for display. The single processor-based scan conversion algorithm that was reported previously uses inverse mapping for geometric transformation, where the pixel values in the Cartesian display are determined from data in the polar domain. Inverse mapping requires access to a full frame of pre-scan-converted ultrasound data, which in a multiprocessor system can be located across multiple processors, thus requiring a significant amount of interprocessor data communications. Our modified scan conversion algorithm reduces the data movement by performing inverse-mapped scan conversion locally on the polar-domain data present in each processor's memory. Each processor handles a smaller amount of data, thus reducing the latency. The raster pixels generated by each processor are combined later. interprocessor synchronization is used to ensure that each processor displays data belonging to the same frame. Data overlapping between processors avoids boundary artifacts between regions that are processed on different processors. Using four Hitachi/Equator Technologies' 300-MHz MAP-CA processors, scan conversion requires 5.6 ms for a 600x420 RGB frame, as compared to 14.6 ms using a single processor, and the latency is reduced by 33.3%. We believe that this type of parallel algorithms will

关键词： programmable ultrasound systems multiprocessors scan conversion latency interprocessor synchronization mediaprocessor

来源：评论

学校读者我要写书评

暂无评论

Queuing spin lock algorithms with preemption

引用

SYSTEMS AND COMPUTERS IN JAPAN 1996年第5期27卷 15-24页

作者： Takada, H Sakamura, K Member Faculty of Science The University of Tokyo Tokyo Japan 113

Both predictable interprocessor synchronization and fast interrupt response are required for real-time systems constructed using asymmetric shared-memory multiprocessors. This paper points out that conventional spin lock algorithms cannot satisfy both requirements at the same time and describes two spin lock algorithms that have been proposed to solve this problem. These algorithms, extensions of queuing spin locks modified to be preemptable for servicing interrupts, can give upper bounds on the times to acquire and release an interprocessor lock while achieving a fast response to interrupt requests. The per-formance measurement of the algorithms demonstrates that the algorithms have required properties. To apply the algorithms to real-time kernels, we also propose an extended algorithm, which is a combination of the two algorithms.

关键词： spin lock asymmetric multiprocessor real-time system interrupt response interprocessor synchronization

来源：评论

学校读者我要写书评

暂无评论

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案：

请选择收藏分类：

通借通还

建议与咨询 留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

时间限定

文献类型

馆藏选择

核心期刊

语言

文献类型

帮助

文字说明：

检索规则说明：

检索范例：

分类表

所选分类

限定检索结果

文献类型

馆藏范围

日期分布

学科分类号

主题

机构

作者

语言

请选择保存的检索档案： 新增检索档案 确定 取消

请选择收藏分类： 新增自定义分类 确定 取消

通借通还

建议与咨询留下您的常用邮箱和电话号码，以便我们向您反馈解决方案和替代方法

请选择保存的检索档案：

请选择收藏分类：