Chip-multiprocessors (CMPs) have become the mainstream parallel architecture in recent years;for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. Th...
详细信息
Chip-multiprocessors (CMPs) have become the mainstream parallel architecture in recent years;for scalability reasons, designs with high core counts tend towards tiled CMPs with physically distributed shared caches. This naturally leads to a Non-Uniform Cache Access (NUCA) design, where on-chip access latencies depend on the physical distances between requesting cores and home cores where the data is cached. Improving data locality is thus key to performance, and several studies have addressed this problem using data replication and data migration. In this paper, we consider another mechanism, hardware-level thread migration. This approach, we argue, can better exploit shared data locality for NUCA designs by effectively replacing multiple round-trip remote cache accesses with a smaller number of migrations. High migration costs, however, make it crucial to use thread migrations judiciously;we therefore propose a novel, on-line prediction scheme which decides whether to perform a remote access (as in traditional NUCA designs) or to perform a thread migration at the instruction level. For a set of parallel benchmarks, our thread migration predictor improves the performance by 24% on average over the shared-NUCA design that only uses remote accesses.
For certain applications involving chip multiprocessors with more than 16 cores, a directoryless architecture with fine-grained and partial-context thread migration can outperform directory-based coherence, providing ...
详细信息
For certain applications involving chip multiprocessors with more than 16 cores, a directoryless architecture with fine-grained and partial-context thread migration can outperform directory-based coherence, providing lighter on-chip traffic and reduced verification complexity.
The explosive dissemination of broadband in recent years has increased the exchange of various types of data via the Internet, leading to an annual increase of 160% in the amount of data used in E-mails and movie cont...
详细信息
The explosive dissemination of broadband in recent years has increased the exchange of various types of data via the Internet, leading to an annual increase of 160% in the amount of data used in E-mails and movie contents. With regard to storage for use in IT systems, the issues attracting most attention are considered to be sudden increases in the data volumes, complications of management and the impacts of shutdowns. This paper discusses solutions to these issues, including "iStorage D8" featuring scalability, manageability and availability and "iStorage D1/D3" featuring high cost efficiency, easy introduction and space saving design.
The growing dominance of wire delays at future technology points renders a microprocessor communication-bound. Clustered microarchitectures allow most dependence chains to execute without being affected by long on-chi...
详细信息
ISBN:
(纸本)9781581138399
The growing dominance of wire delays at future technology points renders a microprocessor communication-bound. Clustered microarchitectures allow most dependence chains to execute without being affected by long on-chip wire latencies. They also allow faster clock speeds and reduce design complexity, thereby emerging as a popular design choice for future microprocessors. However, a centralized data cache threatens to be the primary bottle-neck in highly clustered systems. The paper attempts to identify the most complexity-effective approach to alleviate this bottleneck. While decentralized cache organizations have been proposed, they introduce excessive logic and wiring complexity. The paper evaluates if the performance gains of a decentralized cache are worth the increase in complexity. We also introduce and evaluate the behavior of Cluster Prefetch - the forwarding of data values to a cluster through accurate address prediction. Our results show that the success of this technique depends on accurate speculation across unresolved stores. The technique applies for a wide class of processor models and most importantly, it allows high performance even while employing a simple centralized data cache. We conclude that address prediction holds more promise for future wire-delay-limited processors than decentralized cache organizations.
暂无评论