the register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increas...
详细信息
ISBN:
(纸本)9781581132328
the register file access time is one of the critical delays in current superscalar processors. Its impact on processor performance is likely to increase in future processor generations, as they are expected to increase the issue width (which implies more register ports) and the size of the instruction window (which implies more registers), and to use some kind of multithreading. Under this scenario, the register file access time could be a dominant delay and a pipelined implementation would be desirable to allow for high clock rates. However, a multi-stage register file has severe implications for processor performance (e.g. higher branch misprediction penalty) and complexity (more levels of bypass logic). To tackle these two problems, in this paper we propose a register file architecture composed of multiple banks. In particular we focus on a multi-level organization of the register file, which provides low latency and simple bypass logic. We propose several caching policies and prefetching strategies and demonstrate the potential of this multiple-banked organization. For instance, we show that a two-level organization degrades IPC by 10% and 2% with respect to a non-pipelined single-banked register file, for SpecInt95 and SpecFP95 respectively, but it increases performance by 87% and 92% when the register file access time is factored in.
Factorization of a dense symmetric indefinite matrix is a key computational kernel in many scientific and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symm...
详细信息
ISBN:
(纸本)9781467360661
Factorization of a dense symmetric indefinite matrix is a key computational kernel in many scientific and engineering simulations. However, there is no scalable factorization algorithm that takes advantage of the symmetry and guarantees numerical stability through pivoting at the same time. this is because such an algorithm exhibits many of the fundamental challenges in parallel programming like irregular data accesses and irregular task dependencies. In this paper, we address these challenges in a tiled implementation of a blocked Aasen's algorithm using a dynamic scheduler. To fully exploit the limited parallelism in this left-looking algorithm, we study several performance enhancing techniques; e.g., parallel reduction to update a panel, tall-skinny LU factorization algorithms to factorize the panel, and a parallel implementation of symmetric pivoting. Our performance results on up to 48 AMD Opteron processors demonstrate that our implementation obtains speedups of up to 2.8 over MKL, while losing only one or two digits in the computed residual norms.
Asynchronous variational integrators (AVIs) are used in computational mechanics and graphics to solve complex contact mechanics problems. the parallelization of AVI is difficult problem because it is not possible to b...
详细信息
ISBN:
(纸本)9781450328210
Asynchronous variational integrators (AVIs) are used in computational mechanics and graphics to solve complex contact mechanics problems. the parallelization of AVI is difficult problem because it is not possible to build a dependence graph for AVI either at compile-time or at runtime. However, we show that if the dependence graph for AVI can be updated incrementally as the computation is performed, it is possible to parallelize AVI in a systematic way. Using this approach, we are able to obtain speedups of up to 20 on 24 cores for relatively small AVI problems.
Hardware-software co-synthesis is the process of partitioning an embedded system specification into hardware and software modules to meet performance, cost and reliability goals. In this paper, we address the problem ...
详细信息
Hardware-software co-synthesis is the process of partitioning an embedded system specification into hardware and software modules to meet performance, cost and reliability goals. In this paper, we address the problem of hardware-software co-synthesis of fault-tolerant real-time heterogeneous distributed embedded systems. Fault detection capability is imparted to the embedded system by adding assertion and duplicate-and-compare tasks to the task graph specification prior to cosynthesis. the reliability and availability of the architecture are evaluated during co-synthesis. Our algorithm allows the user to specify multiple types of assertions for each task. It uses the assertion or combination of assertions which achieves the required fault coverage without incurring too much overhead. We propose new methods to: 1) perform fault tolerance based task clustering 2) derive the best error recovery topology using a small number of extra processing elements, 3) exploit multi-dimensional assertions, and 4) share assertions to reduce the fault tolerance overhead. Our algorithm can tackle multirate systems commonly found in multimedia applications. Application of the proposed algorithm to several real-life telecom transport system examples shows its efficacy.
暂无评论