Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to conventional micropro...
详细信息
Media applications are characterized by large amounts of available parallelism, little data reuse, and a high computation to memory access ratio. While these characteristics are poorly matched to conventional microprocessor architectures, they are a good fit for modern VLSI technology with its high arithmetic capacity but limited global bandwidth. The stream programming model, in which an application is coded as streams of data records passing through computation kernels, exposes both parallelism and locality in media applications that can be exploited by VLSI architectures. The Imagine architecture supports the stream programming model by providing a bandwidth hierarchy tailored to the demands of media applications. Compared to a conventional scalar processor. Imagine reduces the global register and memory bandwidth required by typical applications by factors of 13 and 21 respectively. This bandwidth efficiency enables a single chip Imagine processor to achieve a peak performance of 16.2GFLOPS (single-precision floating point) and sustained performance of up to 8.5GFLOPS on media processing kernels.
Branch prediction is a key mechanism used to achieve high performance on multiple issue, deeply pipelined processors. By predicting the branch outcome at the instruction fetch stage of the pipeline, superscalar proces...
详细信息
Branch prediction is a key mechanism used to achieve high performance on multiple issue, deeply pipelined processors. By predicting the branch outcome at the instruction fetch stage of the pipeline, superscalar processors become able to exploit Instruction Level parallelism (ILP) by providing a larger window of instructions. However, when a branch is mispredicted, instructions from the mispredicted path must be discarded. Therefore, branch prediction accuracy is critical to achieve high performance. Existing branch prediction schemes can accurately predict the direction of conditional branches, but have difficulties predicting the correct targets of indirect branches. Indirect branches occur frequently in Object-Oriented Languages (OOL), as well as in Dynamically-Linked Libraries (DLLs), two programming environments rapidly increasing in popularity. In addition, certain language constructs such as multi-way control transfers (e.g., switches), and architectural features such as 64-bit address spaces, utilize indirect branching. In this paper, we describe a new algorithm for predicting unconditional indirect branches called Prediction by Partial Matching (PPM). We base our approach on techniques proven to work optimally in the field of data compression. We combine a viable implementation of the PPM algorithm with dynamic per-branch selection of path-based correlation and compare its prediction accuracy against a variety of predictors. Our results show that, for approximately the same hardware budget, the combined predictor can achieve a misprediction ratio of 9.47%, as compared to 11.48% for the previously published most accurate indirect branch predictor.
We prove a number of negative results about practical (i.e., numerically accurate) algorithms for certain matrix factorizations. In particular, we prove that the popular Givens' method for computing the QR decompo...
详细信息
ISBN:
(纸本)9780897918909
We prove a number of negative results about practical (i.e., numerically accurate) algorithms for certain matrix factorizations. In particular, we prove that the popular Givens' method for computing the QR decomposition is inherently sequential over the realistic model of floating point arithmetic. We also prove a number of additional results concerning Gaussian Elimination for computing the LU decomposition. Altogether, the results of this paper support the widespread belief that there is a tradeoff between parallelism and accuracy in numerical algorithms.
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., BSP and LOGP) account for bandwidth limitations...
详细信息
Recently there has been an increasing interest in models of parallel computation that account for the bandwidth limitations in communication networks. Some models (e.g., BSP and LOGP) account for bandwidth limitations using a per-processor parameter g>1, such that each processor can send/receive at most h messages in g·h time. Other models (e.g., PRAM(m)) account for bandwidth limitations as an aggregate parameter mΩ(√lg p) separation known previously.
This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting impl...
详细信息
ISBN:
(纸本)9780897918909
This paper describes the derivation of an empirically efficient parallel two-dimensional Delaunay triangulation program from a theoretically efficient CREW PRAM algorithm. Compared to previous work, the resulting implementation is not limited to datasets with a uniform distribution of points, achieves significantly better speedups over good serial code, and is widely portable due to its use of MPI as a communication mechanism. Results are presented for a loosely-coupled cluster of workstations, a distributed-memory multicomputer, and a shared-memory multiprocessor. The Machiavelli toolkit used to transform the nested data parallelism inherent in the divide-and-conquer algorithm into achievable task and data parallelism is also described and compared to previous techniques.
暂无评论