Current general-purpose processors are augmented with vector instructions that can process many elements of matrices and vectors in parallel. Transposing a matrix in-place is a main kernel operation required by many s...
详细信息
ISBN:
(纸本)9781467385435
Current general-purpose processors are augmented with vector instructions that can process many elements of matrices and vectors in parallel. Transposing a matrix in-place is a main kernel operation required by many scientific and engineering applications to shuttle data before, during, or after processing. This operation increases the traffic on the memory bus and hence clever techniques such as blocking are required to enhance the performance. In this paper, we present an enhanced version of a previously published algorithm for transposing a matrix on a two-dimensional processor arrays. We restructured this algorithm to fit the one-dimensional vector register architecture augmented to general-purpose CPUs. We implemented the new vector algorithm using Intel SSE4 vector instruction set and compare its performance with the standard sequential algorithm in addition to an already employed implementation of Ekhlundh's algorithm. We also studied the automatic compiler optimizations and their effect on the vectorization of the algorithm. The best of our implementations showed a maximum speedup of 1.6 compared with the sequential algorithm, and an almost equal performance compared with Eklundh's algorithm implementation.
暂无评论