In recent years, high performance computing has benefitted greatly from special accelerator cards such as gpus. Matrix multiplication performed by the blas function dgemm is one of the prime examples where such accele...
详细信息
ISBN:
(纸本)9781479984909
In recent years, high performance computing has benefitted greatly from special accelerator cards such as gpus. Matrix multiplication performed by the blas function dgemm is one of the prime examples where such accelerators excel. dgemm is the computational hotspot of many tasks, among them the linpack benchmark. Current gpus achieve more than 1(TFlop)/s real performance in this task. Being connected via PCI Express, one can easily install multiple gpus in a single compute node. This enables the construction of multi-(TFlop)/s systems out of off-the-shelf components. At such high performance, it is often complicated to feed the gpus with sufficient data to run at full performance. In this paper we first analyze the scalability of our dgemm implementation for multiple fast gpus. Then we suggest a new scheme optimized for this situation and we present an implementation.
暂无评论