the proceedings contain 8 papers. the topics discussed include: a block-oriented, parallel and collective approach to sparse indefinite preconditioning on GPUs;software prefetching for unstructured mesh applications;t...
ISBN:
(纸本)9781728101866
the proceedings contain 8 papers. the topics discussed include: a block-oriented, parallel and collective approach to sparse indefinite preconditioning on GPUs;software prefetching for unstructured mesh applications;there are trillions of little forks in the road. choose wisely! - estimating the cost and likelihood of success of constrained walks to optimize a graph pruning pipeline;scale-free graph processing on a NUMA machine;a fast and simple approach to merge and merge sort using wide vector instructions;impact of traditional sparse optimizations on a migratory thread architecture;mix-and-match: a model-driven runtime optimization strategy for BFS on GPUs;and high-performance GPU implementation of PageRank with reduced precision based on mantissa segmentation.
We address the acceleration of the PageRank algorithm for web information retrieval on graphics processing units (GPUs) via a modular precision framework that adapts the data format in memory to the numerical requirem...
详细信息
ISBN:
(纸本)9781728101866
We address the acceleration of the PageRank algorithm for web information retrieval on graphics processing units (GPUs) via a modular precision framework that adapts the data format in memory to the numerical requirements as the iteration converges. In detail, we abandon the ieee 754 single- and double-precision number representation formats, employed in the standard implementation of PageRank, to instead store the data in memory in some specialized formats. Furthermore, we avoid the data duplication by leveraging a data layout based on mantissa segmentation. Our evaluation on a V100 graphics card from NVIDia shows acceleration factors of up to 30% with respect to the standard algorithm operating in double-precision.
applicationsthat exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache...
详细信息
ISBN:
(纸本)9781728101866
applicationsthat exhibit regular memory access patterns usually benefit transparently from hardware prefetchers that bring data into the fast on-chip cache just before it is required, thereby avoiding expensive cache misses. Unfortunately, unstructured mesh applications contain irregular access patterns that are often more difficult to identify in hardware. An alternative for such workloads is software prefetching, where special non-blocking instructions load data into the cache hierarchy. However, there are currently few examples in the literature on how to incorporate such software prefetches into existing applications with positive results. this paper addresses these issues by demonstrating the utility and implementation of software prefetching in an unstructured finite volume CFD code of representative size and complexity to an industrial application and across a number of processors. We present the benefits of auto-tuning for finding the optimal prefetch distance values across different computational kernels and architectures and demonstrate the importance of choosing the right prefetch destination across the available cache levels for best performance. We discuss the impact of the data layout on the number of prefetch instructions required in kernels with indirect-access patterns and show how to integrate them on top of existing optimisations such as vectorisation. through this we show significant full application speed-ups on a range of processors, such as the Intel Xeon Skylake CPU (15%) as well as on the in-order Intel Xeon Phi Knights Corner (1.99 x) architecture and the out-of-order Knights Landing (33%) manycore processor.
In this presentation I will present the new Adaptive Compute Acceleration Platform. I will show the overall system architecture of the family of devices including the Arm cores (scalar engines), the programmable logic...
详细信息
ISBN:
(纸本)9781728101873;9781728101866
In this presentation I will present the new Adaptive Compute Acceleration Platform. I will show the overall system architecture of the family of devices including the Arm cores (scalar engines), the programmable logic (Adaptable Engines) and the new vector processor cores (AI engines). I will focus on the new AI engines in more detail and show the concepts for the programming environment, the architecture, the integration in the total device, and some application domains, including Machine Learning and 5G wireless applications. I will illustrate the initial design rationale and the architecture tradeoffs. these platforms extend the concept of tuning the memory hierarchy to the problem.
暂无评论