Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. The high-performance computing (HPC) needs of the US government require adva...
ISBN:
(纸本)9781665411271
Summary form only given, as follows. The complete presentation was not made available for publication as part of the conference proceedings. The high-performance computing (HPC) needs of the US government require advances in architectures to support a wide variety of critical missions. Project 38 is a cross-agency effort between the Department of Defense and the Department of Energy exploring architectural enhancements that will provide increased performance and capabilities for future HPC systems. This talk will provide an overview of some of the explorations that have been conducted as part of this effort, their potential impact, and the path forward.
Field-programmable gate arrays (FPGAs) are becoming promising heterogeneous computing components. In the meantime, high-level synthesis has been pushing the FPGA-based development from a register-transfer level to a h...
详细信息
ISBN:
(纸本)9781665415576
Field-programmable gate arrays (FPGAs) are becoming promising heterogeneous computing components. In the meantime, high-level synthesis has been pushing the FPGA-based development from a register-transfer level to a high-level-language design flow using the OpenCL and C/C++ programming languages. The performance of binary search applications is often associated with irregular memory access patterns to off-chip memory. In this paper, we implement the binary search algorithms using OpenCL, and evaluate their performance on an Intel Arria 10 based FPGA platform. Based on the evaluation results, we optimize the grid search in XSBench by vectorizing and replicating the binary search kernel. We identify the computational overhead in the implementations of the vectorizable binary search algorithms, and overcome it by grouping work-items into work-groups. Our optimizations improve the performance of the grid search using the classic binary search by a factor of 1.75 on the FPGA.
Throughput-oriented streaming applications on massive data sets are a prime candidate for parallelization on wide-SIMD platforms, especially when inputs are independent of one another. Many such applications are repre...
详细信息
ISBN:
(纸本)9781665415576
Throughput-oriented streaming applications on massive data sets are a prime candidate for parallelization on wide-SIMD platforms, especially when inputs are independent of one another. Many such applications are represented as a pipeline of compute nodes connected by directed edges. Here, we study applications with irregular data flow, i.e., those where the number of outputs produced per input to a node is data-dependent and unknown a priori. Moreover, we target these applications to architectures (GPUs) where different nodes of the pipeline execute cooperatively on a single wide-SIMD processor. To promote greater SIMD parallelism, irregular application pipelines can utilize queues to gather and compact multiple data items between nodes. However, the decision to introduce a queue between two nodes must trade off benefits to occupancy against costs associated with queue reading, writing, and management. Moreover, once queues are introduced to an application, their relative sizes impact the frequency with which the application switches between nodes, incurring scheduling and context-switching overhead. This work examines two optimization problems associated with queues. First, we consider which pairs of successive nodes in a pipeline should have queues between them to maximize overall application throughput. Second, given a fixed total budget for queue space, we consider how to choose the relative sizes of inter-node queues to minimize the frequency of switching between nodes. We formulate a dynamic programming approach to the first problem and give an empirically useful approximation to the second that allows for an analytical solution. Finally, we validate our theoretical results using real-world irregular streaming computations.
AUTODOCK is a molecular docking software widely used in computational drug design. Its time-consuming executions have motivated the development of AUTODOCK-GPU, an OpenCL-accelerated version that can run on GPUs and C...
详细信息
ISBN:
(纸本)9781665415576
AUTODOCK is a molecular docking software widely used in computational drug design. Its time-consuming executions have motivated the development of AUTODOCK-GPU, an OpenCL-accelerated version that can run on GPUs and CPUs. This work discusses the development of AUTODOCK-GPU from a programming perspective, detailing how our design addresses the irregularity of AuToDocK while pushing towards higher performance. Details on required data transformations, re-structuring of complex functionality, as well as the performance impact of different configurations are also discussed. While AUTODOCK-GPU reaches speedup factors of 341x on a Titan V GPU and 51x on a 48-core Xeon Platinum 8175M CPU, experiments show that performance gains are highly dependent on the molecular complexity under analysis. Finally, we summarize our preliminary experiences when porting AUTODOCK onto FPGAs.
Welcome to the 2020 edition of ia3, the workshop on irregularapplications: architectures and algorithms. The 10th anniversary edition for ia3 happens at an unprecedented time for humanity with the COVID-19 Pandemic c...
Welcome to the 2020 edition of ia3, the workshop on irregularapplications: architectures and algorithms. The 10th anniversary edition for ia3 happens at an unprecedented time for humanity with the COVID-19 Pandemic creating disruptions, changes in behaviors, and issues in many aspects of society. The Pandemic represents a complex data analytics challenge, where new methods are required at all levels of the high-performance system stack to provide insights and actionable knowledge on continuously changing information. Thus, the computing topics addressed by ia3 are more relevant than ever as reflected by the programs of the workshop and the main SC conference, where analytics and graph analytics, in particular, are a central theme.
The proceedings contain 14 papers. The topics discussed include: highly scalable near memory processing with migrating threads on the emu system architecture;parallel interval stabbing on the automata processor;an opt...
ISBN:
(纸本)9781509038671
The proceedings contain 14 papers. The topics discussed include: highly scalable near memory processing with migrating threads on the emu system architecture;parallel interval stabbing on the automata processor;an optimized multicolor point-implicit solver for unstructured grid applications on graphics processing units;optimizing sparse tensor times matrix on multi-core and many-core architectures;compiler transformation to generate hybrid sparse computations;an OpenCL framework for distributed apps on a multidimensional network of FPGAs;fast parallel cosine K-nearest neighbor graph construction;performance evaluation of parallel sparse tensor decomposition implementations;implementation and evaluation of data-compression algorithms for irregular-grid iterative methods on the PEZY-SC processor;dynamic load balancing for high-performance graph processing on hybrid CPU-GPU platforms;a fast level-set segmentation algorithm for image processing designed for parallel architectures;HISC/R: an efficient hypersparse-matrix storage format for scalable graph processing;optimized distributed work-stealing;and fine-grained parallelism in probabilistic parsing with Habanero Java.
AUTODOCK is a molecular docking software widely used in computational drug design. Its time-consuming executions have motivated the development of AUTODOCK-GPU, an OpenCL-accelerated version that can run on GPUs and C...
详细信息
AUTODOCK is a molecular docking software widely used in computational drug design. Its time-consuming executions have motivated the development of AUTODOCK-GPU, an OpenCL-accelerated version that can run on GPUs and CPUs. This work discusses the development of AUTODOCK-GPU from a programming perspective, detailing how our design addresses the irregularity of AUTODOCK while pushing towards higher performance. Details on required data transformations, re-structuring of complex functionality, as well as the performance impact of different configurations are also discussed. While AUTODOCK-GPU reaches speedup factors of 341x on a Titan V GPU and 51x on a 48-core Xeon Platinum 8175M CPU, experiments show that performance gains are highly dependent on the molecular complexity under analysis. Finally, we summarize our preliminary experiences when porting AUTODOCK onto FPGAs.
暂无评论