• parallelism need not be hard - much easier than traditional concurrent programming • parallel programming, like programming, is a team effort that requires many different skills and many different tools - coarse-lev...
详细信息
Nowadays, computing systems accessible to researchers with "Grand Challenge" problems consist of a hardware mixture ranging from clusters of workstations to parallel supercomputers. This hardware is availabl...
详细信息
In recent years, Neural Networks (NNs) have become one of the most prevailing topics in computers science, both in research and in industry. NNs are used for data analysis, natural language processing, autonomous driv...
详细信息
ISBN:
(纸本)9798400713965
In recent years, Neural Networks (NNs) have become one of the most prevailing topics in computers science, both in research and in industry. NNs are used for data analysis, natural language processing, autonomous driving and more. As such, NNs also see more application and use in High-Performance computing (HPC). At the same time, energy efficiency has become an increasingly critical topic. NNs use large amounts of energy for operation, which in return results in large amounts of CO2 emissions. This work presents a comprehensive evaluation of current NN inference soft- and hardware configurations within High-Performance computing (HPC) environments, with a focus on both performance metrics and energy consumption. NN quantization and accelerators such as FPGAs allow for an increased inference efficiency, both in terms of throughput and energy. Therefore, this work focuses on FINN, an efficient NN inference framework for FPGAs, highlighting its current lack of support for HPC systems. We provide an in-depth analysis of FINN in order to implement extensions to optimize the end-to-end execution for the usage in the HPC environment. We thoroughly evaluate the performance and energy efficiency gains using newly implemented optimizations and compare it against existing NN accelerators for HPC. With our extensions of FINN, we were able to achieve a 1847× higher throughput, while also decreasing the latency on average by 0.9978× and EDP by 0.9979× on an Alveo U55C FPGA. Data flow based NN inference accelerators on an FPGA should be used if the performance and energy footprint of the inference process is crucial, and the batch sizes are small to medium. For extremely large batch sizes and a very limited time for network-to-accelerator (less than a few days), using GPUs is still the way to go. Our results show that with the newly developed driver, we outperform a high-end Nvidia A100 GPU by up to 7.81x in throughput, while having a 0.87x lower latency and 0.88x lower energy de
computing electron repulsion integrals (ERIs) is the major computational bottleneck of many quantum mechanical simulation methods, requiring trillions of ERI evaluations per time step. While the computation of indepen...
详细信息
ISBN:
(数字)9798331502812
ISBN:
(纸本)9798331502829
computing electron repulsion integrals (ERIs) is the major computational bottleneck of many quantum mechanical simulation methods, requiring trillions of ERI evaluations per time step. While the computation of independent ERIs is embarrassingly parallel, the efficient computation of individual ERIs on modern processor cores is difficult due to both an insufficient cache size for intermediates of the computation and irregular memory access patterns that are difficult to vectorize. In this paper, we present how our implementation on the AI Engine (AIE) architecture addresses both of these problems. First, we have defined a flexible graph structure, which we call an ERI-Engine, that can be implemented for all 231 canonical ERI quartets from {ss|ss} to {hh| hh} by distributing the computation over 2–14 AIEs. Second, for the larger quartets, we have devised a novel vectorization scheme that leverages the advanced floating-point unit of the AIEs, while also supporting vectorization of independent ERIs for the smaller quartets. Finally, ERI-Engines are horizontally and vertically stackable to fill the entire AIE array, and in particular, the vertically stacked ERI-Engines form a column that uses one or more time-shared channels to stream the results out of the AIE array, almost completely hiding the computational phases of individual ERI-Engines. In terms of absolute performance, we are competitive with recent high-performance implementations of ERI algorithms on FPGAs (SERI) and GPUs (LibintX), as well as well-established highly optimized CPU libraries (Libint, Libcint), while being the unequivocal leader in terms of energy efficiency.
Prototype network-based methods have made substantial progress in few-shot relation extraction (FSRE) by enhancing relation prototypes with relation descriptions. However, the distribution of relations and instances i...
详细信息
The breadth-first search (BFS) algorithm is a fundamental algorithm in graph theory, and it’s parallelization can significantly improve performance. Therefore, there have been numerous efforts to leverage the powerfu...
详细信息
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (M...
详细信息
ISBN:
(纸本)9798400714436
Molecular dynamics simulation emerges as an important area that HPC+AI helps to investigate the physical properties, with machine-learning interatomic potentials (MLIPs) being used. General-purpose machine-learning (ML) tools have been leveraged in MLIPs, but they are not perfectly matched with each other, since many optimization opportunities in MLIPs have been missed by ML tools. This inefficiency arises from the fact that HPC+AI applications work with far more computational complexity compared with pure AI scenarios. This paper has developed an MLIP, named TensorMD, independently from any ML tool. TensorMD has been evaluated on two supercomputers and scaled to 51.8 billion atoms, i.e., ~ 3× compared with state-of-the-art.
Industrial part surface defect detection aims to precisely locate defects in images, which is crucial for quality control in manufacturing. The traditional method needs to be designed in advance, but it has shortcomin...
详细信息
暂无评论