Vendor libraries are tuned for a specific architecture and are not portable to others. Moreover, they lack support for heterogeneity and multi-device orchestration, which is required for efficient use of contemporary ...
详细信息
Since 2011, LUNARC has aimed to provide an interactive HPC environment for its resource users. Several different architectures have been used, but since 2013, we have been using a remote desktop environment based on C...
详细信息
ISBN:
(数字)9798350355543
ISBN:
(纸本)9798350355550
Since 2011, LUNARC has aimed to provide an interactive HPC environment for its resource users. Several different architectures have been used, but since 2013, we have been using a remote desktop environment based on Cendio’s ThinLinc [1] combined with a custom backend framework, GfxLauncher [2], supporting hardware-accelerated graphics applications and Jupyter Notebooks [3] submitted to the backend cluster.
Enhancing and reconstructing environmental images involve refining visual data to improve quality and reconstructing scenes. In remote sensing, this aids in accurate analysis, contributing to advanced understanding an...
详细信息
Due to the memory wall becoming increasingly problematic in high-performancecomputing, there is a steady push to improve memory architectures, mainly focusing on better bandwidth as well as latency. One of the result...
详细信息
ISBN:
(纸本)9781665497473
Due to the memory wall becoming increasingly problematic in high-performancecomputing, there is a steady push to improve memory architectures, mainly focusing on better bandwidth as well as latency. One of the results of this push is the development of high-Bandwidth Memory (HBM) which is an alternative to the regular DRAM typically used by accelerator-cards. This work adapts an existing accelerator architecture for inference on Sum-Product Networks (SPN) to exploit the HBM present on more recent high-performance FPGA-accelerator cards. The evaluation shows that the use of HBM enables almost linear scaling of the performance due to the embarrassingly parallel nature of batch-wise SPN inference. It is also shown that the only hindrance to this scaling is the limited bandwidth available for data-transfers between host and FPGA. Even with this bottleneck, the prior FPGA-based implementation is outperformed by up to 1.50x (geo.-mean 1.29x). Similarly, the CPU and GPU baselines are outperformed by up to 2.4x (*** 1.6x) and 8.4x (geo.-mean 6.9x) respectively. Based on the evaluation, the scaling potential of HBM-based FPGA-accelerators is explored to give an outlook on what is to come with future generations of PCIe-based interfaces.
This article designs and implements a runtime library for general dataflow programming, DFCPP (Luo Q, Huang J, Li J, Du Z. Proceedings of the 52nd international Conference on Parallel Processing workshops. ACM;2023:14...
详细信息
This article designs and implements a runtime library for general dataflow programming, DFCPP (Luo Q, Huang J, Li J, Du Z. Proceedings of the 52nd international Conference on Parallel Processing workshops. ACM;2023:145-152.), and builds upon it to design and implement a multi-machine C++ dataflow library, M-DFCPP. In comparison to existing dataflow programming environments, DFCPP features a user-friendly interface and richer expressive capabilities (Luo Q, Huang J, Li J, Du Z. Proceedings of the 52nd international Conference on Parallel Processing workshops. ACM;2023:145-152.), enabling the representation of various types of dataflow actor tasks (static, dynamic and conditional task). Besides that, DFCPP addresses the memory management and task scheduling for non-uniform memory access architectures, while other dataflow libraries lack attention to these issues. M-DFCPP extends the capability of current dataflow runtime libraries (DFCPP, taskflow, openstream, etc.) and capable of multi-machine computing, while maintains the API compatible with DFCPP. M-DFCPP adopts the concepts of master and follower (Dean J, Ghemawat S. Commun ACM. 2008;51(1):107-113;Ghemawat S, Gobioff H, Leung ST. ACM SIGOPS Operating Systems Review. ACM;2003:29-43.), which form a worksharing framework as many multi-machine system. To shift to the M-DFCPP framework, a filtering layer is inserted to the original DFCPP, transforming it into followers that can cooperate with each other. The master is made of modules for scheduling, data processing, graph partition, state management and so forth. In benchmark tests with workload with directed acyclic graph topology of binary trees and random graphs, DFCPP demonstrated performance improvements of 20% and 8%, respectively, compared to the second fastest library. M-DFCPP consistently exhibits outstanding performance across varying levels of concurrency and task workloads, achieving a maximum speedup of more than 20 over DFCPP, when the task parallelism e
Power has become a key limiting factor in supercomputing. Understanding the power signatures of current production workloads is essential to address this limit and continue to advance scientific computing at scale. Th...
详细信息
Among many details that users need to consider when using cloud computing, the care not to waste resources requires more attention by administrators and new users. When the application does not fully utilize the provi...
详细信息
ISBN:
(纸本)9781665417303
Among many details that users need to consider when using cloud computing, the care not to waste resources requires more attention by administrators and new users. When the application does not fully utilize the provisioned resource, the end-of-the-month bill is unnecessarily increased. Several studies have developed solutions to avoid wastage using predictive techniques. Nonetheless, these approaches require applications' to have predictive behavior and depend on pre-executions or history data. To circumvent these limitations, we explore how a reactive solution can be used to detect and contain wastage. More specifically, we discuss several important issues that arise when quantifying resource wastage caused by HPC resource wastage on the cloud and propose a reactive strategy to quantify, detect, and contain resource wastage in this context. This solution is designed so that it can be applied in environments with expert and non-expert users with no prior knowledge about the applications.
Uncertainty quantification measures the prediction uncertainty of a neural network facing out-of-training-distribution samples. Bayesian Neural Networks (BNNs) can provide high-quality uncertainty quantification by in...
详细信息
ISBN:
(纸本)9781665420273
Uncertainty quantification measures the prediction uncertainty of a neural network facing out-of-training-distribution samples. Bayesian Neural Networks (BNNs) can provide high-quality uncertainty quantification by introducing specific noise to the weights during inference. To accelerate BNN inference, ReRAM processing-in-memory (PIM) architecture is a competitive solution to provide both high-efficient computing and in-situ noise generation at the same time. However, there normally exists a huge gap between the generated noise in PIM hardware and that required by a BNN model. We demonstrate that the quality of uncertainty quantification is substantially degraded due to this gap. To solve this problem, we propose a holistic framework called W2W-PIM. We first introduce an efficient method to generate noise in ReRAM PIM design according to the demand of a BNN model. In addition, the PIM architecture is carefully modified to enable the noise generation and evaluate uncertainty quality. Moreover, a calibration unit is further introduced to reduce the noise gap caused by imperfection of the noise model. Comprehensive evaluation results demonstrate that W2W-PIM framework can achieve high-quality uncertainty quantification and high energy-efficiency at the same time.
Alzheimer’s disease (AD) is a progressive neurodegenerative disorder with an annual global economic impact of approximately $1 trillion. Early diagnosis is crucial to mitigate disease progression, yet current detecti...
详细信息
暂无评论