Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of t...
详细信息
ISBN:
(纸本)9798350370621
Many High Performance Computing (HPC) facilities have developed and deployed frameworks in support of continuous monitoring and operational data analytics (MODA) to help improve efficiency and throughput. Because of the complexity and scale of systems and workflows and the need for low-latency response to address dynamic circumstances, automated feedback and response have the potential to be more effective than current human-in-the-loop approaches which are laborious and error prone. Progress has been limited, however, by factors such as the lack of infrastructure and feedback hooks, and successful deployment is often site- and case-specific. In this position paper we report on the outcomes and plans from a recent Dagstuhl Seminar, seeking to carve a path for community progress in the development of autonomous feedback loops for MODA, based on the established formalism of similar (MAPE-K) loops in autonomous computing and self-adaptive systems. By defining and developing such loops for significant cases experienced across HPC sites, we seek to extract commonalities and develop conventions that will facilitate interoperability and interchangeability with system hardware, software, and applications across different sites, and will motivate vendors and others to provide telemetry interfaces and feedback hooks to enable community development and pervasive deployment of MODA autonomy loops.
High Performance Computing (HPC) architectures are evolving at an ever-accelerating pace in order to meet the growing diversity of demands placed on them by changing workloads (e.g., high-fidelity multi-physics simula...
详细信息
ISBN:
(纸本)9798350383461;9798350383454
High Performance Computing (HPC) architectures are evolving at an ever-accelerating pace in order to meet the growing diversity of demands placed on them by changing workloads (e.g., high-fidelity multi-physics simulations and Artificial Intelligence (AI)). As network, compute, and memory technologies expand to accomplish this, power densities and energy needs have also increased. Our goal is to efficiently manage the disparate workflows and workloads required to meet mission needs across our shared HPC resources while remaining within shifting power, energy, and cooling envelopes. To accomplish this we require holistic insight into how workflow components utilize our HPC resources and their time-varying impact on power, energy, and cooling. Approaches to this vary across the U.S. Department of Energy (DOE) laboratories. At Lawrence Livermore National Laboratory (LLNL) and Sandia National Laboratories (SNL) we have independently taken similar approaches to creating HPC monitoring and analysis systems. LLNL created "Sonar" and SNL created "AppSysFusion", both of which serve as extensible (i.e., pluggable frameworks that enable new functionalities to be easily deployed) HPC monitoring analysis and visualization frameworks. This paper describes the capabilities, pluggable component architecture, and design choices, both historic and current, for the components and the tooling that enables collaborative development across institutions.
暂无评论