As the design space for high-performance computer (HPC) systems grows larger and more complex, modeling and simulation (MODSIM) techniques become more important to better optimize systems. Furthermore, recent extreme-...
详细信息
ISBN:
(纸本)9781665420594
As the design space for high-performance computer (HPC) systems grows larger and more complex, modeling and simulation (MODSIM) techniques become more important to better optimize systems. Furthermore, recent extreme-scale systems and newer technologies can lead to higher system fault rates, which negatively affect system performance and other metrics. Therefore, it is important for system designers to consider the effects of faults and fault-tolerance (FT) techniques on system design through MODSIM. BE-SST is an existing MODSIM methodology and workflow that facilitates preliminary exploration & reduction of large design spaces, particularly by highlighting areas of the space for detailed study and pruning less optimal areas. This paper presents the overall methodology for adding fault-tolerance awareness (FT-awareness) into BE-SST. We present the process used to extend BE-SST, enabling the creation of models that predict the time needed to perform a checkpoint instance for the given system configuration. Additionally, this paper presents a case study where a full HPC system is simulated using BE-SST, including application, hardware, and checkpointing. We validate the models and simulation against actual system measurements, finding an average percent error of less than 17% for the instance models and about 20% for systemsimulation, a level of accuracy acceptable for initial exploration and pruning of the design space. Finally, we show how FT-aware simulation results are used for comparing FT levels in the design space.
Over the past decades, several computer codes have been developed for simulation and analysis of thermal-hydraulics and system response in nuclear reactors under operating, abnormal transient, and accident conditions....
详细信息
Over the past decades, several computer codes have been developed for simulation and analysis of thermal-hydraulics and system response in nuclear reactors under operating, abnormal transient, and accident conditions. However, simulation errors and uncertainties still inevitably exist even while these codes have been extensively assessed and used. In this work, a data-driven framework (Optimal Mesh/Model Information system, OMIS) is formulated and demonstrated to estimate simulation error and suggest optimal selection of computational mesh size (i.e., nodalization) and constitutive correlations (e.g., wall functions and turbulence models) for low-fidelity, coarse-mesh thermal-hydraulic simulation, in order to achieve accuracy comparable to that of high-fidelity simulation. Using results from high-fidelity simulations and experimental data with many fast-running low-fidelity simulations, an error database is built and used to train a machine learning model that can determine the relationship between local simulation error and local physical features. This machine learning model is then used to generate insight and help correct low-fidelity simulations for similar physical conditions. The OMIS framework is designed as a modularized six-step procedure and accomplished with state-of-the-art methods and algorithms. A mixed-convection case study was performed to illustrate the entire framework.
The characterization of software performance (SWP) in complex, service-oriented architecture (SOA)-based system of systems (SoS) environments is an emergent study area. This report focuses on both qualitative and quan...
详细信息
ISBN:
(纸本)9781424481804
The characterization of software performance (SWP) in complex, service-oriented architecture (SOA)-based system of systems (SoS) environments is an emergent study area. This report focuses on both qualitative and quantitative ways of determining the current state of SWP in terms of both test coverage (what has been tested) and confidence (degree of testing) for SOA-based SoS environments. Practical tools and methodologies are offered to aid technical and programmatic managers in the form of a stepwise methodology toward SWP selection. Included are system architecture design considerations, resource limiters of SWP, test event design considerations, organizational and process suggestions toward improved SWP management and a matrix of measurement suggestions.
High-level performance modeling and simulation have become a key ingredient of system-level design as they facilitate early architectural design space exploration. An important precondition for such high-level modelin...
详细信息
High-level performance modeling and simulation have become a key ingredient of system-level design as they facilitate early architectural design space exploration. An important precondition for such high-levelmodeling and simulation methods is that they should yield trustworthy performance estimations. This requires validation ( if possible) and calibration of the simulation models, which are two aspects that have not yet been widely addressed in the system-level community. This article presents a number of mechanisms for both calibrating isolated model components as well as a system-level performance model as a whole. We discuss these model calibration mechanisms in the context of our Sesame system-levelsimulation framework. Two illustrative case studies will also be presented to indicate the merits of model calibration.
暂无评论