咨询与建议

看过本文的还看了

相关文献

该作者的其他文献

文献详情 >Performance evaluation of pred... 收藏
arXiv

Performance evaluation of predictive AI models to support medical decisions: Overview and guidance

作     者:van Calster, Ben Collins, Gary S. Vickers, Andrew J. Wynants, Laure Kerr, Kathleen F. Barreñada, Lasai Varoquaux, Gael Singh, Karandeep Moons, Karel G.M. Hernandez-Boussard, Tina Timmerman, Dirk McLernon, David J. van Smeden, Maarten Steyerberg, Ewout W. 

作者机构:Dept Development and Regeneration KU Leuven Leuven Belgium  KU Leuven Leuven Belgium Dept Biomedical Data Sciences Leiden University Medical Center Leiden Netherlands Centre for Statistics in Medicine Nuffield Department of Orthopaedics Rheumatology and Musculoskeletal Sciences University of Oxford United Kingdom Department of Epidemiology and Biostatistics Memorial Sloan Kettering Cancer Center New YorkNY United States Department of Epidemiology CAPHRI Care and Public Health Research Institute Maastricht University Maastricht Netherlands Department of Biostatistics University of Washington School of Public Health SeattleWA United States Parietal project team INRIA Saclay-Île de France Palaiseau France Division of Biomedical Informatics Department of Medicine University of California San DiegoCA United States Julius Centre for Health Sciences and Primary Care University Medical Centre Utrecht Utrecht University Utrecht Netherlands  Stanford University StanfordCA United States Department of Biomedical Data Science Stanford University StanfordCA United States Department of Obstetrics and Gynecology University Hospitals Leuven Leuven Belgium Institute of Applied Health Sciences University of Aberdeen Aberdeen United Kingdom 

出 版 物:《arXiv》 (arXiv)

年 卷 期:2024年

核心收录:

摘      要:A myriad of measures to illustrate performance of predictive artificial intelligence (AI) models have been proposed in the literature. Selecting appropriate performance measures is essential for predictive AI models that are developed to be used in medical practice, because poorly performing models may harm patients and lead to increased costs. We aim to assess the merits of classic and contemporary performance measures when validating predictive AI models for use in medical practice. We focus on models with a binary outcome. We discuss 32 performance measures covering five performance domains (discrimination, calibration, overall, classification, and clinical utility) along with accompanying graphical assessments. The first four domains cover statistical performance, the fifth domain covers decision-analytic performance. We explain why two key characteristics are important when selecting which performance measures to assess: (1) whether the measure’s expected value is optimized when it is calculated using the correct probabilities (i.e., a proper measure), and (2) whether they reflect either purely statistical performance or decision-analytic performance by properly considering misclassification costs. Seventeen measures exhibit both characteristics, fourteen measures exhibited one characteristic, and one measure possessed neither characteristic (the F1 measure). All classification measures (such as classification accuracy and F1) are improper for clinically relevant decision thresholds other than 0.5 or the prevalence. We recommend the following measures and plots as essential to report: AUROC, calibration plot, a clinical utility measure such as net benefit with decision curve analysis, and a plot with probability distributions per outcome category. © 2024, CC BY.

读者评论 与其他读者分享你的观点

用户名:未登录
我的评分