ArXiv Preprint
Clinical prediction models estimate an individual's risk of a particular
health outcome, conditional on their values of multiple predictors. A developed
model is a consequence of the development dataset and the chosen model building
strategy, including the sample size, number of predictors and analysis method
(e.g., regression or machine learning). Here, we raise the concern that many
models are developed using small datasets that lead to instability in the model
and its predictions (estimated risks). We define four levels of model stability
in estimated risks moving from the overall mean to the individual level. Then,
through simulation and case studies of statistical and machine learning
approaches, we show instability in a model's estimated risks is often
considerable, and ultimately manifests itself as miscalibration of predictions
in new data. Therefore, we recommend researchers should always examine
instability at the model development stage and propose instability plots and
measures to do so. This entails repeating the model building steps (those used
in the development of the original prediction model) in each of multiple (e.g.,
1000) bootstrap samples, to produce multiple bootstrap models, and then
deriving (i) a prediction instability plot of bootstrap model predictions
(y-axis) versus original model predictions (x-axis), (ii) a calibration
instability plot showing calibration curves for the bootstrap models in the
original sample; and (iii) the instability index, which is the mean absolute
difference between individuals' original and bootstrap model predictions. A
case study is used to illustrate how these instability assessments help
reassure (or not) whether model predictions are likely to be reliable (or not),
whilst also informing a model's critical appraisal (risk of bias rating),
fairness assessment and further validation requirements.
Richard D Riley, Gary S Collins
2022-11-02