Some claim that machine learning technology has the potential to transform healthcare systems, but a study published by The BMJ finds that machine learning models have similar performance to traditional statistical models and share similar uncertainty in making risk predictions for individual patients.
The NHS has invested £250m ($323m; €275m) to embed machine learning in healthcare, but researchers say the level of consistency (stability) within and between models should be assessed before they are used to make treatment decisions for individual patients.
Risk prediction models are widely used in clinical practice. They use statistical techniques alongside information about people, such as their age and ethnicity, to identify those at high risk of developing an illness and make decisions about their care.
Previous research has found that a traditional risk prediction model such as QRISK3 has very good model performance at the population level, but has considerable uncertainty on individual risk prediction.
Some studies claim that machine learning models can outperform traditional models, while others argue that they cannot provide explainable reasons behind their predictions, potentially leading to inappropriate actions.
What’s more, machine learning models often ignore censoring—when patients are lost (either by error or by being unreachable) during a study and the model assumes they are disease free, leading to biased predictions.
To explore these issues further, researchers in the UK, China and the Netherlands set out to assess the consistency of machine learning and statistical techniques in predicting individual level and population level risks of cardiovascular disease and the effects of censoring on risk predictions.
They assessed 19 different prediction techniques (12 machine learning models and seven statistical models) using data from 3.6 million patients registered at 391 general practices in England between 1998 and 2018.
Data from general practices, hospital admission and mortality records were used to test each model’s performance against actual events.
All 19 models yielded similar population level performance. However, cardiovascular disease risk predictions for the same patients varied substantially between models, especially in patients with higher risks.
For example, a patient with a cardiovascular disease risk of 9.5-10.5% predicted by the traditional QRISK3 model had a risk of 2.9-9.2% and 2.4-7.2% predicted by other models.
Models that ignored censoring (including commonly used machine learning models) substantially underestimated risk of cardiovascular disease.
Of the 223,815 patients with a cardiovascular disease risk above 7.5% with QRISK3 (a model that does consider censoring), 57.8% would be reclassified below 7.5% when using another type of model, explain the researchers.
The researchers acknowledge some limitations in comparing the different models, such as the fact that more predictors could have been considered. However, they point out that their results remained similar after more detailed analyses, suggesting that they withstand scrutiny.
“A variety of models predicted risks for the same patients very differently despite similar model performances,” they write. “Consequently, different treatment decisions could be made by arbitrarily selecting another modelling technique.”
As such, they suggest these models “should not be directly applied to the prediction of long term risks without considering censoring” and that the level of consistency within and between models “should be routinely assessed before they are used to inform clinical decision making.”
British Medical Journal