Study: Machine Learning Used to Analyze 150,000 COVID-19 Patients, Predict Mortality

Researchers utilizing machine learning technology have found that age, hypertension, insurance status, and hospital site are effectively predictive of COVID-19 mortality.

In the Scientific Reports paper, researchers detail how they used machine learning to analyze electronic health records from nearly 150,000 patients across 21 healthcare systems.

Dr. Tom Piasecki
Dr. Tom Piasecki

“The typical approach to investigating COVID is to select a set of supposed risk factors and then build a statistical model testing how well those risk factors predict severe COVID outcomes,” said Dr. Thomas Piasecki, researcher on the project.

“The problem with this is that we generally don’t know the best way to build the model. For complex phenomena like COVID, we suspect that there will be important interactions between risk factors. For example, the impact of vaccination may differ as a function of patient age or whether they have certain medical conditions. You can build a more complex model to test for this kind of thing. The problem is that, when the number of risk factors you are testing is large, the total number of possible models you could build grows really fast, and it quickly becomes impossible for a human analyst to explore all the possible effects of interest.

“But computers can do the job. Machine learning (ML) represents an efficient statistical approach to sorting through all the possibilities, including higher-order interactions involving numerous risk factors, to produce a parsimonious, highly accurate final prediction model.

“Some previous studies have used ML to predict COVID outcomes with high accuracy, but they often included predictors like lab test results or ICU admission that may be effects of severe COVID rather than pre-existing risk factors. Our study focused on predictors that could be measured prior to hospitalization to identify high-risk groups in a way that has broad public-health implications.”

Dr. Wei-Yin Loh
Dr. Wei-Yin Loh

Researchers used an ML algorithm called GUIDE, developed by UW-CTRI collaborator and UW Professor of Statistics Dr. Wei-Yin Loh. GUIDE works by creating decision trees out of multiple variables and then working through them, determining which variables in which combinations produced, in this case, the highest mortality. These determinations are represented by an “importance score”: the higher the score, the higher the association with mortality.

Of these importance scores, age (especially 62 years and older) was the highest, followed by hypertension, insurance status, and to which hospital a patient was admitted.

These scores reflect data from the entirety of the study, February 1, 2020 to January 31, 2022. However, a subgroup of data only collected from the second year, January 1, 2021 to January 31, 2022, was also analyzed, netting more specific results.

In this subgroup, age was still the biggest predictor of mortality, followed by hypertension; however, now vaccination status (being unvaccinated), hospital site, and race predicted higher mortality rates.

“One surprising finding was that uncomplicated hypertension emerged as one of the most important risk factors for mortality in this study,” said Piasecki. “Previous COVID studies looked at hypertension with mixed results. The observed effect of hypertension may depend on the choices investigators make when building their models. Here, a systematic, unbiased ML-based search testing a wide range of risk factors identified uncomplicated hypertension among the most important predictors of mortality.

“Another interesting finding was that the ML model identified subgroups of patients defined by different clusters of characteristics that had dramatically different mortality rates. For example, patients aged 46 or younger without hypertension had a six-percent mortality rate whereas patients over 62 with hypertension who were unvaccinated and members of certain minority groups had a 30-percent mortality rate. This variety of death rates across subgroups is not surprising because the ML algorithm is designed to find these different subgroups—but the combinations of risk factors defining the riskiest groups are interesting and would have been difficult to hypothesize a priori.

“I hope these findings motivate specific public health outreach. For example, patients with hypertension might be especially encouraged to get vaccinated given the strong association between hypertension and death in the study. This work can also generate hypotheses that can be followed up using different research designs to try to dig into the specific causal pathways driving the statistical associations seen here. Those kinds of studies are important because they could increase our knowledge of the best ways to prevent or treat COVID in different high-risk groups.”