Using machine learning to predict COVID-19 infection and severity risk among 4510 aged adults: a UK Biobank cohort study

Study design and participants

This retrospective study involved the UK Biobank cohort.¹². UK Biobank consists of approximately 500,000 people now aged 50 to 84 years (mean age = 69.4 years). Baseline data was collected in 2006–2010 at 22 centers across the United Kingdom^13.14. Summary data are listed in Table 1. This research involved deidentified epidemiological data. All UK Biobank participants gave written, informed consent. Ethics approval for the UK Biobank study was obtained from the National Health Service Health Research Authority North West—Haydock Research Ethics Committee (16/NW/0274), in accordance with relevant guidelines and regulations from the Declarations of Helsinki. All analyzes were conducted in line with UK Biobank requirements.

Table 1 Baseline Demographics and Data Characteristics. Blood pressure (BP); high-density lipoprotein (HDL); low-density lipoprotein (LDL). A summary and comparison of data among either all participant test cases or a sub-group of test cases that also had non-COVID-19 serology. All retrospective baseline data has italics. Values are in mean ± SD, percentages, or frequency. P values less than 0.05 were considered significant and applicable predictors and indices are bolded.

The following categories of predictors were downloaded: (1) demographics; (2) health behaviors and long-term disability or illness status; (3) anthropometric and bioimpedance measures of fat, muscle, or water content; (4) pulse and blood pressure; (5) a serum panel of thirty biochemistry markers commonly collected in a clinic or hospital setting; and (6) a complete blood count with a manual differential.

demographics

These factors included participant age in years at baseline, sex, education qualifications, ethnicity, and Townsend Deprivation Index. Sex was coded as 0 for female and 1 for male. For education, higher scores roughly correspond to progressively more skilled trade/vocational or academic training. Ethnicity was coded as UK citizens who identified as White, Black/Black British, or Asian/Asian British. The Townsend Index^fifteen is a standardized score indicating relative degree of deprivation or poverty based on permanent address.

Health behaviors and conditions

This category consisted of self-reported alcohol status, smoking status, a subjective health rating on a 1–4 Likert scale (“Excellent” to “Poor”), and whether the participant had a self-described long-term medical condition. As noted in Table 1, 48.4% of participants indicated having such an ailment. We independently confirmed self-reported data with ICD-10 codes while at hospital. These conditions included all-cause dementia and other neurological disorders, various cancers, major depressive disorder, cardiovascular or cerebrovascular diseases and events, cardiometabolic diseases (eg, type 2 diabetes), renal and pulmonary diseases, and other so-called pre-existing conditions. .

vital signs

The first automated reading of pulse, diastolic and systolic blood pressure at the baseline visit were used.

Body morphometrics and compartment mass

Anthropometric measures of adiposity (Body Mass Index, waist circumference) were derived as described¹⁶. Data also included bioelectrical impedance metrics that estimate central body cavity (ie, trunk) and whole body fat mass, fat-free muscle mass, or water content¹⁷.

Blood biochemistry and immunology

Serum biomarkers were assayed from baseline samples as described¹⁸. Briefly, using immunoassay or clinical chemistry devices, spectrophotometry was used to initially quantify values for 34 biochemistry analytes. UK Biobank deemed 30 of these markers to be suitably robust. We rejected a further 4 markers due to data missingness > 70% (estradiol, rheumatoid factor), or because there was strong overlap with multicollinear variables that had more stable distributions or trait-like qualities (glucose rejected vs. glycated hemoglobin/hba1c; direct bilirubin rejected vs. total bilirubin). A complete blood count with a manual differential was separately processed for red and white blood cell counts, as well as white cell sub-types.

Serology measures for non COVID-19 infectious diseases

As described (http://biobank.ctsu.ox.ac.uk/crystal/crystal/docs/infdisease.pdf), among 9695 randomized UK Biobank participants selected from the full 500,000 participant cohort, baseline serum was thawed and pathogen-specific assays run in parallel using flow cytometry on a Luminex bead platform¹⁹.

Here, the goal of the multiplex serology panel was to measure multiple antibodies against several antigens for different pathogens, reducing noise and estimating the prevalence of prior infection and seroconversion in at least UK Biobank. All measures were initially confirmed in serum samples using gold-standard assays with median sensitivity and specificity of 97.0% and 93.7%, respectively. Antibody load for each pathogen-specific antigen was quantified using median fluorescence intensity (MFI). Because seropositivity is difficult to assess for several pathogens, we did not use pathogen prevalence as a predictor in models.

Table 2 shows the selected pathogens, their respective antigens, estimated prevalence of each pathogen based roughly on antibody titers, and assay values. This array ranges from delta-type retroviruses like human T-cell lymphotropic virus 1 that are rare (< 1%) to human herpesviruses 6 and 7 that have an estimated prevalence of more than 90%.

Table 2 Baseline characteristics of infectious disease serology from 2006 to 2010. Antibody levels are specific to each antigen and expressed in Median Fluorescence Intensity (MFI) units. Seroprevalence of at least the main UK Biobank cohort was estimated on samples from 9695 randomized participants, as described in white papers (see “Methods”). The “bold” and “italics” shading are used to distinguish between pathogens and their respective antigens. ^aCagA levels are based on roughly half of the original sample due to a technical lab error.

COVID-19 testing

Our study was based on COVID PCR test data available from March 16th to May 19th 2020. Specifically, we used the May 26th, 2020 tranche of COVID-19 polymerase chain reaction (PCR) data from Public Health England. There were 4510 unique participants who had 7539 individual tests administered, hereafter called test cases. To characterize each test case, the UK Biobank had a binary variable for test positivity (“result”) and a separate binary variable for test location (“origin”). For the positivity variable, a COVID-19 test was coded as negative (0) or positive (1). The second binary variable represented whether the COVID-19 test occurred through a setting that was out-patient (0) or in-patient at hospital (1). As a proxy for COVID-19 severity later verified by electronic health records and death certificates^twentyand as done in other UK Biobank reports^twenty-one, a test case first needed to be positive for COVID-19 (ie, the test had a ‘1’ value for the positivity variable). Next, if the positive test case occurred in an out-patient setting the infection was considered mild (ie, 0), whereas for in-patient hospitalization it was considered severe (ie, 1). Thus, two separate sets of analyzes were run to predict: (1) COVID-19 positivity; and (2) COVID-19 severity.

Statistical analyzes

For a more technical description of the specific machine learning algorithm used to predict test case outcomes, see Supplementary Text 1. Supplementary Text 2 has an in-depth description and analysis of within-subject variation for outcome measures and number of test cases done per participant . Briefly, this variability was modest and had no significant impact on classifier model performance. SPSS 27 was used for all analyzes and Alpha set at 0.05. Preliminary findings suggested that baseline serology data performed well in classifier models, despite a limited number of participants with serology. To determine if this serology sub-group was noticeably different from the full sample, Mann–Whitney U and Kruskal–Wallis tests were done (Alpha = 0.05). Hereafter, separate sets of classification analyzes were performed for: (1) the full cohort; and (2) the sub-group of participants that had serology data. In other words, due to the imbalance of sample sizes and by definition the absence or presence of serology data, classifier performance in the serology sub-group was never statistically compared to the full cohort.

Next, linear discriminant analysis (LDA) was used in two separate sets of analyzes to predict either: (1) COVID-19 diagnosis (negative vs. positive); or (2) COVID-19 infection severity (mild vs. severe). Again, for a given test case, COVID-19 severity would be examined only among participants who tested positive for COVID-19. LDA is a regression-like classification technique that finds the best linear combination of predictors that can maximally distinguish between groups of interest. To determine how useful a given predictor or related group of predictors (eg, demographics) were for classification, simple forced entry models were first done. Subsequently, to derive “best fit,” robust models of the data, stepwise entry (Wilks’ Lambda, F value entry = 3.84) was used to exclude predictors that did not significantly account for unique variance in the classification model. This data reduction step is critical because LDA can lead to model overfitting when there are too many predictors relative to observations^22.23, which are COVID-19 test cases for our purposes. Finally, because there were multiple test cases that could occur for the same participant, this would violate the assumption of independence. To guard against this problem, we used Mundry and Sommer’s permutation LDA approach. Specifically, for each LDA model, permutation testing (1000 iterations, P < 0.05) was done by randomizing participants across groupings of test cases to confirm robustness of the original model.²⁴.

LDA model overfitting can also occur when there is a sample size imbalance. Because there were many more negative vs. positive COVID-19 test cases in the full sample (5329 vs. 2210), the negative test group was undersampled. Specifically, a random number generator was used to discard 2500 negative test cases at random, such that the proportion of negative to positive tests was now 55% to 45% instead of 70.6 to 29.4%. Results without undersampling were similar (data not shown). No such imbalance was seen for COVID-19 severity in the full sample or for the serology sub-group. A typical holdout method of 70% and 30% was used for classifier training and then testing²⁵. Finally, a two-layer non-parametric approach was used to determine model significance and estimated fit of one or more predictors. First, bootstrapping²⁶ (95% Confidence Interval, 1000 iterations) was done to derive robust estimates against any violations of parametric assumptions. Next, ‘leave-one-out’ cross-validation²² was done with bootstrap-derived estimates to ensure that models themselves were robust. Collectively, the stepwise LDA models ensured that estimation bias of coefficients would be low because most predictors are “thrown out” before models are generated using the remaining predictors.

For each LDA classification model, outcome threshold metrics included: specificity (ie, true negatives correctly identified), sensitivity (ie, true positives correctly identified), and the geometric mean (ie, how well the model predicted both true negatives and positives). The area under the curve (AUC) with a 95% confidence interval (CI) was reported to show how well a given model could distinguish between a COVID-19 negative or positive test result, and separately for COVID-19 + test cases if the disease was mild or severe. Receiver operating characteristic (ROC) curves plotted sensitivity against 1-specificity to better visualize results for sets of predictors and a final stepwise model. For stepwise models, the Wilks’ Lambda statistical and standardized coefficients are reported to see how important a given predictor was for the model. A lower Wilks’ Lambda corresponds to a stronger influence on the canonical classifier.

Ethics declarations

Ethics approval for the UK Biobank study was obtained from the National Health Service Health Research Authority North West—Haydock Research Ethics Committee (16/NW/0274). All analyzes were conducted in line with UK Biobank requirements.