Covid Profiles Manuscript VF

Identifying COVID-19 patient profiles in the Basque Country: A clustering
approach
Lander Rodriguez1*, Daniel Fernández2,3, José M. Quintana-Lopez4,5,6,7, Julia Garcia-
Asensio8, Ane Villanueva4,5,6,7, Maria Jose Legarreta4,5,6,7, Nere Larrea4,5,6,7, Irantzu
Barrio9,1
1
Applied Statistics Group, Basque Centre for Applied Mathematics (BCAM), Bilbao,
Basque Country, Spain
2
Serra Húnter Fellow. Department of Statistics and Operations Research (DEIO).
Universitat Politècnica de Catalunya · BarcelonaTech (UPC), Barcelona, Catalonia,
Spain
3
Institute of Mathematics of UPC - BarcelonaTech (IMTech), Barcelona, Catalonia,
Spain
4
Research Unit of the Galdakao-Usansolo University Hospital, Osakidetza Basque
Health Service, Galdakao, Basque Country, Spain
5
Network for Research on Chronicity, Primary Care, and Health Promotion (RICAPPS)
6
Health Service Research Network on Chronic Diseases (REDISSEC), Bilbao, Basque
Country, Spain
7
Kronikgune Institute for Health Services Research, Barakaldo, Basque Country, Spain
8
Office of Healthcare Planning, Organization and Evaluation, Basque Government
Department of Health, Basque Country, Spain
9
Department of Mathematics, University of the Basque Country UPV/EHU, Leioa,
Basque Country, Spain
*Corresponding author: Lander Rodriguez

E-mail: lrodriguez@bcamath.org (LR)
Abstract
The classification of patients is essential in the outbreak of a pandemic to identify the
worst prognostic patients. In this research our aim is to identify clinically useful profiles
with a novel clustering technique and to demonstrate their association with the adverse
evolution of the COVID-19 disease.
We implement a two-stage process in this retrospective cohort study: first we identify
the profiles of SARS-CoV-2 positive patients with the KAMILA clustering technique
and then we assess their association with adverse outcomes such as mortality, bad
progress (ICU or death) and hospital admission. The profiles are created for four
different periods of the pandemic through a population-based database containing
sociodemographic, comorbidities and baseline treatments data.
In general, four different groups have been identified: Very low, young patients with
almost no comorbidities; Low, middle-aged patients with few comorbidities; High, old
patients with different number of comorbidities; and Very high, old patients with
multiple comorbidities. The variables that mainly segregate these clusters are age, the
Charlson index, diabetes, kidney disease, metastatic solid tumor and heart failure. In
addition, these profiles strongly associate with the adverse outcomes of COVID-19.
Finally, even if the identified profiles were stable along the pandemic, the hospital
admissions, bad progress and death rates decreased.
To our best knowledge, this is the first study determining COVID-19 patient profiles
from COVID-19 positives of the population and to assess their evolution over time. Our
findings suggest the appropriateness of clustering methods for a quick classification of
the most vulnerable patients in new pandemics or other diseases for an improved
medical attention.
Introduction
It is crucial to quickly identify the worst prognostic patients in the outbreak of a new
virulent virus. In a context with great uncertainty, high circulation of the virus and high
hospitalization and death rates, it is critical to rapidly segregate patients so that targeted
care intervention strategies can be developed for an improved medical attention. In the
pages that follow, our purpose is to classify patients according to their clinical and
sociodemographic profiles and subsequently show they are related with the evolution of
the disease. In this case the profiles are related to the COVID-19 outcomes, although
they are independent of the virus and could be related to any disease or new pandemics.
The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection began in
December 2019 [1] and the World Health Organization declared a global pandemic in
March 2020. The disease became a threat to public health [2] due to its ease of
transmission and the number of deaths caused throughout the world [3]. This prompted
governments worldwide to take urgent action to contain the spread of the virus and
mitigate its effects.
The virulence of this pandemic precipitated the creation of a vast number of models to
further understand the disease. These models were mainly developed with Machine
Learning (ML) and advanced statistical methods. Both type of methods are able to
extract relevant variables from electronic health records [4], which could be a valuable
tool for either predicting adverse outcomes or classifying patients based on their
similarities and differences in baseline clinical and sociodemographic variables. For the
latter, clustering techniques are appropriate as they discover hidden and inherent
patterns to organize data into groups without any a priori hypothesis [5] and have
already been successfully applied in medical research [6-8]. However, these methods
1
received very little attention to classify patients during the pandemic. Even if patient
profiles were identified with clustering techniques in [9] and [10], the studies were
restricted to hospitalized patients.
One of the main challenges of clustering methods is their application to mixed-type data
i.e. categorical and continuous variables. This unsupervised learning task is often
accomplished with either categorical or continuous variables although clinical research
usually involves both of them. In this regard, various techniques have been developed to
overcome the inherent difficulties of applying mathematical operations to both types of
variables. However, given a specific context, there is in general no clear guidance to
choose the most appropriate technique [11].
The novel KAMILA (KAy-means for MIxed LArge data) clustering approach [12,13] is
suitable when handling mixed-type data as suggested by different benchmarking studies
[11, 14]. Among the different methods, KAMILA in general offers superior
performance, which is emphasized when dealing with large datasets. In addition, it
provides the best performance in time efficiency thanks to the scalability of the
algorithm [11]. KAMILA overcomes different challenges of the methods employed for
clustering mixed-type data, like: the requirement of strong parametric assumptions, the
incapacity to minimize the contributions of individual variables and the need of an
arbitrary choice of weightings for the relative contribution of the variables [13].
In this research work we identify the COVID-19 patient profiles from the Basque
Country for the most important periods of the pandemic. This is accomplished through
the implementation of a two-stage process. First, we identify the COVID-19 patient
profiles of the positives from the Basque Country with KAMILA and later we assess
their association with the adverse outcomes of the disease. We hypothesize that the
obtained groups will be associated with the disease severity, leading to a clinically
2
useful segregation of the patients. In addition, we explore the differences among clusters
and investigate their evolution along the pandemic.
Materials and Methods
This is a retrospective study of a cohort based on data from the electronic database and
health records of the health service from the Basque Country.
Database
All the patients included in this study were residents of the Basque Country, a region
with a population of 2.18 million, who had SARS-CoV-2 between March 1, 2020 and
January 9, 2022. COVID-19 diagnosis was laboratory confirmed by a positive result on
the reverse transcriptase-polymerase chain reaction assay for SARS-CoV-2 or a positive
antigen test. Also, from March 1, 2020 to July 31, 2020 positive IgM or IgG antibody
tests performed to patients having symptoms suggestive of the disease or having had
contact with a positive case were included in the sample. The authors did not have
access to information that could identify individual participants during or after data
collection. The need for consent was waived by the ethics committee due to the
pandemic situation. The study protocol was approved by the Ethics Committee for our
area (reference PI2020123).
Patient data was included in a unified electronic database of the Basque Country health
service. The data includes sociodemographic data (age, sex, and nursing home
residents’ indicator); vaccination dates and doses; baseline comorbidities (all those
included in Charlson Comorbidity Index [15] plus angina, arrhythmia, arterial
hypertension, dyslipidemia, asthma, bronchiectasis, cystic fibrosis, interstitial lung
disease, lymphoma, leukemia, coagulopathy, inflammatory bowel disease and
3
gastrointestinal bleeding); and baseline treatments based on the Anatomical,
Therapeutic, Chemical (ATC) classification system [16].
We grouped comorbidities in the following way: myocardial infarction; angina;
arrhythmia; congestive heart failure; peripheral vascular disease; cerebrovascular
disease; hemiplegia and/or paraplegia; arterial hypertension; dyslipidemia; dementia;
interstitial pulmonary disease; cystic fibrosis; chronic obstructive pulmonary disease
(COPD); bronchiectasis; chronic bronchial infection; asthma; liver disease (mild liver,
moderate or severe liver disease); diabetes (diabetes with/without organ damage);
kidney disease; malignant tumor; metastatic solid tumor; lymphoma; rheumatic disease;
peptic ulcer; inflammatory bowel disease; and coagulopathies. For baseline medication
the baseline treatment was defined as any drug prescribed before diagnosis with SARS-
CoV-2 infection and had no end date.
Vaccine doses were coded in the following manner: the first dose was considered 14
days after the inoculation of the vaccine whereas the second and third doses were
considered the day the inoculation occurred. There were no fourth doses in the period of
study. The vaccination variable was determined as a three-level categorical variable: no
dose or 1 dose, 2 doses, and 3 doses. This categorization was decided because in this
region the first dose was a transitional dose to the second one, the moment a patient was
considered to be protected against the virus, and there was only three weeks difference
between the first two doses. Thus, we considered that getting one dose was more similar
to having none than to be fully vaccinated. Apart from that, the third dose was
considered a booster dose, and thus, we decided to separate it from the full vaccination
indicator.
4
The outcomes of interest in the study were hospitalization, bad progress (ICU or death)
and death. Their definition is shown below:
 Hospitalization: when a patient tested positive for COVID-19 before
hospitalization, hospital admission was considered COVID-19 related if it
occurred within 15 days of the positive test. If the patient tested positive during
hospitalization, hospital admission was considered COVID-19 related up to 21
days after the positive test. This last definition was included to account for the
lack of testing capacity at the beginning of the pandemic.
 Death: if the patient died during the three months following COVID-19
diagnosis or during a hospitalization, or three months from hospital discharge
by a COVID-19 admission.
 Bad progress (ICU or death): when the patient died (as defined above) or had
an ICU admission during a hospital admission related to a SARS-CoV-2
infection.
The data of the study was collected from March 1, 2020 until April 9, 2022 and it was
accessed on April 18, 2022 for research purposes.
Statistical Analysis
The full period of collection of data was divided into 4 different periods: The first
period spans from March 1, 2020 until June 30, 2020 when the first wave of the
pandemic took place; the second one spans from July 1, 2020 until December 31, 2020
when the vaccination process started; the third one takes place from January 1, 2021
until December 13, 2021 when the Omicron wave started; and the last period covers the
Omicron wave until January 9, 2022. On Januray 9, 2022 the Basque Government
5
modified its protocol for collecting COVID-19 positive data, which restricted the time
span of data acquisition for the Omicron wave.
Due to our interest in the first stages of the pandemic only the first positive from each
patient was included in the study. Additionally, only adult patients were considered.
Descriptive statistics included frequency tables for categorical variables and median and
interquartile ranges for numerical ones. Vaccination data was only available for periods
3 and 4 as the vaccination process started in the third period.
Patients were clustered based on KAMILA for the different periods of the pandemic.
Clusters were determined with all the available variables except the disease outcomes
and the nursing home residents’ indicator, whose association with the clusters was later
assessed. The numerical variables (i.e. age and the Charlson index) were standardized to
avoid the variable units manage the clustering structure. The number of clusters studied
for each period was from a minimum of two to a maximum of five as more groups were
considered excessive for a correct clinical distinction of the patients. The optimal
number of clusters was selected based on the prediction strength method [17] with a
threshold of 0.8, as suggested by the authors. This procedure was accomplished with the
kamila R package [13]. All statistical analyses were performed using R (version 4.1.2)
[18].
The resulting number of clusters were labeled as Very low, Low, High and Very high
based on their defining characteristics. Presumably, this labeling would later indicate
the likelihood of the adverse outcomes of COVID-19, although this assumption should
be ratified following the clusters creation in an unsupervised manner.
Post-hoc Analysis
6
Apart from the descriptive characteristics of the different clusters two more aspects
were examined. First, we investigated if the clusters were significantly different for the
same period. Second, we studied if the same risk level clusters had evolved during the
pandemic. The variables were compared pairwise with the Pearson Chi-squared test for
the categorical variables and the two-sided Mann-Whitney U test for continuous
variables with an alpha of 0.01 to be considered statistically significant. The Shapiro-
Wilk test was previously done on the continuous variables to confirm their distribution
was not normal. Due to the large sample size, the effect size was measured with
Cramer’s V [19] for categorical variables and Vargha and Delaney’s A [20] for
continuous ones. Large effect sizes were considered as suggested by the previous
authors, respectively: Cramer’s V values superior to 0.5 for 1 degree of freedom
variables and higher than 0.35 for 2 degrees of freedom variables, and Vargha and
Delaney’s A values higher than 0.71 or lower than 0.29 for continuous variables.
Results
Summary statistics of the sociodemographic variables and the background
characteristics of the whole sample for the different periods can be seen in Table 1. The
rest of the variables can be found online in the S1 Table. Significant differences exist
among the variables for the different periods, being the first the most dissimilar one. In
general, the proportion of the comorbidities, nursing home residents, and the median
and interquartile ranges of age and the Charlson index decrease with time. Of interest
here is the reduction of hospitalization, bad progress (ICU or death), and death
percentages from period 1 to 4 (Omicron).
Table 1. Descriptive characteristics of COVID-19 patients for the different pandemic
periods.
7
Variables Period 1 Period 2 Period 3 Omicron
TOTAL 20,457 (5.38%) 79,942 (21.03%) 140,672 (37.01%) 139,018 (36.58%)
Sociodemographic variables
Gender2-4, N (%)
Female 12,529 (61.25) 42,164 (52.74) 71,345 (50.72) 73,522 (52.89)
Male 7,928 (38.75) 37,778 (47.26) 69,327 (49.28) 65,496 (47.11)
3-4
Age, Median [Q1,Q3] 57 [44,75] 47 [33,61] 44 [30,58] 44 [32,55]
Nursing home, N (%) 3,523 (17.22) 2,530 (3.16) 1,556 (1.11) 1,011 (0.73)
Vaccines, N (%)
0-1 dose 103,867 (73.84) 15,683 (11.28)
2 doses 35,382 (25.15) 99,866 (71.84)
3 doses 1,423 (1.01) 23,469 (16.88)
Comorbidities
Charlson index, Median [Q1,Q3] 0 [0,2] 0 [0,1] 0 [0,1] 0 [0,1]
Myocardial infarction, N (%) 959 (4.69) 1,800 (2.25) 2,618 (1.86) 2,127 (1.53)
Congestive heart failure, N (%) 1,679 (8.21) 2,619 (3.28) 3,357 (2.39) 2,281 (1.64)
Peripheral vascular disease, N 1,058 (5.17) 2,234 (2.79) 2,992 (2.13) 2,363 (1.70)
(%)
Cerebrovascular disease, N (%) 2,484 (12.14) 4,922 (6.16) 6,884 (4.89) 5,809 (4.18)
Dementia, N (%) 1,685 (8.24) 2,042 (2.55) 1,798 (1.28) 1,104 (0.79)
COPD, N (%) 3,693 (18.05) 12,207 (15.27) 22,251 (15.82) 22,512 (16.19)
Rheumatic disease, N (%) 638 (3.12) 1,428 (1.79) 2,114 (1.50) 1,908 (1.37)
Peptic ulcer, N (%) 700 (3.42) 1,590 (1.99) 2,400 (1.71) 2,080 (1.50)
Liver disease, N (%)
Mild 1,072 (5.24) 2,800 (3.50) 4,359 (3.10) 3,994 (2.87)
Moderate/Severe 140 (0.68) 264 (0.33) 335 (0.24) 296 (0.21)
Diabetes, N (%)
Yes, without organ damage 2,151 (10.51) 4,967 (6.21) 7,075 (5.03) 5,312 (3.82)
Yes, with organ damage 531 (2.60) 924 (1.16) 1,266 (0.90) 853 (0.61)
Hemiplegia / Paraplegia, N (%) 442 (2.16) 773 (0.97) 918 (0.65) 694 (0.50)
Kidney, N (%) 2,206 (10.78) 4,606 (5.76) 7,086 (5.04) 6,055 (4.36)
Metastatic solid tumor, N (%) 507 (2.48) 1,193 (1.49) 1,746 (1.24) 1,381 (0.99)
Heart failure, N (%) 1,679 (8.21) 2,619 (3.28) 3,357 (2.39) 2,281 (1.64)
Angina, N (%) 717 (3.50) 1,402 (1.75) 1,948 (1.38) 1,523 (1.10)
Arterial hypertension, N (%) 7,109 (34.75) 17,614 (22.03) 25,611 (18.21) 21,390 (15.39)
Dyslipidemia, N (%) 6,494 (31.74) 17,367 (21.72) 26,653 (18.95) 24,842 (17.87)
Lymphoma, N (%) 940 (4.60) 4,055 (5.07) 8,673 (6.17) 9,712 (6.99)
Gastrointestinal bleeding, N (%) 361 (1.76) 605 (0.76) 857 (0.61) 661 (0.48)
Chronic bronchitis, N (%) 1,441 (7.04) 3,575 (4.47) 5,810 (4.13) 5,201 (3.74)
Cystic fibrosis, N (%) 568 (2.78) 830 (1.04) 1,136 (0.81) 884 (0.64)
Outcome variables
Hospitalization, N (%) 5,486 (26.82) 6,943 (8.69) 10,951 (7.78) 2,236 (1.61)
Death, N (%) 1,678 (8.20) 1,764 (2.21) 1,901 (1.35) 584 (0.42)
Bad progress, N (%) 2,028 (9.91) 2,334 (2.92) 3,112 (2.21) 761 (0.55)
Footnote: The superscripts found in the variable names indicate the periods in which that variable
satisfies the independence test with an alpha of 0.01. In case no superscripts are shown, it means there
are significant differences among all the periods. Only sociodemographic variables, comorbidities with
significant differences among all periods and the outcomes are included.
8
Turning now to the clusters’ identification, four clusters, from Very low to Very high,
were identified in Periods 1 to 3, while three clusters were identified in the Omicron
period. In Fig 1 COVID-19 patients are plotted according to their age and the Charlson
index and are colored by the identified clusters. For each cluster, the median values of
age and the Charlson index are represented by bigger dots. In addition, tables with the
sample size of the different clusters are shown for each period. Age is the variable that
differentiates the Very low and Low clusters while the Charlson index is similar in these
clusters for all the periods. In contrast, the High and Very high clusters have similar
ages but different Charlson index values, being higher for the Very high group. Of note
here is that in the last period there is just one High-Very high severity cluster with a
Charlson index in-between the one obtained in the previous periods.
In order to analyze the differences of the clusters in more detail, in Fig 2 the age and the
Charlson index density plots corresponding to the clusters identified in Fig 1 are shown.
The Very low and Low clusters present different age distributions and similar Charlson
index density distributions. On the other side, the High and Very high clusters have
similar age distributions but different Charlson index distributions. In the last period the
High-Very high cluster has a similar age distribution to these clusters in the previous
periods while the Charlson index density distribution is a combination of the
distributions of these two clusters from the previous periods.
Fig 3 shows the proportions of hospitalization, bad progress and death in each cluster
for all the periods. The clusters created in an unsupervised manner present a stepped
proportion of adverse outcomes of the disease. Actually, the proportion of the outcomes
increases with the risk level of the clusters for all the periods and outcomes.
In Table 2 the summary of the differences between clusters for the same period can be
seen. Only the variables with significant differences in the pairwise tests performed
9
between the different level clusters and the ones that have at least one large effect size
between clusters in those tests are shown in Table 2. The numeric superscript found on
top of the variable values that define the clusters indicate the cluster indexes that have
large effect sizes with that specific variable for that specific cluster. All the outcomes
were also included. The rest of the comparisons can be found online in the S2 Table.
Table 2. Cluster descriptive characteristics by period and differences for each period by
cluster.
Very low [1] Low [2] High [3] Very high [4]
Period 1 N = 11148 N = 4996 N = 3051 N = 1262
Age, Median [Q1,Q3] 46 [36-54]2,3,4 71 [60-82]1,3 85 [76-90]1,2 80 [70-87]1
Nursing home, N (%) 168 (1.51)3,4 1295 (25.92) 1557 (51.03)1 503 (39.86)1
3,4 3,4 1,2,4
Charlson index, Median [Q1,Q3] 0 [0-0] 1 [0-2] 3 [2-4] 7 [6-9]1,2,3
Congestive heart failure N (%) 36 (0.32)3,4 23 (0.46)4 1043 (34.19)1 577 (45.72)1,2
4
Peripheral vascular disease, N (%) 38 (0.34) 133 (2.66) 489 (16.03) 398 (31.54)1
Cerebrovascular disease, N (%) 250 (2.24)3,4 435 (8.71) 1255 (41.13)1 544 (43.11)1
1-4
Liver disease, N (%)
Mild 190 (1.70) 369 (7.39) 278 (9.11) 235 (18.62)
Moderate/severe 11 (0.10) 16 (0.32) 15 (0.49) 98 (7.77)
1-3,1-4,2-4,3-4
Diabetes, N (%)
Yes, without organ damage 0 (0.92) 818 (16.37) 872 (28.58) 358 (28.37)
Yes, with organ damage 2 (0.02) 36 (0.72) 134 (4.39) 359 (28.45)
Kidney disease, N (%) 237 (2.13)4- 305 (6.1)4 910 (29.83) 754 (59.75)1,2
Metastatic solid tumor, N (%) 47 (0.42)3,4 22 (0.44)3,4 0 (0)1,2 438 (34.71)1,2
3 3, 1,2
HIV, N (%) 5 (0.04) 6 (0.12) 0 (0) 20 (1.58)
3,4 4 1
Heart failure, N (%) 36 (0.32) 23 (0.46) 1043 (34.19) 577 (45.72)1,2
2,3,4 1 1
Arterial hypertension, N (%) 476 (4.27) 3191 (63.87) 2456 (80.5) 986 (78.13)1
Antidiabetics, N (%) 73 (0.65)4 690 (13.81) 779 (25.53) 566 (44.85)1
3,4 1
Diuretics, N (%) 35 (0.31) 468 (9.37) 1281 (41.99) 568 (45.01)1
RAAS inhibitors, N (%) 93 (0.83)2,3,4 2227 (44.58)1 1480 (48.51)1 502 (39.78)1
3,4 1
Lipid lowering drugs/statins, N (%) 164 (1.47) 1598 (31.99) 1302 (42.67) 545 (43.19)1
Anticoagulants, N (%) 134 (1.2)3,4 715 (14.31)3 2255 (73.91)1,2 799 (63.31)1
3,4 1
Antiplatelets, N (%) 59 (0.53) 395 (7.91) 1333 (43.69) 438 (34.71)1
Hospitalization, N (%) 1590 (14.26) 1840 (36.83) 1368 (44.84) 688 (54.52)
Bad progress, N (%) 155 (1.39) 609 (12.19) 828 (27.14) 436 (34.55)
Death, N (%) 40 (0.36)4 457 (9.15) 767 (25.14) 414 (32.81)1
Period 2 N = 44399 N = 22857 N = 9638 N = 3048
2,3,4 1,3,4 1,2
Age, Median [Q1,Q3] 35 [26-43] 58 [53-65] 76 [65-85] 77 [66-86]1,2
3,4 3,4 1,2,4
Charlson index, Median [Q1,Q3] 0 [0-0] 0 [0-0] 2 [1-3] 7 [5-8]1,2,3
Congestive heart failure N (%) 63 (0.14)4 50 (0.22)4 1386 (14.38) 1120 (36.75)1,2
4
COPD, N (%) 7118 (16.03) 511 (2.24) 3255 (33.77) 1323 (43.41)2
Liver disease, N (%)1-4
Mild 430 (0.97) 968 (4.24) 884 (9.17) 518 (16.99)
10
Moderate/severe 20 (0.05) 21 (0.09) 25 (0.26) 198 (6.50)
Diabetes, N (%)1-3,1-4,2-3,2-4
Kidney disease, N (%) 864 (1.95)4 463 (2.03)4 1705 (17.69) 1574 (51.64)1,2
Metastatic solid tumor, N (%) 50 (0.11)3,4 63 (0.28)3,4 0 (0)1,2 1080 (35.43)1,2
3 3
HIV, N (%) 18 (0.04) 15 (0.07) 0 (0) 38 (1.25)
4 4
Heart failure, N (%) 63 (0.14) 50 (0.22) 1386 (14.38) 1120 (36.75)1,2
3,4 1
Arterial hypertension, N (%) 623 (1.4) 7347 (32.14) 7372 (76.49) 2272 (74.54)1
Dyslipidemia, N (%) 1230 (2.77)3,4 8530 (37.32) 5773 (59.9)1 1834 (60.17)1
4
Antidiabetics, N (%) 217 (0.49) 946 (4.14) 2528 (26.23) 1209 (39.67)1
Diuretics, N (%) 37 (0.08)4 629 (2.75) 2250 (23.35) 1182 (38.78)1
3,4 1
RAAS inhibitors, N (%) 79 (0.18) 4581 (20.04) 5377 (55.79) 1300 (42.65)1
Lipid lowering drugs/statins, N (%) 93 (0.21)3,4 3355 (14.68) 4872 (50.55)1 1418 (46.52)1
3,4 3,4 1,2
Anticoagulants, N (%) 423 (0.95) 990 (4.33) 5002 (51.9) 1785 (58.56)1,2
Antiplatelets, N (%) 124 (0.28)3,4 257 (1.12) 3054 (31.69)1 971 (31.86)1
Bad progress, N (%) 126 (0.28) 497 (2.17) 1036 (10.75) 675 (22.15)
Death, N (%) 18 (0.04) 247 (1.08) 871 (9.04) 628 (20.60)
Period 3 N = 69409 N = 50401 N = 16735 N = 4127
Age, Median [Q1,Q3] 30 [23-38]2,3,4 53 [48-61]1,3,4 72 [63-81]1,2 74 [63-84]1,2
Charlson index, Median [Q1,Q3] 0 [0-1]3,4 0 [0-0]3,4 2 [1-3]1,2,4 7 [5-8]1,2,3
4 4
Congestive heart failure N (%) 107 (0.15) 106 (0.21) 1756 (10.49) 1388 (33.63)1,2
Liver disease, N (%)1-4
Mild 581 (0.84) 1365 (2.71) 1703 (10.18) 710 (17.20)
Moderate/severe 26 (0.04) 32 (0.06) 36 (0.22) 241 (5.84)
Diabetes, N (%)1-3,1-4,2-3,2-4
4
Kidney disease, N (%) 1837 (2.65) 861 (1.71) 2449 (14.63) 1939 (46.98)2
3,4 3,4 1,2
Metastatic solid tumor, N (%) 12 (0.02) 78 (0.15) 0 (0) 1656 (40.13)1,2
HIV, N (%) 12 (0.02)3 19 (0.04)3 0 (0)1,2 99 (2.4)
4 4
Heart failure, N (%) 107 (0.15) 106 (0.21) 1756 (10.49) 1388 (33.63)1,2
Arterial hypertension, N (%) 778 (1.12)3,4 9613 (19.07)3 12364 (73.88)1,2 2856 (69.2)1
3,4 1
Dyslipidemia, N (%) 1749 (2.52) 12682 (25.16) 9936 (59.37) 2286 (55.39)1
Antidiabetics, N (%) 329 (0.47)4 921 (1.83) 4262 (25.47) 1525 (36.95)1
4
Diuretics, N (%) 41 (0.06) 646 (1.28) 3007 (17.97) 1411 (34.19)1
RAAS inhibitors, N (%) 93 (0.13)3,4 5356 (10.63) 9552 (57.08)1 1770 (42.89)1
3,4 1
Lipid lowering drugs/statins, N (%) 135 (0.19) 3716 (7.37) 8681 (51.87) 1801 (43.64)1
Anticoagulants, N (%) 784 (1.13)3,4 1243 (2.47)3,4 7487 (44.74)1,2 2231 (54.06)1,2
Bad progress, N (%) 218 (0.31) 773 (1.53) 1352 (8.08) 769 (18.63)
Death, N (%) 11 (0.02) 231 (0.46) 970 (5.80) 689 (16.69)
Period 4
High – Very high
N = 99899 N = 31581 [4] N = 7538
Age, Median [Q1,Q3]All
39 [28-46] 60 [55-68]
11
Vaccines, N (%)1-2,1-4
73 [63-84]
2 doses
83486 (83.57) 14286 (45.24)

3 doses
3035 (3.04) 15371 (48.67) 2094 (27.78)

Charlson index, Median [Q1,Q3]
0 [0-0]4 0 [0-1]4 5063 (67.17)

1-4,2-4
Diabetes, N (%)
4 [3-5]1,2
Yes, without organ damage
550 (0.55) 2444 (7.74)

Yes, with organ damage
60 (0.06) 61 (0.19) 2318 (30.75)

Arterial hypertension, N (%)
1930 (1.93)2,4 14048 (44.48)1 732 (9.71)

Dyslipidemia, N (%)
4958 (4.96)4 15414 (48.81) 5412 (71.8)1

Antidiabetics, N (%)
442 (0.44)4 2265 (7.17) 4470 (59.3)1

RAAS inhibitors, N (%)
225 (0.23)4 9771 (30.94) 2619 (34.74)1

Lipid lowering drugs/statins, N (%)
326 (0.33)4 7969 (25.23) 3883 (51.51)1

Anticoagulants, N (%)
1045 (1.05)4 3187 (10.09) 4051 (53.74)1

Antiplatelets, N (%)
283 (0.28)4 2041 (6.46) 4237 (56.21)1

Hospitalization, N (%)
604 (0.60) 764 (2.42) 2671 (35.43)1

Bad progress, N (%)
81 (0.08) 234 (0.74) 868 (11.51)

Death, N (%)
21 (0.02) 156 (0.49) 446 (5.92)

Footnote: The numeric superscript found on top of the variable values that define the clusters indicate
the cluster indexes that have large effect sizes with that specific variable for that specific cluster. Only
12
variables with large effect sizes, as suggested by [26] and [27], in the performed tests were included
together with the outcomes.
Apart from the Charlson index and age, the variables that differentiate the lower-risk
and higher-risk clusters in the first three periods are: a higher percentage in the higher
risk-clusters of congestive heart failure, peripheral vascular disease, cerebrovascular
disease, liver disease, diabetes, kidney disease, metastatic solid tumor, heart failure,
arterial hypertension, and the baseline prescribed treatments of antidiabetics, diuretics,
RAAS inhibitors, statins, anticoagulants and antiplatelets in period 1; congestive heart
failure, COPD, liver disease, diabetes, kidney disease, metastatic solid tumor, heart
failure, arterial hypertension, dyslipidemia and baseline treatments of antidiabetics,
diuretics, RAAS inhibitors statins, anticoagulants and antiplatelets in period 2,
congestive heart failure, liver disease, diabetes, kidney disease, metastatic solid tumor,
heart failure, arterial hypertension, dyslipidemia and baseline treatments of
antidiabetics, diuretics, RAAS inhibitors, statins and anticoagulants in period 3. In the
Omicron period the proportion of vaccinated people, the presence of diabetes, arterial
hypertension, dyslipidemia and prescribed treatments of antidiabetics, RAAS inhibitors,
statins, anticoagulants and antiplatelets segregate the lower risk clusters with the High-
Very high one.
Even though the presence of comorbidities is further increased in the High and Very
high clusters, there are some comorbidities that gradually increase from the Very low to
the Low cluster differentiating these profiles: diabetes, arterial hypertension and
dyslipidemia. In addition, the High and Very high clusters are mainly segregated by
their proportions in heart failure, liver disease, diabetes, kidney disease and metastatic
solid tumor, all of them included and contributing to the Charlson index.
13
Finally, even though there are no large effect sizes in the COVID-19 outcomes
comparisons except for death between the Very low and Very high clusters in period 1,
we can see that for all the outcomes (hospitalization, bad progress and death) and
periods there are significant differences in their proportions among clusters, as it was
concluded in Fig 3.
In Table 3 a summary of the differences among periods for a specific risk level are
shown. Again, the variables shown in Table 3 are the ones that for a specific cluster
level have at least one large effect size in the tests performed for all the possible
combinations of periods. The numeric superscript found on top of the variable values
that define the clusters indicate the period indexes that have large effect sizes with that
specific variable for that specific period. The High-Very high cluster found in the last
period was compared with the Very high cluster from the previous periods. The
outcomes were also included in the table. The rest of the comparisons can be found
online in the S3 Table.
Table 3. Cluster descriptive characteristics by risk level and differences in time.
Period 1 Period 2 Period 3 Period 4

Very low
Age, Median [Q1, Q3] 46 [36-54]2,3 35 [26-43]1 30 [23-38]1 39 [28-46]
3-4
Vaccines, N (%)
2 doses 9788 (14.1) 83486 (83.57)
3 doses 101 (0.15) 3035 (3.04)
Bad progress, N (%) 155 (1.39) 126 (0.28) 218 (0.31) 81 (0.08)
Death, N (%) 40 (0.36) 18 (0.04) 11 (0.02) 21 (0.02)
Low
Age, Median [Q1, Q3] 71 [60-82]2,3 58 [53-65]1 53 [48-61]1 60 [55-68]
3-4
Vaccines, N (%)
2 doses 18161 (36.03) 14286 (45.24)
3 doses 411 (0.82) 15371 (48.67)
Charlson index, Median [Q1, Q3] 1 [0-2]3 0 [0-0] 0 [0-0]1 0 [0-1]
Bad progress, N (%) 609 (12.19) 497 (2.17) 773 (1.53) 234 (0.74)
Death, N (%) 457 (9.15) 247 (1.08) 231 (0.46) 156 (0.49)
14
Period 1 Period 2 Period 3 Period 4
High
Age, Median [Q1, Q3] 85 [76-90]3 76 [65-85] 72 [63-81]1
Charlson index, Median [Q1, Q3] 3 [2-4]2,3 2 [1-3]1 2 [1-3]1
Hospitalization, N (%) 1368 (44.84) 2438 (25.30) 3734 (22.31)
Bad progress, N (%) 828 (27.14) 1036 (10.75) 1352 (8.08)
Death, N (%) 767 (25.14) 871 (9.04) 970 (5.80)
Very high (High-Very high in
period 4)
Vaccines, N (%)3-4
2 doses 1464 (35.47) 2094 (27.78)
3 doses 211 (5.11) 5063 (67.17)
Charlson index, Median [Q1, Q3] 7 [6-9]4 7 [5-8]4 7 [5-8]4 4 [3-5]2,3,4
Bad progress, N (%) 436 (34.55) 675 (22.15) 769 (18.63) 446 (5.92)
1
Death, N (%) 414 (32.81) 628 (20.60) 689 (16.69) 407 (5.40)
Footnote: The numeric superscript found on top of the variable values that define the clusters indicate
the period indexes that have large effect sizes with that specific variable for that specific period. Only
variables with large effect sizes, as suggested by [26] and [27], in the performed tests were included
together with the outcomes.
In this case, the only variables with at least one large effect size are age, the Charlson
index and the vaccine doses. With the exception of the Very high cluster, age is reduced
along the pandemic in all the clusters. In addition, leaving out the Very low cluster, the
Charlson index is also reduced for the rest of the clusters. This can also be seen in the
outcomes proportion. Although there are not large effect sizes in the comparisons, the
proportions of death, bad progress and hospital admission are reduced along the
pandemic.
Discussion
This study utilized the KAMILA clustering technique to classify patients according to
their clinical and sociodemographic characteristics for different time periods of the
COVID-19 pandemic based on a large cohort of patients. We concentrated on the first
periods on account of the high hospitalization and death rates in the early stages of the
COVID-19 pandemic.
15
As we hypothesized, the identified clusters are closely associated with the adverse
outcomes of the disease creating a description of risk profiles. Actually, the severity of
the outcomes increases with the risk level of the clusters for all the periods and
outcomes. While the outcomes were related to the COVID-19, this procedure is
independent of the virus and could be used in new pandemics or other diseases. This
could be valuable, especially in the early stages of a pandemic, to obtain a quick
identification of risk groups and to provide targeted care to the most vulnerable patients.
As a summary of the results, we found four profiles based on age and the presence, or
not, of comorbidities as measured by the Charlson Comorbidity Index:
 Young patients with almost no comorbidities: Very low.
 Middle-aged patients with, generally, few comorbidities: Low.
 Old patients with different number of comorbidities: High.
 Old patients with multiple comorbidities: Very high.
Age and the Charlson index, as a measurement of the patients’ comorbidities impact,
emerged as important factors discriminating the profiles, with age being more relevant
discerning the Very low and Low profiles, while the Charlson index was more important
separating the High and Very high profiles. Although age, the Charlson index and
comorbidities proportion reduced along the pandemic, there are no large differences,
which suggests that the clusters have been stable over time. In fact, the COVID-19
adverse outcomes reduce presumably due to a combination of factors, including the
reduced virulence of the virus, the initiation of vaccination programs, increased
population immunity due to infections and reinfections, population-level containment
measures, and more effective treatments at the hospital level.
16
The highest risk profiles are characterized by higher ages and higher Charlson index
values, which are associated with poorer outcomes of COVID-19. On the one side, age
has a significant impact due to its correlation with frailty and overall health condition,
which have been shown to result in poorer COVID-19 outcomes in the literature [21-
23]. On the other side, the Charlson index mainly segregates the higher risk profiles,
suggesting its influence on a worse prognosis of the disease. This is consistent with the
Charlson index’s use as an indicator of the patient’s health status. In addition, prior
studies have also associated the Charlson index with a worse COVID-19 prognosis [24].
The evolution of these variables along the pandemic is explained by the special case of
the first period. The lower testing capacity at the initial stage of the pandemic [25]
resulted in selective testing of the most severe cases, which may have biased the results
of this period. For instance, the median age of the Low cluster in this period was 71,
which is high considering its risk level. However, in the subsequent two periods, with
increased testing capacity, the median age reduced significantly.
In the last period (Omicron variant) the Very high and High clusters join together
creating an intermediate cluster that we renamed High-Very high. This can be concluded
by simultaneously looking at Fig 1 and Fig 2: the Charlson index median value of the
High-Very high cluster in the last period is in the middle of the median values of both
clusters from the previous period and its distribution is a combination of their density
plots. Accordingly, the best prognosis patients of the previous periods from the High
profile may have mixed with the patients from the Low profile, resulting in this case in
an increase of the median age of the Low profile for the last period.
Regarding the rest of the variables, there were no large differences among periods for
the same risk level clusters. This leads us to conclude that the clusters have been stable
along the pandemic and that the reduction of the adverse outcomes of the disease is
17
explained by the following reasons: the vaccination process starting in 2021 and the
effectiveness of the vaccines [26]; in general better self-protection of the population; an
improvement of the treatments applied to COVID-19 positives; and the appearance of
less virulent SARS-CoV-2 variants [27].
Comorbidities and baseline treatments’ proportions increase in the clusters identified as
higher risk profiles. Bearing in mind the High and Very high profiles are distinct due to
the Charlson index, it is interesting to note which comorbidities segregate these clusters.
As expected, these comorbidities contribute to the Charlson index: diabetes, liver
disease, kidney disease, metastatic solid tumor and heart failure. These variables have
already been reported in the literature as suggestive of a worse prognosis of COVID-19
[28-31]. Furthermore, other comorbidities gradually increase in all the clusters like
arterial hypertension, dyslipidemia and diabetes and even discriminate the Very low and
Low profiles. These variables have already been associated with poorer outcomes of
COVID-19 [32,33].
Regarding baseline treatments, antidiabetics, diuretics, RAAS inhibitors, statins,
anticoagulants and antiplatelets proportions differ in the higher risk clusters and the
lower ones. These treatments are usually employed to treat various comorbidities
simultaneously; thus, we consider they are not that important by themselves to create
patient profiles. Rather than being essential in creating the profiles, baseline treatments
can be considered as an indirect measure of the severity of associated comorbidities.
The strong correspondence of our results with those found in the literature reinforces
that KAMILA can effectively classify patients. In fact, the variables of the profiles that
differ the most are the ones that have been highlighted in the literature as suggestive of a
worse COVID-19 prognosis. This also accords the earlier studies suggesting KAMILA
is suitable when dealing with mixed-type data and large databases.

18
This study has several strengths, including the large sample size that encompasses all
COVID-19-positive cases in the Basque Country over a period of almost two years.
Additionally, the database used in this study contained a wide range of variables,
including sociodemographic factors, comorbidities and baseline treatments, which
allowed the identification of patient profiles. Furthermore, the study has not been
restricted to hospitalized patients, unlike some previous studies [9,10], and the profile
identification has been expanded to the COVID-19 positives of a population. However,
there are also several limitations in our study. First, there are not established standards
for the statistical validation of unsupervised clustering results, but the acceptance of the
clustering structures by clinicians with expertise in this topic helped to mitigate this
issue. Second, our study identifies statistical associations and descriptions, but does not
describe causality. Third, this study is time-limited and future research may be needed
to identify the future profiles of COVID-19 positives in new waves. Finally, to enhance
generalizability, identification of the profiles of other regions should be performed to
compare the created clusters.
Conclusions
The purpose of the present research was to classify patients according to their clinical
and sociodemographic profiles and subsequently show their association with the
evolution of the disease. Our initial hypothesis was tested and the profiles created in an
unsupervised manner could be associated with the adverse outcomes of COVID-19. In
addition, the study’s results are consistent with the literature, suggesting the variables
that differ the profiles the most are those highlighted in the literature as indicative of a
worse COVID-19 prognosis. In particular, age and the Charlson index have played a
19
major role in the determination of the profiles jointly with diabetes, kidney disease,
metastatic solid tumor, and heart failure.
These findings suggest that this method can be used in new pandemics, with other
chronic conditions or even with populations, to segregate patients in a clinically useful
way. This could lead to a quick classification of the worse prognostic patients, allowing
for targeted care intervention strategies and resulting in an overall improvement of their
medical attention.
20
Acknowledgments
We are grateful for the support of the Basque health service, Osakidetza, and the
Department of Health of the Basque Government. We also gratefully acknowledge the
patients who participated in the study. Open access funding provided by BCAM.
References
1. Zhu N, Zhang D, Wang W, Li X, Yang B, Song J, Zhao X, et al. A Novel
Coronavirus from Patients with Pneumonia in China, 2019. New England Journal of
Medicine. 2020; 382(8): 727-733. doi: 10.1056/NEJMoa2001017.
2. McCabe R, Schmit N, Christen P, D´Aeth JC, Løchen A, Rizmie D, et al. Adapting
hospital capacity to meet changing demands during the COVID-19 pandemic. BMC
Med. 2020; 18(1): 1-12. doi: https://doi.org/10.1186/s12916-020-01781-w.
3. World Health Organization. WHO Coronavirus (COVID-19) Dashboard. [Cited 2023
February 15]. Available from: https://covid19.who.int.
4. Garg A, Mago V. Role of machine learning in medical research: A survey. Computer
Science Review. 2021; 40: 100370. doi: https://doi.org/10.1016/j.cosrev.2021.100370.
5. Landau S, Leese m, Stahl D, Everitt BS. Cluster Analysis. 5 th ed. John Wiley & Sons;
2011.
6. McLachlan GJ. Cluster analysis and related techniques in medical research. Stat
Methods Med Res. 1992;1(1): 27-48. doi: 10.1177/096228029200100103.
7. Rappoport N, Shamir R. Multi-omic and multi-view clustering algorithms: review
and cancer benchmark. Nucleic Acids Res. 2018;46(20):10546-10562. doi:
10.1093/nar/gky889.
8. Grant RW, McCloskey J, Hatfield M, Uratsu C, Ralston JD, Bayliss E, et al. Use of
Latent Class Analysis and k-means Clustering to Identify Complex Patient Profiles.
JAMA Netw Open. 2020; 3(12): e2029068-e2029068.
9. Bondeelle L, Chevret S, Cassonnet S, Harei S, Denis B, de Castro N, et al. Profiles
and outcomes in patients with COVID-19 admitted to wards of a French
oncohematological hospital: A clustering approach. PLoS One. Published online
2021:e0250569-e0250569.
10. Ye W, Lu W, Tang Y, Chen G, Li X, Ji C. Identification of COVID-19 Clinical
Phenotypes by Principal Component Analysis-Based Cluster Analysis. Front Med
(Laussane). 2020; 7: 570614. doi: 10.3389/fmed.2020.570614.
11. Preud’homme G, Duarte K, Dalleau K, Lacomblez C, Bresso E, Smail-Tabbone M,
et al. Head-to-head comparison of clustering methods for heterogeneous data: a
simulation-driven benchmark. Scientific Reports. 2021; 11: 4202. doi:
https://doi.org/10.1038/s41598-021-83340-8.
12. Foss A, Markatou M, Ray B, Heching A. A semiparametric method for clustering
mixed data. Machine Learning. 2016; 105: 419-458. doi: 10.1007/s10994-016-5575-7.
13. Foss A, Markatou M, kamila: Clustering Mixed-Type Data in R and Hadoop.
Journal of Statistical Software. 2018, 83(16): 1-44. doi: 10.18637/jss.v083.i13.
14. Costa E, Papatsouma I, Markos A. Benchmarking distance-based partitioning
methods for mixed-type data. Advances in Data Analysis and Classification. 2022. Doi:
https://doi.org/10.1007/s11634-022-00521-7.
15. Charlson ME, Sax FL, MacKenzie CR, Fields SD, Braham RL, Douglas Jr RG.
Assessing illness severity: does clinical judgement work? J Chronic Dis. 1986; 39(6):
439-452. doi: 10.1016/0021-9681(86)90111-6.
16. WHO Collaborating Centre for Drug Statistics Methodology. Guidelines for ATC
classification and DDD assignment, 16th ed. Oslo; 2013.
17. Tibshirani R, Walther G. Cluster Validation by Prediction Strength. Journal of
Computational and Graphical Statistics. 2005; 14(3): 511-528. doi:
10.1198/106186005X59243.
18. R Core Team. A language and environment for statistical computing. R Foundation
for Statistical Computing, Austria. 2021. Available from: https://www.R-project.org/.
19. Kim HY. Statistical notes for clinical researchers: Chi-squared test and Fisher’s
exact test. Restorative Dentistry & Endodontrics. 2017; 42(2): 152-155. doi:
10.5395/rde.2017.42.2.152.
20. Vargha A, Delaney HD. A critique and Improvement of the CL Common Language
Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral
Statistics. 2000; 25(2): 101-132. doi: 10.3102/10769986025002101.
21. Gupta RK, Marks M, Samuels THA, Luintel A, Rampling T, Chowdhury H, et al.
Systematic evaluation and external validation of 22 prognostic models among
hospitalised adults with COVID-19: an observational cohort study. Eur Respir J. 2020;
56(6): 2003498. doi: 10.1183/13993003.03498-2020.
22. Verity R, Okell LC, Dorigatti I, Winskill P, Whittaker C, Imai N, et al. Estimates of
the severity of coronavirus disease 2019: a model-based analysis. The lancet: Infectious
diseases. 2020; 20(6): 669-677. doi: https://doi.org/10.1016/S1473-3099(20)30243-7.

23. Sousa GJB, Garces TS, Cestari VRF, Florêncio RS, Moreira TMM, Pereira MLD.
Mortality and survival of COVID-19. Epidemiology & Infection. 2020; 148:e123. Doi:
10.1017/S0950268820001405.
24. Tuty Kuswardhani RA, Henrina J, Pranata R, Lim MA, Lawrensia S, Suastika K.
Charlson comorbidty index and a composite of poor outcomes in COVID-19 patients: A
systematic review and meta-analysis. Diabetes & Metabolic Sindrome: Clinal Research
& Reviews. 2020; 14(6): 2103-2109. Doi: https://doi.org/10.1016/j.dsx.2020.10.022.
25. Soriano V, Ganado-Pinilla P, Sanchez-Santos M, Gómez-Gallego F, Barreiro P, de
Mendoza C, et al. Main differences between the first and second waves of COVID-19 in
Madrid, Spain. International Journal of Infectious Diseases. 2021; 105: 374-376. Doi:
10.1016/j.ijid.2021.02.115.
26. Lin D, Gu Y, Wheeler B, Young H, Holloway S, Sunny SK, et al. Effectiveness of
Covid-19 Vaccines over a 9-Month Period in North Carolina. The New England Journal
of Medicine. 2022; 386: 933-941. doi: 10.1056/NEJMoa2117128.
27. Shuai H, Chan JFW, Hu B, Chai Y, Yuen TTT, Yin F, et al. Attenuated replication
and pathogenicity of SARS-CoV-2 B.1.1.529 Omicron. Nature. 2022; 603(7902): 696-
699. Doi: 10.1038/s41586-022-04442-5.
28. Huang I, Lim MA, Pranata R. Diabetes mellitus is associated with an increased
mortality and severity of disease in COVID-19 pneumonia – a systematic review, meta-
analysis and meta-regression. Diabetes & Metabolic Syndrome: Clinical Research &
Reviews. 2020; 14(4): 395-403. Doi: https://doi.org/10.1016/j.dsx.2020.04.018.
29. Cheng Y, Luo R, Wang K, Zhang M, Wang Z, Dong L, et al. Kidney disease is
associated with in-hospital death of patients with COVID-19. Kidney International.
2020; 97(5): 829-838. Doi: https://doi.org/10.1016/j.kint.2020.03.005.

30. Yoshida Y, Chu S, Fox S, Zu Y, Lovre D, Denson JL, et al. Sex differences in
determinants of COVID-19 severe outcomes – findings from the National COVID
Cohort Collaborative (N3C). BMC Infectious Diseases. 2022; 22: 784. Doi:
https://doi.org/10.1186/s12879-022-07776-7.
31. Rey JR, Caro-Codón J, Rosillo SO, Iniesta AM, Castrejón-Castrejón S, Marco-
Clement I, et al. Heart failure in COVID-19 patients: prevalence, incidence and
prognostic implications. European Journal of Heart Failure. 2020; 22: 2205-2215. Doi:
https://doi.org/10.1002/ejhf.1990.
32. Pranata R, Lim MA, Huang I, Raharjo SB. Hypertension is associated with an
increased mortality and severity of disease in COVID-19 pneumonia: A systematic
review, meta-analysis and meta-regression. Journal of the Renin-Anglotensis-
Aldosterone System. 2020; 21(2): 147032032092689. Doi:
10.1177/1470320320926899.
33. Hariyanto TI, Kurniawan A. Dyslipidemia is associated with severe coronavirus
disease 2019 (COVID-19) infection. Diabetes & Metabolic Synrdome: Clinical
Research & Reviews. 2020; 14(5): 1463-1465. Doi:
https://doi.org/10.1016/j.dsx.2020.07.054.
Figure captions
Fig 1. Overview of the clusters according to age and the Charlson index.
Fig 2. Charlson index density plots for all the periods and clusters.
Fig 3. Description of main outcomes differences by clusters and periods.
Supporting information captions

S1 Table. Complementary table for the descriptive characteristics in Table 1.
S2 Table. Complementary table for the cluster descriptive characteristics by period
found in Table 2.
S3 Table. Complementary table to the cluster descriptive characteristics by risk level
found in Table 3.
Declarations
Ethics Statement
The study protocol was approved by the Ethics Committee of the Basque Country
(reference PI2020123). The need for consent was waived by the ethics committee due to
the pandemic situation.
Funding
This work was supported in part by the health outcomes group from Galdakao-
Barrualde Health Organization; the Kronikgune Institute for Health Service Research;
Instituto de Salud Carlos III (ISCIII) through the project "RD16/0001/0001" (Red de
Investigación en Servicios de Salud en Enfermedades Crónicas) and the project
“RD21CIII/0003/0017” (Red de Investigación en Cronicidad, Atención Primaria y
Prevención y Promoción de la Salud) and co-funded by the European Union, and the
Basque Government through BMTF ‘‘Mathematical Modeling Applied to Health’’
Project. The work of IB was financially supported in part by grants from the
Departamento de Educación, Política Lingüística y Cultura del Gobierno Vasco
[IT1456-22] and by the Ministry of Science and Innovation through BCAM Severo
Ochoa accreditation [CEX2021-001142-S / MICIN / AEI /10.13039/501100011033]
and through project [PID2020-115882RB-I00 / AEI /10.13039/501100011033] funded

by Agencia Estatal de Investigación and acronym ``S3M1P4R" and also by the Basque
Government through the BERC 2022-2025 program. DF has been supported by the
Ministerio de Ciencia e Innovación (Spain) [PID2019-104830RB-I00/ DOI (AEI):
10.13039/501100011033], and by grant 2021 SGR 01421 (GRBIO) administrated by
the Departament de Recerca I Universitats de la Generalitat de Catalunya (Spain). The
funders had no role in study design, data collection and analysis, decision to publish, or
preparation of the manuscript.

Covid Profiles Manuscript VF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Covid Profiles Manuscript VF

Uploaded by

Copyright:

Available Formats

Identifying COVID-19 patient profiles in the Basque Country: A clustering

Lander Rodriguez1*, Daniel Fernández2,3, José M. Quintana-Lopez4,5,6,7, Julia Garcia-

Asensio8, Ane Villanueva4,5,6,7, Maria Jose Legarreta4,5,6,7, Nere Larrea4,5,6,7, Irantzu

*Corresponding author: Lander Rodriguez

The classification of patients is essential in the outbreak of a pandemic to identify the

evolution of the COVID-19 disease.

We implement a two-stage process in this retrospective cohort study: first we identify

different periods of the pandemic through a population-based database containing

sociodemographic, comorbidities and baseline treatments data.

admissions, bad progress and death rates decreased.

findings suggest the appropriateness of clustering methods for a quick classification of

The severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) infection began in

mitigate its effects.

restricted to hospitalized patients.

accomplished with either categorical or continuous variables although clinical research

overcome the inherent difficulties of applying mathematical operations to both types of

variables. However, given a specific context, there is in general no clear guidance to

choose the most appropriate technique [11].

suitable when handling mixed-type data as suggested by different benchmarking studies

performance, which is emphasized when dealing with large datasets. In addition, it

incapacity to minimize the contributions of individual variables and the need of an

the implementation of a two-stage process. First, we identify the COVID-19 patient

and investigate their evolution along the pandemic.

Materials and Methods

health records of the health service from the Basque Country.

January 9, 2022. COVID-19 diagnosis was laboratory confirmed by a positive result on

the reverse transcriptase-polymerase chain reaction assay for SARS-CoV-2 or a positive

area (reference PI2020123).

included in Charlson Comorbidity Index [15] plus angina, arrhythmia, arterial

hypertension, dyslipidemia, asthma, bronchiectasis, cystic fibrosis, interstitial lung

disease, lymphoma, leukemia, coagulopathy, inflammatory bowel disease and

Therapeutic, Chemical (ATC) classification system [16].

We grouped comorbidities in the following way: myocardial infarction; angina;

arrhythmia; congestive heart failure; peripheral vascular disease; cerebrovascular

disease; hemiplegia and/or paraplegia; arterial hypertension; dyslipidemia; dementia;

interstitial pulmonary disease; cystic fibrosis; chronic obstructive pulmonary disease

moderate or severe liver disease); diabetes (diabetes with/without organ damage);

CoV-2 infection and had no end date.

study. The vaccination variable was determined as a three-level categorical variable: no

and death. Their definition is shown below:

 Hospitalization: when a patient tested positive for COVID-19 before

hospitalization, hospital admission was considered COVID-19 related if it

hospitalization, hospital admission was considered COVID-19 related up to 21

lack of testing capacity at the beginning of the pandemic.

diagnosis or during a hospitalization, or three months from hospital discharge

an ICU admission during a hospital admission related to a SARS-CoV-2

accessed on April 18, 2022 for research purposes.

span of data acquisition for the Omicron wave.

3 and 4 as the vaccination process started in the third period.

be ratified following the clusters creation in an unsupervised manner.

variables with an alpha of 0.01 to be considered statistically significant. The Shapiro-

authors, respectively: Cramer’s V values superior to 0.5 for 1 degree of freedom

Summary statistics of the sociodemographic variables and the background

percentages from period 1 to 4 (Omicron).

Table 1. Descriptive characteristics of COVID-19 patients for the different pandemic

Charlson index in-between the one obtained in the previous periods.

periods while the Charlson index density distribution is a combination of the

distributions of these two clusters from the previous periods.

83486 (83.57) 14286 (45.24)

3035 (3.04) 15371 (48.67) 2094 (27.78)

0 [0-0]4 0 [0-1]4 5063 (67.17)

550 (0.55) 2444 (7.74)

60 (0.06) 61 (0.19) 2318 (30.75)