Professional Documents
Culture Documents
approach
Barrio9,1
1
Applied Statistics Group, Basque Centre for Applied Mathematics (BCAM), Bilbao,
Basque Country, Spain
2
Serra Húnter Fellow. Department of Statistics and Operations Research (DEIO).
Universitat Politècnica de Catalunya · BarcelonaTech (UPC), Barcelona, Catalonia,
Spain
3
Institute of Mathematics of UPC - BarcelonaTech (IMTech), Barcelona, Catalonia,
Spain
4
Research Unit of the Galdakao-Usansolo University Hospital, Osakidetza Basque
Health Service, Galdakao, Basque Country, Spain
5
Network for Research on Chronicity, Primary Care, and Health Promotion (RICAPPS)
6
Health Service Research Network on Chronic Diseases (REDISSEC), Bilbao, Basque
Country, Spain
7
Kronikgune Institute for Health Services Research, Barakaldo, Basque Country, Spain
8
Office of Healthcare Planning, Organization and Evaluation, Basque Government
Department of Health, Basque Country, Spain
9
Department of Mathematics, University of the Basque Country UPV/EHU, Leioa,
Basque Country, Spain
worst prognostic patients. In this research our aim is to identify clinically useful profiles
with a novel clustering technique and to demonstrate their association with the adverse
the profiles of SARS-CoV-2 positive patients with the KAMILA clustering technique
and then we assess their association with adverse outcomes such as mortality, bad
progress (ICU or death) and hospital admission. The profiles are created for four
In general, four different groups have been identified: Very low, young patients with
almost no comorbidities; Low, middle-aged patients with few comorbidities; High, old
patients with different number of comorbidities; and Very high, old patients with
multiple comorbidities. The variables that mainly segregate these clusters are age, the
Charlson index, diabetes, kidney disease, metastatic solid tumor and heart failure. In
addition, these profiles strongly associate with the adverse outcomes of COVID-19.
Finally, even if the identified profiles were stable along the pandemic, the hospital
To our best knowledge, this is the first study determining COVID-19 patient profiles
from COVID-19 positives of the population and to assess their evolution over time. Our
the most vulnerable patients in new pandemics or other diseases for an improved
medical attention.
Introduction
It is crucial to quickly identify the worst prognostic patients in the outbreak of a new
virulent virus. In a context with great uncertainty, high circulation of the virus and high
hospitalization and death rates, it is critical to rapidly segregate patients so that targeted
care intervention strategies can be developed for an improved medical attention. In the
pages that follow, our purpose is to classify patients according to their clinical and
sociodemographic profiles and subsequently show they are related with the evolution of
the disease. In this case the profiles are related to the COVID-19 outcomes, although
they are independent of the virus and could be related to any disease or new pandemics.
December 2019 [1] and the World Health Organization declared a global pandemic in
March 2020. The disease became a threat to public health [2] due to its ease of
transmission and the number of deaths caused throughout the world [3]. This prompted
governments worldwide to take urgent action to contain the spread of the virus and
The virulence of this pandemic precipitated the creation of a vast number of models to
further understand the disease. These models were mainly developed with Machine
Learning (ML) and advanced statistical methods. Both type of methods are able to
extract relevant variables from electronic health records [4], which could be a valuable
tool for either predicting adverse outcomes or classifying patients based on their
similarities and differences in baseline clinical and sociodemographic variables. For the
latter, clustering techniques are appropriate as they discover hidden and inherent
patterns to organize data into groups without any a priori hypothesis [5] and have
already been successfully applied in medical research [6-8]. However, these methods
1
received very little attention to classify patients during the pandemic. Even if patient
profiles were identified with clustering techniques in [9] and [10], the studies were
One of the main challenges of clustering methods is their application to mixed-type data
i.e. categorical and continuous variables. This unsupervised learning task is often
usually involves both of them. In this regard, various techniques have been developed to
The novel KAMILA (KAy-means for MIxed LArge data) clustering approach [12,13] is
[11, 14]. Among the different methods, KAMILA in general offers superior
provides the best performance in time efficiency thanks to the scalability of the
algorithm [11]. KAMILA overcomes different challenges of the methods employed for
clustering mixed-type data, like: the requirement of strong parametric assumptions, the
arbitrary choice of weightings for the relative contribution of the variables [13].
In this research work we identify the COVID-19 patient profiles from the Basque
Country for the most important periods of the pandemic. This is accomplished through
profiles of the positives from the Basque Country with KAMILA and later we assess
their association with the adverse outcomes of the disease. We hypothesize that the
obtained groups will be associated with the disease severity, leading to a clinically
2
useful segregation of the patients. In addition, we explore the differences among clusters
This is a retrospective study of a cohort based on data from the electronic database and
Database
All the patients included in this study were residents of the Basque Country, a region
with a population of 2.18 million, who had SARS-CoV-2 between March 1, 2020 and
antigen test. Also, from March 1, 2020 to July 31, 2020 positive IgM or IgG antibody
tests performed to patients having symptoms suggestive of the disease or having had
contact with a positive case were included in the sample. The authors did not have
access to information that could identify individual participants during or after data
collection. The need for consent was waived by the ethics committee due to the
pandemic situation. The study protocol was approved by the Ethics Committee for our
Patient data was included in a unified electronic database of the Basque Country health
service. The data includes sociodemographic data (age, sex, and nursing home
residents’ indicator); vaccination dates and doses; baseline comorbidities (all those
3
gastrointestinal bleeding); and baseline treatments based on the Anatomical,
(COPD); bronchiectasis; chronic bronchial infection; asthma; liver disease (mild liver,
kidney disease; malignant tumor; metastatic solid tumor; lymphoma; rheumatic disease;
peptic ulcer; inflammatory bowel disease; and coagulopathies. For baseline medication
the baseline treatment was defined as any drug prescribed before diagnosis with SARS-
Vaccine doses were coded in the following manner: the first dose was considered 14
days after the inoculation of the vaccine whereas the second and third doses were
considered the day the inoculation occurred. There were no fourth doses in the period of
dose or 1 dose, 2 doses, and 3 doses. This categorization was decided because in this
region the first dose was a transitional dose to the second one, the moment a patient was
considered to be protected against the virus, and there was only three weeks difference
between the first two doses. Thus, we considered that getting one dose was more similar
to having none than to be fully vaccinated. Apart from that, the third dose was
considered a booster dose, and thus, we decided to separate it from the full vaccination
indicator.
4
The outcomes of interest in the study were hospitalization, bad progress (ICU or death)
occurred within 15 days of the positive test. If the patient tested positive during
days after the positive test. This last definition was included to account for the
Death: if the patient died during the three months following COVID-19
by a COVID-19 admission.
Bad progress (ICU or death): when the patient died (as defined above) or had
infection.
The data of the study was collected from March 1, 2020 until April 9, 2022 and it was
Statistical Analysis
The full period of collection of data was divided into 4 different periods: The first
period spans from March 1, 2020 until June 30, 2020 when the first wave of the
pandemic took place; the second one spans from July 1, 2020 until December 31, 2020
when the vaccination process started; the third one takes place from January 1, 2021
until December 13, 2021 when the Omicron wave started; and the last period covers the
Omicron wave until January 9, 2022. On Januray 9, 2022 the Basque Government
5
modified its protocol for collecting COVID-19 positive data, which restricted the time
Due to our interest in the first stages of the pandemic only the first positive from each
patient was included in the study. Additionally, only adult patients were considered.
Descriptive statistics included frequency tables for categorical variables and median and
interquartile ranges for numerical ones. Vaccination data was only available for periods
Patients were clustered based on KAMILA for the different periods of the pandemic.
Clusters were determined with all the available variables except the disease outcomes
and the nursing home residents’ indicator, whose association with the clusters was later
assessed. The numerical variables (i.e. age and the Charlson index) were standardized to
avoid the variable units manage the clustering structure. The number of clusters studied
for each period was from a minimum of two to a maximum of five as more groups were
considered excessive for a correct clinical distinction of the patients. The optimal
number of clusters was selected based on the prediction strength method [17] with a
threshold of 0.8, as suggested by the authors. This procedure was accomplished with the
kamila R package [13]. All statistical analyses were performed using R (version 4.1.2)
[18].
The resulting number of clusters were labeled as Very low, Low, High and Very high
based on their defining characteristics. Presumably, this labeling would later indicate
the likelihood of the adverse outcomes of COVID-19, although this assumption should
Post-hoc Analysis
6
Apart from the descriptive characteristics of the different clusters two more aspects
were examined. First, we investigated if the clusters were significantly different for the
same period. Second, we studied if the same risk level clusters had evolved during the
pandemic. The variables were compared pairwise with the Pearson Chi-squared test for
the categorical variables and the two-sided Mann-Whitney U test for continuous
Wilk test was previously done on the continuous variables to confirm their distribution
was not normal. Due to the large sample size, the effect size was measured with
Cramer’s V [19] for categorical variables and Vargha and Delaney’s A [20] for
continuous ones. Large effect sizes were considered as suggested by the previous
variables and higher than 0.35 for 2 degrees of freedom variables, and Vargha and
Delaney’s A values higher than 0.71 or lower than 0.29 for continuous variables.
Results
characteristics of the whole sample for the different periods can be seen in Table 1. The
rest of the variables can be found online in the S1 Table. Significant differences exist
among the variables for the different periods, being the first the most dissimilar one. In
general, the proportion of the comorbidities, nursing home residents, and the median
and interquartile ranges of age and the Charlson index decrease with time. Of interest
here is the reduction of hospitalization, bad progress (ICU or death), and death
periods.
7
Variables Period 1 Period 2 Period 3 Omicron
TOTAL 20,457 (5.38%) 79,942 (21.03%) 140,672 (37.01%) 139,018 (36.58%)
Sociodemographic variables
Gender2-4, N (%)
Female 12,529 (61.25) 42,164 (52.74) 71,345 (50.72) 73,522 (52.89)
Male 7,928 (38.75) 37,778 (47.26) 69,327 (49.28) 65,496 (47.11)
3-4
Age, Median [Q1,Q3] 57 [44,75] 47 [33,61] 44 [30,58] 44 [32,55]
Nursing home, N (%) 3,523 (17.22) 2,530 (3.16) 1,556 (1.11) 1,011 (0.73)
Vaccines, N (%)
0-1 dose 103,867 (73.84) 15,683 (11.28)
2 doses 35,382 (25.15) 99,866 (71.84)
3 doses 1,423 (1.01) 23,469 (16.88)
Comorbidities
Charlson index, Median [Q1,Q3] 0 [0,2] 0 [0,1] 0 [0,1] 0 [0,1]
Myocardial infarction, N (%) 959 (4.69) 1,800 (2.25) 2,618 (1.86) 2,127 (1.53)
Congestive heart failure, N (%) 1,679 (8.21) 2,619 (3.28) 3,357 (2.39) 2,281 (1.64)
Peripheral vascular disease, N 1,058 (5.17) 2,234 (2.79) 2,992 (2.13) 2,363 (1.70)
(%)
Cerebrovascular disease, N (%) 2,484 (12.14) 4,922 (6.16) 6,884 (4.89) 5,809 (4.18)
Dementia, N (%) 1,685 (8.24) 2,042 (2.55) 1,798 (1.28) 1,104 (0.79)
COPD, N (%) 3,693 (18.05) 12,207 (15.27) 22,251 (15.82) 22,512 (16.19)
Rheumatic disease, N (%) 638 (3.12) 1,428 (1.79) 2,114 (1.50) 1,908 (1.37)
Peptic ulcer, N (%) 700 (3.42) 1,590 (1.99) 2,400 (1.71) 2,080 (1.50)
Liver disease, N (%)
Mild 1,072 (5.24) 2,800 (3.50) 4,359 (3.10) 3,994 (2.87)
Moderate/Severe 140 (0.68) 264 (0.33) 335 (0.24) 296 (0.21)
Diabetes, N (%)
Yes, without organ damage 2,151 (10.51) 4,967 (6.21) 7,075 (5.03) 5,312 (3.82)
Yes, with organ damage 531 (2.60) 924 (1.16) 1,266 (0.90) 853 (0.61)
Hemiplegia / Paraplegia, N (%) 442 (2.16) 773 (0.97) 918 (0.65) 694 (0.50)
Kidney, N (%) 2,206 (10.78) 4,606 (5.76) 7,086 (5.04) 6,055 (4.36)
Metastatic solid tumor, N (%) 507 (2.48) 1,193 (1.49) 1,746 (1.24) 1,381 (0.99)
Heart failure, N (%) 1,679 (8.21) 2,619 (3.28) 3,357 (2.39) 2,281 (1.64)
Angina, N (%) 717 (3.50) 1,402 (1.75) 1,948 (1.38) 1,523 (1.10)
Arterial hypertension, N (%) 7,109 (34.75) 17,614 (22.03) 25,611 (18.21) 21,390 (15.39)
Dyslipidemia, N (%) 6,494 (31.74) 17,367 (21.72) 26,653 (18.95) 24,842 (17.87)
Lymphoma, N (%) 940 (4.60) 4,055 (5.07) 8,673 (6.17) 9,712 (6.99)
Gastrointestinal bleeding, N (%) 361 (1.76) 605 (0.76) 857 (0.61) 661 (0.48)
Chronic bronchitis, N (%) 1,441 (7.04) 3,575 (4.47) 5,810 (4.13) 5,201 (3.74)
Cystic fibrosis, N (%) 568 (2.78) 830 (1.04) 1,136 (0.81) 884 (0.64)
Outcome variables
Hospitalization, N (%) 5,486 (26.82) 6,943 (8.69) 10,951 (7.78) 2,236 (1.61)
Death, N (%) 1,678 (8.20) 1,764 (2.21) 1,901 (1.35) 584 (0.42)
Bad progress, N (%) 2,028 (9.91) 2,334 (2.92) 3,112 (2.21) 761 (0.55)
Footnote: The superscripts found in the variable names indicate the periods in which that variable
satisfies the independence test with an alpha of 0.01. In case no superscripts are shown, it means there
are significant differences among all the periods. Only sociodemographic variables, comorbidities with
significant differences among all periods and the outcomes are included.
8
Turning now to the clusters’ identification, four clusters, from Very low to Very high,
were identified in Periods 1 to 3, while three clusters were identified in the Omicron
period. In Fig 1 COVID-19 patients are plotted according to their age and the Charlson
index and are colored by the identified clusters. For each cluster, the median values of
age and the Charlson index are represented by bigger dots. In addition, tables with the
sample size of the different clusters are shown for each period. Age is the variable that
differentiates the Very low and Low clusters while the Charlson index is similar in these
clusters for all the periods. In contrast, the High and Very high clusters have similar
ages but different Charlson index values, being higher for the Very high group. Of note
here is that in the last period there is just one High-Very high severity cluster with a
In order to analyze the differences of the clusters in more detail, in Fig 2 the age and the
Charlson index density plots corresponding to the clusters identified in Fig 1 are shown.
The Very low and Low clusters present different age distributions and similar Charlson
index density distributions. On the other side, the High and Very high clusters have
similar age distributions but different Charlson index distributions. In the last period the
High-Very high cluster has a similar age distribution to these clusters in the previous
Fig 3 shows the proportions of hospitalization, bad progress and death in each cluster
for all the periods. The clusters created in an unsupervised manner present a stepped
proportion of adverse outcomes of the disease. Actually, the proportion of the outcomes
increases with the risk level of the clusters for all the periods and outcomes.
In Table 2 the summary of the differences between clusters for the same period can be
seen. Only the variables with significant differences in the pairwise tests performed
9
between the different level clusters and the ones that have at least one large effect size
between clusters in those tests are shown in Table 2. The numeric superscript found on
top of the variable values that define the clusters indicate the cluster indexes that have
large effect sizes with that specific variable for that specific cluster. All the outcomes
were also included. The rest of the comparisons can be found online in the S2 Table.
Table 2. Cluster descriptive characteristics by period and differences for each period by
cluster.
Very low [1] Low [2] High [3] Very high [4]
Period 1 N = 11148 N = 4996 N = 3051 N = 1262
Age, Median [Q1,Q3] 46 [36-54]2,3,4 71 [60-82]1,3 85 [76-90]1,2 80 [70-87]1
Nursing home, N (%) 168 (1.51)3,4 1295 (25.92) 1557 (51.03)1 503 (39.86)1
3,4 3,4 1,2,4
Charlson index, Median [Q1,Q3] 0 [0-0] 1 [0-2] 3 [2-4] 7 [6-9]1,2,3
Congestive heart failure N (%) 36 (0.32)3,4 23 (0.46)4 1043 (34.19)1 577 (45.72)1,2
4
Peripheral vascular disease, N (%) 38 (0.34) 133 (2.66) 489 (16.03) 398 (31.54)1
Cerebrovascular disease, N (%) 250 (2.24)3,4 435 (8.71) 1255 (41.13)1 544 (43.11)1
1-4
Liver disease, N (%)
Mild 190 (1.70) 369 (7.39) 278 (9.11) 235 (18.62)
Moderate/severe 11 (0.10) 16 (0.32) 15 (0.49) 98 (7.77)
1-3,1-4,2-4,3-4
Diabetes, N (%)
Yes, without organ damage 0 (0.92) 818 (16.37) 872 (28.58) 358 (28.37)
Yes, with organ damage 2 (0.02) 36 (0.72) 134 (4.39) 359 (28.45)
Kidney disease, N (%) 237 (2.13)4- 305 (6.1)4 910 (29.83) 754 (59.75)1,2
Metastatic solid tumor, N (%) 47 (0.42)3,4 22 (0.44)3,4 0 (0)1,2 438 (34.71)1,2
3 3, 1,2
HIV, N (%) 5 (0.04) 6 (0.12) 0 (0) 20 (1.58)
3,4 4 1
Heart failure, N (%) 36 (0.32) 23 (0.46) 1043 (34.19) 577 (45.72)1,2
2,3,4 1 1
Arterial hypertension, N (%) 476 (4.27) 3191 (63.87) 2456 (80.5) 986 (78.13)1
Antidiabetics, N (%) 73 (0.65)4 690 (13.81) 779 (25.53) 566 (44.85)1
3,4 1
Diuretics, N (%) 35 (0.31) 468 (9.37) 1281 (41.99) 568 (45.01)1
RAAS inhibitors, N (%) 93 (0.83)2,3,4 2227 (44.58)1 1480 (48.51)1 502 (39.78)1
3,4 1
Lipid lowering drugs/statins, N (%) 164 (1.47) 1598 (31.99) 1302 (42.67) 545 (43.19)1
Anticoagulants, N (%) 134 (1.2)3,4 715 (14.31)3 2255 (73.91)1,2 799 (63.31)1
3,4 1
Antiplatelets, N (%) 59 (0.53) 395 (7.91) 1333 (43.69) 438 (34.71)1
Hospitalization, N (%) 1590 (14.26) 1840 (36.83) 1368 (44.84) 688 (54.52)
Bad progress, N (%) 155 (1.39) 609 (12.19) 828 (27.14) 436 (34.55)
Death, N (%) 40 (0.36)4 457 (9.15) 767 (25.14) 414 (32.81)1
Period 2 N = 44399 N = 22857 N = 9638 N = 3048
2,3,4 1,3,4 1,2
Age, Median [Q1,Q3] 35 [26-43] 58 [53-65] 76 [65-85] 77 [66-86]1,2
3,4 3,4 1,2,4
Charlson index, Median [Q1,Q3] 0 [0-0] 0 [0-0] 2 [1-3] 7 [5-8]1,2,3
Congestive heart failure N (%) 63 (0.14)4 50 (0.22)4 1386 (14.38) 1120 (36.75)1,2
4
COPD, N (%) 7118 (16.03) 511 (2.24) 3255 (33.77) 1323 (43.41)2
Liver disease, N (%)1-4
Mild 430 (0.97) 968 (4.24) 884 (9.17) 518 (16.99)
10
Very low [1] Low [2] High [3] Very high [4]
Moderate/severe 20 (0.05) 21 (0.09) 25 (0.26) 198 (6.50)
Diabetes, N (%)1-3,1-4,2-3,2-4
Yes, without organ damage 252 (0.57) 1042 (4.56) 2779 (28.83) 894 (29.33)
Yes, with organ damage 21 (0.05) 31 (0.14) 235 (2.44) 637 (20.9)
Kidney disease, N (%) 864 (1.95)4 463 (2.03)4 1705 (17.69) 1574 (51.64)1,2
Metastatic solid tumor, N (%) 50 (0.11)3,4 63 (0.28)3,4 0 (0)1,2 1080 (35.43)1,2
3 3
HIV, N (%) 18 (0.04) 15 (0.07) 0 (0) 38 (1.25)
4 4
Heart failure, N (%) 63 (0.14) 50 (0.22) 1386 (14.38) 1120 (36.75)1,2
3,4 1
Arterial hypertension, N (%) 623 (1.4) 7347 (32.14) 7372 (76.49) 2272 (74.54)1
Dyslipidemia, N (%) 1230 (2.77)3,4 8530 (37.32) 5773 (59.9)1 1834 (60.17)1
4
Antidiabetics, N (%) 217 (0.49) 946 (4.14) 2528 (26.23) 1209 (39.67)1
Diuretics, N (%) 37 (0.08)4 629 (2.75) 2250 (23.35) 1182 (38.78)1
3,4 1
RAAS inhibitors, N (%) 79 (0.18) 4581 (20.04) 5377 (55.79) 1300 (42.65)1
Lipid lowering drugs/statins, N (%) 93 (0.21)3,4 3355 (14.68) 4872 (50.55)1 1418 (46.52)1
3,4 3,4 1,2
Anticoagulants, N (%) 423 (0.95) 990 (4.33) 5002 (51.9) 1785 (58.56)1,2
Antiplatelets, N (%) 124 (0.28)3,4 257 (1.12) 3054 (31.69)1 971 (31.86)1
Hospitalization, N (%) 1129 (2.54) 2151 (9.41) 2438 (25.30) 1225 (40.19)
Bad progress, N (%) 126 (0.28) 497 (2.17) 1036 (10.75) 675 (22.15)
Death, N (%) 18 (0.04) 247 (1.08) 871 (9.04) 628 (20.60)
Period 3 N = 69409 N = 50401 N = 16735 N = 4127
Age, Median [Q1,Q3] 30 [23-38]2,3,4 53 [48-61]1,3,4 72 [63-81]1,2 74 [63-84]1,2
Charlson index, Median [Q1,Q3] 0 [0-1]3,4 0 [0-0]3,4 2 [1-3]1,2,4 7 [5-8]1,2,3
4 4
Congestive heart failure N (%) 107 (0.15) 106 (0.21) 1756 (10.49) 1388 (33.63)1,2
Liver disease, N (%)1-4
Mild 581 (0.84) 1365 (2.71) 1703 (10.18) 710 (17.20)
Moderate/severe 26 (0.04) 32 (0.06) 36 (0.22) 241 (5.84)
Diabetes, N (%)1-3,1-4,2-3,2-4
Yes, without organ damage 398 (0.57) 1070 (2.12) 4523 (27.03) 1084 (26.27)
Yes, with organ damage 27 (0.04) 45 (0.09) 401 (2.4) 793 (19.21)
4
Kidney disease, N (%) 1837 (2.65) 861 (1.71) 2449 (14.63) 1939 (46.98)2
3,4 3,4 1,2
Metastatic solid tumor, N (%) 12 (0.02) 78 (0.15) 0 (0) 1656 (40.13)1,2
HIV, N (%) 12 (0.02)3 19 (0.04)3 0 (0)1,2 99 (2.4)
4 4
Heart failure, N (%) 107 (0.15) 106 (0.21) 1756 (10.49) 1388 (33.63)1,2
Arterial hypertension, N (%) 778 (1.12)3,4 9613 (19.07)3 12364 (73.88)1,2 2856 (69.2)1
3,4 1
Dyslipidemia, N (%) 1749 (2.52) 12682 (25.16) 9936 (59.37) 2286 (55.39)1
Antidiabetics, N (%) 329 (0.47)4 921 (1.83) 4262 (25.47) 1525 (36.95)1
4
Diuretics, N (%) 41 (0.06) 646 (1.28) 3007 (17.97) 1411 (34.19)1
RAAS inhibitors, N (%) 93 (0.13)3,4 5356 (10.63) 9552 (57.08)1 1770 (42.89)1
3,4 1
Lipid lowering drugs/statins, N (%) 135 (0.19) 3716 (7.37) 8681 (51.87) 1801 (43.64)1
Anticoagulants, N (%) 784 (1.13)3,4 1243 (2.47)3,4 7487 (44.74)1,2 2231 (54.06)1,2
Hospitalization, N (%) 1843 (2.66) 3831 (7.60) 3734 (22.31) 1543 (37.39)
Bad progress, N (%) 218 (0.31) 773 (1.53) 1352 (8.08) 769 (18.63)
Death, N (%) 11 (0.02) 231 (0.46) 970 (5.80) 689 (16.69)
Period 4
High – Very high
N = 99899 N = 31581 [4] N = 7538
Age, Median [Q1,Q3]All
39 [28-46] 60 [55-68]
11
Very low [1] Low [2] High [3] Very high [4]
Vaccines, N (%)1-2,1-4
73 [63-84]
2 doses
4 [3-5]1,2
Yes, without organ damage
the cluster indexes that have large effect sizes with that specific variable for that specific cluster. Only
12
variables with large effect sizes, as suggested by [26] and [27], in the performed tests were included
Apart from the Charlson index and age, the variables that differentiate the lower-risk
and higher-risk clusters in the first three periods are: a higher percentage in the higher
disease, liver disease, diabetes, kidney disease, metastatic solid tumor, heart failure,
failure, COPD, liver disease, diabetes, kidney disease, metastatic solid tumor, heart
congestive heart failure, liver disease, diabetes, kidney disease, metastatic solid tumor,
Omicron period the proportion of vaccinated people, the presence of diabetes, arterial
statins, anticoagulants and antiplatelets segregate the lower risk clusters with the High-
Even though the presence of comorbidities is further increased in the High and Very
high clusters, there are some comorbidities that gradually increase from the Very low to
the Low cluster differentiating these profiles: diabetes, arterial hypertension and
dyslipidemia. In addition, the High and Very high clusters are mainly segregated by
their proportions in heart failure, liver disease, diabetes, kidney disease and metastatic
solid tumor, all of them included and contributing to the Charlson index.
13
Finally, even though there are no large effect sizes in the COVID-19 outcomes
comparisons except for death between the Very low and Very high clusters in period 1,
we can see that for all the outcomes (hospitalization, bad progress and death) and
periods there are significant differences in their proportions among clusters, as it was
concluded in Fig 3.
In Table 3 a summary of the differences among periods for a specific risk level are
shown. Again, the variables shown in Table 3 are the ones that for a specific cluster
level have at least one large effect size in the tests performed for all the possible
combinations of periods. The numeric superscript found on top of the variable values
that define the clusters indicate the period indexes that have large effect sizes with that
specific variable for that specific period. The High-Very high cluster found in the last
period was compared with the Very high cluster from the previous periods. The
outcomes were also included in the table. The rest of the comparisons can be found
14
Period 1 Period 2 Period 3 Period 4
High
Age, Median [Q1, Q3] 85 [76-90]3 76 [65-85] 72 [63-81]1
Charlson index, Median [Q1, Q3] 3 [2-4]2,3 2 [1-3]1 2 [1-3]1
Hospitalization, N (%) 1368 (44.84) 2438 (25.30) 3734 (22.31)
Bad progress, N (%) 828 (27.14) 1036 (10.75) 1352 (8.08)
Death, N (%) 767 (25.14) 871 (9.04) 970 (5.80)
Very high (High-Very high in
period 4)
Vaccines, N (%)3-4
2 doses 1464 (35.47) 2094 (27.78)
3 doses 211 (5.11) 5063 (67.17)
Charlson index, Median [Q1, Q3] 7 [6-9]4 7 [5-8]4 7 [5-8]4 4 [3-5]2,3,4
Hospitalization, N (%) 688 (54.52) 1225 (40.19) 1543 (37.39) 868 (11.51)
Bad progress, N (%) 436 (34.55) 675 (22.15) 769 (18.63) 446 (5.92)
1
Death, N (%) 414 (32.81) 628 (20.60) 689 (16.69) 407 (5.40)
Footnote: The numeric superscript found on top of the variable values that define the clusters indicate
the period indexes that have large effect sizes with that specific variable for that specific period. Only
variables with large effect sizes, as suggested by [26] and [27], in the performed tests were included
In this case, the only variables with at least one large effect size are age, the Charlson
index and the vaccine doses. With the exception of the Very high cluster, age is reduced
along the pandemic in all the clusters. In addition, leaving out the Very low cluster, the
Charlson index is also reduced for the rest of the clusters. This can also be seen in the
outcomes proportion. Although there are not large effect sizes in the comparisons, the
proportions of death, bad progress and hospital admission are reduced along the
pandemic.
Discussion
This study utilized the KAMILA clustering technique to classify patients according to
their clinical and sociodemographic characteristics for different time periods of the
periods on account of the high hospitalization and death rates in the early stages of the
COVID-19 pandemic.
15
As we hypothesized, the identified clusters are closely associated with the adverse
outcomes of the disease creating a description of risk profiles. Actually, the severity of
the outcomes increases with the risk level of the clusters for all the periods and
outcomes. While the outcomes were related to the COVID-19, this procedure is
independent of the virus and could be used in new pandemics or other diseases. This
identification of risk groups and to provide targeted care to the most vulnerable patients.
As a summary of the results, we found four profiles based on age and the presence, or
Age and the Charlson index, as a measurement of the patients’ comorbidities impact,
emerged as important factors discriminating the profiles, with age being more relevant
discerning the Very low and Low profiles, while the Charlson index was more important
separating the High and Very high profiles. Although age, the Charlson index and
comorbidities proportion reduced along the pandemic, there are no large differences,
which suggests that the clusters have been stable over time. In fact, the COVID-19
16
The highest risk profiles are characterized by higher ages and higher Charlson index
values, which are associated with poorer outcomes of COVID-19. On the one side, age
has a significant impact due to its correlation with frailty and overall health condition,
which have been shown to result in poorer COVID-19 outcomes in the literature [21-
23]. On the other side, the Charlson index mainly segregates the higher risk profiles,
suggesting its influence on a worse prognosis of the disease. This is consistent with the
Charlson index’s use as an indicator of the patient’s health status. In addition, prior
studies have also associated the Charlson index with a worse COVID-19 prognosis [24].
The evolution of these variables along the pandemic is explained by the special case of
the first period. The lower testing capacity at the initial stage of the pandemic [25]
resulted in selective testing of the most severe cases, which may have biased the results
of this period. For instance, the median age of the Low cluster in this period was 71,
which is high considering its risk level. However, in the subsequent two periods, with
In the last period (Omicron variant) the Very high and High clusters join together
creating an intermediate cluster that we renamed High-Very high. This can be concluded
by simultaneously looking at Fig 1 and Fig 2: the Charlson index median value of the
High-Very high cluster in the last period is in the middle of the median values of both
clusters from the previous period and its distribution is a combination of their density
plots. Accordingly, the best prognosis patients of the previous periods from the High
profile may have mixed with the patients from the Low profile, resulting in this case in
an increase of the median age of the Low profile for the last period.
Regarding the rest of the variables, there were no large differences among periods for
the same risk level clusters. This leads us to conclude that the clusters have been stable
along the pandemic and that the reduction of the adverse outcomes of the disease is
17
explained by the following reasons: the vaccination process starting in 2021 and the
higher risk profiles. Bearing in mind the High and Very high profiles are distinct due to
the Charlson index, it is interesting to note which comorbidities segregate these clusters.
disease, kidney disease, metastatic solid tumor and heart failure. These variables have
[28-31]. Furthermore, other comorbidities gradually increase in all the clusters like
arterial hypertension, dyslipidemia and diabetes and even discriminate the Very low and
Low profiles. These variables have already been associated with poorer outcomes of
COVID-19 [32,33].
anticoagulants and antiplatelets proportions differ in the higher risk clusters and the
lower ones. These treatments are usually employed to treat various comorbidities
simultaneously; thus, we consider they are not that important by themselves to create
patient profiles. Rather than being essential in creating the profiles, baseline treatments
The strong correspondence of our results with those found in the literature reinforces
that KAMILA can effectively classify patients. In fact, the variables of the profiles that
differ the most are the ones that have been highlighted in the literature as suggestive of a
worse COVID-19 prognosis. This also accords the earlier studies suggesting KAMILA
COVID-19-positive cases in the Basque Country over a period of almost two years.
Additionally, the database used in this study contained a wide range of variables,
allowed the identification of patient profiles. Furthermore, the study has not been
restricted to hospitalized patients, unlike some previous studies [9,10], and the profile
there are also several limitations in our study. First, there are not established standards
for the statistical validation of unsupervised clustering results, but the acceptance of the
clustering structures by clinicians with expertise in this topic helped to mitigate this
issue. Second, our study identifies statistical associations and descriptions, but does not
describe causality. Third, this study is time-limited and future research may be needed
to identify the future profiles of COVID-19 positives in new waves. Finally, to enhance
Conclusions
The purpose of the present research was to classify patients according to their clinical
and sociodemographic profiles and subsequently show their association with the
evolution of the disease. Our initial hypothesis was tested and the profiles created in an
addition, the study’s results are consistent with the literature, suggesting the variables
that differ the profiles the most are those highlighted in the literature as indicative of a
worse COVID-19 prognosis. In particular, age and the Charlson index have played a
19
major role in the determination of the profiles jointly with diabetes, kidney disease,
These findings suggest that this method can be used in new pandemics, with other
way. This could lead to a quick classification of the worse prognostic patients, allowing
for targeted care intervention strategies and resulting in an overall improvement of their
medical attention.
20
Acknowledgments
We are grateful for the support of the Basque health service, Osakidetza, and the
patients who participated in the study. Open access funding provided by BCAM.
References
Coronavirus from Patients with Pneumonia in China, 2019. New England Journal of
hospital capacity to meet changing demands during the COVID-19 pandemic. BMC
5. Landau S, Leese m, Stahl D, Everitt BS. Cluster Analysis. 5 th ed. John Wiley & Sons;
2011.
6. McLachlan GJ. Cluster analysis and related techniques in medical research. Stat
10.1093/nar/gky889.
8. Grant RW, McCloskey J, Hatfield M, Uratsu C, Ralston JD, Bayliss E, et al. Use of
Latent Class Analysis and k-means Clustering to Identify Complex Patient Profiles.
2021:e0250569-e0250569.
https://doi.org/10.1038/s41598-021-83340-8.
methods for mixed-type data. Advances in Data Analysis and Classification. 2022. Doi:
https://doi.org/10.1007/s11634-022-00521-7.
15. Charlson ME, Sax FL, MacKenzie CR, Fields SD, Braham RL, Douglas Jr RG.
Assessing illness severity: does clinical judgement work? J Chronic Dis. 1986; 39(6):
16. WHO Collaborating Centre for Drug Statistics Methodology. Guidelines for ATC
10.1198/106186005X59243.
18. R Core Team. A language and environment for statistical computing. R Foundation
19. Kim HY. Statistical notes for clinical researchers: Chi-squared test and Fisher’s
exact test. Restorative Dentistry & Endodontrics. 2017; 42(2): 152-155. doi:
10.5395/rde.2017.42.2.152.
20. Vargha A, Delaney HD. A critique and Improvement of the CL Common Language
Effect Size Statistics of McGraw and Wong. Journal of Educational and Behavioral
21. Gupta RK, Marks M, Samuels THA, Luintel A, Rampling T, Chowdhury H, et al.
hospitalised adults with COVID-19: an observational cohort study. Eur Respir J. 2020;
22. Verity R, Okell LC, Dorigatti I, Winskill P, Whittaker C, Imai N, et al. Estimates of
the severity of coronavirus disease 2019: a model-based analysis. The lancet: Infectious
Mortality and survival of COVID-19. Epidemiology & Infection. 2020; 148:e123. Doi:
10.1017/S0950268820001405.
24. Tuty Kuswardhani RA, Henrina J, Pranata R, Lim MA, Lawrensia S, Suastika K.
systematic review and meta-analysis. Diabetes & Metabolic Sindrome: Clinal Research
Mendoza C, et al. Main differences between the first and second waves of COVID-19 in
Madrid, Spain. International Journal of Infectious Diseases. 2021; 105: 374-376. Doi:
10.1016/j.ijid.2021.02.115.
Covid-19 Vaccines over a 9-Month Period in North Carolina. The New England Journal
27. Shuai H, Chan JFW, Hu B, Chai Y, Yuen TTT, Yin F, et al. Attenuated replication
28. Huang I, Lim MA, Pranata R. Diabetes mellitus is associated with an increased
analysis and meta-regression. Diabetes & Metabolic Syndrome: Clinical Research &
29. Cheng Y, Luo R, Wang K, Zhang M, Wang Z, Dong L, et al. Kidney disease is
Cohort Collaborative (N3C). BMC Infectious Diseases. 2022; 22: 784. Doi:
https://doi.org/10.1186/s12879-022-07776-7.
31. Rey JR, Caro-Codón J, Rosillo SO, Iniesta AM, Castrejón-Castrejón S, Marco-
prognostic implications. European Journal of Heart Failure. 2020; 22: 2205-2215. Doi:
https://doi.org/10.1002/ejhf.1990.
32. Pranata R, Lim MA, Huang I, Raharjo SB. Hypertension is associated with an
10.1177/1470320320926899.
https://doi.org/10.1016/j.dsx.2020.07.054.
Figure captions
Fig 1. Overview of the clusters according to age and the Charlson index.
Fig 2. Charlson index density plots for all the periods and clusters.
found in Table 2.
found in Table 3.
Declarations
Ethics Statement
The study protocol was approved by the Ethics Committee of the Basque Country
(reference PI2020123). The need for consent was waived by the ethics committee due to
Funding
This work was supported in part by the health outcomes group from Galdakao-
Barrualde Health Organization; the Kronikgune Institute for Health Service Research;
Instituto de Salud Carlos III (ISCIII) through the project "RD16/0001/0001" (Red de
Prevención y Promoción de la Salud) and co-funded by the European Union, and the
Project. The work of IB was financially supported in part by grants from the
[IT1456-22] and by the Ministry of Science and Innovation through BCAM Severo
Government through the BERC 2022-2025 program. DF has been supported by the
funders had no role in study design, data collection and analysis, decision to publish, or