You are on page 1of 10

Multivariate Statistics:

First Challenge

Germán José Padua Pleguezuelo

1
Index:
1. Introduction to the problem..............................................................................3
2. Glaucoma Research...............................................................................................3
3. Data Preprocessing...............................................................................................4
3.1. Excel...................................................................................................................................4
3.2. RStudio............................................................................................................................5
3.2.1. Missing values..................................................................................................5
3.2.2. Outliers.................................................................................................................6
4. Correlation.................................................................................................................6
5. Extra Conclusions.................................................................................................. 8
6. Bibliography............................................................................................................ 10

2
1. Introduction to the problem
In this challenge, we are asked to analyze a dataset consisting of
121 patients who underwent glaucoma surgery using laser
technology. We ought to determine whether the pre-surgery
condition is related to the long-term progression of the patient.

I have used R language together with RStudio to compute some


calculations over the data and KNIME as it offers a simple and
easy way to visualize the results.

2. Glaucoma Research
In order to understand the problem correctly, an initial
investigation about glaucoma should be performed.

First, I learned that glaucoma is not a single disease, it’s a group of


eye diseases that damage the optic nerve. There’s no cure for
glaucoma, but early treatment can often stop the vision loss, so it’s
crucial to diagnose glaucoma in its early stages.

There are two major types:


- Open-angle glaucoma: this is the most common type. It
happens when the eye does not drain fluid as well as it
should. As a result, eye pressure builds up and starts to
damage the optic nerve.
- Angle-closure glaucoma: this type happens when the iris is
very close to the drainage angle in the eye and it can end up
blocking it. When it’s completely blocked, eye pressure rises
very quickly.

Other glaucoma types include: normal-tension, congenital,


neovascular, pigmentary, exfoliation, uveitic …

Every type has its own treatment, for example open-angle


glaucoma can be treated by medicines, laser treatment or surgery,
whereas uveitic glaucoma is usually treated with medicines or
surgery.

3
Laser treatment is a procedure that works by helping the fluid in
your eye drain, which can help lower the pressure inside the eye.
Some side effects are swelling or soreness, sometimes the laser
can scratch the cornea or make it very dry, that could be painful
but the pain usually goes away as the cornea heals. This treatment
works very well for most people, but not everyone. The average
time to know if it has worked is from 4 to 6 weeks.

Finally, I looked up information about Intraocular Pressure (IOP).


Each person’s eye pressure is different, and there is no single
correct pressure for everyone. Generally, the range for normal
pressure is between 10 and 21 mmHg. Most people who have
glaucoma will have an eye pressure higher than 21 mmHg. However,
there are cases where people with lower pressure suffer from
glaucoma.

3. Data Preprocessing
3.1. Excel
Looking at the excel file I fixed some errors:
- Many cells had the value #N/D. I erased all of these to treat
them as missing values.
- There was text in some cells that contained quantitative
variables so I removed it.
- In the column ENERGIA_IMPACTO, some intervals were given. I
substituted them by the mean of the interval.
- Two rows had the majority of the values equal to zero. I
deleted them from the dataset because I considered them
useless for the study.

4
3.2. RStudio
I started noticing that the columns ENERGIA_TOTAL,
ENERGIA_IMPACTO and N_IMPACTOS were related.
ENERGIA_TOTAL = ENERGIA_IMPACTO * N_IMPACTOS
However, this equation was not fulfilled in general. The exact result
was only achieved in 7 cases and even with a tolerance of 30 there
were values that failed to satisfy the equation. We can conclude
that there is an error in at least one of the three variables but we
do not know which. Since there weren’t many rows that differ more
than 25 with their expected value, I decided to delete them from
the dataset.

3.2.1. Missing values


My next step was to deal with missing values. The following
attributes are the ones with the greatest unknown rate:
CIRUJIA_PREVIA, DOLOR y SEXO. They are all discrete so I will
replace the values with the median instead of using the mean. A
problem with this substitution is that we have replaced many
missing values with only one possibility, for example, DOLOR had
60 NA values and we've replaced them all with 1. Other approaches
could be tried if we have time, such as K–Nearest Neighbor and
regression.

The next columns we’ll treat are OJO (4 NA) and TIPO_GLAUCOMA
(2 NA). Since they're just 6 rows, we will continue with the study not
taking them into account as they could be important factors for
the other attributes.

Finally, I noticed that columns about eye pressure contained


zeroes, which is an impossible value for that magnitude. It could
be an error in the measurement so we calculate the mean of each
column (without taking into account null) and replace the zeroes.
We have affected the distribution of IOPs and patient’s
progression, because of that I don’t expect to obtain a great
conclusion.

5
3.2.2. Outliers
I started visualizing all quantitative variables using boxplots to
identify whether they have outliers or not. There was an extreme
value of 45019 in CUADRANTE so I changed it with the mode (4).
Then, I decided to remove the column ENERGIA_TOTAL, because its
data was redundant and there were many outliers.

Next, I tried to maintain very high values high so I didn’t use the
mean or any other measure:
- PIO_PRE_SLT: I changed the four values greater than 35 (36 by
30 and 46 by 33).
- N_IMPACTOS: I focused on values greater than 175 ( 178 <- 174,
183 <- 175, 202 <- 180)
- ENERGIA_IMPACTO: I reduced the highest value, 2.4, to 2.2.
- There are many outliers in the columns of the progression of
PIO after the treatment. I tried two different approaches.
First, I winsorized the data with 10th and 90th percentiles.
Secondly, I applied the log transformation to the variables.
The “best” result obtained was with winsorizing so I continued
with that data.

4. Correlation
I calculated the correlation matrix using what I learned in the
practical sessions of PCA and showed the pairs of variables with
correlation higher than 0.5 or lower than -0.5. We can see some
correlations that make sense such as the positive correlation
between PIOs after the medical operation or the positive
correlation between FARMACOS.

Unfortunately, PIO_PRE_SLT which is the pressure before the


intervention does not appear to have some correlation with the
progression of the patient. Because of that, I used another
software named KNIME to visualize the data and calculate other
measures.

With KNIME I calculated the correlation matrix with Spearman’s


rank correlation coefficient and the linear correlation matrix which

6
uses Pearson’s product-moment coefficient for numeric variables
and Pearson’s chi square test on the contingency table for
nominal ones.

Rank Correlation:

Linear Correlation:

Very similar results are obtained. We can highlight that IOP before
the intervention is correlated positively with the pressure after. It

7
seems that each person’s eye pressure influences the most the
pressure they have during some time after the laser treatment.
However, in the long-term we can see that the correlation
coefficient between PIO_PRE_SLT and PIO_3_MES is lower, so more
than 1 month may be needed to evaluate the results of the
treatment.

5. Extra Conclusions
I want to add some visualizations of the data to see what’s the
most common type of glaucoma and whether there’s a relation
between the age of the patient and the glaucoma type they suffer
from.

The most common types are GPAA (Glaucoma Primario de Ángulo


Abierto) and HTO (Hipertensión Ocular).

8
We'd like to emphasize the presence of numerous infants with
Open-Angle glaucoma and HTO, highlighting the significant age
range affected by glaucoma, which can strike individuals as early
as their 40s and through their 80s. It is important for people within
this age group to undergo periodic eye examinations to identify
glaucoma in its initial stages and treat it correctly.

9
6. Bibliography
https://www.nhs.uk/conditions/glaucoma/
https://www.aao.org/salud-ocular/enfermedades/que-es-la-glauco
ma
https://glaucoma.org/hypotony/
What is low eye pressure and does it cause any damage to your
eyes? - American Academy of Ophthalmology
https://www.nei.nih.gov/learn-about-eye-health/eye-conditions-an
d-diseases/glaucoma

10

You might also like