Professional Documents
Culture Documents
Divya Nori
9th Grade
2017 – 2018
ABSTRACT
Over the past 50 years, the number of young people diagnosed with near-sightedness or
myopia in the U.S. has doubled. If this trend continues, by 2050, half of the world’s population,
around 5 billion people, will be myopic. This rapid increase in number of cases diagnosed is
alarming because myopia can serve as a precursor to many other vision problems, even leading
to blindness in extreme situations. Until 5 years ago, scientists thought that almost all cases of
pathological myopia were caused by a genetic predisposition. Recent research shows that while
30% of these cases have occurred due to genetics or during early childhood years, 60% of these
cases have progressed from a treatable form of myopia, starting in adolescence. By using a
mathematical model to predict the likelihood of myopia, time and money can be saved.
Additionally, the prevalence of myopia and other related vision disorders can be reduced. This
investigation created a predictive model using machine learning to determine risk factors of
An ML-based Predictive Model for Vision Disorders in Adolescents 2
myopia using data collected through a survey. The survey was distributed to 60 adolescents, and
responses were recorded. A database was created with the data, and the data was then analyzed
by a computer program. The program was written in R. The time spent on a screen in the car,
the time spent outdoors, the race of the subject, the height of the subject, and whether the
subject’s sibling had glasses were determined as the most highly predicting to dependence on a
vision aid.
Divya Nori
TABLE OF CONTENTS
ABSTRACT................................................................................................................................................2
LIST OF GRAPHS......................................................................................................................................4
LIST OF TABLES......................................................................................................................................5
I. INTRODUCTION TO INVESTIGATION.........................................................................................6
Background Research..............................................................................................................................6
Important Concepts..............................................................................................................................6
Overview of Myopia..........................................................................................................................10
Overview of Hyperopia.....................................................................................................................12
Recent Studies and Probable Causes..................................................................................................12
Diagnosis, Prevention, and Treatment...............................................................................................16
Epidemiology and Statistics...............................................................................................................18
Interview with Optometrist....................................................................................................................19
Statement of the Problem.......................................................................................................................20
Research Question.................................................................................................................................21
Hypothesis.............................................................................................................................................21
II. PROCEDURE OF INVESTIGATION..............................................................................................21
Variables................................................................................................................................................21
An ML-based Predictive Model for Vision Disorders in Adolescents 3
Procedure...............................................................................................................................................21
Vision Survey....................................................................................................................................23
III. RESULTS..................................................................................................................................25
Program.................................................................................................................................................25
Distribution Data...................................................................................................................................37
Statistical Analysis Data........................................................................................................................42
Model Output........................................................................................................................................45
Discussions............................................................................................................................................48
Analysis.............................................................................................................................................48
Conclusions.......................................................................................................................................52
Future Work......................................................................................................................................53
IV. ACKNOWLEDGEMENTS...........................................................................................................53
V. REFERENCES..................................................................................................................................54
VI. APPENDIX...................................................................................................................................57
Divya Nori
LIST OF GRAPHS
Figure 1: Distribution of Students with Visual Aids……….…………….……………………...………37
Figure 2: Distribution of Students by Age……………...………………………..…………….…..……37
Figure 3: Distribution of Students by Gender………………….…………...…………………...………37
Figure 4: Distribution of Students by Height.……………………………...……………….…...………37
Figure 5: Distribution of Students by Weight.………………...…………...…..………………..………37
Figure 6: Distribution of Students by Race.………………………...…...………………….…..….……38
Figure 7: Distribution of Students Reading While Traveling in a Car…………...……….…...……..…38
Figure 8: Distribution of Students Watching a Screen While Traveling in a Car….….................…..…38
Figure 9: Distribution of Students Reading While Lying Down…….….……………...……...…..……38
Figure 10: Distribution of Students Watching a Screen While Lying Down…………..………...………38
Figure 11: Distribution of Students Reading in Low Light………………………..…………......………39
Figure 12: Distribution of Students Watching a Screen in Low Light……………..…...……......………39
Figure 13: Distribution of Students by Time Spent on a Screen for School Work….....................………39
Figure 14: Distribution of Students by Time Spent Watching TV….............................................………39
An ML-based
Figure Predictive
15: Distribution Modelwith
of Students for Vision
SiblingsDisorders in Adolescents
Using Visual Aids…........................................………39 4
Divya Nori
Figure 32: Decision Tree………………………………………………………………………………….46
Figure 33: Random Forest Scaled Importance Values……………………………………………………46
Figure 34: Gradient Boosted Machine Scaled Importance Values…………………………………….…47
Figure 35: GBM ROC Curve……………………………………………………………………………..47
LIST OF TABLES
Table 1: Relative Risk and Odds Ratios for Identified Variables……….………...…...………42
Table 2: Positive Pairwise Correlations……………………………….....…………...…..……43
Table 3: Negative Pairwise Correlations………………..…………….....…………...…..……44
Table 4: Important Features Across Algorithms….……………………………………………52
Divya Nori
I. INTRODUCTION TO INVESTIGATION
Over the past 50 years, the number of young people diagnosed with near-sightedness or
myopia in the U.S. has doubled, making nearly 40-50% of the population of young people
nearsighted [1]. If this trend continues, by 2050, half of the world’s population, around 5 billion
people, will be myopic [2]. This rapid increase in number of cases diagnosed is alarming
because myopia can serve as a precursor to many other vision problems, even leading to
extreme myopia that frequently results in cataracts and glaucoma and occurs in 20% of people
diagnosed with myopia. Until 5 years ago, scientists thought that almost all cases of pathological
myopia were caused by a genetic predisposition [3]. Recent research shows that while 30% of
these cases have occurred due to genetics or during early childhood years, 60% of these cases
An ML-based Predictive Model for Vision Disorders in Adolescents 6
have progressed from a treatable form of myopia, starting in adolescence [4]. To detect the early
onset of pathological myopia, and other serious vision problems resulting from myopia,
especially during the current "myopic epidemic", a more efficient and cost-effective solution to
Background Research
Important Concepts
What Is Machine Learning?
There are two major types of programming: traditional or rule-based programming and
machine learning. In traditional programming, the computer scientist inputs data and writes a
program to get an output. In machine learning, the data and the output are given to the computer,
and the computer produces a model. This new way of programming is useful in several ways. It
is extremely difficult to write rule-based programs in some scenarios. For example, computing
Divya Nori
the probability that a credit card transaction is fraudulent would require several if-then
statements. Also, since this problem is always changing and there is a moving target, the
program would need to keep getting updated. Another difficult scenario is recognizing a 3D
and hard to follow or replicate. Machine Learning is used in these types of scenarios to achieve
Intelligence is the study of how we can make machines able to understand the world, make
subset of this field and enables machines to improve at tasks using several examples of scenarios.
Instead of writing a program for each specific task, lots of examples are collected that specify the
examples the model was trained with. If the model needs to be updated, it can be retrained with
Other Concepts
words, it is the odds that an outcome will occur given an exposure. This measure is typically
used in case-control studies, like this investigation. The natural log (ln) of the odds ratio is equal
Divya Nori
The relative risk or risk ratio is the ratio of the probability of an event occurring in an
exposed group to the probability of the event occurring in the non-exposed group. This measure
variables. There are several types of regression, but the two used in this investigation are logistic
regression and regularized regression. Logistic regression is used when the dependent variable is
binary. The objective function for this type of regression is called MLE or Maximum Likelihood
the logistic regression objective function to prevent overfitting. There are two types of
regularized regression: the LASSO model and Ridge Regression. The LASSO model introduces
a penalty that is equal to the absolute values of the coefficients computed by the model. Ridge
regression introduces
An ML-based a penalty
Predictive Modelequal to the square
for Vision of the
Disorders in absolute values of the coefficients.
Adolescents 8
This investigation uses the LASSO model because it drives the other non-predictive coefficients
to zero, whereas the ridge regression model takes the coefficients close to zero. This allows
Decision Tree Learning is a Machine Learning algorithm which uses a decision tree as a
supervised learning classification problems. In a decision tree, the observations are represented
in the branches. If the feature the is present, the left branch is followed, and if the feature is not
present, the right branch is followed. The target value is represented in the leaves with the top
number (0 or 1) representing whether most individuals in the category have vision aids or not,
the middle number representing the percentage (as a decimal) of individuals in category with
vision aids, and the bottom number representing individuals in a category as a percentage of the
Divya Nori
total population. There are two types of decision trees, classification and regression trees,
referred to collectively as CART. A classification tree, which is used in this experiment, deals
with discrete target values. Regression trees have continuous target values that can be
considered real numbers. Decision trees are simple to understand and interpret. They also
problems. This algorithm constructs several fully-grown decision trees in parallel. Random
forest uses decision trees with low bias and high variance. This is because of the bias-variance
tradeoff, which is the problem of simultaneously reducing bias and variance. Bias is the error
from assumptions in an algorithm. High bias causes an algorithm to miss features which is
known as underfitting. Variance is an error from sensitivity to small fluctuations in the training
set.
An High variance
ML-based causes lots
Predictive of random
Model noise
for Vision and causes
Disorders the model to memorize the data and
in Adolescents 9
overfit. Random forest uses bootstrap aggregation or bagging to combine the individual decision
trees. Bagging is a machine learning ensemble algorithm used to improve accuracy of machine
learning algorithms. It is a special type of model averaging technique that is usually applied to
decision trees. Random forest helps prevent overfitting on the training dataset, which is common
with decision trees and GBM. It is a highly accurate classifier and runs efficiently on larger
datasets.
Gradient Boosted Machines use gradient boosting, which is a machine learning technique
used in regression and classification problems. There are three elements involved in GBMs: A
loss function, a weak learner, and an additive model. A loss (objective) function needs to
optimized or minimized. Different loss functions are used for different problems. The weak
learners have high bias and low variance to make predictions, the converse of Random Forest.
Divya Nori
The weak learners usually CART trees. The additive model produces an ensemble model of
weaker prediction models by adding them one at a type, rather than in parallel like Random
Forest. Different objective loss functions can be used with the same algorithm, which makes
Gradient Descent is the process by which the minimum value of a function is found. The
model minimizes the MLE objective function to obtain the coefficients. An analogy that is often
used to illustrate gradient descent involves a person on a mountain. The mountain is foggy, and
the person is trying to get down or find the minima. Since the path is not visible, he must use his
local information to get down. He must look at the steepness at his current position and proceed
in the direction with the steepest descent. This is much like the process the machine learning
model follows to minimize the MLE objective function and determine the variable coefficients.
An ML-based Predictive
N-fold Cross Modelisfor
Validation Visionvalidation
a model Disorderstechnique
in Adolescents
used to assess how the results of 10
a statistical analysis will generalize to an independent dataset. It is mainly used to estimate the
accuracy of a model. The dataset is divided into four groups (4-fold), and each group is used for
training three times, and for test once. N-fold cross validation allows all the data to be trained
and tested with. This allows the AUC to be higher, even though the sample size may be lower.
Overview of Myopia
which one’s eyeball is elongated, and incoming light focuses in front of the retina, instead of on
the retina. Because of this, objects that are distant appear blurry and objects that are close appear
normal. While myopia may be benign in most cases, resulting in only a glasses/contacts, myopia
Divya Nori
greatly increases the risk of serious vision problems such as retinal detachment, cataracts, and
glaucoma.
The first person to recognize and distinguish between myopia and hyperopia, also known
as far-sightedness, is said to be Greek Philosopher Aristotle (384 BC – 322 BC). However, these
speculations were materialized in 65 AD when the word meyin, meaning close, and opdos,
meaning eye, were used in the compilation of Roman Law called Libris Pandectorum. The
Greek word “Myopos” eventually evolved into the Latin word “myops”, meaning a condition in
which someone attempts to see clearly by partially closing their eyes. While this condition has
been recognized for over 2000 years, research on the causes of myopia began about 200 years
ago [5]. Over time, scientists have understood that myopia results from a combination of genetic
predisposition and environmental factors. The main risk factors are believed to be time spent
Myopia can occur at various scales, and the rate of progression to degenerative myopia
can also vary. Each person diagnosed with myopia can see clearly out to a certain distance that
varies from individual to individual, and everything further than that becomes blurry. Eye
examination have shown that most myopic eyes have the same structure as non-myopic eyes,
retina, through a weak point in the eyeball. Usually, the protrusion is black and effects inner
layers of the eye like the cornea and sclera. The most significant structural difference between
myopic and normal eyes is the length of the eye. The retina needs to stretch to cover the
increased distance, and hence becomes weaker. This causes lattice degeneration and retinal
Divya Nori
detachment. Lattice degeneration is an eye disease in which the retina develops breaks and tears,
which causes the retina to eventually detach, resulting in legal blindness [6].
Overview of Hyperopia
appear clearly but, near ones appear blurry. Converse to myopia, hyperopia causes one’s eyeball
to shorten and cornea has too much curvature. This causes the image to focus behind the retina,
instead of on the retina. Hyperopia is most commonly seen in middle-aged adults, progressing as
they get older, and is much less likely in adolescents [18]. The correlation between hyperopia
and cataracts is alarming; 77.78% of hyperopic subjects end up with cortical cataracts [19].
Recently,Predictive
An ML-based scientistsModel
have begun talking
for Vision about anin“epidemic
Disorders of nearsightedness”. While
Adolescents 12
the prevalence of nearsightedness in the U.S. is high, there are other parts of the world with
unimaginable prevalence rates. Parts of East Asia like Singapore, China, Japan, and Korea have
myopia prevalence rates as high as 80% among middle/high school age children.
Mathematician Johannes Kepler (1571 – 1630) blamed his nearsightedness on near work, or all
the writing/calculations that he did up close. This was the hypothesis that was believed for a
long time. Smartphones/tablets are not included in the definition of near work, since they are too
modern. Since nearsightedness has been on the rise since before these electronics were very
popular, scientists do not attribute this epidemic to them. Extensive research has been done on
the effect of near work on myopia rates, and while scientists have not established a firm link,
Divya Nori
In the 20th century, scientists learned that genetics plays a role in the progression of
myopia. While the likelihood that you will be myopic is higher if your parents are as well, the
relationship between the presence of myopia and the mutation of a specific gene is not straight-
forward. A few dozen genotypes together influence the final phenotype present in an individual.
In order to confirm several hypotheses about genetic predisposition, a study was one in an Inuit
community in 1969. At the start of the experiment, 2 out of 131 people in that community
(1.5%) were nearsighted. After 1-2 generations (children/grandchildren), prevalence rates rose
to nearly 50%. This rapid change could not have resulted from genetics alone. Scientists then
concluded that while genes may have some influence, the main cause of nearsightedness is
does not attribute itself to the skyrocketing rates, but it does give scientists a clue as to what the
cause might be. Researchers adjusted the data to account for how much education each of the
participants had. The effect diminished, meaning that the amount of education the subjects had
accounted for the difference in myopia rates. Scientists then created the “Parental Investment”
hypothesis. Parents tend to emphasize the value of education more in first-borns than in their
later children. As a result, the first-born children spend more time studying, and therefore have a
higher rate of myopia. Another study conducted by professors from the Sun Yat Sen University
in China, investigated the link between socioeconomic status and nearsightedness by comparing
the rates of myopia in neighboring Chinese provinces. Schoolchildren from the Shaanxi
province, a middle-income province, were compared to schoolchildren from the Gansu province,
Divya Nori
a relatively poor province. In the wealthier province, the prevalence of myopia was twice that of
the poor province. While the researchers could not find a probable reason to explain the
difference, they found that higher math scores, from the wealthier province, were correlated with
higher rates of nearsightedness. This emphasizes the link between education and
nearsightedness. This also explains the extend of the problem in Asia, since education is
conducted a study on ethnic Chinese children living in Sydney and Singapore. The made sure
that the kids’ parents had similar rates of nearsightedness (around 70% in both studies). In the
children, the difference was alarming. Only 3.3% of the kids living in Sydney were nearsighted,
while 29.1% of the kids living in Singapore were nearsighted. Surprisingly, the children in
average of 13 hours a week outside, the kids in Singapore spent in average of 3 hours outside.
Public health officials hypothesized about the effect of sunlight on myopia and its progression.
Scientists proceeded to look for rigorous evidence and a mechanism in which this link could be
established.
Within the last few years, researchers have made considerable progress towards
establishing this as a probable cause. There have been several experiments on animals that show
that light protects against myopia. In a study done by researchers in Germany, myopia was
induced in chicks through the use of special goggles. They then placed one group of chicks in
sunlight, and the other group under regular laboratory lighting. The onset of myopia was slowed
Divya Nori
After establishing rigorous evidence, researchers then focused their attention on the
science behind sunlight and our brains. They found a substance produced by organisms’ brains
that influences eye development: Dopamine. Dopamine is a neurotransmitter that plays several
important roles in human (and other animals’) brains/bodies. This chemical is released to send
signals to other nerve cells. The brain has dopamine-pathways that influence motor control and
hormone release. In order to establish that Dopamine affects proper eye development, scientists
injected chicks with a chemical that blocks Dopamine. Without the presence of this chemical,
sunlight no long protected the chicks from myopia. Scientists concluded that Dopamine is
released as a result of bright light. This chemical is also related to the body’s day-night rhythm.
The human body switches from low-light nighttime vision to daytime vision, and without the
presence of Dopamine, this switch cannot occur. Researches now believe that this “Dopamine
cycle” is required for healthy eye development throughout childhood and adolescence. If this
An ML-based Predictive Model for Vision Disorders in Adolescents 15
cycle is disrupted, especially during one’s middle-school years, the eyeball tends to become
elongated, causing myopia. Scientists call this hypothesis the Light-Dopamine hypothesis.
To test this hypothesis, scientists looked at primary school children from 12 schools in
Guangzhou, China. These children were divided into two groups of 6 schools each, so about 950
children were in each group. One of the groups did not change their daily routine, while the
other group added a 40-minute outdoor activity time to their schedule. For 3 years, the
researchers tracked the children and their eye development. At the end of the trial, the incidence
rate of myopia among the children that spent time outside was 30%, while in the other group, the
rate of myopia was 39.5%. While the reduction was less than they expected, 3 years a short
amount of time to witness significant change. In order to better establish the link, researchers are
Divya Nori
Diagnosis, Prevention, and Treatment
assessment of the refractive status of each eye. An autorefractor is a machine that measures how
light enters the person’s eye, and how the eye changes the light. A retinoscope is a way of
shining the light on a person’s eye to observe the retina. A phoropter, a device that contains
different lenses, is then used to determine the exact prescription. These techniques and tools
There are various forms of myopia, each differing in intensity and treatment options.
Simple myopia, which is typically less than 4 diopters, can easily be fixed with glasses/contacts.
eye opposite the lens and includes the retina, optic disc, macula, fovea, and posterior pole.
Degenerative myopia is also characterized by a very high refractive error and subnormal vision
even after correction. This form of myopia is known to continually progress and get worse over
time. Another form of myopia is Pseudo myopia, which is the blurring of distance vision
brought about by a spasm of the accommodation system. Pseudo myopia includes nocturnal
myopia, near work-induced transient myopia, and instrument myopia. The last form of myopia
is Induced myopia, also known as acquired myopia. This form of myopia is brought on by the
use of different drugs, increases in glucose levels, nuclear sclerosis, oxygen toxicity, etc. [9].
there are several other preventions for myopia/the progression of myopia. The use of glasses and
contact lenses can help alter the speed of myopic progression. The American Optometric
Divya Nori
Association’s Clinical Practice Guidelines for Myopia refers to several studies which show that
switching from full-time, part-time, and no lens does not appear to slow progression. In young
people ages 18 or younger, topical medications such as Anti-muscarinic is known to slow the
progression of myopia. The muscarinic receptor antagonist is a neuro-transmitter that blocks the
chemical Acetylcholine, which like Dopamine influences eye development. Eyedrops containing
cyclopentolate and atropine, chemicals that influence nerve connections, also slow the
progression of myopia. These chemicals often have side effects which cause light sensitivity,
since they cause too much Dopamine to be produced. Sclera reinforcement surgery is another
method which targets lattice degeneration. It provides reinforcement to thinning posterior poles,
which is part of the fundus. By slowing or stopping the progression of the disorder, quality of
Other than
An ML-based targetingModel
Predictive the rate
forofVision
progression, there
Disorders in is no concrete and universally accepted
Adolescents 17
solution to treat myopia. There are several types of refractive surgery that can be done for up to
some corneal tissue using an excimer laser, an ultraviolet laser commonly used in eye surgery.
While PRK is relatively safe, the recovery process is usually painful [11]. Another type of
surgery is LASIK surgery, where a flap is cut on the cornea prior to the procedure. Then, using
an excimer laser, the curvature of the cornea is changed. While the process is quite like PRK,
the recovery is usually painless, but cornea stability may be sacrificed [12]. Unlike PRK and
LASIK where the corneal surface is modified, the intra-ocular lens can also be modified by
implanting another lens inside the eye. While this typically fixes the refractive problem, it can
Divya Nori
Recently, several alternative therapies have been developed because of the lack of
scientific interventions. Vision therapy, also known as “behavioral optometry”, includes various
eye exercises and other relaxation techniques. One of the most popular forms of vision therapy
is the Bates Method. Physician William Horatio Bates (1860 – 1931) attributed all vision
disorders, including myopia, to a constant strain on the eyes. Hence, he felt that glasses harmed
the eye and trained the eye into the myopic state. His method includes palming, visualization,
movement, and exposure to sunlight. Scientists have stated that these alternative techniques
have “no clear scientific evidence” and so “they cannot be advocated for” [14]. In the late
1980s and early 1990s, biofeedback became very popular as a treatment for myopia.
Biofeedback is the process of gaining awareness of many physiological functions primarily using
instruments that provide information on the activity of those same systems. The goal was to be
able to manipulate these systems at one’s own will. Scientists maintain that biofeedback training
An ML-based Predictive Model for Vision Disorders in Adolescents 18
is not consistent [15].
The prevalence of myopia varies with age, country, race, religion, ethnicity, sex,
environment, occupation, and other factors. When comparing studies, more than one factor
differs, which makes comparisons of progression and incidence difficult. For example, in Asian
cultures, the prevalence of myopia has been as high as 70 – 90%, while in Europe and the United
States, the incidence rates are 30 – 40%. There is even more difference when comparing to
Africa, with prevalence rates between 10 – 20%. One must understand that race/ethnicity are not
the only factors that change; environment, socioeconomic factors, country, etc. also affect these
rates. This makes it hard to pinpoint a certain risk factor. Subsequently, myopia is about twice
Divya Nori
as common in Jewish communities than in non-Jewish communities. Religion may not have
The epidemiology of global refractive errors has become a popular research topic. In
North America, myopia is most common in the United States when compared to Canada and
Mexico. Research suggests that prevalence has increased dramatically over the past few
decades. In 1971 – 1972, the National Health and Nutrition Examination Survey reported the
first estimate in myopia prevalence in the U.S. The incidence rate of myopia in people’s age 12
– 54 was 25%. Another survey was conducted in 1999 – 2004, and the prevalence had increased
to 42% [1]. A different study was done of 2,523 students in grades 1 – 8 (ages 5 – 17) in a
diverse region of the United States. 10% of the students had at least -0.75 diopters of myopia,
but there was considerable variance when race was accounted for. Asians had the highest
prevalence (19%),
An ML-based then Hispanics
Predictive (13%),
Model for followed
Vision by African
Disorders Americans (7%), and lastly
in Adolescents 19
Caucasians (4%) [17].
Most of them are nearsighted, however kids below 10 are usually farsighted.
Most of them tend to be Asian, there are no gender trends that I know of.
Divya Nori
Yes, serious jumps in prescription are very common and occur in 15 – 20% of myopic
adolescents.
Prolonged screen use is the main factor. Family history, medication, and nutrients
6. Are there any vision disorders that can result from myopic progression?
Refractive amblyopia is when the brain ignores one eye because it is misaligned, leading
to a “lazy eye”.
7. Currently, is there anything that can help with the early detection of vision disorders
Medical checkups are supposed to test vision, but they are not very thorough, and only
An ML-based Predictive Model for Vision Disorders in Adolescents 20
happen once a year.
To detect the early onset of pathological myopia, and other serious vision problems
resulting from myopia, especially during the current "myopic epidemic", a more efficient and
model can accurately determine primary risk factors of myopia, and using this model, a simple
test to inform people of their level of risk can be developed. Additionally, more specialized
treatment can be tailored to these risk factors, and the prevalence of pathological myopic
Divya Nori
Research Question
Which factors will a model, trained using machine learning algorithms on data collected
Hypothesis
If a predictive model is built based on the surveys and is used to score students who are at
risk for vision problems so that they can be taken in for further testing, then the key risk factors
Variables
Procedure1
iii. Timeline: Distribute the surveys; collect them after three weeks
1
There are no constants, controls, or materials in this investigation
Divya Nori
iv. Mode: Paper (includes an introductory letter, a parental consent form,
iv. Reference Period: The time frame each survey should take (less than 5
minutes)
v. Response Format: Make sure that they respondent clearly knows what
An ML-based Predictive Model for Vision Disorders in Adolescents 22
to mark to signify their answer
clearly
a. Obtain parental consent from participants that are minors (all participants are
2
The vision survey used in this investigation is on the next page
Divya Nori
PLACEHOLDER FOR
Vision Survey
Divya Nori
3. Convert the survey to a digital format.
database
b. Give the answer choices numbers for each question (first question first choice
c. Create a row for each survey (ex. John is row 1, Amy is row 2)
d. Give each question a variable name (ex. The question “How long do you
spend watching T.V. will become “timeTV”). These variable names will be
e. Record the results in this format: If John picked the first answer choice for
“1”
An ML-based Predictive Model for Vision Disorders in Adolescents 24
4. Create summary statistics for each question.
race, etc.
Divya Nori
b. Used to study univariate characteristics.
III. RESULTS
Program
remove(list=c('pkgs','pkgs.loaded'))
Divya Nori
cat(sprintf('Number of labels: %d',labels.dt[,.N]), '\n\n')
labels.dt
## Number of questions: 30
## Number of labels: 97
Survey Data
#changes for Spanish version: race Hispanic and African American was swapped
data_orig.dt[, tmp:=race]
data_orig.dt[language==2 & tmp==2, race:=3]; data_orig.dt[language==2 &
tmp==3, race:=2]
An ML-based Predictive Model for Vision Disorders in Adolescents 26
data_orig.dt[, tmp:=NULL]
#template
#replacing *NL* with \n
ggp.1 <- function(var, col, xl=NULL) {
if(is.null(xl)) {xl <-
paste0(toupper(substring(var,1,1)),substring(var,2))}
xax <- labels.dt[feature==var]
labels <- stringi::stri_replace_all_fixed(xax[['label']],
Divya Nori
c("<=",">=","*NL*"), c("\u2264","\u2265","\n"),
vectorize_all=FALSE)
ggplot(data_orig.dt, aes_string(paste0('factor(',var,')'))) +
theme_light() +
theme(panel.grid.minor=element_blank(),
text=element_text(family='Segoe UI', size=15)) +
geom_bar(aes(y=(..count..)/sum(..count..)), fill=cbPalette[col]) +
geom_text(aes(label=..count.., y=(..count../sum(..count..))*.5),
stat='count', color='white', size=4) +
scale_x_discrete(breaks=xax[['value']], labels=labels) + xlab(xl) +
scale_y_continuous(labels=percent) + ylab('Students (%)')
}
Response Variable
Physical Attributes
Food Servings
Divya Nori
Time Spent on Activities
Other Factors
#response
model.dt[, y:=glassesContacts %in% c(1,2,4)]
#demographics
model.dt[, gender.Female:=gender %in% c(1)]
Divya Nori
model.dt[, age.GT12:=age>12]
model.dt[, race.NonWhite:=!(race %in% c(1))]
model.dt[, height.GT54:=height %in% c(4)]
model.dt[, weight.GT130:=weight %in% c(5)]
#food
model.dt[, vegetable.GT1:=vegetable %in% c(3,4,5)]
model.dt[, fruit.GT2:=fruit %in% c(4,5)]
model.dt[, snack.GT2:=snack %in% c(4,5)]
model.dt[, protein.GT2:=protein %in% c(4,5)]
model.dt[, lunchNotHome:=lunchHome %in% c(2)]
#time on activities
model.dt[, timeScreenSchool.LE1:=timeScreenSchool %in% c(1)]
model.dt[, timeTelevision.LE1:=timeTV %in% c(1)]
model.dt[, timeCellPhone.LE1:=timeCellPhone %in% c(1)]
model.dt[, timeComputer.LE1:=timeComputer %in% c(1)]
An ML-based Predictive Model for Vision Disorders in Adolescents 29
model.dt[, timeBook.LE1:=timeBook %in% c(1)]
model.dt[, timeOutdoors.LE1:=timeOutdoors %in% c(1)]
#others
#model.dt[, transport.Car:=transportation %in% c(1)]
model.dt[, braces.Yes:=braces %in% c(1)]
model.dt[, siblingsGlasses.Yes:=!(siblings %in% c(3))]
#model.dt[, physicalActive.GT1hr:=physicalActive %in% c(3)]
Divya Nori
model.dt <- model.dt[, setdiff(names(model.dt), vars.orig), with=F]
model.dt <- model.dt[, lapply(.SD, as.integer), .SDcols=names(model.dt)]
model.dt
Divya Nori
ggp.2 <- function(dt, fpar) {
dt.p <- setnames(dt[, fpar[['cols']], with=F], c('var','value'))
dt.p <- setorder(dt.p, -value)[1:fpar[['nobs']]]
dt.p <- merge(dt.p, varcol.dt, by='var', sort=FALSE)
dt.p <- setorder(dt.p, value)[, var:=factor(.I,labels=var)]
#N <- 5
#dt <- data.table(X=sample(varcol.dt[['var']],N), Y=runif(N))
#print(dt)
An ML-based Predictive Model for Vision Disorders in Adolescents 31
#par <- list(nobs=N, cols=c('X','Y'), xylab=c('Variable','Odds Ratio'))
#ggp.2(dt, par)
Divya Nori
univar.dt <- setorder(univar.dt[, c('codes','rel.risk','odds.ratio')],
-odds.ratio)
univar.dt
Divya Nori
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 55555
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.3 (2017-11-30)
s <- capture.output(h2o.removeAll())
h2o.no_progress()
Divya Nori
## 0 17 17 0.500000 =17/34
## 1 1 25 0.038462 =1/26
## Totals 18 42 0.300000 =18/60
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.265312 0.735294 15
## 2 max f2 0.265312 0.856164 15
## 3 max f0point5 0.584681 0.655738 8
## 4 max accuracy 0.584681 0.700000 8
## 5 max precision 0.764473 0.875000 2
## 6 max recall 0.148481 1.000000 17
## 7 max specificity 0.870703 0.970588 0
## 8 max absolute_mcc 0.265312 0.499083 15
## 9 max min_per_class_accuracy 0.561815 0.653846 9
## 10 max mean_per_class_accuracy 0.265312 0.730769 15
##
An ML-based Predictive Model for Vision Disorders in Adolescents 34
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or
`h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
coef.glm <- glm.h2o@model$coefficients
coef.glm.dt <- data.table(code=names(coef.glm), regval=unlist(coef.glm))
[abs(regval)>1E-3 & code!='Intercept']
ggp.2(coef.glm.dt, list(nobs=5, cols=c('code','regval'),
xylab=c('Variable','Coefficient')))
Divya Nori
## Mean Per-Class Error: 0.1549774
## AUC: 0.8891403
## Gini: 0.7782805
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal
threshold:
## 0 1 Error Rate
## 0 30 4 0.117647 =4/34
## 1 5 21 0.192308 =5/26
## Totals 35 25 0.150000 =9/60
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.470767 0.823529 23
## 2 max f2 0.352211 0.890411 38
## 3 max f0point5 0.470767 0.833333 23
## 4 max accuracy 0.470767 0.850000 23
##An5 ML-based Predictivemax
Model for Vision0.666833
precision Disorders 1.000000
in Adolescents0 35
## 6 max recall 0.352211 1.000000 38
## 7 max specificity 0.666833 1.000000 0
## 8 max absolute_mcc 0.470767 0.693585 23
## 9 max min_per_class_accuracy 0.470767 0.807692 23
## 10 max mean_per_class_accuracy 0.470767 0.845023 23
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or
`h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
varimp.gbm.dt <- data.table(gbm.h2o@model$variable_importances)[,
c('relative_importance','percentage'):=NULL]
ggp.2(varimp.gbm.dt, list(nobs=5, cols=c('variable','scaled_importance'),
xylab=c('Variable','Scaled Importance')))
Divya Nori
#summary(rf.h2o)
h2o.performance(rf.h2o, newdata=data.h2o)
## H2OBinomialMetrics: drf
##
## MSE: 0.1891142
## RMSE: 0.4348727
## LogLoss: 0.5668165
## Mean Per-Class Error: 0.1210407
## AUC: 0.9343891
## Gini: 0.8687783
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal
threshold:
## 0 1 Error Rate
## 0 31 3 0.088235 =3/34
## 1 4 22 0.153846 =4/26
## Totals 35 25 0.116667 =7/60
##An ML-based Predictive Model for Vision Disorders in Adolescents 36
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.478210 0.862745 24
## 2 max f2 0.436659 0.905797 33
## 3 max f0point5 0.478210 0.873016 24
## 4 max accuracy 0.478210 0.883333 24
## 5 max precision 0.607206 1.000000 0
## 6 max recall 0.381417 1.000000 41
## 7 max specificity 0.607206 1.000000 0
## 8 max absolute_mcc 0.478210 0.761806 24
## 9 max min_per_class_accuracy 0.463585 0.852941 27
## 10 max mean_per_class_accuracy 0.478210 0.878959 24
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or
`h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
varimp.rf.dt <- data.table(rf.h2o@model$variable_importances)[,
c('relative_importance','percentage'):=NULL]
Divya Nori
ggp.2(varimp.rf.dt, list(nobs=5, cols=c('variable','scaled_importance'),
xylab=c('Variable','Scaled Importance')))
Distribution Data
Divya Nori
Figure 4: Distribution of Students by Height Figure 5: Distribution of Students by Weight
Figure 7/8: Distribution of Students Reading While Traveling in a Car/Watching a Screen While Traveling in a Car
Figure 9/10: Distribution of Students Reading While Lying Down/Watching a Screen While Lying Down
Divya Nori
Figure 11/12: Distribution of Students Reading in Low Light/Watching a Screen in Low Light
Figure 13/14: Distribution of Students by Time Spent on a Screen for School Work/Time Spent Watching TV
Figure 15: Distribution of Students with Siblings Using Visual Aids Figure 16: Distribution of Students by Physical Activity Time
Divya Nori
Figure 17: Distribution of Students by Snack Servings Figure 18: Distribution of Students by Protein Servings
Figure 19/20: Distribution of Students by Time Spent on a Cell Phone/Time Spent Reading Books
Figure 21/22: Distribution of Students by Time Spent Using Computer/Time Spent Outdoors
Divya Nori
Figure 23: Distribution of Students by Form of Transportation Figure 24: Distribution of Students by Use of Braces
Figure 25: Distribution of Students by Vegetable Servings Figure 26: Distribution of Students by Fruit Servings
Divya Nori
Statistical Analysis Data
Table 1: Relative Risk and Odds Ratios for Identified Variables
Divya Nori
An ML-based Predictive Model for Vision Disorders in Adolescents 43
Divya Nori
Table 3: Pairwise Negative Correlations
Divya Nori
Model Output
Divya Nori
Figure 32: Decision Tree
An ML-based Predictive Model for Vision Disorders in Adolescents 46
Divya Nori
Figure 33: Random Forest Scaled Importance Values
Figure 35: GBM ROC Curve
Discussions
Analysis
Figure 1 displays the percent of students with visual aids; glasses, contacts, both, or none. Most
of the students (~57%) do not use glasses or contacts and ~43% of the students do. This gives a
good size sample for each group, rather than having many students in one group and barely any
used to convert them to binary variables. While half of the features were already binary (Figures
3, 7, 8, 9, 10, 12, 14, 19, 20, 21, 22, 24), the remaining variables needed to be converted. The
binary variables were defined based on the number of subjects in each group. For example, in
Figure 6, since ~57% of the students are Caucasian, the two variables created for the “race”
feature were race.White and race.NonWhite. If the variables had been created with the “other”
category instead of Caucasian (race.Other and race.NonOther), there would only be 6 subjects
(10%) in the race.Other category. If race.Other had been chosen by the model as one of the
predictive variables, because of the low number of subjects, the coefficient would not accurately
Divya Nori
From the summary statistics, the following binary variables and their complements were
physicalActive.GT1hr.
Table 1 and Figure 28 show the top ten most predictive variables based on relative risk
and odds ratio. If the relative risk/odds ratio is higher, it is more likely that the variable has more
predictive power. Similarly, if the odds ratio is greater than one, it is more likely that the
individual has a visual aid if that feature is present. The ordering of variables based on relative
risk
Analmost corresponds
ML-based to Model
Predictive that of for
oddsVision
ratios,Disorders
showing in
that these measures are good indicators of
Adolescents 48
the predictive power. Also, the order of features by univariate odds ratios was similar to that of
other algorithms, so if the odds ratio is higher, the feature is likely to have predictive power in
other models.
Table 2 shows the top five pairs of positively correlated variables. These coefficients
were computed using the pairwise correlation formula. Because of the low sample size, the
coefficients are not very strong (as seen in Figure 29), but if the model picks one of these
variables, the correlations convey that the other variable could be used as well. For example, as
seen in Figure 30, height.GT54 was identified by the GLM model as having predictive power.
The positive correlation shows that weight.GT130 could also have some predictive power. The
same can be said for Figure 32 about the weight feature in correlation to height. Similarly,
Table 3 shows the top five pairs of negatively correlated variables. Protein.GT2 and
Divya Nori
lunchNotHome were negatively correlated, which is interesting because it may show that school
lunches are not as nutritious. Other conclusions can be drawn from these correlations, and with a
Figure 30 displays the top five most predictive variables based on the regression
coefficients computed by the Generalized Linear Model. All the variables have positive
coefficients, meaning that if the feature is present, it is more likely that the subject has or will
need a vision aid. The intercept indicates the likelihood that the subject will require a vision aid
if none of the features are present. Since this value is negative, it shows that it is very unlikely
that the subject will need glasses/contacts if they are not in any of the categories listed. The most
predictive variable found, with an absolute value of ~1.373, is the use of a screen in a car,
followed by if race is non-white (~0.835), time spent outdoors is less than one hour (~0.823),
height is greaterPredictive
An ML-based than 5.4 feet (~0.729),
Model and the
for Vision subject’s
Disorders insibling need glasses (~0.728). In these
Adolescents 49
cases, if the feature is present, the subject is more likely to need a vision aid. The coefficient
indicates the predictive power of the variable, so the time spent outdoors is more predictive than
height. Figure 31 displays the ROC Curve for the Genearlized Linear Model. The model has an
AUC of ~0.765. This is fairly accurate given the number of observations (60). The ROC curve
plots 1 minus the specificity against the senstivity. Specificity (TNR or True Negative Rate) is
equal to True Negatives/All Negatives on a confusion matrix. Sensitivity (TPR or True Positive
In Figure 32, if the the leaf is green, them most individuals in the category have vision
aids. If the leaf is blue, most individuals in the category do not have vision aids. The darkness
of color correlates to the percentage of individuals in the category that display the response
variable indicated by the leaf. The importance of the feature correlates the placement on the tree.
Divya Nori
In other words, the features on the upper branches have more predictive power than features
positioned on the lower branches. The features returned were use of a screen in a car, whether or
not the subject’s siblings had glasses, time spent oudoors, reading why lying down, reading in
low light, and weight greater than 130 pounds. The weight feature can be correlated with the
height feature, as seen in Figure 29, so when looking at the important variables, they can be
paired.
The top five scaled importance values of the features returned by the Distributed Random
Forest Model are shown in Figure 33. They are use of screen in a car, time spent oudoors, height
greater than 5.4 feet, whether or not the subject’s siblings have glasses, and whether the subject’s
race is white or non-white. Scaled importance makes the most predictive feature one and scales
the other features down accordingly. Since all of the scaled importance values are positive, so if
the
Anfeature is present
ML-based in an Model
Predictive individual, they will
for Vision likely need
Disorders a vision aid. DRF returned an AUC
in Adolescents 50
of ~0.934, which is the highest across all of the models.
Figure 34 shows the scaled importance values for the features returned by the Gradient
Boosted Model. The features returned are use of screen in a car, whether or not the subject’s
siblings have glasses, time spent outdoors, height greater than 5.4 feet, and reading in low light.
Since all of the scaled importance values are positive, so if the feature is present in an individual,
they will likely need a vision aid. It is interesting to note that the first two features’ scaled
importance values are almost double that of the third. The Gradient Boosted Machine ROC
Curve is shown in Figure 35. The model returned an AUC of ~0.889, which is fairly accurate.
Divya Nori
Conclusions
GL Random GB
Decision Tree
M Forest M
screenCar.Yes 1 1 1 1
timeOutdoors.LE1 3 2 2 3
An ML-based Predictive Model for Vision Disorders in Adolescents 51
siblingsGlasses.Yes 5 3 4 2
height.GT54/weight.GT13
4 5 3 4
0
race.NonWhite 2 5
readLowLight.Yes 4 5
readLyingDown.Yes 3
The most predictive features across algorithms are shown in the Table 4. The use of
screen in a car feature ranked first in all algorithms tested. It was followed by time spent
outdoors and whether or not the subject’s siblings have glasses. Since height and weight are
Divya Nori
correlated, they are paired together. When paired, the feature came up as predictive in all
algorithms tested. Race and reading habits came up as predictive in some algorithms, so they
The models are accurate and can be used as an initial filtering before a formal eye exam.
This is especially important in middle and high schools where yearly eye exams are not
conducted. The original hypothesis was proven to be partially correct. While time oudoors did
Future Work
In the future, there are several things that could be done to improve model accuracy and
efficiency. For example, more survey responses could be obtained for a higher AUC. By
An ML-based Predictive Model for Vision Disorders in Adolescents 52
increasing the number of subjects, other statistics like confidence interval could also be
computed. Increasing sample and feature variation could also make the model more generic and
reduce memorization. The program could be made more efficient for a wide scale adoption. A
IV. ACKNOWLEDGEMENTS
There are several people that helped make this investigation possible:
The survey participants and their parents/guardians – this investigation would not have
Ms. Kindra Smith and Ms. Lynnette Lindesay – thank you so much for your cooperation and
Divya Nori
Dr. Varsha Sonawane – thank you for being a great mentor through the entire process.
Dr. Ratidzo Macharaga – thank you for taking time out of your busy day to talk to me and
My family – this experiment would not be where it is without your support. Thank you!
V. REFERENCES
[1] - MHS, Susan Vitale PhD. “Increased Prevalence of Myopia in the United States
article/424548
[2] - Holden, Brian A. “Global Prevalence of Myopia and High Myopia and Temporal
An ML-based Predictive Model for Vision Disorders in Adolescents 53
Trends from 2000 through 2050.” AAO Journal, http://www.aaojournal.org/article/S0161-
6420(16)00025-7/abstract
https://www.nature.com /news/the-myopia-boom-1.17120
http://www.eyeassociates.com/pathological-myopia/
[6] - Holden, B, et al. “Myopia, an Underrated Global Challenge to Vision: Where the
Current Data Takes Us on Myopia Control.” Eye, Nature Publishing Group, Feb. 2014,
www.ncbi.nlm.nih.gov/pmc/articles/PMC3930268/.
Divya Nori
[7] - Crew, Bec. “Watch: The Nearsighted Epidemic Is Real.” ScienceAlert,
www.sciencealert.com/watch-the-nearsightedness-epidemic-is-real.
[8] - “Facts About Refractive Errors.” National Eye Institute, U.S. Department of Health
[9] - Cassin, Barbara, and Melvin L. Rubin. Dictionary of Eye Terminology. Triad Pub. Co.,
2004.
An [12] - “Laser
ML-based in Situ Keratomileusis
Predictive to Treat
Model for Vision Myopia:
Disorders Early Experience.” Journal of
in Adolescents 54
[13] - Moshirfar, Majid, et al. “Incidence Rate and Occurrence of Visually Significant
www.ncbi.nlm.nih.gov/pmc/articles/PMC3986296/.
[14] - Barrett, Brendan T. “A Critical Evaluation of the Evidence Supporting the Practice of
jsessionid=E4C2719BD1E1A2728C55696B37019AD2.f04t03.
Divya Nori
[15] - Randle, Robert J. “Responses of Myopes to Volitional Control Training of
2007, onlinelibrary.wiley.com/doi/10.1111/j.1475-1313.1988.tb01063.x/abstract.
[17] - Robert N. Kleinstein, OD, MPH, PhD. “Refractive Error and Ethnicity in
jamanetwork.com/journals/jamaophthalmology/fullarticle/415584.
www.aoa.org/patients-and-public/eye-and-vision-problems/glossary-of-eye-and-vision-
conditions/hyperopia?sso=y.
An ML-based Predictive Model for Vision Disorders in Adolescents 55
[19] - “Home - PMC - NCBI.” National Center for Biotechnology Information, U.S.
www.covd.org/?page=Focusing.
[21] - LD, Jill Corleone RDN. “Can Foods Make You Grow Taller?” LIVESTRONG.COM,
taller/.
Divya Nori
VI. APPENDIX
Divya Nori
An ML-based Predictive Model for Vision Disorders in Adolescents 57
Divya Nori