An AI-Based Predictive Model For Vision Disorders in Adolescents Divya Nori 9th Grade Milton High School 2017 - 2018

An AI-Based Predictive Model for Vision Disorders in Adolescents
Divya Nori
9th Grade
Milton High School
2017 – 2018
ABSTRACT
Over the past 50 years, the number of young people diagnosed with near-sightedness or
myopia in the U.S. has doubled. If this trend continues, by 2050, half of the world’s population,
around 5 billion people, will be myopic. This rapid increase in number of cases diagnosed is
alarming because myopia can serve as a precursor to many other vision problems, even leading
to blindness in extreme situations. Until 5 years ago, scientists thought that almost all cases of
pathological myopia were caused by a genetic predisposition. Recent research shows that while
30% of these cases have occurred due to genetics or during early childhood years, 60% of these
cases have progressed from a treatable form of myopia, starting in adolescence. By using a
mathematical model to predict the likelihood of myopia, time and money can be saved.
Additionally, the prevalence of myopia and other related vision disorders can be reduced. This
investigation created a predictive model using machine learning to determine risk factors of
An ML-based Predictive Model for Vision Disorders in Adolescents 2
myopia using data collected through a survey. The survey was distributed to 60 adolescents, and
responses were recorded. A database was created with the data, and the data was then analyzed
by a computer program. The program was written in R. The time spent on a screen in the car,
the time spent outdoors, the race of the subject, the height of the subject, and whether the
subject’s sibling had glasses were determined as the most highly predicting to dependence on a
vision aid.
Divya Nori
TABLE OF CONTENTS
ABSTRACT................................................................................................................................................2
LIST OF GRAPHS......................................................................................................................................4
LIST OF TABLES......................................................................................................................................5
I. INTRODUCTION TO INVESTIGATION.........................................................................................6
Background Research..............................................................................................................................6
Important Concepts..............................................................................................................................6
Overview of Myopia..........................................................................................................................10
Overview of Hyperopia.....................................................................................................................12
Recent Studies and Probable Causes..................................................................................................12
Diagnosis, Prevention, and Treatment...............................................................................................16
Epidemiology and Statistics...............................................................................................................18
Interview with Optometrist....................................................................................................................19
Statement of the Problem.......................................................................................................................20
Research Question.................................................................................................................................21
Hypothesis.............................................................................................................................................21
II. PROCEDURE OF INVESTIGATION..............................................................................................21
Variables................................................................................................................................................21
Procedure...............................................................................................................................................21
Vision Survey....................................................................................................................................23
III. RESULTS..................................................................................................................................25
Program.................................................................................................................................................25
Distribution Data...................................................................................................................................37
Statistical Analysis Data........................................................................................................................42
Model Output........................................................................................................................................45
Discussions............................................................................................................................................48
Analysis.............................................................................................................................................48
Conclusions.......................................................................................................................................52
Future Work......................................................................................................................................53
IV. ACKNOWLEDGEMENTS...........................................................................................................53
V. REFERENCES..................................................................................................................................54
VI. APPENDIX...................................................................................................................................57
Divya Nori
LIST OF GRAPHS
Figure 1: Distribution of Students with Visual Aids……….…………….……………………...………37
Figure 2: Distribution of Students by Age……………...………………………..…………….…..……37
Figure 3: Distribution of Students by Gender………………….…………...…………………...………37
Figure 4: Distribution of Students by Height.……………………………...……………….…...………37
Figure 5: Distribution of Students by Weight.………………...…………...…..………………..………37
Figure 6: Distribution of Students by Race.………………………...…...………………….…..….……38
Figure 7: Distribution of Students Reading While Traveling in a Car…………...……….…...……..…38
Figure 8: Distribution of Students Watching a Screen While Traveling in a Car….….................…..…38
Figure 9: Distribution of Students Reading While Lying Down…….….……………...……...…..……38
Figure 10: Distribution of Students Watching a Screen While Lying Down…………..………...………38
Figure 11: Distribution of Students Reading in Low Light………………………..…………......………39
Figure 12: Distribution of Students Watching a Screen in Low Light……………..…...……......………39
Figure 13: Distribution of Students by Time Spent on a Screen for School Work….....................………39
Figure 14: Distribution of Students by Time Spent Watching TV….............................................………39
An ML-based
Figure Predictive
15: Distribution Modelwith
of Students for Vision
SiblingsDisorders in Adolescents
Using Visual Aids…........................................………39 4
Figure 16: Distribution of Students by Physical Activity Time….....………….............................………39

Figure 17: Distribution of Students by Snack Servings….....….....................................................………40
Figure 18: Distribution of Students by Protein Servings…...….....................................................………40
Figure 19: Distribution of Students by Time Spent on a Cell Phone…...………….......................………40
Figure 20: Distribution of Students by Time Spent Reading Books…...…....................................………40
Figure 21: Distribution of Students by Time Spent Using a Computer…...……………...............………40
Figure 22: Distribution of Students by Time Spent Outdoors…………………………..………………..40
Figure 23: Distribution of Students by Form of Transportation…...…..........................................………41
Figure 24: Distribution of Students by Use of Braces……………..………….....….....................………41
Figure 25: Distribution of Students by Vegetable Servings……….…………….….....................………41
Figure 26: Distribution of Students by Fruit Servings……………………………...….................………41
Figure 27: Distribution of Students by Source of Lunch……………………..…………………………..41
Figure 28: Odds Ratio Graph……………..………………………………………………………………43
Figure 29: Pairwise Correlations Heat Map…..…………………………………………………………..44
Figure 30: Generalized Linear Model Regression Coefficients………………………………………..…45
Figure 31: GLM ROC Curve……………………………………………………………………………..45
Divya Nori
Figure 32: Decision Tree………………………………………………………………………………….46
Figure 33: Random Forest Scaled Importance Values……………………………………………………46
Figure 34: Gradient Boosted Machine Scaled Importance Values…………………………………….…47
Figure 35: GBM ROC Curve……………………………………………………………………………..47
LIST OF TABLES
Table 1: Relative Risk and Odds Ratios for Identified Variables……….………...…...………42
Table 2: Positive Pairwise Correlations……………………………….....…………...…..……43
Table 3: Negative Pairwise Correlations………………..…………….....…………...…..……44
Table 4: Important Features Across Algorithms….……………………………………………52
Divya Nori
I. INTRODUCTION TO INVESTIGATION
Over the past 50 years, the number of young people diagnosed with near-sightedness or
myopia in the U.S. has doubled, making nearly 40-50% of the population of young people
nearsighted [1]. If this trend continues, by 2050, half of the world’s population, around 5 billion
people, will be myopic [2]. This rapid increase in number of cases diagnosed is alarming
because myopia can serve as a precursor to many other vision problems, even leading to
blindness in extreme situations. Pathological myopia, or degenerative myopia, is a form of
extreme myopia that frequently results in cataracts and glaucoma and occurs in 20% of people
diagnosed with myopia. Until 5 years ago, scientists thought that almost all cases of pathological
myopia were caused by a genetic predisposition [3]. Recent research shows that while 30% of
these cases have occurred due to genetics or during early childhood years, 60% of these cases
have progressed from a treatable form of myopia, starting in adolescence [4]. To detect the early
onset of pathological myopia, and other serious vision problems resulting from myopia,
especially during the current "myopic epidemic", a more efficient and cost-effective solution to
identify people at risk needs to be developed.
Background Research
Important Concepts
What Is Machine Learning?
There are two major types of programming: traditional or rule-based programming and
machine learning. In traditional programming, the computer scientist inputs data and writes a
program to get an output. In machine learning, the data and the output are given to the computer,
and the computer produces a model. This new way of programming is useful in several ways. It
is extremely difficult to write rule-based programs in some scenarios. For example, computing
Divya Nori
the probability that a credit card transaction is fraudulent would require several if-then
statements. Also, since this problem is always changing and there is a moving target, the
program would need to keep getting updated. Another difficult scenario is recognizing a 3D
object in a cluttered scene. If a rule-based program could be written, it would be complicated
and hard to follow or replicate. Machine Learning is used in these types of scenarios to achieve
a more efficient result.
Machine Learning is part of a broader area of study: Artificial Intelligence. Artificial
Intelligence is the study of how we can make machines able to understand the world, make
predictions, choose actions, or more broadly, do judgmental processes. Machine Learning is a
subset of this field and enables machines to improve at tasks using several examples of scenarios.
Instead of writing a program for each specific task, lots of examples are collected that specify the
correct output for

An ML-based a given input.
Predictive ModelThe Machine
for Vision Learninginalgorithm
Disorders takes these examples and
Adolescents 7
produces a program. This allows the program to work for new examples, not just for the
examples the model was trained with. If the model needs to be updated, it can be retrained with
new data, and new rules do not need to be written.
Other Concepts
An odds ratio is a measure of association between exposure and outcome. In other
words, it is the odds that an outcome will occur given an exposure. This measure is typically
used in case-control studies, like this investigation. The natural log (ln) of the odds ratio is equal
to the logistic regression coefficient produced by the machine learning model.
Divya Nori
The relative risk or risk ratio is the ratio of the probability of an event occurring in an
exposed group to the probability of the event occurring in the non-exposed group. This measure
of association is similar to odds ratio.
Regression is a set of statistical processes used to determine the relationship among
variables. There are several types of regression, but the two used in this investigation are logistic
regression and regularized regression. Logistic regression is used when the dependent variable is
binary. The objective function for this type of regression is called MLE or Maximum Likelihood
Estimator. Regularized regression (used in Generalized Linear Models) introduces a penalty to
the logistic regression objective function to prevent overfitting. There are two types of
regularized regression: the LASSO model and Ridge Regression. The LASSO model introduces
a penalty that is equal to the absolute values of the coefficients computed by the model. Ridge
regression introduces
An ML-based a penalty
Predictive Modelequal to the square
for Vision of the
Disorders in absolute values of the coefficients.
Adolescents 8
This investigation uses the LASSO model because it drives the other non-predictive coefficients
to zero, whereas the ridge regression model takes the coefficients close to zero. This allows
model generalization, not memorization.
Decision Tree Learning is a Machine Learning algorithm which uses a decision tree as a
predictive model to go from observations about an item to a conclusion. It is commonly used in
supervised learning classification problems. In a decision tree, the observations are represented
in the branches. If the feature the is present, the left branch is followed, and if the feature is not
present, the right branch is followed. The target value is represented in the leaves with the top
number (0 or 1) representing whether most individuals in the category have vision aids or not,
the middle number representing the percentage (as a decimal) of individuals in category with
vision aids, and the bottom number representing individuals in a category as a percentage of the
Divya Nori
total population. There are two types of decision trees, classification and regression trees,
referred to collectively as CART. A classification tree, which is used in this experiment, deals
with discrete target values. Regression trees have continuous target values that can be
considered real numbers. Decision trees are simple to understand and interpret. They also
require less data preparation than other algorithms.
Distributed Random Forest is an ensemble learning method used in classification
problems. This algorithm constructs several fully-grown decision trees in parallel. Random
forest uses decision trees with low bias and high variance. This is because of the bias-variance
tradeoff, which is the problem of simultaneously reducing bias and variance. Bias is the error
from assumptions in an algorithm. High bias causes an algorithm to miss features which is
known as underfitting. Variance is an error from sensitivity to small fluctuations in the training
set.
An High variance
ML-based causes lots
Predictive of random
Model noise
for Vision and causes
Disorders the model to memorize the data and
in Adolescents 9
overfit. Random forest uses bootstrap aggregation or bagging to combine the individual decision
trees. Bagging is a machine learning ensemble algorithm used to improve accuracy of machine
learning algorithms. It is a special type of model averaging technique that is usually applied to
decision trees. Random forest helps prevent overfitting on the training dataset, which is common
with decision trees and GBM. It is a highly accurate classifier and runs efficiently on larger
datasets.
Gradient Boosted Machines use gradient boosting, which is a machine learning technique
used in regression and classification problems. There are three elements involved in GBMs: A
loss function, a weak learner, and an additive model. A loss (objective) function needs to
optimized or minimized. Different loss functions are used for different problems. The weak
learners have high bias and low variance to make predictions, the converse of Random Forest.
Divya Nori
The weak learners usually CART trees. The additive model produces an ensemble model of
weaker prediction models by adding them one at a type, rather than in parallel like Random
Forest. Different objective loss functions can be used with the same algorithm, which makes
Gradient Boosting an appealing approach.
Gradient Descent is the process by which the minimum value of a function is found. The
model minimizes the MLE objective function to obtain the coefficients. An analogy that is often
used to illustrate gradient descent involves a person on a mountain. The mountain is foggy, and
the person is trying to get down or find the minima. Since the path is not visible, he must use his
local information to get down. He must look at the steepness at his current position and proceed
in the direction with the steepest descent. This is much like the process the machine learning
model follows to minimize the MLE objective function and determine the variable coefficients.
An ML-based Predictive
N-fold Cross Modelisfor
Validation Visionvalidation
a model Disorderstechnique
in Adolescents
used to assess how the results of 10
a statistical analysis will generalize to an independent dataset. It is mainly used to estimate the
accuracy of a model. The dataset is divided into four groups (4-fold), and each group is used for
training three times, and for test once. N-fold cross validation allows all the data to be trained
and tested with. This allows the AUC to be higher, even though the sample size may be lower.
Overview of Myopia
Myopia, also known as near-sightedness or short-sightedness, is a vision disorder in
which one’s eyeball is elongated, and incoming light focuses in front of the retina, instead of on
the retina. Because of this, objects that are distant appear blurry and objects that are close appear
normal. While myopia may be benign in most cases, resulting in only a glasses/contacts, myopia
Divya Nori
greatly increases the risk of serious vision problems such as retinal detachment, cataracts, and
glaucoma.
The first person to recognize and distinguish between myopia and hyperopia, also known
as far-sightedness, is said to be Greek Philosopher Aristotle (384 BC – 322 BC). However, these
speculations were materialized in 65 AD when the word meyin, meaning close, and opdos,
meaning eye, were used in the compilation of Roman Law called Libris Pandectorum. The
Greek word “Myopos” eventually evolved into the Latin word “myops”, meaning a condition in
which someone attempts to see clearly by partially closing their eyes. While this condition has
been recognized for over 2000 years, research on the causes of myopia began about 200 years
ago [5]. Over time, scientists have understood that myopia results from a combination of genetic
predisposition and environmental factors. The main risk factors are believed to be time spent
doing work and Predictive

An ML-based focusing onModel
close for
objects,
Visiontime spent outdoors,
Disorders family history of the condition,
in Adolescents 11
socioeconomic class, Vitamin A deficiency, etc.
Myopia can occur at various scales, and the rate of progression to degenerative myopia
can also vary. Each person diagnosed with myopia can see clearly out to a certain distance that
varies from individual to individual, and everything further than that becomes blurry. Eye
examination have shown that most myopic eyes have the same structure as non-myopic eyes,
although in cases of high to degenerative myopia, a staphyloma can sometimes be seen in an
examination. A staphyloma is an abnormal protrusion of uveal tissue, which surrounds the
retina, through a weak point in the eyeball. Usually, the protrusion is black and effects inner
layers of the eye like the cornea and sclera. The most significant structural difference between
myopic and normal eyes is the length of the eye. The retina needs to stretch to cover the
increased distance, and hence becomes weaker. This causes lattice degeneration and retinal
Divya Nori
detachment. Lattice degeneration is an eye disease in which the retina develops breaks and tears,
which causes the retina to eventually detach, resulting in legal blindness [6].
Overview of Hyperopia
Hyperopia, also known as far-sightedness, is a vision disorder in which distant objects
appear clearly but, near ones appear blurry. Converse to myopia, hyperopia causes one’s eyeball
to shorten and cornea has too much curvature. This causes the image to focus behind the retina,
instead of on the retina. Hyperopia is most commonly seen in middle-aged adults, progressing as
they get older, and is much less likely in adolescents [18]. The correlation between hyperopia
and cataracts is alarming; 77.78% of hyperopic subjects end up with cortical cataracts [19].
Recent Studies and Probable Causes
Recently,Predictive
An ML-based scientistsModel
have begun talking
for Vision about anin“epidemic
Disorders of nearsightedness”. While
Adolescents 12
the prevalence of nearsightedness in the U.S. is high, there are other parts of the world with
unimaginable prevalence rates. Parts of East Asia like Singapore, China, Japan, and Korea have
myopia prevalence rates as high as 80% among middle/high school age children.
Over the years, scientists have blamed nearsightedness of a variety of things.
Mathematician Johannes Kepler (1571 – 1630) blamed his nearsightedness on near work, or all
the writing/calculations that he did up close. This was the hypothesis that was believed for a
long time. Smartphones/tablets are not included in the definition of near work, since they are too
modern. Since nearsightedness has been on the rise since before these electronics were very
popular, scientists do not attribute this epidemic to them. Extensive research has been done on
the effect of near work on myopia rates, and while scientists have not established a firm link,
they have not ruled it out completely.
Divya Nori
In the 20th century, scientists learned that genetics plays a role in the progression of
myopia. While the likelihood that you will be myopic is higher if your parents are as well, the
relationship between the presence of myopia and the mutation of a specific gene is not straight-
forward. A few dozen genotypes together influence the final phenotype present in an individual.
In order to confirm several hypotheses about genetic predisposition, a study was one in an Inuit
community in 1969. At the start of the experiment, 2 out of 131 people in that community
(1.5%) were nearsighted. After 1-2 generations (children/grandchildren), prevalence rates rose
to nearly 50%. This rapid change could not have resulted from genetics alone. Scientists then
concluded that while genes may have some influence, the main cause of nearsightedness is
present in the environment.
Scientists proceeded to investigate a link between nearsightedness and education. A
study published Predictive

An ML-based in OctoberModel
2015 by
forresearchers from the
Vision Disorders in University
Adolescentsof Wales Cardiff concluded 13
that the prevalence of myopia in first-borns was higher than in children born later by 10%. This
does not attribute itself to the skyrocketing rates, but it does give scientists a clue as to what the
cause might be. Researchers adjusted the data to account for how much education each of the
participants had. The effect diminished, meaning that the amount of education the subjects had
accounted for the difference in myopia rates. Scientists then created the “Parental Investment”
hypothesis. Parents tend to emphasize the value of education more in first-borns than in their
later children. As a result, the first-born children spend more time studying, and therefore have a
higher rate of myopia. Another study conducted by professors from the Sun Yat Sen University
in China, investigated the link between socioeconomic status and nearsightedness by comparing
the rates of myopia in neighboring Chinese provinces. Schoolchildren from the Shaanxi
province, a middle-income province, were compared to schoolchildren from the Gansu province,
Divya Nori
a relatively poor province. In the wealthier province, the prevalence of myopia was twice that of
the poor province. While the researchers could not find a probable reason to explain the
difference, they found that higher math scores, from the wealthier province, were correlated with
higher rates of nearsightedness. This emphasizes the link between education and
nearsightedness. This also explains the extend of the problem in Asia, since education is
particularly emphasized in many East Asian cultures.
To determine if culture influenced nearsightedness as supposed to race, researchers
conducted a study on ethnic Chinese children living in Sydney and Singapore. The made sure
that the kids’ parents had similar rates of nearsightedness (around 70% in both studies). In the
children, the difference was alarming. Only 3.3% of the kids living in Sydney were nearsighted,
while 29.1% of the kids living in Singapore were nearsighted. Surprisingly, the children in
Sydney did morePredictive

An ML-based near-workModel
than the
for children in Singapore.
Vision Disorders The only difference between the
in Adolescents 14
groups of children was how much time they spent outside. While the kids in Sydney spent an
average of 13 hours a week outside, the kids in Singapore spent in average of 3 hours outside.
Public health officials hypothesized about the effect of sunlight on myopia and its progression.
Scientists proceeded to look for rigorous evidence and a mechanism in which this link could be
established.
Within the last few years, researchers have made considerable progress towards
establishing this as a probable cause. There have been several experiments on animals that show
that light protects against myopia. In a study done by researchers in Germany, myopia was
induced in chicks through the use of special goggles. They then placed one group of chicks in
sunlight, and the other group under regular laboratory lighting. The onset of myopia was slowed
in the group raised under sunlight by around 60%.
Divya Nori
After establishing rigorous evidence, researchers then focused their attention on the
science behind sunlight and our brains. They found a substance produced by organisms’ brains
that influences eye development: Dopamine. Dopamine is a neurotransmitter that plays several
important roles in human (and other animals’) brains/bodies. This chemical is released to send
signals to other nerve cells. The brain has dopamine-pathways that influence motor control and
hormone release. In order to establish that Dopamine affects proper eye development, scientists
injected chicks with a chemical that blocks Dopamine. Without the presence of this chemical,
sunlight no long protected the chicks from myopia. Scientists concluded that Dopamine is
released as a result of bright light. This chemical is also related to the body’s day-night rhythm.
The human body switches from low-light nighttime vision to daytime vision, and without the
presence of Dopamine, this switch cannot occur. Researches now believe that this “Dopamine
cycle” is required for healthy eye development throughout childhood and adolescence. If this
cycle is disrupted, especially during one’s middle-school years, the eyeball tends to become
elongated, causing myopia. Scientists call this hypothesis the Light-Dopamine hypothesis.
To test this hypothesis, scientists looked at primary school children from 12 schools in
Guangzhou, China. These children were divided into two groups of 6 schools each, so about 950
children were in each group. One of the groups did not change their daily routine, while the
other group added a 40-minute outdoor activity time to their schedule. For 3 years, the
researchers tracked the children and their eye development. At the end of the trial, the incidence
rate of myopia among the children that spent time outside was 30%, while in the other group, the
rate of myopia was 39.5%. While the reduction was less than they expected, 3 years a short
amount of time to witness significant change. In order to better establish the link, researchers are
calling for more studies [7].
Divya Nori
Diagnosis, Prevention, and Treatment
Myopia is most commonly diagnosed by eye care professionals (optometrists or
ophthalmologists). Usually, an autorefractor or a retinoscope is used to give an initial
assessment of the refractive status of each eye. An autorefractor is a machine that measures how
light enters the person’s eye, and how the eye changes the light. A retinoscope is a way of
shining the light on a person’s eye to observe the retina. A phoropter, a device that contains
different lenses, is then used to determine the exact prescription. These techniques and tools
help distinguish between myopia, hyperopia, astigmatism, and presbyopia [8].
There are various forms of myopia, each differing in intensity and treatment options.
Simple myopia, which is typically less than 4 diopters, can easily be fixed with glasses/contacts.
Degenerative myopia (malignant, pathological, or progressive myopia) is characterized in other

An ML-based
ways. The mostPredictive
common isModel for fundus
through Vision changes.
DisordersThe
in Adolescents
fundus is the interior surface of the 16
eye opposite the lens and includes the retina, optic disc, macula, fovea, and posterior pole.
Degenerative myopia is also characterized by a very high refractive error and subnormal vision
even after correction. This form of myopia is known to continually progress and get worse over
time. Another form of myopia is Pseudo myopia, which is the blurring of distance vision
brought about by a spasm of the accommodation system. Pseudo myopia includes nocturnal
myopia, near work-induced transient myopia, and instrument myopia. The last form of myopia
is Induced myopia, also known as acquired myopia. This form of myopia is brought on by the
use of different drugs, increases in glucose levels, nuclear sclerosis, oxygen toxicity, etc. [9].
Other than spending time outdoors as suggested by the Light-Dopamine Hypothesis,
there are several other preventions for myopia/the progression of myopia. The use of glasses and
contact lenses can help alter the speed of myopic progression. The American Optometric
Divya Nori
Association’s Clinical Practice Guidelines for Myopia refers to several studies which show that
switching from full-time, part-time, and no lens does not appear to slow progression. In young
people ages 18 or younger, topical medications such as Anti-muscarinic is known to slow the
progression of myopia. The muscarinic receptor antagonist is a neuro-transmitter that blocks the
chemical Acetylcholine, which like Dopamine influences eye development. Eyedrops containing
cyclopentolate and atropine, chemicals that influence nerve connections, also slow the
progression of myopia. These chemicals often have side effects which cause light sensitivity,
since they cause too much Dopamine to be produced. Sclera reinforcement surgery is another
method which targets lattice degeneration. It provides reinforcement to thinning posterior poles,
which is part of the fundus. By slowing or stopping the progression of the disorder, quality of
vision and extent of myopia may be improved [10].
Other than
An ML-based targetingModel
Predictive the rate
forofVision
progression, there
Disorders in is no concrete and universally accepted
Adolescents 17
solution to treat myopia. There are several types of refractive surgery that can be done for up to
6 diopters of myopia. PRK surgery (photorefractive keratectomy) is a laser eye procedure
intended to reduce a person’s dependence on glasses/contacts. The method involves removing
some corneal tissue using an excimer laser, an ultraviolet laser commonly used in eye surgery.
While PRK is relatively safe, the recovery process is usually painful [11]. Another type of
surgery is LASIK surgery, where a flap is cut on the cornea prior to the procedure. Then, using
an excimer laser, the curvature of the cornea is changed. While the process is quite like PRK,
the recovery is usually painless, but cornea stability may be sacrificed [12]. Unlike PRK and
LASIK where the corneal surface is modified, the intra-ocular lens can also be modified by
implanting another lens inside the eye. While this typically fixes the refractive problem, it can
eventually lead to glaucoma and other serious eye disorders [13].
Divya Nori
Recently, several alternative therapies have been developed because of the lack of
scientific interventions. Vision therapy, also known as “behavioral optometry”, includes various
eye exercises and other relaxation techniques. One of the most popular forms of vision therapy
is the Bates Method. Physician William Horatio Bates (1860 – 1931) attributed all vision
disorders, including myopia, to a constant strain on the eyes. Hence, he felt that glasses harmed
the eye and trained the eye into the myopic state. His method includes palming, visualization,
movement, and exposure to sunlight. Scientists have stated that these alternative techniques
have “no clear scientific evidence” and so “they cannot be advocated for” [14]. In the late
1980s and early 1990s, biofeedback became very popular as a treatment for myopia.
Biofeedback is the process of gaining awareness of many physiological functions primarily using
instruments that provide information on the activity of those same systems. The goal was to be
able to manipulate these systems at one’s own will. Scientists maintain that biofeedback training
is not consistent [15].
Epidemiology and Statistics
The prevalence of myopia varies with age, country, race, religion, ethnicity, sex,
environment, occupation, and other factors. When comparing studies, more than one factor
differs, which makes comparisons of progression and incidence difficult. For example, in Asian
cultures, the prevalence of myopia has been as high as 70 – 90%, while in Europe and the United
States, the incidence rates are 30 – 40%. There is even more difference when comparing to
Africa, with prevalence rates between 10 – 20%. One must understand that race/ethnicity are not
the only factors that change; environment, socioeconomic factors, country, etc. also affect these
rates. This makes it hard to pinpoint a certain risk factor. Subsequently, myopia is about twice
Divya Nori
as common in Jewish communities than in non-Jewish communities. Religion may not have
anything to do with this difference, but currently it is hard to tell [16].
The epidemiology of global refractive errors has become a popular research topic. In
North America, myopia is most common in the United States when compared to Canada and
Mexico. Research suggests that prevalence has increased dramatically over the past few
decades. In 1971 – 1972, the National Health and Nutrition Examination Survey reported the
first estimate in myopia prevalence in the U.S. The incidence rate of myopia in people’s age 12
– 54 was 25%. Another survey was conducted in 1999 – 2004, and the prevalence had increased
to 42% [1]. A different study was done of 2,523 students in grades 1 – 8 (ages 5 – 17) in a
diverse region of the United States. 10% of the students had at least -0.75 diopters of myopia,
but there was considerable variance when race was accounted for. Asians had the highest
prevalence (19%),
An ML-based then Hispanics
Predictive (13%),
Model for followed
Vision by African
Disorders Americans (7%), and lastly
in Adolescents 19
Caucasians (4%) [17].
Interview with Optometrist
Dr. Ratidzo Macharaga, O.D, Envision Eyecare

1. On average, what percentage of your patients are 10 to 15 years old?
Around 40% of my patients are 10 to 15 years old.
2. Are most of them nearsighted or farsighted?
Most of them are nearsighted, however kids below 10 are usually farsighted.
3. Are there any other trends in myopic adolescents?
Most of them tend to be Asian, there are no gender trends that I know of.
4. Is myopic progression or serious jumps in prescription common?
Divya Nori
Yes, serious jumps in prescription are very common and occur in 15 – 20% of myopic
adolescents.
5. What factors might affect the progression of myopia in adolescents?
Prolonged screen use is the main factor. Family history, medication, and nutrients
(Vitamins A and D) can also influence myopic progression.
6. Are there any vision disorders that can result from myopic progression?
Yes, the most common vision disorder is refractive amblyopia.
Refractive amblyopia is when the brain ignores one eye because it is misaligned, leading
to a “lazy eye”.
7. Currently, is there anything that can help with the early detection of vision disorders
other than professional eye exams?
Medical checkups are supposed to test vision, but they are not very thorough, and only
happen once a year.
Statement of the Problem
To detect the early onset of pathological myopia, and other serious vision problems
resulting from myopia, especially during the current "myopic epidemic", a more efficient and
cost-effective solution to identify people at risk needs to be developed. A machine learning
model can accurately determine primary risk factors of myopia, and using this model, a simple
test to inform people of their level of risk can be developed. Additionally, more specialized
treatment can be tailored to these risk factors, and the prevalence of pathological myopic
progression can be greatly reduced.
Divya Nori
Research Question
Which factors will a model, trained using machine learning algorithms on data collected
through experimentation, identify as key risks of vision disorders in adolescents?
Hypothesis
If a predictive model is built based on the surveys and is used to score students who are at
risk for vision problems so that they can be taken in for further testing, then the key risk factors
will be the one regarding time spent outdoors.
II. PROCEDURE OF INVESTIGATION
Variables

Independent Variable: Features (ex. Height, race, duration of screen time, etc.)
Dependent Variable: Whether the subject has myopia
Procedure1
1. Design the survey.
a. Think about the logistics.
i. Goal: Obtain information about the demographics, lifestyle, and eye
health of the subject
ii. Target Population: Adolescents (Ages 10 – 17)
iii. Timeline: Distribute the surveys; collect them after three weeks
1
There are no constants, controls, or materials in this investigation
Divya Nori
iv. Mode: Paper (includes an introductory letter, a parental consent form,
and the vision survey)
b. Develop the questions.
i. Reliability: Each survey question should mean the same thing to
everyone, including those who administer the survey
ii. Question Structure: Avoid “double-barreled” questions where two
responses are required
iii. Question Type: Refrain from using open-ended questions unless
necessary (an open-ended question is necessary for age)
iv. Reference Period: The time frame each survey should take (less than 5
minutes)
v. Response Format: Make sure that they respondent clearly knows what
to mark to signify their answer
c. Present the survey in an appealing and professional way2.
i. Establish credibility: Create an introductory letter to state your goals
clearly
ii. Appeal to all audiences: Translate to different languages, if some of
the respondents require that
2. Administer the survey.
a. Obtain parental consent from participants that are minors (all participants are
minors in this investigation)
b. Collect the survey three weeks after distributing them
2
The vision survey used in this investigation is on the next page
Divya Nori
PLACEHOLDER FOR
Vision Survey
Divya Nori
3. Convert the survey to a digital format.
a. In this investigation, results will be recorded electronically in a secure
database
b. Give the answer choices numbers for each question (first question first choice
is 1, first question second choice is 2, second question first choice is 1, etc.)
c. Create a row for each survey (ex. John is row 1, Amy is row 2)
d. Give each question a variable name (ex. The question “How long do you
spend watching T.V. will become “timeTV”). These variable names will be
used as the headings for the columns
e. Record the results in this format: If John picked the first answer choice for
question 1, in Row 1, Column glassesContacts (variable name), there will be a
“1”
4. Create summary statistics for each question.
a. R will generate statistics like % of subjects with glasses, % male, % in each
race, etc.
5. Create binary explanatory variables.
a. Ensure that there are sufficient counts in each group.
b. Create binary variables from the summary statistics.
6. Create a response variable.
a. The response variable is binary.
b. Vision aids or not (0 or 1).
7. Compute odds ratios and relative risk.
a. Each variable is computed separately.
Divya Nori
b. Used to study univariate characteristics.
8. Train using Gradient Descent in R
a. Use the H2O Library and 4-fold Cross-validation
b. Use a Generalized Linear Model, Decision Tree Model, Random Forest
Model, and Gradient Boosted Model
9. Determine the most important features across all algorithms
III. RESULTS
Program
Vision Survey Analysis

Divya Nori
pkgs <- c('data.table', 'extrafont', 'ggplot2', 'grid', 'gridExtra', 'h2o',

'RColorBrewer',
'rpart', 'rpart.plot', 'scales')
pkgs.loaded = sapply(pkgs, function(x)
suppressPackageStartupMessages(require(x, character.only=TRUE)))
stopifnot(length(pkgs)==sum(pkgs.loaded))
options(list(stringsAsFactors=FALSE, width=130))
cbPalette <- c("#999999", "#C01B1B", "#56B4E9", "#008080", "#FA8072",

"#0072B2")
blank <- rectGrob(gp=gpar(col="white"))
remove(list=c('pkgs','pkgs.loaded'))
Reading Input Data
Labels for Categorical Features
labels.dt <- fread('labels.csv', sep='|')

cat(sprintf('Number of questions: %d',uniqueN(labels.dt[,feature])), '\n')
Divya Nori
cat(sprintf('Number of labels: %d',labels.dt[,.N]), '\n\n')
labels.dt
## Number of questions: 30
## Number of labels: 97
Survey Data
data_orig.dt <- fread("surveys.csv",

drop=c('glassesMS','glassesNear','glassesFar'))
cat('Number of surveys:', data_orig.dt[,.N])
data_orig.dt
## Number of surveys: 60
1-10 of 60 rows | 1-10 of 29 columns
#changes for Spanish version: race Hispanic and African American was swapped
data_orig.dt[, tmp:=race]
data_orig.dt[language==2 & tmp==2, race:=3]; data_orig.dt[language==2 &
tmp==3, race:=2]
data_orig.dt[, tmp:=NULL]
#changes for Spanish version: screens set to same as read

for(x in c('Car','LowLight','LyingDown')) {
data_orig.dt[language==2, paste0('screen',x):=get(paste0('read',x))]
}
data_orig.dt[language==2]
Distribution of Survey Responses
#template
#replacing *NL* with \n
ggp.1 <- function(var, col, xl=NULL) {
if(is.null(xl)) {xl <-
paste0(toupper(substring(var,1,1)),substring(var,2))}
xax <- labels.dt[feature==var]
labels <- stringi::stri_replace_all_fixed(xax[['label']],
Divya Nori
c("<=",">=","*NL*"), c("\u2264","\u2265","\n"),
vectorize_all=FALSE)
ggplot(data_orig.dt, aes_string(paste0('factor(',var,')'))) +
theme_light() +
theme(panel.grid.minor=element_blank(),
text=element_text(family='Segoe UI', size=15)) +
geom_bar(aes(y=(..count..)/sum(..count..)), fill=cbPalette[col]) +
geom_text(aes(label=..count.., y=(..count../sum(..count..))*.5),
stat='count', color='white', size=4) +
scale_x_discrete(breaks=xax[['value']], labels=labels) + xlab(xl) +
scale_y_continuous(labels=percent) + ylab('Students (%)')
}
Response Variable
grid.arrange(blank, ggp.1('glassesContacts',1,'Glasses or Contacts'), blank,

widths=c(1,6,1))
An ML-based Predictive Model for Vision Disorders in Adolescents

Demographics 27
grid.arrange(ggp.1('age',5), blank, ggp.1('gender',3), widths=c(6,.5,3.5))

grid.arrange(blank, ggp.1('race',6), blank, widths=c(1,8,1))
Physical Attributes
grid.arrange(ggp.1('height',2,'Height (ft)'), blank, ggp.1('weight',5,'Weight

(lbs)'), widths=c(4.4,.5,6.2))
Food Servings
grid.arrange(ggp.1('vegetable',3,'Servings of Vegetables'), blank,

ggp.1('fruit',6,'Servings of Fruit'), widths=c(6,.5,6))
grid.arrange(ggp.1('snack',5,'Servings of Snacks'), blank,
ggp.1('protein',2,'Servings of Protein'), widths=c(5.7,.2,5))
grid.arrange(blank, ggp.1('lunchHome',1,'Lunch from Home'), blank,
widths=c(1.5,5,1.5))
Divya Nori
Time Spent on Activities
grid.arrange(ggp.1('timeScreenSchool',1,'Screen Time for School Work'), blank,

ggp.1('timeTV',5,'Time Spent Watching TV'), widths=c(6.5,.5,5))
grid.arrange(ggp.1('timeCellPhone',3,'Time Spent on a Cell Phone'), blank,
ggp.1('timeBook',4,'Time Spent Reading Books'), widths=c(5,.5,5))
grid.arrange(ggp.1('timeComputer',2,'Time Spent Using Computer'), blank,
ggp.1('timeOutdoors',6,'Time Spent Outdoors'), widths=c(5,.5,5))
Reading or Watching a Screen
grid.arrange(ggp.1('readCar',1,'Reading While Traveling in a Car'), blank,

ggp.1('screenCar',2,'Watching a Screen While Traveling in a
Car'), widths=c(5,.5,5))
grid.arrange(ggp.1('readLowLight',4,'Reading In Low Light'), blank,
ggp.1('screenLowLight',5,'Watching a Screen In Low Light'),
widths=c(6,.5,4.5))
grid.arrange(ggp.1('readLyingDown',6,'Reading While Lying Down'), blank,
ggp.1('screenLyingDown',3,'Watching a Screen While Lying Down'),
widths=c(5,.5,5))
Other Factors
grid.arrange(ggp.1('transportation',5), blank, ggp.1('braces',2),

widths=c(6.5,.5,3.5))
grid.arrange(ggp.1('siblings',6,'Siblings With Vision Aids'), blank,
ggp.1('physicalActive',3,'Physical Activity'), widths=c(6,.5,5))
Creating Features for Model Matrix
model.dt <- copy(data_orig.dt)

vars.orig <- names(data_orig.dt)
#response
model.dt[, y:=glassesContacts %in% c(1,2,4)]
#demographics
model.dt[, gender.Female:=gender %in% c(1)]
Divya Nori
model.dt[, age.GT12:=age>12]
model.dt[, race.NonWhite:=!(race %in% c(1))]
model.dt[, height.GT54:=height %in% c(4)]
model.dt[, weight.GT130:=weight %in% c(5)]
#food
model.dt[, vegetable.GT1:=vegetable %in% c(3,4,5)]
model.dt[, fruit.GT2:=fruit %in% c(4,5)]
model.dt[, snack.GT2:=snack %in% c(4,5)]
model.dt[, protein.GT2:=protein %in% c(4,5)]
model.dt[, lunchNotHome:=lunchHome %in% c(2)]
#time on activities
model.dt[, timeScreenSchool.LE1:=timeScreenSchool %in% c(1)]
model.dt[, timeTelevision.LE1:=timeTV %in% c(1)]
model.dt[, timeCellPhone.LE1:=timeCellPhone %in% c(1)]
model.dt[, timeComputer.LE1:=timeComputer %in% c(1)]
model.dt[, timeBook.LE1:=timeBook %in% c(1)]
model.dt[, timeOutdoors.LE1:=timeOutdoors %in% c(1)]
#reading/screen time in various situations

model.dt[, readCar.Yes:=readCar %in% c(1)]
model.dt[, readLowLight.Yes:=readLowLight %in% c(1)]
model.dt[, readLyingDown.Yes:=readLyingDown %in% c(1)]
model.dt[, screenCar.Yes:=screenCar %in% c(1)]
model.dt[, screenLowLight.Yes:=screenLowLight %in% c(1)]
model.dt[, screenLyingDown.Yes:=screenLyingDown %in% c(1)]
#others
#model.dt[, transport.Car:=transportation %in% c(1)]
model.dt[, braces.Yes:=braces %in% c(1)]
model.dt[, siblingsGlasses.Yes:=!(siblings %in% c(3))]
#model.dt[, physicalActive.GT1hr:=physicalActive %in% c(3)]
Divya Nori
model.dt <- model.dt[, setdiff(names(model.dt), vars.orig), with=F]
model.dt <- model.dt[, lapply(.SD, as.integer), .SDcols=names(model.dt)]
model.dt
Summary Statistics: Correlations Between All Pairs of Features
cr <- cor(model.dt[, -c('y')])

cr.dt <- data.table(melt(cr, na.rm=TRUE, value.name='cor'))
ggplot(data = cr.dt, aes(Var1, Var2, fill = cor))+ geom_tile(color='white') +

theme_light() + coord_fixed() +
scale_fill_gradient2(low='red', high='blue', mid='white', midpoint=0,
limit=c(-1,1), space='Lab',
name='Correlation') +
theme(axis.text.x=element_text(angle=90, vjust=.5, size=9, hjust=1),
axis.text.y=element_text(size=9, vjust=.5),
An ML-based Predictive Model for Vision
text=element_text(family='Segoe UI'))Disorders in Adolescents 30
cr.dt <- cr.dt[as.integer(Var1)<as.integer(Var2)]
cbind(setorder(cr.dt, -cor)[1:5], data.table('|'='|'),setorder(cr.dt, cor)
[1:5])
varcol.dt <- rbind(

data.table(var=c('timeOutdoors.LE1','screenCar.Yes','siblingsGlasses.Yes'),
color=c('#C7E9B4','#7FCDBB','#41B6C4')),
data.table(var=c('height.GT54','weight.GT130'))[, color:='#1D91C0'],
data.table(var=c('lunchNotHome','race.NonWhite'))[, color:='#225EA8'],
data.table(var=c('readLowLight.Yes','readLyingDown.Yes'),
color=c('#253494','#081D58'))
)
varcol.dt <- rbind(varcol.dt,
data.table(var=setdiff(names(model.dt),c(varcol.dt[['var']],'y')))[,
color:='#000000']
)
Divya Nori
ggp.2 <- function(dt, fpar) {
dt.p <- setnames(dt[, fpar[['cols']], with=F], c('var','value'))
dt.p <- setorder(dt.p, -value)[1:fpar[['nobs']]]
dt.p <- merge(dt.p, varcol.dt, by='var', sort=FALSE)
dt.p <- setorder(dt.p, value)[, var:=factor(.I,labels=var)]
p <- ggplot(dt.p, aes(var,value,fill=var)) + theme_light() +

geom_col(alpha=.8) +
coord_flip() + xlab(fpar[['xylab']][1]) + ylab(fpar[['xylab']][2]) +
scale_fill_manual(values=dt.p[['color']]) +
theme(legend.position='none', text=element_text(family='Segoe UI',
size=14))
return(p)
}
#N <- 5
#dt <- data.table(X=sample(varcol.dt[['var']],N), Y=runif(N))
#print(dt)
#par <- list(nobs=N, cols=c('X','Y'), xylab=c('Variable','Odds Ratio'))
#ggp.2(dt, par)
Relative Risk and Odds Ratio
tot <- model.dt[, .N, y]

tmp <- model.dt[, lapply(.SD,sum), by=y, .SDcols=2:ncol(model.dt)]
col <- names(tmp)[seq(2,ncol(tmp))]
univar.dt <- data.table(codes=col,

eve.con=unname(unlist(tmp[y==1,col,with=F])), eve=tot[y==1,N],
nev.con=unname(unlist(tmp[y==0,col,with=F])),
nev=tot[y==0,N])
univar.dt[, ':='(eve.nocon=eve-eve.con, nev.nocon=nev-nev.con)][,
c('eve','nev'):=NULL]
univar.dt[, ':='(con=eve.con+nev.con, nocon=eve.nocon+nev.nocon)]
univar.dt[, ':='(rel.risk=(eve.con*nocon)/(eve.nocon*con),
odds.ratio=(eve.con*nev.nocon)/(nev.con*eve.nocon))]
Divya Nori
univar.dt <- setorder(univar.dt[, c('codes','rel.risk','odds.ratio')],
-odds.ratio)
univar.dt
ggp.2(univar.dt, list(nobs=5, cols=c('codes','odds.ratio'),

xylab=c('Variable','Odds Ratio')))
remove(list=c('col','tmp','tot'))
Predictive Models: Decision Tree Model
model.rp <- rpart(y~., method='class', control=rpart.control(cp=1E-5,

minsplit=10, xval=4), data=model.dt)
rpart.plot(model.rp, type=1, fallen.leaves=FALSE, tweak=1.1)
localH2O <- h2o.init(ip='localhost', port=55555, max_mem_size='8g', nthreads=-
1)
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
##An ML-based Predictive Model for Vision Disorders in Adolescents
C:\Users\User\AppData\Local\Temp\RtmpEJ0lnF/h2o_User_started_from_r.out 32
## C:\Users\User\AppData\Local\Temp\RtmpEJ0lnF/h2o_User_started_from_r.err
##
##
## Starting H2O JVM and connecting: .. Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 4 seconds 163 milliseconds
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.18.0.4
## H2O cluster version age: 8 days
## H2O cluster name: H2O_started_from_R_User_jll417
## H2O cluster total nodes: 1
## H2O cluster total memory: 7.11 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8
Divya Nori
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 55555
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.3 (2017-11-30)
s <- capture.output(h2o.removeAll())
h2o.no_progress()
data.h2o <- as.h2o(model.dt)

data.h2o$y <- as.factor(data.h2o$y)
Generalized Linear Model (GLM)
glm.h2o <- h2o.glm(x=seq(2,ncol(model.dt)), y=1, training_frame=data.h2o,

family='binomial', seed=12345,
nfolds=4, lambda=1E-3, max_active_predictors=5,
An ML-based Predictive
intercept=TRUE) Model for Vision Disorders in Adolescents 33
h2o.performance(glm.h2o, newdata=data.h2o)
## H2OBinomialMetrics: glm
##
## MSE: 0.1943385
## RMSE: 0.4408384
## LogLoss: 0.5674731
## Mean Per-Class Error: 0.2692308
## AUC: 0.7647059
## Gini: 0.5294118
## R^2: 0.2085764
## Residual Deviance: 68.09678
## AIC: 80.09678
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal
threshold:
## 0 1 Error Rate
Divya Nori
## 0 17 17 0.500000 =17/34
## 1 1 25 0.038462 =1/26
## Totals 18 42 0.300000 =18/60
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.265312 0.735294 15
## 2 max f2 0.265312 0.856164 15
## 3 max f0point5 0.584681 0.655738 8
## 4 max accuracy 0.584681 0.700000 8
## 5 max precision 0.764473 0.875000 2
## 6 max recall 0.148481 1.000000 17
## 7 max specificity 0.870703 0.970588 0
## 8 max absolute_mcc 0.265312 0.499083 15
## 9 max min_per_class_accuracy 0.561815 0.653846 9
## 10 max mean_per_class_accuracy 0.265312 0.730769 15
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or
`h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
coef.glm <- glm.h2o@model$coefficients
coef.glm.dt <- data.table(code=names(coef.glm), regval=unlist(coef.glm))
[abs(regval)>1E-3 & code!='Intercept']
ggp.2(coef.glm.dt, list(nobs=5, cols=c('code','regval'),
xylab=c('Variable','Coefficient')))
Gradient Boosted Model (GBM)
gbm.h2o <- h2o.gbm(x=seq(2,ncol(model.dt)), y=1, training_frame=data.h2o,

seed=12345,
nfolds=4, ntrees=100, learn_rate=1E-2)
h2o.performance(gbm.h2o, newdata=data.h2o)
## H2OBinomialMetrics: gbm
##
## MSE: 0.1795813
## RMSE: 0.4237703
## LogLoss: 0.5452771
Divya Nori
## AUC: 0.8891403
## Gini: 0.7782805
##
threshold:
## 0 1 Error Rate
## 0 30 4 0.117647 =4/34
## 1 5 21 0.192308 =5/26
## Totals 35 25 0.150000 =9/60
##
## 1 max f1 0.470767 0.823529 23
## 2 max f2 0.352211 0.890411 38
## 3 max f0point5 0.470767 0.833333 23
## 4 max accuracy 0.470767 0.850000 23
##An5 ML-based Predictivemax
Model for Vision0.666833
precision Disorders 1.000000
in Adolescents0 35
## 6 max recall 0.352211 1.000000 38
## 7 max specificity 0.666833 1.000000 0
## 8 max absolute_mcc 0.470767 0.693585 23
##
varimp.gbm.dt <- data.table(gbm.h2o@model$variable_importances)[,
c('relative_importance','percentage'):=NULL]
ggp.2(varimp.gbm.dt, list(nobs=5, cols=c('variable','scaled_importance'),
xylab=c('Variable','Scaled Importance')))
Random Forest Model (RF)
rf.h2o <- h2o.randomForest(x=seq(2,ncol(model.dt)), y=1,

training_frame=data.h2o, seed=12345,
ntrees=150, max_depth=2)
Divya Nori
#summary(rf.h2o)
h2o.performance(rf.h2o, newdata=data.h2o)
## H2OBinomialMetrics: drf
##
## MSE: 0.1891142
## RMSE: 0.4348727
## LogLoss: 0.5668165
## AUC: 0.9343891
## Gini: 0.8687783
##
threshold:
## 0 1 Error Rate
## 0 31 3 0.088235 =3/34
## 1 4 22 0.153846 =4/26
## Totals 35 25 0.116667 =7/60
##An ML-based Predictive Model for Vision Disorders in Adolescents 36
## 1 max f1 0.478210 0.862745 24
## 2 max f2 0.436659 0.905797 33
## 3 max f0point5 0.478210 0.873016 24
## 4 max accuracy 0.478210 0.883333 24
## 5 max precision 0.607206 1.000000 0
## 6 max recall 0.381417 1.000000 41
## 7 max specificity 0.607206 1.000000 0
## 8 max absolute_mcc 0.478210 0.761806 24
##
varimp.rf.dt <- data.table(rf.h2o@model$variable_importances)[,
c('relative_importance','percentage'):=NULL]
Divya Nori
ggp.2(varimp.rf.dt, list(nobs=5, cols=c('variable','scaled_importance'),
xylab=c('Variable','Scaled Importance')))
Distribution Data
Figure 1: Distribution of Students with Visual Aids
Figure 2: Distribution of Students by Age Figure 3: Distribution of Students by Gender
Divya Nori
Figure 4: Distribution of Students by Height Figure 5: Distribution of Students by Weight
Figure 6: Distribution of Students by Race
Figure 7/8: Distribution of Students Reading While Traveling in a Car/Watching a Screen While Traveling in a Car
Figure 9/10: Distribution of Students Reading While Lying Down/Watching a Screen While Lying Down
Divya Nori
Figure 11/12: Distribution of Students Reading in Low Light/Watching a Screen in Low Light
Figure 13/14: Distribution of Students by Time Spent on a Screen for School Work/Time Spent Watching TV
Figure 15: Distribution of Students with Siblings Using Visual Aids Figure 16: Distribution of Students by Physical Activity Time
Divya Nori
Figure 17: Distribution of Students by Snack Servings Figure 18: Distribution of Students by Protein Servings
Figure 19/20: Distribution of Students by Time Spent on a Cell Phone/Time Spent Reading Books
Figure 21/22: Distribution of Students by Time Spent Using Computer/Time Spent Outdoors
Divya Nori
Figure 23: Distribution of Students by Form of Transportation Figure 24: Distribution of Students by Use of Braces
Figure 25: Distribution of Students by Vegetable Servings Figure 26: Distribution of Students by Fruit Servings
Figure 27: Distribution of Students by Source of Lunch
Divya Nori
Statistical Analysis Data
Table 1: Relative Risk and Odds Ratios for Identified Variables
Codes Relative Risk Odds Ratio
screenCar.yes 2.548780 4.342105
height.GT54 1.777778 3.333333
timeOutdoors.LE1 2.072072 3.333333
siblingsGlasses.Yes 1.591837 2.380952
race.NonWhite 1.525641 2.138889
lunchNotHome 1.455882 2.042017
readLowLight.Yes 1.444444 1.888889

timeBook.LE1 1.317829 1.594203
screenLowLight.Yes 1.317829 1.594203
readCar.Yes 1.235294 1.470588
Divya Nori
Figure 28: Odds Ratio Graph
Table 2: Pairwise Positive Correlations
Variable 1 Variable 2 Correlation
height.GT54 weight.GT130 0.4900980
timeScreenSchool.LE1 timeComputer.LE1 0.4559916
snack.GT2 timeBook.LE1 0.4116251
race.NonWhite lunchNotHome 0.3853551

weight.GT130 braces.Yes 0.3767590
Divya Nori
Table 3: Pairwise Negative Correlations
Variable 1 Variable 2 Correlation
timeBook.LE1 readLyingDown -0.3791612
Weight.GT130 timeBook.LE1 -0.3573595
race.NonWhite protein.GT2 -0.3435780
protein.GT2 lunchNotHome -0.3204059

gender.Female height.GT54 -0.3195196
Figure 29: Pairwise Correlations Heat Map
Divya Nori
Model Output
Figure 30: Generalized Linear Model Regression Coefficients
Figure 31: GLM ROC Curve
Divya Nori
Figure 32: Decision Tree
Figure 34: Gradient Boosted Machine Scaled Importance Values
Divya Nori
Figure 33: Random Forest Scaled Importance Values
Figure 35: GBM ROC Curve
Discussions
Analysis
Figure 1 displays the percent of students with visual aids; glasses, contacts, both, or none. Most
of the students (~57%) do not use glasses or contacts and ~43% of the students do. This gives a
good size sample for each group, rather than having many students in one group and barely any
students in the other.

For the rest of the explanatory variables (Figures 2 – 27), the summary statistics were
used to convert them to binary variables. While half of the features were already binary (Figures
3, 7, 8, 9, 10, 12, 14, 19, 20, 21, 22, 24), the remaining variables needed to be converted. The
binary variables were defined based on the number of subjects in each group. For example, in
Figure 6, since ~57% of the students are Caucasian, the two variables created for the “race”
feature were race.White and race.NonWhite. If the variables had been created with the “other”
category instead of Caucasian (race.Other and race.NonOther), there would only be 6 subjects
(10%) in the race.Other category. If race.Other had been chosen by the model as one of the
predictive variables, because of the low number of subjects, the coefficient would not accurately
depict the predictive power.
Divya Nori
From the summary statistics, the following binary variables and their complements were
created: gender.Female, age.GT12, race.NonWhite, height.GT54, weight.GT130,
vegetable.GT1, fruit.GT2, snack.GT2, protein.GT2, lunchNotHome, timeScreenSchool.LE1,
timeTelevision.LE1, timeCellPhone.LE1, timeComputer.LE1, timeBook.LE1,
timeOutdoors.LE1, readCar.Yes, readLowLight.Yes, readLyingDown.Yes, screenCar.Yes,
screenLowLight.Yes, screenLyingDown.Yes, braces.Yes, siblingsGlasses.Yes,
physicalActive.GT1hr.
Table 1 and Figure 28 show the top ten most predictive variables based on relative risk
and odds ratio. If the relative risk/odds ratio is higher, it is more likely that the variable has more
predictive power. Similarly, if the odds ratio is greater than one, it is more likely that the
individual has a visual aid if that feature is present. The ordering of variables based on relative
risk
Analmost corresponds
ML-based to Model
Predictive that of for
oddsVision
ratios,Disorders
showing in
that these measures are good indicators of
Adolescents 48
the predictive power. Also, the order of features by univariate odds ratios was similar to that of
other algorithms, so if the odds ratio is higher, the feature is likely to have predictive power in
other models.
Table 2 shows the top five pairs of positively correlated variables. These coefficients
were computed using the pairwise correlation formula. Because of the low sample size, the
coefficients are not very strong (as seen in Figure 29), but if the model picks one of these
variables, the correlations convey that the other variable could be used as well. For example, as
seen in Figure 30, height.GT54 was identified by the GLM model as having predictive power.
The positive correlation shows that weight.GT130 could also have some predictive power. The
same can be said for Figure 32 about the weight feature in correlation to height. Similarly,
Table 3 shows the top five pairs of negatively correlated variables. Protein.GT2 and
Divya Nori
lunchNotHome were negatively correlated, which is interesting because it may show that school
lunches are not as nutritious. Other conclusions can be drawn from these correlations, and with a
greater sample size, the strength of the coefficients would increase.
Figure 30 displays the top five most predictive variables based on the regression
coefficients computed by the Generalized Linear Model. All the variables have positive
coefficients, meaning that if the feature is present, it is more likely that the subject has or will
need a vision aid. The intercept indicates the likelihood that the subject will require a vision aid
if none of the features are present. Since this value is negative, it shows that it is very unlikely
that the subject will need glasses/contacts if they are not in any of the categories listed. The most
predictive variable found, with an absolute value of ~1.373, is the use of a screen in a car,
followed by if race is non-white (~0.835), time spent outdoors is less than one hour (~0.823),
height is greaterPredictive
An ML-based than 5.4 feet (~0.729),
Model and the
for Vision subject’s
Disorders insibling need glasses (~0.728). In these
Adolescents 49
cases, if the feature is present, the subject is more likely to need a vision aid. The coefficient
indicates the predictive power of the variable, so the time spent outdoors is more predictive than
height. Figure 31 displays the ROC Curve for the Genearlized Linear Model. The model has an
AUC of ~0.765. This is fairly accurate given the number of observations (60). The ROC curve
plots 1 minus the specificity against the senstivity. Specificity (TNR or True Negative Rate) is
equal to True Negatives/All Negatives on a confusion matrix. Sensitivity (TPR or True Positive
Rate) is equal to True Positives/All Positives on a confusion matrix.
In Figure 32, if the the leaf is green, them most individuals in the category have vision
aids. If the leaf is blue, most individuals in the category do not have vision aids. The darkness
of color correlates to the percentage of individuals in the category that display the response
variable indicated by the leaf. The importance of the feature correlates the placement on the tree.
Divya Nori
In other words, the features on the upper branches have more predictive power than features
positioned on the lower branches. The features returned were use of a screen in a car, whether or
not the subject’s siblings had glasses, time spent oudoors, reading why lying down, reading in
low light, and weight greater than 130 pounds. The weight feature can be correlated with the
height feature, as seen in Figure 29, so when looking at the important variables, they can be
paired.
The top five scaled importance values of the features returned by the Distributed Random
Forest Model are shown in Figure 33. They are use of screen in a car, time spent oudoors, height
greater than 5.4 feet, whether or not the subject’s siblings have glasses, and whether the subject’s
race is white or non-white. Scaled importance makes the most predictive feature one and scales
the other features down accordingly. Since all of the scaled importance values are positive, so if
the
Anfeature is present
ML-based in an Model
Predictive individual, they will
for Vision likely need
Disorders a vision aid. DRF returned an AUC
in Adolescents 50
of ~0.934, which is the highest across all of the models.
Figure 34 shows the scaled importance values for the features returned by the Gradient
Boosted Model. The features returned are use of screen in a car, whether or not the subject’s
siblings have glasses, time spent outdoors, height greater than 5.4 feet, and reading in low light.
Since all of the scaled importance values are positive, so if the feature is present in an individual,
they will likely need a vision aid. It is interesting to note that the first two features’ scaled
importance values are almost double that of the third. The Gradient Boosted Machine ROC
Curve is shown in Figure 35. The model returned an AUC of ~0.889, which is fairly accurate.
Divya Nori
Conclusions
GL Random GB
Decision Tree
M Forest M
screenCar.Yes 1 1 1 1
timeOutdoors.LE1 3 2 2 3
siblingsGlasses.Yes 5 3 4 2
height.GT54/weight.GT13
4 5 3 4
0
race.NonWhite 2 5
readLowLight.Yes 4 5
readLyingDown.Yes 3
Table 4: Important Features Across Algorithms
The most predictive features across algorithms are shown in the Table 4. The use of
screen in a car feature ranked first in all algorithms tested. It was followed by time spent
outdoors and whether or not the subject’s siblings have glasses. Since height and weight are
Divya Nori
correlated, they are paired together. When paired, the feature came up as predictive in all
algorithms tested. Race and reading habits came up as predictive in some algorithms, so they
could be show more predictive ability in a larger dataset.
The models are accurate and can be used as an initial filtering before a formal eye exam.
This is especially important in middle and high schools where yearly eye exams are not
conducted. The original hypothesis was proven to be partially correct. While time oudoors did
have some predictive power, it was not the most predictive.
Future Work
In the future, there are several things that could be done to improve model accuracy and
efficiency. For example, more survey responses could be obtained for a higher AUC. By
increasing the number of subjects, other statistics like confidence interval could also be
computed. Increasing sample and feature variation could also make the model more generic and
reduce memorization. The program could be made more efficient for a wide scale adoption. A
mobile application could be develoed to make this happen.
IV. ACKNOWLEDGEMENTS
There are several people that helped make this investigation possible:
The survey participants and their parents/guardians – this investigation would not have
happened without you. Thank you so much!
Ms. Kindra Smith and Ms. Lynnette Lindesay – thank you so much for your cooperation and
help with the survey administration.
Divya Nori
Dr. Varsha Sonawane – thank you for being a great mentor through the entire process.
Dr. Ratidzo Macharaga – thank you for taking time out of your busy day to talk to me and
provide your professional view point.
My family – this experiment would not be where it is without your support. Thank you!
V. REFERENCES
[1] - MHS, Susan Vitale PhD. “Increased Prevalence of Myopia in the United States
Between 1971-1972 and 1999-2004.” Archives of Ophthalmology, American Medical
Association, 14 Dec. 2009, https://jamanetwork.com/journals/jamaophthalmology/full
article/424548
[2] - Holden, Brian A. “Global Prevalence of Myopia and High Myopia and Temporal
Trends from 2000 through 2050.” AAO Journal, http://www.aaojournal.org/article/S0161-
6420(16)00025-7/abstract
[3] - “The Myopia Boom.”Nature News, Nature Publishing Group,
https://www.nature.com /news/the-myopia-boom-1.17120
[4] - “Pathological Myopia.” The Low Vision Centers of Indiana,
http://www.eyeassociates.com/pathological-myopia/
[5] - The Myopia Myth, www.myopia.org/ebook/10chapter5.htm.
[6] - Holden, B, et al. “Myopia, an Underrated Global Challenge to Vision: Where the
Current Data Takes Us on Myopia Control.” Eye, Nature Publishing Group, Feb. 2014,
www.ncbi.nlm.nih.gov/pmc/articles/PMC3930268/.
Divya Nori
[7] - Crew, Bec. “Watch: The Nearsighted Epidemic Is Real.” ScienceAlert,
www.sciencealert.com/watch-the-nearsightedness-epidemic-is-real.
[8] - “Facts About Refractive Errors.” National Eye Institute, U.S. Department of Health
and Human Services, 1 Oct. 2010, nei.nih.gov/health/errors/errors.
[9] - Cassin, Barbara, and Melvin L. Rubin. Dictionary of Eye Terminology. Triad Pub. Co.,
2004.
[10] - Saw, S M, et al. “Myopia: Attempts to Arrest Progression.” British Journal of
Ophthalmology, BMJ Publishing Group Ltd, 1 Nov. 2002, bjo.bmj.com/content/86/11/1306.
[11] - Seiler, Theo. “Excimer Laser Keratectomy for Correction of Astigmatism.” Atlanta
Journal of Opthalmology, www.ajo.com/article/0002-9394(88)90173-0/pdf.
An [12] - “Laser
ML-based in Situ Keratomileusis
Predictive to Treat
Model for Vision Myopia:
Disorders Early Experience.” Journal of
in Adolescents 54
Cataract and Refractive Surgery, www.jcrsjournal.org/article/S0886-3350(97)80149-6/pdf.
[13] - Moshirfar, Majid, et al. “Incidence Rate and Occurrence of Visually Significant
Cataract Formation and Corneal Decompensation after Implantation of Verisyse/Artisan Phakic
Intraocular Lens.” Clinical Ophthalmology (Auckland, N.Z.), Dove Medical Press, 2014,
www.ncbi.nlm.nih.gov/pmc/articles/PMC3986296/.
[14] - Barrett, Brendan T. “A Critical Evaluation of the Evidence Supporting the Practice of
Behavioural Vision Therapy.” Ophthalmic and Physiological Optics, Blackwell Publishing Ltd,
22 Dec. 2008, onlinelibrary.wiley.com/doi/10.1111/j.1475 1313.2008.00607.x/abstract;
jsessionid=E4C2719BD1E1A2728C55696B37019AD2.f04t03.
Divya Nori
[15] - Randle, Robert J. “Responses of Myopes to Volitional Control Training of
Accommodation.”Ophthalmic and Physiological Optics, Blackwell Publishing Ltd, 19 Dec.
2007, onlinelibrary.wiley.com/doi/10.1111/j.1475-1313.1988.tb01063.x/abstract.
[16] - “Home - PMC - NCBI.” National Center for Biotechnology Information, U.S.
National Library of Medicine, www.ncbi.nlm.nih.gov/pmc/.
[17] - Robert N. Kleinstein, OD, MPH, PhD. “Refractive Error and Ethnicity in
Children.” Archives of Ophthalmology, American Medical Association, 1 Aug. 2003,
jamanetwork.com/journals/jamaophthalmology/fullarticle/415584.
[18] - “Hyperopia (Farsightedness).” American Optometric Association,
www.aoa.org/patients-and-public/eye-and-vision-problems/glossary-of-eye-and-vision-
conditions/hyperopia?sso=y.
[19] - “Home - PMC - NCBI.” National Center for Biotechnology Information, U.S.
National Library of Medicine, www.ncbi.nlm.nih.gov/pmc/.
[20] - Focusing Problems - College of Optometrists in Vision Development (COVD),
www.covd.org/?page=Focusing.
[21] - LD, Jill Corleone RDN. “Can Foods Make You Grow Taller?” LIVESTRONG.COM,
Leaf Group, 18 July 2017, www.livestrong.com/article/215191-what-foods-make-you-grow-
taller/.
Divya Nori
VI. APPENDIX
Divya Nori
Divya Nori

An AI-Based Predictive Model For Vision Disorders in Adolescents Divya Nori 9th Grade Milton High School 2017 - 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

An AI-Based Predictive Model For Vision Disorders in Adolescents Divya Nori 9th Grade Milton High School 2017 - 2018

Uploaded by

Copyright:

Available Formats

An AI-Based Predictive Model for Vision Disorders in Adolescents

Milton High School

Figure 16: Distribution of Students by Physical Activity Time….....………….............................………39

An ML-based Predictive Model for Vision Disorders in Adolescents 5

blindness in extreme situations. Pathological myopia, or degenerative myopia, is a form of

identify people at risk needs to be developed.

object in a cluttered scene. If a rule-based program could be written, it would be complicated

a more efficient result.

Machine Learning is part of a broader area of study: Artificial Intelligence. Artificial

predictions, choose actions, or more broadly, do judgmental processes. Machine Learning is a

correct output for

new data, and new rules do not need to be written.

An odds ratio is a measure of association between exposure and outcome. In other

to the logistic regression coefficient produced by the machine learning model.

of association is similar to odds ratio.

Regression is a set of statistical processes used to determine the relationship among

Estimator. Regularized regression (used in Generalized Linear Models) introduces a penalty to

model generalization, not memorization.

predictive model to go from observations about an item to a conclusion. It is commonly used in

require less data preparation than other algorithms.

Distributed Random Forest is an ensemble learning method used in classification

Gradient Boosting an appealing approach.

Myopia, also known as near-sightedness or short-sightedness, is a vision disorder in

doing work and Predictive

although in cases of high to degenerative myopia, a staphyloma can sometimes be seen in an

examination. A staphyloma is an abnormal protrusion of uveal tissue, which surrounds the

Hyperopia, also known as far-sightedness, is a vision disorder in which distant objects

Recent Studies and Probable Causes

Over the years, scientists have blamed nearsightedness of a variety of things.

they have not ruled it out completely.

present in the environment.

Scientists proceeded to investigate a link between nearsightedness and education. A

study published Predictive

particularly emphasized in many East Asian cultures.

To determine if culture influenced nearsightedness as supposed to race, researchers

Sydney did morePredictive

in the group raised under sunlight by around 60%.

calling for more studies [7].

Myopia is most commonly diagnosed by eye care professionals (optometrists or

ophthalmologists). Usually, an autorefractor or a retinoscope is used to give an initial

help distinguish between myopia, hyperopia, astigmatism, and presbyopia [8].

Degenerative myopia (malignant, pathological, or progressive myopia) is characterized in other

Other than spending time outdoors as suggested by the Light-Dopamine Hypothesis,

vision and extent of myopia may be improved [10].

6 diopters of myopia. PRK surgery (photorefractive keratectomy) is a laser eye procedure

intended to reduce a person’s dependence on glasses/contacts. The method involves removing

eventually lead to glaucoma and other serious eye disorders [13].

Epidemiology and Statistics

anything to do with this difference, but currently it is hard to tell [16].

Interview with Optometrist

Dr. Ratidzo Macharaga, O.D, Envision Eyecare

Around 40% of my patients are 10 to 15 years old.

2. Are most of them nearsighted or farsighted?

3. Are there any other trends in myopic adolescents?

4. Is myopic progression or serious jumps in prescription common?

5. What factors might affect the progression of myopia in adolescents?

(Vitamins A and D) can also influence myopic progression.

Yes, the most common vision disorder is refractive amblyopia.

other than professional eye exams?

Statement of the Problem

cost-effective solution to identify people at risk needs to be developed. A machine learning

progression can be greatly reduced.

through experimentation, identify as key risks of vision disorders in adolescents?