You are on page 1of 57

An AI-Based Predictive Model for Vision Disorders in Adolescents

Divya Nori

9th Grade

Milton High School

2017 – 2018
ABSTRACT
Over the past 50 years, the number of young people diagnosed with near-sightedness or

myopia in the U.S. has doubled. If this trend continues, by 2050, half of the world’s population,

around 5 billion people, will be myopic. This rapid increase in number of cases diagnosed is

alarming because myopia can serve as a precursor to many other vision problems, even leading

to blindness in extreme situations. Until 5 years ago, scientists thought that almost all cases of

pathological myopia were caused by a genetic predisposition. Recent research shows that while

30% of these cases have occurred due to genetics or during early childhood years, 60% of these

cases have progressed from a treatable form of myopia, starting in adolescence. By using a

mathematical model to predict the likelihood of myopia, time and money can be saved.

Additionally, the prevalence of myopia and other related vision disorders can be reduced. This

investigation created a predictive model using machine learning to determine risk factors of
An ML-based Predictive Model for Vision Disorders in Adolescents 2
myopia using data collected through a survey. The survey was distributed to 60 adolescents, and

responses were recorded. A database was created with the data, and the data was then analyzed

by a computer program. The program was written in R. The time spent on a screen in the car,

the time spent outdoors, the race of the subject, the height of the subject, and whether the

subject’s sibling had glasses were determined as the most highly predicting to dependence on a

vision aid.

Divya Nori
TABLE OF CONTENTS
ABSTRACT................................................................................................................................................2
LIST OF GRAPHS......................................................................................................................................4
LIST OF TABLES......................................................................................................................................5
I. INTRODUCTION TO INVESTIGATION.........................................................................................6
Background Research..............................................................................................................................6
Important Concepts..............................................................................................................................6
Overview of Myopia..........................................................................................................................10
Overview of Hyperopia.....................................................................................................................12
Recent Studies and Probable Causes..................................................................................................12
Diagnosis, Prevention, and Treatment...............................................................................................16
Epidemiology and Statistics...............................................................................................................18
Interview with Optometrist....................................................................................................................19
Statement of the Problem.......................................................................................................................20
Research Question.................................................................................................................................21
Hypothesis.............................................................................................................................................21
II. PROCEDURE OF INVESTIGATION..............................................................................................21
Variables................................................................................................................................................21
An ML-based Predictive Model for Vision Disorders in Adolescents 3
Procedure...............................................................................................................................................21
Vision Survey....................................................................................................................................23
III. RESULTS..................................................................................................................................25
Program.................................................................................................................................................25
Distribution Data...................................................................................................................................37
Statistical Analysis Data........................................................................................................................42
Model Output........................................................................................................................................45
Discussions............................................................................................................................................48
Analysis.............................................................................................................................................48
Conclusions.......................................................................................................................................52
Future Work......................................................................................................................................53
IV. ACKNOWLEDGEMENTS...........................................................................................................53
V. REFERENCES..................................................................................................................................54
VI. APPENDIX...................................................................................................................................57

Divya Nori
LIST OF GRAPHS
Figure 1: Distribution of Students with Visual Aids……….…………….……………………...………37
Figure 2: Distribution of Students by Age……………...………………………..…………….…..……37
Figure 3: Distribution of Students by Gender………………….…………...…………………...………37
Figure 4: Distribution of Students by Height.……………………………...……………….…...………37
Figure 5: Distribution of Students by Weight.………………...…………...…..………………..………37
Figure 6: Distribution of Students by Race.………………………...…...………………….…..….……38
Figure 7: Distribution of Students Reading While Traveling in a Car…………...……….…...……..…38
Figure 8: Distribution of Students Watching a Screen While Traveling in a Car….….................…..…38
Figure 9: Distribution of Students Reading While Lying Down…….….……………...……...…..……38
Figure 10: Distribution of Students Watching a Screen While Lying Down…………..………...………38
Figure 11: Distribution of Students Reading in Low Light………………………..…………......………39
Figure 12: Distribution of Students Watching a Screen in Low Light……………..…...……......………39
Figure 13: Distribution of Students by Time Spent on a Screen for School Work….....................………39
Figure 14: Distribution of Students by Time Spent Watching TV….............................................………39
An ML-based
Figure Predictive
15: Distribution Modelwith
of Students for Vision
SiblingsDisorders in Adolescents
Using Visual Aids…........................................………39 4

Figure 16: Distribution of Students by Physical Activity Time….....………….............................………39


Figure 17: Distribution of Students by Snack Servings….....….....................................................………40
Figure 18: Distribution of Students by Protein Servings…...….....................................................………40
Figure 19: Distribution of Students by Time Spent on a Cell Phone…...………….......................………40
Figure 20: Distribution of Students by Time Spent Reading Books…...…....................................………40
Figure 21: Distribution of Students by Time Spent Using a Computer…...……………...............………40
Figure 22: Distribution of Students by Time Spent Outdoors…………………………..………………..40
Figure 23: Distribution of Students by Form of Transportation…...…..........................................………41
Figure 24: Distribution of Students by Use of Braces……………..………….....….....................………41
Figure 25: Distribution of Students by Vegetable Servings……….…………….….....................………41
Figure 26: Distribution of Students by Fruit Servings……………………………...….................………41
Figure 27: Distribution of Students by Source of Lunch……………………..…………………………..41
Figure 28: Odds Ratio Graph……………..………………………………………………………………43
Figure 29: Pairwise Correlations Heat Map…..…………………………………………………………..44
Figure 30: Generalized Linear Model Regression Coefficients………………………………………..…45
Figure 31: GLM ROC Curve……………………………………………………………………………..45

Divya Nori
Figure 32: Decision Tree………………………………………………………………………………….46
Figure 33: Random Forest Scaled Importance Values……………………………………………………46
Figure 34: Gradient Boosted Machine Scaled Importance Values…………………………………….…47
Figure 35: GBM ROC Curve……………………………………………………………………………..47

LIST OF TABLES
Table 1: Relative Risk and Odds Ratios for Identified Variables……….………...…...………42
Table 2: Positive Pairwise Correlations……………………………….....…………...…..……43
Table 3: Negative Pairwise Correlations………………..…………….....…………...…..……44
Table 4: Important Features Across Algorithms….……………………………………………52

An ML-based Predictive Model for Vision Disorders in Adolescents 5

Divya Nori
I. INTRODUCTION TO INVESTIGATION

Over the past 50 years, the number of young people diagnosed with near-sightedness or

myopia in the U.S. has doubled, making nearly 40-50% of the population of young people

nearsighted [1]. If this trend continues, by 2050, half of the world’s population, around 5 billion

people, will be myopic [2]. This rapid increase in number of cases diagnosed is alarming

because myopia can serve as a precursor to many other vision problems, even leading to

blindness in extreme situations. Pathological myopia, or degenerative myopia, is a form of

extreme myopia that frequently results in cataracts and glaucoma and occurs in 20% of people

diagnosed with myopia. Until 5 years ago, scientists thought that almost all cases of pathological

myopia were caused by a genetic predisposition [3]. Recent research shows that while 30% of

these cases have occurred due to genetics or during early childhood years, 60% of these cases
An ML-based Predictive Model for Vision Disorders in Adolescents 6
have progressed from a treatable form of myopia, starting in adolescence [4]. To detect the early

onset of pathological myopia, and other serious vision problems resulting from myopia,

especially during the current "myopic epidemic", a more efficient and cost-effective solution to

identify people at risk needs to be developed.

Background Research
Important Concepts
What Is Machine Learning?

There are two major types of programming: traditional or rule-based programming and

machine learning. In traditional programming, the computer scientist inputs data and writes a

program to get an output. In machine learning, the data and the output are given to the computer,

and the computer produces a model. This new way of programming is useful in several ways. It

is extremely difficult to write rule-based programs in some scenarios. For example, computing

Divya Nori
the probability that a credit card transaction is fraudulent would require several if-then

statements. Also, since this problem is always changing and there is a moving target, the

program would need to keep getting updated. Another difficult scenario is recognizing a 3D

object in a cluttered scene. If a rule-based program could be written, it would be complicated

and hard to follow or replicate. Machine Learning is used in these types of scenarios to achieve

a more efficient result.

Machine Learning is part of a broader area of study: Artificial Intelligence. Artificial

Intelligence is the study of how we can make machines able to understand the world, make

predictions, choose actions, or more broadly, do judgmental processes. Machine Learning is a

subset of this field and enables machines to improve at tasks using several examples of scenarios.

Instead of writing a program for each specific task, lots of examples are collected that specify the

correct output for


An ML-based a given input.
Predictive ModelThe Machine
for Vision Learninginalgorithm
Disorders takes these examples and
Adolescents 7
produces a program. This allows the program to work for new examples, not just for the

examples the model was trained with. If the model needs to be updated, it can be retrained with

new data, and new rules do not need to be written.

Other Concepts

An odds ratio is a measure of association between exposure and outcome. In other

words, it is the odds that an outcome will occur given an exposure. This measure is typically

used in case-control studies, like this investigation. The natural log (ln) of the odds ratio is equal

to the logistic regression coefficient produced by the machine learning model.

Divya Nori
The relative risk or risk ratio is the ratio of the probability of an event occurring in an

exposed group to the probability of the event occurring in the non-exposed group. This measure

of association is similar to odds ratio.

Regression is a set of statistical processes used to determine the relationship among

variables. There are several types of regression, but the two used in this investigation are logistic

regression and regularized regression. Logistic regression is used when the dependent variable is

binary. The objective function for this type of regression is called MLE or Maximum Likelihood

Estimator. Regularized regression (used in Generalized Linear Models) introduces a penalty to

the logistic regression objective function to prevent overfitting. There are two types of

regularized regression: the LASSO model and Ridge Regression. The LASSO model introduces

a penalty that is equal to the absolute values of the coefficients computed by the model. Ridge

regression introduces
An ML-based a penalty
Predictive Modelequal to the square
for Vision of the
Disorders in absolute values of the coefficients.
Adolescents 8
This investigation uses the LASSO model because it drives the other non-predictive coefficients

to zero, whereas the ridge regression model takes the coefficients close to zero. This allows

model generalization, not memorization.

Decision Tree Learning is a Machine Learning algorithm which uses a decision tree as a

predictive model to go from observations about an item to a conclusion. It is commonly used in

supervised learning classification problems. In a decision tree, the observations are represented

in the branches. If the feature the is present, the left branch is followed, and if the feature is not

present, the right branch is followed. The target value is represented in the leaves with the top

number (0 or 1) representing whether most individuals in the category have vision aids or not,

the middle number representing the percentage (as a decimal) of individuals in category with

vision aids, and the bottom number representing individuals in a category as a percentage of the

Divya Nori
total population. There are two types of decision trees, classification and regression trees,

referred to collectively as CART. A classification tree, which is used in this experiment, deals

with discrete target values. Regression trees have continuous target values that can be

considered real numbers. Decision trees are simple to understand and interpret. They also

require less data preparation than other algorithms.

Distributed Random Forest is an ensemble learning method used in classification

problems. This algorithm constructs several fully-grown decision trees in parallel. Random

forest uses decision trees with low bias and high variance. This is because of the bias-variance

tradeoff, which is the problem of simultaneously reducing bias and variance. Bias is the error

from assumptions in an algorithm. High bias causes an algorithm to miss features which is

known as underfitting. Variance is an error from sensitivity to small fluctuations in the training

set.
An High variance
ML-based causes lots
Predictive of random
Model noise
for Vision and causes
Disorders the model to memorize the data and
in Adolescents 9
overfit. Random forest uses bootstrap aggregation or bagging to combine the individual decision

trees. Bagging is a machine learning ensemble algorithm used to improve accuracy of machine

learning algorithms. It is a special type of model averaging technique that is usually applied to

decision trees. Random forest helps prevent overfitting on the training dataset, which is common

with decision trees and GBM. It is a highly accurate classifier and runs efficiently on larger

datasets.

Gradient Boosted Machines use gradient boosting, which is a machine learning technique

used in regression and classification problems. There are three elements involved in GBMs: A

loss function, a weak learner, and an additive model. A loss (objective) function needs to

optimized or minimized. Different loss functions are used for different problems. The weak

learners have high bias and low variance to make predictions, the converse of Random Forest.

Divya Nori
The weak learners usually CART trees. The additive model produces an ensemble model of

weaker prediction models by adding them one at a type, rather than in parallel like Random

Forest. Different objective loss functions can be used with the same algorithm, which makes

Gradient Boosting an appealing approach.

Gradient Descent is the process by which the minimum value of a function is found. The

model minimizes the MLE objective function to obtain the coefficients. An analogy that is often

used to illustrate gradient descent involves a person on a mountain. The mountain is foggy, and

the person is trying to get down or find the minima. Since the path is not visible, he must use his

local information to get down. He must look at the steepness at his current position and proceed

in the direction with the steepest descent. This is much like the process the machine learning

model follows to minimize the MLE objective function and determine the variable coefficients.

An ML-based Predictive
N-fold Cross Modelisfor
Validation Visionvalidation
a model Disorderstechnique
in Adolescents
used to assess how the results of 10

a statistical analysis will generalize to an independent dataset. It is mainly used to estimate the

accuracy of a model. The dataset is divided into four groups (4-fold), and each group is used for

training three times, and for test once. N-fold cross validation allows all the data to be trained

and tested with. This allows the AUC to be higher, even though the sample size may be lower.

Overview of Myopia

Myopia, also known as near-sightedness or short-sightedness, is a vision disorder in

which one’s eyeball is elongated, and incoming light focuses in front of the retina, instead of on

the retina. Because of this, objects that are distant appear blurry and objects that are close appear

normal. While myopia may be benign in most cases, resulting in only a glasses/contacts, myopia

Divya Nori
greatly increases the risk of serious vision problems such as retinal detachment, cataracts, and

glaucoma.

The first person to recognize and distinguish between myopia and hyperopia, also known

as far-sightedness, is said to be Greek Philosopher Aristotle (384 BC – 322 BC). However, these

speculations were materialized in 65 AD when the word meyin, meaning close, and opdos,

meaning eye, were used in the compilation of Roman Law called Libris Pandectorum. The

Greek word “Myopos” eventually evolved into the Latin word “myops”, meaning a condition in

which someone attempts to see clearly by partially closing their eyes. While this condition has

been recognized for over 2000 years, research on the causes of myopia began about 200 years

ago [5]. Over time, scientists have understood that myopia results from a combination of genetic

predisposition and environmental factors. The main risk factors are believed to be time spent

doing work and Predictive


An ML-based focusing onModel
close for
objects,
Visiontime spent outdoors,
Disorders family history of the condition,
in Adolescents 11
socioeconomic class, Vitamin A deficiency, etc.

Myopia can occur at various scales, and the rate of progression to degenerative myopia

can also vary. Each person diagnosed with myopia can see clearly out to a certain distance that

varies from individual to individual, and everything further than that becomes blurry. Eye

examination have shown that most myopic eyes have the same structure as non-myopic eyes,

although in cases of high to degenerative myopia, a staphyloma can sometimes be seen in an

examination. A staphyloma is an abnormal protrusion of uveal tissue, which surrounds the

retina, through a weak point in the eyeball. Usually, the protrusion is black and effects inner

layers of the eye like the cornea and sclera. The most significant structural difference between

myopic and normal eyes is the length of the eye. The retina needs to stretch to cover the

increased distance, and hence becomes weaker. This causes lattice degeneration and retinal

Divya Nori
detachment. Lattice degeneration is an eye disease in which the retina develops breaks and tears,

which causes the retina to eventually detach, resulting in legal blindness [6].

Overview of Hyperopia

Hyperopia, also known as far-sightedness, is a vision disorder in which distant objects

appear clearly but, near ones appear blurry. Converse to myopia, hyperopia causes one’s eyeball

to shorten and cornea has too much curvature. This causes the image to focus behind the retina,

instead of on the retina. Hyperopia is most commonly seen in middle-aged adults, progressing as

they get older, and is much less likely in adolescents [18]. The correlation between hyperopia

and cataracts is alarming; 77.78% of hyperopic subjects end up with cortical cataracts [19].

Recent Studies and Probable Causes

Recently,Predictive
An ML-based scientistsModel
have begun talking
for Vision about anin“epidemic
Disorders of nearsightedness”. While
Adolescents 12
the prevalence of nearsightedness in the U.S. is high, there are other parts of the world with

unimaginable prevalence rates. Parts of East Asia like Singapore, China, Japan, and Korea have

myopia prevalence rates as high as 80% among middle/high school age children.

Over the years, scientists have blamed nearsightedness of a variety of things.

Mathematician Johannes Kepler (1571 – 1630) blamed his nearsightedness on near work, or all

the writing/calculations that he did up close. This was the hypothesis that was believed for a

long time. Smartphones/tablets are not included in the definition of near work, since they are too

modern. Since nearsightedness has been on the rise since before these electronics were very

popular, scientists do not attribute this epidemic to them. Extensive research has been done on

the effect of near work on myopia rates, and while scientists have not established a firm link,

they have not ruled it out completely.

Divya Nori
In the 20th century, scientists learned that genetics plays a role in the progression of

myopia. While the likelihood that you will be myopic is higher if your parents are as well, the

relationship between the presence of myopia and the mutation of a specific gene is not straight-

forward. A few dozen genotypes together influence the final phenotype present in an individual.

In order to confirm several hypotheses about genetic predisposition, a study was one in an Inuit

community in 1969. At the start of the experiment, 2 out of 131 people in that community

(1.5%) were nearsighted. After 1-2 generations (children/grandchildren), prevalence rates rose

to nearly 50%. This rapid change could not have resulted from genetics alone. Scientists then

concluded that while genes may have some influence, the main cause of nearsightedness is

present in the environment.

Scientists proceeded to investigate a link between nearsightedness and education. A

study published Predictive


An ML-based in OctoberModel
2015 by
forresearchers from the
Vision Disorders in University
Adolescentsof Wales Cardiff concluded 13
that the prevalence of myopia in first-borns was higher than in children born later by 10%. This

does not attribute itself to the skyrocketing rates, but it does give scientists a clue as to what the

cause might be. Researchers adjusted the data to account for how much education each of the

participants had. The effect diminished, meaning that the amount of education the subjects had

accounted for the difference in myopia rates. Scientists then created the “Parental Investment”

hypothesis. Parents tend to emphasize the value of education more in first-borns than in their

later children. As a result, the first-born children spend more time studying, and therefore have a

higher rate of myopia. Another study conducted by professors from the Sun Yat Sen University

in China, investigated the link between socioeconomic status and nearsightedness by comparing

the rates of myopia in neighboring Chinese provinces. Schoolchildren from the Shaanxi

province, a middle-income province, were compared to schoolchildren from the Gansu province,

Divya Nori
a relatively poor province. In the wealthier province, the prevalence of myopia was twice that of

the poor province. While the researchers could not find a probable reason to explain the

difference, they found that higher math scores, from the wealthier province, were correlated with

higher rates of nearsightedness. This emphasizes the link between education and

nearsightedness. This also explains the extend of the problem in Asia, since education is

particularly emphasized in many East Asian cultures.

To determine if culture influenced nearsightedness as supposed to race, researchers

conducted a study on ethnic Chinese children living in Sydney and Singapore. The made sure

that the kids’ parents had similar rates of nearsightedness (around 70% in both studies). In the

children, the difference was alarming. Only 3.3% of the kids living in Sydney were nearsighted,

while 29.1% of the kids living in Singapore were nearsighted. Surprisingly, the children in

Sydney did morePredictive


An ML-based near-workModel
than the
for children in Singapore.
Vision Disorders The only difference between the
in Adolescents 14
groups of children was how much time they spent outside. While the kids in Sydney spent an

average of 13 hours a week outside, the kids in Singapore spent in average of 3 hours outside.

Public health officials hypothesized about the effect of sunlight on myopia and its progression.

Scientists proceeded to look for rigorous evidence and a mechanism in which this link could be

established.

Within the last few years, researchers have made considerable progress towards

establishing this as a probable cause. There have been several experiments on animals that show

that light protects against myopia. In a study done by researchers in Germany, myopia was

induced in chicks through the use of special goggles. They then placed one group of chicks in

sunlight, and the other group under regular laboratory lighting. The onset of myopia was slowed

in the group raised under sunlight by around 60%.

Divya Nori
After establishing rigorous evidence, researchers then focused their attention on the

science behind sunlight and our brains. They found a substance produced by organisms’ brains

that influences eye development: Dopamine. Dopamine is a neurotransmitter that plays several

important roles in human (and other animals’) brains/bodies. This chemical is released to send

signals to other nerve cells. The brain has dopamine-pathways that influence motor control and

hormone release. In order to establish that Dopamine affects proper eye development, scientists

injected chicks with a chemical that blocks Dopamine. Without the presence of this chemical,

sunlight no long protected the chicks from myopia. Scientists concluded that Dopamine is

released as a result of bright light. This chemical is also related to the body’s day-night rhythm.

The human body switches from low-light nighttime vision to daytime vision, and without the

presence of Dopamine, this switch cannot occur. Researches now believe that this “Dopamine

cycle” is required for healthy eye development throughout childhood and adolescence. If this
An ML-based Predictive Model for Vision Disorders in Adolescents 15
cycle is disrupted, especially during one’s middle-school years, the eyeball tends to become

elongated, causing myopia. Scientists call this hypothesis the Light-Dopamine hypothesis.

To test this hypothesis, scientists looked at primary school children from 12 schools in

Guangzhou, China. These children were divided into two groups of 6 schools each, so about 950

children were in each group. One of the groups did not change their daily routine, while the

other group added a 40-minute outdoor activity time to their schedule. For 3 years, the

researchers tracked the children and their eye development. At the end of the trial, the incidence

rate of myopia among the children that spent time outside was 30%, while in the other group, the

rate of myopia was 39.5%. While the reduction was less than they expected, 3 years a short

amount of time to witness significant change. In order to better establish the link, researchers are

calling for more studies [7].

Divya Nori
Diagnosis, Prevention, and Treatment

Myopia is most commonly diagnosed by eye care professionals (optometrists or

ophthalmologists). Usually, an autorefractor or a retinoscope is used to give an initial

assessment of the refractive status of each eye. An autorefractor is a machine that measures how

light enters the person’s eye, and how the eye changes the light. A retinoscope is a way of

shining the light on a person’s eye to observe the retina. A phoropter, a device that contains

different lenses, is then used to determine the exact prescription. These techniques and tools

help distinguish between myopia, hyperopia, astigmatism, and presbyopia [8].

There are various forms of myopia, each differing in intensity and treatment options.

Simple myopia, which is typically less than 4 diopters, can easily be fixed with glasses/contacts.

Degenerative myopia (malignant, pathological, or progressive myopia) is characterized in other


An ML-based
ways. The mostPredictive
common isModel for fundus
through Vision changes.
DisordersThe
in Adolescents
fundus is the interior surface of the 16

eye opposite the lens and includes the retina, optic disc, macula, fovea, and posterior pole.

Degenerative myopia is also characterized by a very high refractive error and subnormal vision

even after correction. This form of myopia is known to continually progress and get worse over

time. Another form of myopia is Pseudo myopia, which is the blurring of distance vision

brought about by a spasm of the accommodation system. Pseudo myopia includes nocturnal

myopia, near work-induced transient myopia, and instrument myopia. The last form of myopia

is Induced myopia, also known as acquired myopia. This form of myopia is brought on by the

use of different drugs, increases in glucose levels, nuclear sclerosis, oxygen toxicity, etc. [9].

Other than spending time outdoors as suggested by the Light-Dopamine Hypothesis,

there are several other preventions for myopia/the progression of myopia. The use of glasses and

contact lenses can help alter the speed of myopic progression. The American Optometric

Divya Nori
Association’s Clinical Practice Guidelines for Myopia refers to several studies which show that

switching from full-time, part-time, and no lens does not appear to slow progression. In young

people ages 18 or younger, topical medications such as Anti-muscarinic is known to slow the

progression of myopia. The muscarinic receptor antagonist is a neuro-transmitter that blocks the

chemical Acetylcholine, which like Dopamine influences eye development. Eyedrops containing

cyclopentolate and atropine, chemicals that influence nerve connections, also slow the

progression of myopia. These chemicals often have side effects which cause light sensitivity,

since they cause too much Dopamine to be produced. Sclera reinforcement surgery is another

method which targets lattice degeneration. It provides reinforcement to thinning posterior poles,

which is part of the fundus. By slowing or stopping the progression of the disorder, quality of

vision and extent of myopia may be improved [10].

Other than
An ML-based targetingModel
Predictive the rate
forofVision
progression, there
Disorders in is no concrete and universally accepted
Adolescents 17
solution to treat myopia. There are several types of refractive surgery that can be done for up to

6 diopters of myopia. PRK surgery (photorefractive keratectomy) is a laser eye procedure

intended to reduce a person’s dependence on glasses/contacts. The method involves removing

some corneal tissue using an excimer laser, an ultraviolet laser commonly used in eye surgery.

While PRK is relatively safe, the recovery process is usually painful [11]. Another type of

surgery is LASIK surgery, where a flap is cut on the cornea prior to the procedure. Then, using

an excimer laser, the curvature of the cornea is changed. While the process is quite like PRK,

the recovery is usually painless, but cornea stability may be sacrificed [12]. Unlike PRK and

LASIK where the corneal surface is modified, the intra-ocular lens can also be modified by

implanting another lens inside the eye. While this typically fixes the refractive problem, it can

eventually lead to glaucoma and other serious eye disorders [13].

Divya Nori
Recently, several alternative therapies have been developed because of the lack of

scientific interventions. Vision therapy, also known as “behavioral optometry”, includes various

eye exercises and other relaxation techniques. One of the most popular forms of vision therapy

is the Bates Method. Physician William Horatio Bates (1860 – 1931) attributed all vision

disorders, including myopia, to a constant strain on the eyes. Hence, he felt that glasses harmed

the eye and trained the eye into the myopic state. His method includes palming, visualization,

movement, and exposure to sunlight. Scientists have stated that these alternative techniques

have “no clear scientific evidence” and so “they cannot be advocated for” [14]. In the late

1980s and early 1990s, biofeedback became very popular as a treatment for myopia.

Biofeedback is the process of gaining awareness of many physiological functions primarily using

instruments that provide information on the activity of those same systems. The goal was to be

able to manipulate these systems at one’s own will. Scientists maintain that biofeedback training
An ML-based Predictive Model for Vision Disorders in Adolescents 18
is not consistent [15].

Epidemiology and Statistics

The prevalence of myopia varies with age, country, race, religion, ethnicity, sex,

environment, occupation, and other factors. When comparing studies, more than one factor

differs, which makes comparisons of progression and incidence difficult. For example, in Asian

cultures, the prevalence of myopia has been as high as 70 – 90%, while in Europe and the United

States, the incidence rates are 30 – 40%. There is even more difference when comparing to

Africa, with prevalence rates between 10 – 20%. One must understand that race/ethnicity are not

the only factors that change; environment, socioeconomic factors, country, etc. also affect these

rates. This makes it hard to pinpoint a certain risk factor. Subsequently, myopia is about twice

Divya Nori
as common in Jewish communities than in non-Jewish communities. Religion may not have

anything to do with this difference, but currently it is hard to tell [16].

The epidemiology of global refractive errors has become a popular research topic. In

North America, myopia is most common in the United States when compared to Canada and

Mexico. Research suggests that prevalence has increased dramatically over the past few

decades. In 1971 – 1972, the National Health and Nutrition Examination Survey reported the

first estimate in myopia prevalence in the U.S. The incidence rate of myopia in people’s age 12

– 54 was 25%. Another survey was conducted in 1999 – 2004, and the prevalence had increased

to 42% [1]. A different study was done of 2,523 students in grades 1 – 8 (ages 5 – 17) in a

diverse region of the United States. 10% of the students had at least -0.75 diopters of myopia,

but there was considerable variance when race was accounted for. Asians had the highest

prevalence (19%),
An ML-based then Hispanics
Predictive (13%),
Model for followed
Vision by African
Disorders Americans (7%), and lastly
in Adolescents 19
Caucasians (4%) [17].

Interview with Optometrist

Dr. Ratidzo Macharaga, O.D, Envision Eyecare


1. On average, what percentage of your patients are 10 to 15 years old?

Around 40% of my patients are 10 to 15 years old.

2. Are most of them nearsighted or farsighted?

Most of them are nearsighted, however kids below 10 are usually farsighted.

3. Are there any other trends in myopic adolescents?

Most of them tend to be Asian, there are no gender trends that I know of.

4. Is myopic progression or serious jumps in prescription common?

Divya Nori
Yes, serious jumps in prescription are very common and occur in 15 – 20% of myopic

adolescents.

5. What factors might affect the progression of myopia in adolescents?

Prolonged screen use is the main factor. Family history, medication, and nutrients

(Vitamins A and D) can also influence myopic progression.

6. Are there any vision disorders that can result from myopic progression?

Yes, the most common vision disorder is refractive amblyopia.

Refractive amblyopia is when the brain ignores one eye because it is misaligned, leading

to a “lazy eye”.

7. Currently, is there anything that can help with the early detection of vision disorders

other than professional eye exams?

Medical checkups are supposed to test vision, but they are not very thorough, and only
An ML-based Predictive Model for Vision Disorders in Adolescents 20
happen once a year.

Statement of the Problem

To detect the early onset of pathological myopia, and other serious vision problems

resulting from myopia, especially during the current "myopic epidemic", a more efficient and

cost-effective solution to identify people at risk needs to be developed. A machine learning

model can accurately determine primary risk factors of myopia, and using this model, a simple

test to inform people of their level of risk can be developed. Additionally, more specialized

treatment can be tailored to these risk factors, and the prevalence of pathological myopic

progression can be greatly reduced.

Divya Nori
Research Question

Which factors will a model, trained using machine learning algorithms on data collected

through experimentation, identify as key risks of vision disorders in adolescents?

Hypothesis

If a predictive model is built based on the surveys and is used to score students who are at

risk for vision problems so that they can be taken in for further testing, then the key risk factors

will be the one regarding time spent outdoors.

II. PROCEDURE OF INVESTIGATION

Variables

An ML-based Predictive Model for Vision Disorders in Adolescents 21


Independent Variable: Features (ex. Height, race, duration of screen time, etc.)

Dependent Variable: Whether the subject has myopia

Procedure1

1. Design the survey.

a. Think about the logistics.

i. Goal: Obtain information about the demographics, lifestyle, and eye

health of the subject

ii. Target Population: Adolescents (Ages 10 – 17)

iii. Timeline: Distribute the surveys; collect them after three weeks

1
There are no constants, controls, or materials in this investigation

Divya Nori
iv. Mode: Paper (includes an introductory letter, a parental consent form,

and the vision survey)

b. Develop the questions.

i. Reliability: Each survey question should mean the same thing to

everyone, including those who administer the survey

ii. Question Structure: Avoid “double-barreled” questions where two

responses are required

iii. Question Type: Refrain from using open-ended questions unless

necessary (an open-ended question is necessary for age)

iv. Reference Period: The time frame each survey should take (less than 5

minutes)

v. Response Format: Make sure that they respondent clearly knows what
An ML-based Predictive Model for Vision Disorders in Adolescents 22
to mark to signify their answer

c. Present the survey in an appealing and professional way2.

i. Establish credibility: Create an introductory letter to state your goals

clearly

ii. Appeal to all audiences: Translate to different languages, if some of

the respondents require that

2. Administer the survey.

a. Obtain parental consent from participants that are minors (all participants are

minors in this investigation)

b. Collect the survey three weeks after distributing them

2
The vision survey used in this investigation is on the next page

Divya Nori
PLACEHOLDER FOR
Vision Survey

An ML-based Predictive Model for Vision Disorders in Adolescents 23

Divya Nori
3. Convert the survey to a digital format.

a. In this investigation, results will be recorded electronically in a secure

database

b. Give the answer choices numbers for each question (first question first choice

is 1, first question second choice is 2, second question first choice is 1, etc.)

c. Create a row for each survey (ex. John is row 1, Amy is row 2)

d. Give each question a variable name (ex. The question “How long do you

spend watching T.V. will become “timeTV”). These variable names will be

used as the headings for the columns

e. Record the results in this format: If John picked the first answer choice for

question 1, in Row 1, Column glassesContacts (variable name), there will be a

“1”
An ML-based Predictive Model for Vision Disorders in Adolescents 24
4. Create summary statistics for each question.

a. R will generate statistics like % of subjects with glasses, % male, % in each

race, etc.

5. Create binary explanatory variables.

a. Ensure that there are sufficient counts in each group.

b. Create binary variables from the summary statistics.

6. Create a response variable.

a. The response variable is binary.

b. Vision aids or not (0 or 1).

7. Compute odds ratios and relative risk.

a. Each variable is computed separately.

Divya Nori
b. Used to study univariate characteristics.

8. Train using Gradient Descent in R

a. Use the H2O Library and 4-fold Cross-validation

b. Use a Generalized Linear Model, Decision Tree Model, Random Forest

Model, and Gradient Boosted Model

9. Determine the most important features across all algorithms

III. RESULTS

Program

Vision Survey Analysis


Divya Nori

pkgs <- c('data.table', 'extrafont', 'ggplot2', 'grid', 'gridExtra', 'h2o',


'RColorBrewer',
An ML-based Predictive Model for Vision Disorders in Adolescents 25
'rpart', 'rpart.plot', 'scales')
pkgs.loaded = sapply(pkgs, function(x)
suppressPackageStartupMessages(require(x, character.only=TRUE)))
stopifnot(length(pkgs)==sum(pkgs.loaded))
options(list(stringsAsFactors=FALSE, width=130))

cbPalette <- c("#999999", "#C01B1B", "#56B4E9", "#008080", "#FA8072",


"#0072B2")
blank <- rectGrob(gp=gpar(col="white"))

remove(list=c('pkgs','pkgs.loaded'))

Reading Input Data

Labels for Categorical Features

labels.dt <- fread('labels.csv', sep='|')


cat(sprintf('Number of questions: %d',uniqueN(labels.dt[,feature])), '\n')

Divya Nori
cat(sprintf('Number of labels: %d',labels.dt[,.N]), '\n\n')
labels.dt
## Number of questions: 30
## Number of labels: 97

Survey Data

data_orig.dt <- fread("surveys.csv",


drop=c('glassesMS','glassesNear','glassesFar'))
cat('Number of surveys:', data_orig.dt[,.N])
data_orig.dt
## Number of surveys: 60

1-10 of 60 rows | 1-10 of 29 columns

#changes for Spanish version: race Hispanic and African American was swapped
data_orig.dt[, tmp:=race]
data_orig.dt[language==2 & tmp==2, race:=3]; data_orig.dt[language==2 &
tmp==3, race:=2]
An ML-based Predictive Model for Vision Disorders in Adolescents 26
data_orig.dt[, tmp:=NULL]

#changes for Spanish version: screens set to same as read


for(x in c('Car','LowLight','LyingDown')) {
data_orig.dt[language==2, paste0('screen',x):=get(paste0('read',x))]
}
data_orig.dt[language==2]

Distribution of Survey Responses

#template
#replacing *NL* with \n
ggp.1 <- function(var, col, xl=NULL) {
if(is.null(xl)) {xl <-
paste0(toupper(substring(var,1,1)),substring(var,2))}
xax <- labels.dt[feature==var]
labels <- stringi::stri_replace_all_fixed(xax[['label']],

Divya Nori
c("<=",">=","*NL*"), c("\u2264","\u2265","\n"),
vectorize_all=FALSE)

ggplot(data_orig.dt, aes_string(paste0('factor(',var,')'))) +
theme_light() +
theme(panel.grid.minor=element_blank(),
text=element_text(family='Segoe UI', size=15)) +
geom_bar(aes(y=(..count..)/sum(..count..)), fill=cbPalette[col]) +
geom_text(aes(label=..count.., y=(..count../sum(..count..))*.5),
stat='count', color='white', size=4) +
scale_x_discrete(breaks=xax[['value']], labels=labels) + xlab(xl) +
scale_y_continuous(labels=percent) + ylab('Students (%)')
}

Response Variable

grid.arrange(blank, ggp.1('glassesContacts',1,'Glasses or Contacts'), blank,


widths=c(1,6,1))

An ML-based Predictive Model for Vision Disorders in Adolescents


Demographics 27

grid.arrange(ggp.1('age',5), blank, ggp.1('gender',3), widths=c(6,.5,3.5))


grid.arrange(blank, ggp.1('race',6), blank, widths=c(1,8,1))

Physical Attributes

grid.arrange(ggp.1('height',2,'Height (ft)'), blank, ggp.1('weight',5,'Weight


(lbs)'), widths=c(4.4,.5,6.2))

Food Servings

grid.arrange(ggp.1('vegetable',3,'Servings of Vegetables'), blank,


ggp.1('fruit',6,'Servings of Fruit'), widths=c(6,.5,6))
grid.arrange(ggp.1('snack',5,'Servings of Snacks'), blank,
ggp.1('protein',2,'Servings of Protein'), widths=c(5.7,.2,5))
grid.arrange(blank, ggp.1('lunchHome',1,'Lunch from Home'), blank,
widths=c(1.5,5,1.5))

Divya Nori
Time Spent on Activities

grid.arrange(ggp.1('timeScreenSchool',1,'Screen Time for School Work'), blank,


ggp.1('timeTV',5,'Time Spent Watching TV'), widths=c(6.5,.5,5))
grid.arrange(ggp.1('timeCellPhone',3,'Time Spent on a Cell Phone'), blank,
ggp.1('timeBook',4,'Time Spent Reading Books'), widths=c(5,.5,5))
grid.arrange(ggp.1('timeComputer',2,'Time Spent Using Computer'), blank,
ggp.1('timeOutdoors',6,'Time Spent Outdoors'), widths=c(5,.5,5))

Reading or Watching a Screen

grid.arrange(ggp.1('readCar',1,'Reading While Traveling in a Car'), blank,


ggp.1('screenCar',2,'Watching a Screen While Traveling in a
Car'), widths=c(5,.5,5))
grid.arrange(ggp.1('readLowLight',4,'Reading In Low Light'), blank,
ggp.1('screenLowLight',5,'Watching a Screen In Low Light'),
widths=c(6,.5,4.5))
grid.arrange(ggp.1('readLyingDown',6,'Reading While Lying Down'), blank,
ggp.1('screenLyingDown',3,'Watching a Screen While Lying Down'),
widths=c(5,.5,5))
An ML-based Predictive Model for Vision Disorders in Adolescents 28

Other Factors

grid.arrange(ggp.1('transportation',5), blank, ggp.1('braces',2),


widths=c(6.5,.5,3.5))
grid.arrange(ggp.1('siblings',6,'Siblings With Vision Aids'), blank,
ggp.1('physicalActive',3,'Physical Activity'), widths=c(6,.5,5))

Creating Features for Model Matrix

model.dt <- copy(data_orig.dt)


vars.orig <- names(data_orig.dt)

#response
model.dt[, y:=glassesContacts %in% c(1,2,4)]

#demographics
model.dt[, gender.Female:=gender %in% c(1)]

Divya Nori
model.dt[, age.GT12:=age>12]
model.dt[, race.NonWhite:=!(race %in% c(1))]
model.dt[, height.GT54:=height %in% c(4)]
model.dt[, weight.GT130:=weight %in% c(5)]

#food
model.dt[, vegetable.GT1:=vegetable %in% c(3,4,5)]
model.dt[, fruit.GT2:=fruit %in% c(4,5)]
model.dt[, snack.GT2:=snack %in% c(4,5)]
model.dt[, protein.GT2:=protein %in% c(4,5)]
model.dt[, lunchNotHome:=lunchHome %in% c(2)]

#time on activities
model.dt[, timeScreenSchool.LE1:=timeScreenSchool %in% c(1)]
model.dt[, timeTelevision.LE1:=timeTV %in% c(1)]
model.dt[, timeCellPhone.LE1:=timeCellPhone %in% c(1)]
model.dt[, timeComputer.LE1:=timeComputer %in% c(1)]
An ML-based Predictive Model for Vision Disorders in Adolescents 29
model.dt[, timeBook.LE1:=timeBook %in% c(1)]
model.dt[, timeOutdoors.LE1:=timeOutdoors %in% c(1)]

#reading/screen time in various situations


model.dt[, readCar.Yes:=readCar %in% c(1)]
model.dt[, readLowLight.Yes:=readLowLight %in% c(1)]
model.dt[, readLyingDown.Yes:=readLyingDown %in% c(1)]
model.dt[, screenCar.Yes:=screenCar %in% c(1)]
model.dt[, screenLowLight.Yes:=screenLowLight %in% c(1)]
model.dt[, screenLyingDown.Yes:=screenLyingDown %in% c(1)]

#others
#model.dt[, transport.Car:=transportation %in% c(1)]
model.dt[, braces.Yes:=braces %in% c(1)]
model.dt[, siblingsGlasses.Yes:=!(siblings %in% c(3))]
#model.dt[, physicalActive.GT1hr:=physicalActive %in% c(3)]

Divya Nori
model.dt <- model.dt[, setdiff(names(model.dt), vars.orig), with=F]
model.dt <- model.dt[, lapply(.SD, as.integer), .SDcols=names(model.dt)]

model.dt

Summary Statistics: Correlations Between All Pairs of Features

cr <- cor(model.dt[, -c('y')])


cr.dt <- data.table(melt(cr, na.rm=TRUE, value.name='cor'))

ggplot(data = cr.dt, aes(Var1, Var2, fill = cor))+ geom_tile(color='white') +


theme_light() + coord_fixed() +
scale_fill_gradient2(low='red', high='blue', mid='white', midpoint=0,
limit=c(-1,1), space='Lab',
name='Correlation') +
theme(axis.text.x=element_text(angle=90, vjust=.5, size=9, hjust=1),
axis.text.y=element_text(size=9, vjust=.5),
An ML-based Predictive Model for Vision
text=element_text(family='Segoe UI'))Disorders in Adolescents 30
cr.dt <- cr.dt[as.integer(Var1)<as.integer(Var2)]
cbind(setorder(cr.dt, -cor)[1:5], data.table('|'='|'),setorder(cr.dt, cor)
[1:5])

varcol.dt <- rbind(


data.table(var=c('timeOutdoors.LE1','screenCar.Yes','siblingsGlasses.Yes'),
color=c('#C7E9B4','#7FCDBB','#41B6C4')),
data.table(var=c('height.GT54','weight.GT130'))[, color:='#1D91C0'],
data.table(var=c('lunchNotHome','race.NonWhite'))[, color:='#225EA8'],
data.table(var=c('readLowLight.Yes','readLyingDown.Yes'),
color=c('#253494','#081D58'))
)
varcol.dt <- rbind(varcol.dt,
data.table(var=setdiff(names(model.dt),c(varcol.dt[['var']],'y')))[,
color:='#000000']
)

Divya Nori
ggp.2 <- function(dt, fpar) {
dt.p <- setnames(dt[, fpar[['cols']], with=F], c('var','value'))
dt.p <- setorder(dt.p, -value)[1:fpar[['nobs']]]
dt.p <- merge(dt.p, varcol.dt, by='var', sort=FALSE)
dt.p <- setorder(dt.p, value)[, var:=factor(.I,labels=var)]

p <- ggplot(dt.p, aes(var,value,fill=var)) + theme_light() +


geom_col(alpha=.8) +
coord_flip() + xlab(fpar[['xylab']][1]) + ylab(fpar[['xylab']][2]) +
scale_fill_manual(values=dt.p[['color']]) +
theme(legend.position='none', text=element_text(family='Segoe UI',
size=14))
return(p)
}

#N <- 5
#dt <- data.table(X=sample(varcol.dt[['var']],N), Y=runif(N))
#print(dt)
An ML-based Predictive Model for Vision Disorders in Adolescents 31
#par <- list(nobs=N, cols=c('X','Y'), xylab=c('Variable','Odds Ratio'))
#ggp.2(dt, par)

Relative Risk and Odds Ratio

tot <- model.dt[, .N, y]


tmp <- model.dt[, lapply(.SD,sum), by=y, .SDcols=2:ncol(model.dt)]
col <- names(tmp)[seq(2,ncol(tmp))]

univar.dt <- data.table(codes=col,


eve.con=unname(unlist(tmp[y==1,col,with=F])), eve=tot[y==1,N],
nev.con=unname(unlist(tmp[y==0,col,with=F])),
nev=tot[y==0,N])
univar.dt[, ':='(eve.nocon=eve-eve.con, nev.nocon=nev-nev.con)][,
c('eve','nev'):=NULL]
univar.dt[, ':='(con=eve.con+nev.con, nocon=eve.nocon+nev.nocon)]
univar.dt[, ':='(rel.risk=(eve.con*nocon)/(eve.nocon*con),
odds.ratio=(eve.con*nev.nocon)/(nev.con*eve.nocon))]

Divya Nori
univar.dt <- setorder(univar.dt[, c('codes','rel.risk','odds.ratio')],
-odds.ratio)
univar.dt

ggp.2(univar.dt, list(nobs=5, cols=c('codes','odds.ratio'),


xylab=c('Variable','Odds Ratio')))
remove(list=c('col','tmp','tot'))

Predictive Models: Decision Tree Model

model.rp <- rpart(y~., method='class', control=rpart.control(cp=1E-5,


minsplit=10, xval=4), data=model.dt)
rpart.plot(model.rp, type=1, fallen.leaves=FALSE, tweak=1.1)
localH2O <- h2o.init(ip='localhost', port=55555, max_mem_size='8g', nthreads=-
1)
##
## H2O is not running yet, starting it now...
##
## Note: In case of errors look at the following log files:
##An ML-based Predictive Model for Vision Disorders in Adolescents
C:\Users\User\AppData\Local\Temp\RtmpEJ0lnF/h2o_User_started_from_r.out 32
## C:\Users\User\AppData\Local\Temp\RtmpEJ0lnF/h2o_User_started_from_r.err
##
##
## Starting H2O JVM and connecting: .. Connection successful!
##
## R is connected to the H2O cluster:
## H2O cluster uptime: 4 seconds 163 milliseconds
## H2O cluster timezone: America/New_York
## H2O data parsing timezone: UTC
## H2O cluster version: 3.18.0.4
## H2O cluster version age: 8 days
## H2O cluster name: H2O_started_from_R_User_jll417
## H2O cluster total nodes: 1
## H2O cluster total memory: 7.11 GB
## H2O cluster total cores: 8
## H2O cluster allowed cores: 8

Divya Nori
## H2O cluster healthy: TRUE
## H2O Connection ip: localhost
## H2O Connection port: 55555
## H2O Connection proxy: NA
## H2O Internal Security: FALSE
## H2O API Extensions: Algos, AutoML, Core V3, Core V4
## R Version: R version 3.4.3 (2017-11-30)
s <- capture.output(h2o.removeAll())
h2o.no_progress()

data.h2o <- as.h2o(model.dt)


data.h2o$y <- as.factor(data.h2o$y)

Generalized Linear Model (GLM)

glm.h2o <- h2o.glm(x=seq(2,ncol(model.dt)), y=1, training_frame=data.h2o,


family='binomial', seed=12345,
nfolds=4, lambda=1E-3, max_active_predictors=5,
An ML-based Predictive
intercept=TRUE) Model for Vision Disorders in Adolescents 33
h2o.performance(glm.h2o, newdata=data.h2o)
## H2OBinomialMetrics: glm
##
## MSE: 0.1943385
## RMSE: 0.4408384
## LogLoss: 0.5674731
## Mean Per-Class Error: 0.2692308
## AUC: 0.7647059
## Gini: 0.5294118
## R^2: 0.2085764
## Residual Deviance: 68.09678
## AIC: 80.09678
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal
threshold:
## 0 1 Error Rate

Divya Nori
## 0 17 17 0.500000 =17/34
## 1 1 25 0.038462 =1/26
## Totals 18 42 0.300000 =18/60
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.265312 0.735294 15
## 2 max f2 0.265312 0.856164 15
## 3 max f0point5 0.584681 0.655738 8
## 4 max accuracy 0.584681 0.700000 8
## 5 max precision 0.764473 0.875000 2
## 6 max recall 0.148481 1.000000 17
## 7 max specificity 0.870703 0.970588 0
## 8 max absolute_mcc 0.265312 0.499083 15
## 9 max min_per_class_accuracy 0.561815 0.653846 9
## 10 max mean_per_class_accuracy 0.265312 0.730769 15
##
An ML-based Predictive Model for Vision Disorders in Adolescents 34
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or
`h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
coef.glm <- glm.h2o@model$coefficients
coef.glm.dt <- data.table(code=names(coef.glm), regval=unlist(coef.glm))
[abs(regval)>1E-3 & code!='Intercept']
ggp.2(coef.glm.dt, list(nobs=5, cols=c('code','regval'),
xylab=c('Variable','Coefficient')))

Gradient Boosted Model (GBM)

gbm.h2o <- h2o.gbm(x=seq(2,ncol(model.dt)), y=1, training_frame=data.h2o,


seed=12345,
nfolds=4, ntrees=100, learn_rate=1E-2)
h2o.performance(gbm.h2o, newdata=data.h2o)
## H2OBinomialMetrics: gbm
##
## MSE: 0.1795813
## RMSE: 0.4237703
## LogLoss: 0.5452771

Divya Nori
## Mean Per-Class Error: 0.1549774
## AUC: 0.8891403
## Gini: 0.7782805
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal
threshold:
## 0 1 Error Rate
## 0 30 4 0.117647 =4/34
## 1 5 21 0.192308 =5/26
## Totals 35 25 0.150000 =9/60
##
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.470767 0.823529 23
## 2 max f2 0.352211 0.890411 38
## 3 max f0point5 0.470767 0.833333 23
## 4 max accuracy 0.470767 0.850000 23
##An5 ML-based Predictivemax
Model for Vision0.666833
precision Disorders 1.000000
in Adolescents0 35
## 6 max recall 0.352211 1.000000 38
## 7 max specificity 0.666833 1.000000 0
## 8 max absolute_mcc 0.470767 0.693585 23
## 9 max min_per_class_accuracy 0.470767 0.807692 23
## 10 max mean_per_class_accuracy 0.470767 0.845023 23
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or
`h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
varimp.gbm.dt <- data.table(gbm.h2o@model$variable_importances)[,
c('relative_importance','percentage'):=NULL]
ggp.2(varimp.gbm.dt, list(nobs=5, cols=c('variable','scaled_importance'),
xylab=c('Variable','Scaled Importance')))

Random Forest Model (RF)

rf.h2o <- h2o.randomForest(x=seq(2,ncol(model.dt)), y=1,


training_frame=data.h2o, seed=12345,
ntrees=150, max_depth=2)

Divya Nori
#summary(rf.h2o)
h2o.performance(rf.h2o, newdata=data.h2o)
## H2OBinomialMetrics: drf
##
## MSE: 0.1891142
## RMSE: 0.4348727
## LogLoss: 0.5668165
## Mean Per-Class Error: 0.1210407
## AUC: 0.9343891
## Gini: 0.8687783
##
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal
threshold:
## 0 1 Error Rate
## 0 31 3 0.088235 =3/34
## 1 4 22 0.153846 =4/26
## Totals 35 25 0.116667 =7/60
##An ML-based Predictive Model for Vision Disorders in Adolescents 36
## Maximum Metrics: Maximum metrics at their respective thresholds
## metric threshold value idx
## 1 max f1 0.478210 0.862745 24
## 2 max f2 0.436659 0.905797 33
## 3 max f0point5 0.478210 0.873016 24
## 4 max accuracy 0.478210 0.883333 24
## 5 max precision 0.607206 1.000000 0
## 6 max recall 0.381417 1.000000 41
## 7 max specificity 0.607206 1.000000 0
## 8 max absolute_mcc 0.478210 0.761806 24
## 9 max min_per_class_accuracy 0.463585 0.852941 27
## 10 max mean_per_class_accuracy 0.478210 0.878959 24
##
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or
`h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
varimp.rf.dt <- data.table(rf.h2o@model$variable_importances)[,
c('relative_importance','percentage'):=NULL]

Divya Nori
ggp.2(varimp.rf.dt, list(nobs=5, cols=c('variable','scaled_importance'),
xylab=c('Variable','Scaled Importance')))

Distribution Data

Figure 1: Distribution of Students with Visual Aids

An ML-based Predictive Model for Vision Disorders in Adolescents 37

Figure 2: Distribution of Students by Age Figure 3: Distribution of Students by Gender

Divya Nori
Figure 4: Distribution of Students by Height Figure 5: Distribution of Students by Weight

Figure 6: Distribution of Students by Race

An ML-based Predictive Model for Vision Disorders in Adolescents 38

Figure 7/8: Distribution of Students Reading While Traveling in a Car/Watching a Screen While Traveling in a Car

Figure 9/10: Distribution of Students Reading While Lying Down/Watching a Screen While Lying Down

Divya Nori
Figure 11/12: Distribution of Students Reading in Low Light/Watching a Screen in Low Light

An ML-based Predictive Model for Vision Disorders in Adolescents 39

Figure 13/14: Distribution of Students by Time Spent on a Screen for School Work/Time Spent Watching TV

Figure 15: Distribution of Students with Siblings Using Visual Aids Figure 16: Distribution of Students by Physical Activity Time

Divya Nori
Figure 17: Distribution of Students by Snack Servings Figure 18: Distribution of Students by Protein Servings

An ML-based Predictive Model for Vision Disorders in Adolescents 40

Figure 19/20: Distribution of Students by Time Spent on a Cell Phone/Time Spent Reading Books

Figure 21/22: Distribution of Students by Time Spent Using Computer/Time Spent Outdoors

Divya Nori
Figure 23: Distribution of Students by Form of Transportation Figure 24: Distribution of Students by Use of Braces

An ML-based Predictive Model for Vision Disorders in Adolescents 41

Figure 25: Distribution of Students by Vegetable Servings Figure 26: Distribution of Students by Fruit Servings

Figure 27: Distribution of Students by Source of Lunch

Divya Nori
Statistical Analysis Data
Table 1: Relative Risk and Odds Ratios for Identified Variables

Codes Relative Risk Odds Ratio

screenCar.yes 2.548780 4.342105

height.GT54 1.777778 3.333333

timeOutdoors.LE1 2.072072 3.333333

siblingsGlasses.Yes 1.591837 2.380952

race.NonWhite 1.525641 2.138889

lunchNotHome 1.455882 2.042017

readLowLight.Yes 1.444444 1.888889


An ML-based Predictive Model for Vision Disorders in Adolescents 42
timeBook.LE1 1.317829 1.594203

screenLowLight.Yes 1.317829 1.594203

readCar.Yes 1.235294 1.470588

Divya Nori
An ML-based Predictive Model for Vision Disorders in Adolescents 43

Figure 28: Odds Ratio Graph

Table 2: Pairwise Positive Correlations

Variable 1 Variable 2 Correlation

height.GT54 weight.GT130 0.4900980

timeScreenSchool.LE1 timeComputer.LE1 0.4559916

snack.GT2 timeBook.LE1 0.4116251

race.NonWhite lunchNotHome 0.3853551


weight.GT130 braces.Yes 0.3767590

Divya Nori
Table 3: Pairwise Negative Correlations

Variable 1 Variable 2 Correlation

timeBook.LE1 readLyingDown -0.3791612

Weight.GT130 timeBook.LE1 -0.3573595

race.NonWhite protein.GT2 -0.3435780

protein.GT2 lunchNotHome -0.3204059


gender.Female height.GT54 -0.3195196

An ML-based Predictive Model for Vision Disorders in Adolescents 44

Figure 29: Pairwise Correlations Heat Map

Divya Nori
Model Output

An ML-based Predictive Model for Vision Disorders in Adolescents 45

Figure 30: Generalized Linear Model Regression Coefficients

Figure 31: GLM ROC Curve

Divya Nori
Figure 32: Decision Tree
An ML-based Predictive Model for Vision Disorders in Adolescents 46

Figure 34: Gradient Boosted Machine Scaled Importance Values

Divya Nori
Figure 33: Random Forest Scaled Importance Values
Figure 35: GBM ROC Curve
Discussions
Analysis

Figure 1 displays the percent of students with visual aids; glasses, contacts, both, or none. Most

of the students (~57%) do not use glasses or contacts and ~43% of the students do. This gives a

good size sample for each group, rather than having many students in one group and barely any

students in the other.


An ML-based Predictive Model for Vision Disorders in Adolescents 47
For the rest of the explanatory variables (Figures 2 – 27), the summary statistics were

used to convert them to binary variables. While half of the features were already binary (Figures

3, 7, 8, 9, 10, 12, 14, 19, 20, 21, 22, 24), the remaining variables needed to be converted. The

binary variables were defined based on the number of subjects in each group. For example, in

Figure 6, since ~57% of the students are Caucasian, the two variables created for the “race”

feature were race.White and race.NonWhite. If the variables had been created with the “other”

category instead of Caucasian (race.Other and race.NonOther), there would only be 6 subjects

(10%) in the race.Other category. If race.Other had been chosen by the model as one of the

predictive variables, because of the low number of subjects, the coefficient would not accurately

depict the predictive power.

Divya Nori
From the summary statistics, the following binary variables and their complements were

created: gender.Female, age.GT12, race.NonWhite, height.GT54, weight.GT130,

vegetable.GT1, fruit.GT2, snack.GT2, protein.GT2, lunchNotHome, timeScreenSchool.LE1,

timeTelevision.LE1, timeCellPhone.LE1, timeComputer.LE1, timeBook.LE1,

timeOutdoors.LE1, readCar.Yes, readLowLight.Yes, readLyingDown.Yes, screenCar.Yes,

screenLowLight.Yes, screenLyingDown.Yes, braces.Yes, siblingsGlasses.Yes,

physicalActive.GT1hr.

Table 1 and Figure 28 show the top ten most predictive variables based on relative risk

and odds ratio. If the relative risk/odds ratio is higher, it is more likely that the variable has more

predictive power. Similarly, if the odds ratio is greater than one, it is more likely that the

individual has a visual aid if that feature is present. The ordering of variables based on relative

risk
Analmost corresponds
ML-based to Model
Predictive that of for
oddsVision
ratios,Disorders
showing in
that these measures are good indicators of
Adolescents 48
the predictive power. Also, the order of features by univariate odds ratios was similar to that of

other algorithms, so if the odds ratio is higher, the feature is likely to have predictive power in

other models.

Table 2 shows the top five pairs of positively correlated variables. These coefficients

were computed using the pairwise correlation formula. Because of the low sample size, the

coefficients are not very strong (as seen in Figure 29), but if the model picks one of these

variables, the correlations convey that the other variable could be used as well. For example, as

seen in Figure 30, height.GT54 was identified by the GLM model as having predictive power.

The positive correlation shows that weight.GT130 could also have some predictive power. The

same can be said for Figure 32 about the weight feature in correlation to height. Similarly,

Table 3 shows the top five pairs of negatively correlated variables. Protein.GT2 and

Divya Nori
lunchNotHome were negatively correlated, which is interesting because it may show that school

lunches are not as nutritious. Other conclusions can be drawn from these correlations, and with a

greater sample size, the strength of the coefficients would increase.

Figure 30 displays the top five most predictive variables based on the regression

coefficients computed by the Generalized Linear Model. All the variables have positive

coefficients, meaning that if the feature is present, it is more likely that the subject has or will

need a vision aid. The intercept indicates the likelihood that the subject will require a vision aid

if none of the features are present. Since this value is negative, it shows that it is very unlikely

that the subject will need glasses/contacts if they are not in any of the categories listed. The most

predictive variable found, with an absolute value of ~1.373, is the use of a screen in a car,

followed by if race is non-white (~0.835), time spent outdoors is less than one hour (~0.823),

height is greaterPredictive
An ML-based than 5.4 feet (~0.729),
Model and the
for Vision subject’s
Disorders insibling need glasses (~0.728). In these
Adolescents 49
cases, if the feature is present, the subject is more likely to need a vision aid. The coefficient

indicates the predictive power of the variable, so the time spent outdoors is more predictive than

height. Figure 31 displays the ROC Curve for the Genearlized Linear Model. The model has an

AUC of ~0.765. This is fairly accurate given the number of observations (60). The ROC curve

plots 1 minus the specificity against the senstivity. Specificity (TNR or True Negative Rate) is

equal to True Negatives/All Negatives on a confusion matrix. Sensitivity (TPR or True Positive

Rate) is equal to True Positives/All Positives on a confusion matrix.

In Figure 32, if the the leaf is green, them most individuals in the category have vision

aids. If the leaf is blue, most individuals in the category do not have vision aids. The darkness

of color correlates to the percentage of individuals in the category that display the response

variable indicated by the leaf. The importance of the feature correlates the placement on the tree.

Divya Nori
In other words, the features on the upper branches have more predictive power than features

positioned on the lower branches. The features returned were use of a screen in a car, whether or

not the subject’s siblings had glasses, time spent oudoors, reading why lying down, reading in

low light, and weight greater than 130 pounds. The weight feature can be correlated with the

height feature, as seen in Figure 29, so when looking at the important variables, they can be

paired.

The top five scaled importance values of the features returned by the Distributed Random

Forest Model are shown in Figure 33. They are use of screen in a car, time spent oudoors, height

greater than 5.4 feet, whether or not the subject’s siblings have glasses, and whether the subject’s

race is white or non-white. Scaled importance makes the most predictive feature one and scales

the other features down accordingly. Since all of the scaled importance values are positive, so if

the
Anfeature is present
ML-based in an Model
Predictive individual, they will
for Vision likely need
Disorders a vision aid. DRF returned an AUC
in Adolescents 50
of ~0.934, which is the highest across all of the models.

Figure 34 shows the scaled importance values for the features returned by the Gradient

Boosted Model. The features returned are use of screen in a car, whether or not the subject’s

siblings have glasses, time spent outdoors, height greater than 5.4 feet, and reading in low light.

Since all of the scaled importance values are positive, so if the feature is present in an individual,

they will likely need a vision aid. It is interesting to note that the first two features’ scaled

importance values are almost double that of the third. The Gradient Boosted Machine ROC

Curve is shown in Figure 35. The model returned an AUC of ~0.889, which is fairly accurate.

Divya Nori
Conclusions

GL Random GB
Decision Tree
M Forest M

screenCar.Yes 1 1 1 1

timeOutdoors.LE1 3 2 2 3
An ML-based Predictive Model for Vision Disorders in Adolescents 51
siblingsGlasses.Yes 5 3 4 2

height.GT54/weight.GT13
4 5 3 4
0

race.NonWhite 2 5

readLowLight.Yes 4 5

readLyingDown.Yes 3

Table 4: Important Features Across Algorithms

The most predictive features across algorithms are shown in the Table 4. The use of

screen in a car feature ranked first in all algorithms tested. It was followed by time spent

outdoors and whether or not the subject’s siblings have glasses. Since height and weight are

Divya Nori
correlated, they are paired together. When paired, the feature came up as predictive in all

algorithms tested. Race and reading habits came up as predictive in some algorithms, so they

could be show more predictive ability in a larger dataset.

The models are accurate and can be used as an initial filtering before a formal eye exam.

This is especially important in middle and high schools where yearly eye exams are not

conducted. The original hypothesis was proven to be partially correct. While time oudoors did

have some predictive power, it was not the most predictive.

Future Work

In the future, there are several things that could be done to improve model accuracy and

efficiency. For example, more survey responses could be obtained for a higher AUC. By
An ML-based Predictive Model for Vision Disorders in Adolescents 52
increasing the number of subjects, other statistics like confidence interval could also be

computed. Increasing sample and feature variation could also make the model more generic and

reduce memorization. The program could be made more efficient for a wide scale adoption. A

mobile application could be develoed to make this happen.

IV. ACKNOWLEDGEMENTS

There are several people that helped make this investigation possible:

The survey participants and their parents/guardians – this investigation would not have

happened without you. Thank you so much!

Ms. Kindra Smith and Ms. Lynnette Lindesay – thank you so much for your cooperation and

help with the survey administration.

Divya Nori
Dr. Varsha Sonawane – thank you for being a great mentor through the entire process.

Dr. Ratidzo Macharaga – thank you for taking time out of your busy day to talk to me and

provide your professional view point.

My family – this experiment would not be where it is without your support. Thank you!

V. REFERENCES

[1] - MHS, Susan Vitale PhD. “Increased Prevalence of Myopia in the United States

Between 1971-1972 and 1999-2004.” Archives of Ophthalmology, American Medical

Association, 14 Dec. 2009, https://jamanetwork.com/journals/jamaophthalmology/full

article/424548

[2] - Holden, Brian A. “Global Prevalence of Myopia and High Myopia and Temporal
An ML-based Predictive Model for Vision Disorders in Adolescents 53
Trends from 2000 through 2050.” AAO Journal, http://www.aaojournal.org/article/S0161-

6420(16)00025-7/abstract

[3] - “The Myopia Boom.”Nature News, Nature Publishing Group,

https://www.nature.com /news/the-myopia-boom-1.17120

[4] - “Pathological Myopia.” The Low Vision Centers of Indiana,

http://www.eyeassociates.com/pathological-myopia/

[5] - The Myopia Myth, www.myopia.org/ebook/10chapter5.htm.

[6] - Holden, B, et al. “Myopia, an Underrated Global Challenge to Vision: Where the

Current Data Takes Us on Myopia Control.” Eye, Nature Publishing Group, Feb. 2014,

www.ncbi.nlm.nih.gov/pmc/articles/PMC3930268/.

Divya Nori
[7] - Crew, Bec. “Watch: The Nearsighted Epidemic Is Real.” ScienceAlert,

www.sciencealert.com/watch-the-nearsightedness-epidemic-is-real.

[8] - “Facts About Refractive Errors.” National Eye Institute, U.S. Department of Health

and Human Services, 1 Oct. 2010, nei.nih.gov/health/errors/errors.

[9] - Cassin, Barbara, and Melvin L. Rubin. Dictionary of Eye Terminology. Triad Pub. Co.,

2004.

[10] - Saw, S M, et al. “Myopia: Attempts to Arrest Progression.” British Journal of

Ophthalmology, BMJ Publishing Group Ltd, 1 Nov. 2002, bjo.bmj.com/content/86/11/1306.

[11] - Seiler, Theo. “Excimer Laser Keratectomy for Correction of Astigmatism.” Atlanta

Journal of Opthalmology, www.ajo.com/article/0002-9394(88)90173-0/pdf.

An [12] - “Laser
ML-based in Situ Keratomileusis
Predictive to Treat
Model for Vision Myopia:
Disorders Early Experience.” Journal of
in Adolescents 54

Cataract and Refractive Surgery, www.jcrsjournal.org/article/S0886-3350(97)80149-6/pdf.

[13] - Moshirfar, Majid, et al. “Incidence Rate and Occurrence of Visually Significant

Cataract Formation and Corneal Decompensation after Implantation of Verisyse/Artisan Phakic

Intraocular Lens.” Clinical Ophthalmology (Auckland, N.Z.), Dove Medical Press, 2014,

www.ncbi.nlm.nih.gov/pmc/articles/PMC3986296/.

[14] - Barrett, Brendan T. “A Critical Evaluation of the Evidence Supporting the Practice of

Behavioural Vision Therapy.” Ophthalmic and Physiological Optics, Blackwell Publishing Ltd,

22 Dec. 2008, onlinelibrary.wiley.com/doi/10.1111/j.1475 1313.2008.00607.x/abstract;

jsessionid=E4C2719BD1E1A2728C55696B37019AD2.f04t03.

Divya Nori
[15] - Randle, Robert J. “Responses of Myopes to Volitional Control Training of

Accommodation.”Ophthalmic and Physiological Optics, Blackwell Publishing Ltd, 19 Dec.

2007, onlinelibrary.wiley.com/doi/10.1111/j.1475-1313.1988.tb01063.x/abstract.

[16] - “Home - PMC - NCBI.” National Center for Biotechnology Information, U.S.

National Library of Medicine, www.ncbi.nlm.nih.gov/pmc/.

[17] - Robert N. Kleinstein, OD, MPH, PhD. “Refractive Error and Ethnicity in

Children.” Archives of Ophthalmology, American Medical Association, 1 Aug. 2003,

jamanetwork.com/journals/jamaophthalmology/fullarticle/415584.

[18] - “Hyperopia (Farsightedness).” American Optometric Association,

www.aoa.org/patients-and-public/eye-and-vision-problems/glossary-of-eye-and-vision-

conditions/hyperopia?sso=y.
An ML-based Predictive Model for Vision Disorders in Adolescents 55
[19] - “Home - PMC - NCBI.” National Center for Biotechnology Information, U.S.

National Library of Medicine, www.ncbi.nlm.nih.gov/pmc/.

[20] - Focusing Problems - College of Optometrists in Vision Development (COVD),

www.covd.org/?page=Focusing.

[21] - LD, Jill Corleone RDN. “Can Foods Make You Grow Taller?” LIVESTRONG.COM,

Leaf Group, 18 July 2017, www.livestrong.com/article/215191-what-foods-make-you-grow-

taller/.

Divya Nori
VI. APPENDIX

An ML-based Predictive Model for Vision Disorders in Adolescents 56

Divya Nori
An ML-based Predictive Model for Vision Disorders in Adolescents 57

Divya Nori

You might also like