Professional Documents
Culture Documents
A Project
MASTER OF SCIENCE
in
Computer Science
by
Siddharth Chittora
FALL
2019
© 2019
Siddharth Chittora
A Project
by
Siddharth Chittora
Approved by:
____________________________
Date
iii
Student: Siddharth Chittora
I certify that this student has met the requirements for format contained in the
University format manual, and this project is suitable for electronic submission to the
iv
Abstract
of
by
Siddharth Chittora
especially in the subfields of Machine Learning and Artificial Intelligence. Yet, there is
a noticeable gap in the number of qualified job applicants who can help solve some of the
Computer Science education much earlier, to k-12 students. The hope is, with an earlier
introduction and better continuous training, students will be properly trained and
interested into studying Computer Science in college and able to obtain these coveted
Given this hope and movement, there needs to be more applications and
instructional plans to support k-12 teachers as they bring more Computer Science topics
into the classroom. One such area of Computer Science is Machine Learning. Currently
there are some tools and curriculum available for broader Computer Science areas. But
there is a lack of instructional methods and tools for Machine Learning for High School
Students. And there is especially a lack of visually interactive lesson plans for high
v
To solve this problem, I present a visually interactive instructional tool intended
to introduce basic concepts of Machine Learning to High School Students. Studies have
shown that any content when delivered in form that is visual and interactive along with
textual information is more likely to be retained in memory [1]. The tool is interactive
and also focuses on giving the High School Students a taste of required mathematical
background for Machine Leaning and Artificial Intelligence. Selection of datasets and
problem I am trying to solve are chosen such that the High School Students can relate to
them, for example instead of solving problems like stock price prediction and network
intrusion detection, I will solve problems like Rain and No Rain Prediction and
Temperature Prediction, which can be more interesting for High School Students. The
students are more likely to stay focused and interested if they have a feeling of
achievement after completing a topic. They start with simpler yet interesting problems
and work their way up. Through this tool a student will be able to brush up basic concepts
of mathematics required for Machine Learning and Artificial Intelligence and classical
Machine Learning and Artificial Intelligence algorithms like K Nearest Neighbors, Basic
Linear Classification, Linear Regression, Naïve Bayes and others in more interactive
_______________________
Date
vi
ACKNOWLEDGEMENTS
I would like to thank Prof. Anna Baynes, my project advisor for all knowledge
and guidance she has provided me through this project. She has been very supportive and
I would also like to thank Prof. Haiquan Chen for giving me his time and inputs
as second reader.
I would also like to thank my parents, Rajendra and Bharti for their constant
support and love and making me believe that I can accomplish anything when I put my
mind to it. I would like to thank my sister, Palak for always reminding me to chase my
dreams.
I would also like to thank my friends for their support and for standing by my side
Lastly, I would like to thank the Department of Computer Science, Dr. Jinsong
Ouyang and Dr. Nikrouz Faroughi for giving me this wonderful opportunity to learn and
grow.
vii
TABLE OF CONTENTS
Page
Chapter
1. INTRODUCTION ................................................................................................. 1
2. BACKGROUND ................................................................................................... 4
6. CONCLUSION ................................................................................................... 36
Bibliography ............................................................................................................... 38
viii
LIST OF TABLES
Tables Page
ix
LIST OF FIGURES
Figures Page
x
21. Code Snippet – Tool Tip ................................................................................. 26
xi
1
Chapter 1: Introduction
ubiquitous and involved in many facets of our lives, can be used to attract more students
many products and applications in our world and our future technological goals; therefore,
it would prove be beneficial if High School Students have an understanding of this area.
We can find a wide variety of online platforms [2, 3] and courses [4, 5] for learning and
experimenting with Machine Leaning and Artificial Intelligence concepts, but these
resources require the students to have a basic understanding of programming and other
related concepts. There is an absence of a concrete learning path in the field of Machine
Learning and Artificial Intelligence that caters High School Students [6, 7], where a student
Learning and see steps involved in solving a simple Machine Learning problems [8, 9].
There can be many reasons for unavailability of learning strategies for Machine
Learning and Artificial Intelligence. One of these reasons is the lack of visually interactive
content on Machine Learning and Artificial Intelligence that a High School Student can
comprehend. Most of the online Machine Learning and Artificial Intelligence courses
expect the students to know how to code in a required language like Python and R. Also,
these courses tend to emphasize heavily solely on the part of solving the Machine Learning
problem. Other important steps involved in solving a Data Science pipeline, like data
collection, visualization, processing and analysis, might not be incorporated into the
2
Machine Learning lesson [6]. Most High School Students do not know how to code and
the courses available online can be overwhelming and intimidating. Students need to be
made familiar with the background of the field including the required mathematical
fundamentals and languages used with the help of relevant examples. They need to know
importance of data and the entire process from collection to analysis. There have been a
few attempts, for example curriculum provided to k-12 students for an introduction to
computational thinking, presentations [10] and video lectures but there is still room for
To address this problem, I present a novel visual interactive tool that will effectively
deliver basic concepts of Machine Learning and Artificial Intelligence. It has been proven
that any content when delivered through a medium that is visual and interactive, along with
textual information, is more likely to be retained in memory [1]. The tool is interactive and
will also focus on giving the High School Students a taste solving a Machine Learning and
Artificial Intelligence problem, by explaining the end-to-end process and steps involved in
the solution. The selection of data sets play a vital role [7] in creating a tool that caters to
school students, for example instead of solving problems like stock price prediction and
network intrusion detection for which an average High School Student is less likely to have
knowledge about, I use problems on Weather Data Analysis, Rain Fall Prediction, and
Temperature Prediction. Students are more likely to stay focused and interested if they
have a feeling of achievement after completing a topic. They should start with simpler yet
interesting problems and work their way up. Through this tool a student will be able to
3
understand basic concepts of Machine Learning and can see a Machine Learning Algorithm
Chapter 2: Background
Learning and Artificial Intelligence is the future and we need our future work force, which
are current High Schoolers to understand the benefits and methods of learning this difficult
topic. In this section, I present the related work and background of attempts in making
Computer Science and Machine Learning more approachable for High School Students.
ECS [13] is a curriculum for introductory k-12 students designed to engage students
in computational thinking and practice. ECS was originally designed for Los Angeles
Unified School District to broaden the participation in Computing particularly for girls and
ECS was designed with the aim to familiarize High School Students with the
breadth of Computer Science through exploring topics that are engaging and accessible for
students. The curriculum focuses more on the concepts of Computer Science and help them
understand why a certain tool and language is used to address a particular type of problem
rather than creating the entire course such that it revolves around a certain tool or
programming language.
The goal of ECS is to develop and strengthen the concepts of algorithms, problem
solving, and programming by using the problems with context that are relevant to students
lives. The curriculum was developed around Computer Science content and computational
areas where ECS curriculum lack is a detailed learning path for Machine Learning. The
curriculum does not offer hands-on projects based in Machine Learning for the students to
There have been a few studies attempting to introduce Data Science to High School
Students. One of the studies [7] proposed conducting workshops to make students work on
Machine Learning examples and make them understand the working of simple algorithms,
focusing on collection of data, processing it and drawing inferences from the predictions.
There is study that proposes setting up a Machine Learning Laboratory [14]. The study
[14] primarily focused on use of K-Means Algorithm which has limited use case.
The related work found do not focus on creating a solid learning plan which
simplifies the fundamentals of Machine Learning to the level of a High School Student
A visualization in which the designer wants to provide interactive features for the
users to explore and identify insights themselves is called Exploratory Visualization. This
allows the user to understand the data and its possibilities using different visual
representations. These types of visualizations are created with a high level of granularity
6
and used to explore different stories the data hides. This form of exploratory analysis is a
convey and narrates it in the form of a story to the viewer is called Explanatory
that he wants the view to notice, by data processing and data analytics. Complex
problems are broken into smaller sub problems so that it is easier for the viewer to focus
textures along with different color, angle, slope and volume combinations is called visual
encoding. A single concept or content can be expressed in many forms of visual encodings.
Hence, it is critical in the design process to use appropriate visual encodings to convey the
content accurately to the viewers. The best visual encoding allows the viewer to require
The visual design principles provide guidelines in making design choices [15]. The
simplicity of the visualization makes the users understand the content expressed in the
In order to build any tool, before making design choices, fundamental requirements
need to be gathered. There are three main steps involved in designing a tool, System
Design, Visual Design and Backend Design, each executed to minimize redundancy and
maximize reusability. In this section, I will discuss the methodology used to design the
tool that will be used to introduce the fundamental of Machine Learning to High School
Students by the help interactive visualizations. The tool requires to use a dataset and try to
solve problems understandable for a High School Student. The tool will be web-based and
requires the appropriate web framework which can support the Python script that handles
To represent the data received from the Python backend, in a format that is easily
visualizations must correctly represent the data for the Machine Learning problem that is
being solved.
The tool consists of two parts, custom FLASK API that is used to perform all the
Machine Learning related tasks and deliver data to the frontend in the form of Comma
8
Separated Values (CSV) and Interactive dashboards that are used to demonstrate the
process of solving Machine Learning problems using the data obtained from the FLASK
API. As the tool caters to High School Students, the selection of dataset and Machine
Learning problem statement is very important. The tool uses weather dataset [16] to
demonstrate the process of solving two types of Supervised Learning Problems, namely,
Classification and Regression. The tool also uses an interactive diagram to explain the Data
Science Pipeline, which shows the end-to-end process of solving a Data Science related
problem, right from getting the raw data to processing it and obtaining inference from it.
The aim of all the visualizations used in the tools is to give High School Student a
basic understanding of Machine Learning and how it can be used to solve day to day
problems in our lives. It is very important to use the appropriate visual encoding to correctly
represent the information being conveyed by the data used in solving a problem. There are
3. Line Charts
4. Area Charts
9
information they are trying to convey. Different charts can be put into four categories
shown in Figure 1:
1. Relationship
2. Comparison
3. Distribution
4. Composition
The CSV files obtained from the backend, is input for the visualization and has data
visualization that is used to exhibit the process solving a typical Data Science problem.
Unlike a typical static diagram, students can interact with different components of the
diagram and get insights of the different steps involved in solving a Data Science problem,
shown in Figure 3.
Chart that is used to show how the data is distributed among different labels in the given
Weather dataset. Bubble Chart has been used here because for the dataset that I have, a
Bubble Chart can clearly represent the how the data is distributed. The charts are
accompanied by a legend that tells the student what the color of the circles mean. The
colors of the circles are chosen such that they correctly represent the information that chart
delivers without distracting the students. The color palette consists of red, pink, light green
The Line Charts and Area Charts are used to represent the timeline of different
weather conditions in the dataset and can be interacted with in different ways. These charts
13
are part of the Dashboards that are used to demonstrate the steps involved in solving the
two types in Machine Learning problems that I am targeting in this tool. The design process
The initial design of the dashboard had multiple Line Charts side-by-side, shown
in Figure 6 and Figure 7. This allowed the students to compare the data for different
weather conditions easily. The students can interact with the chart by hovering over the
graph and see the actual weather data using a tool tip. Beneath each Line Chart, is an Area
Chart which represented the same data, which can be used to zoom into the graphs using
brush interaction, and the Area Chart was followed by the legend for the charts. The color
scheme for the legend was carefully chosen to represent particular weather condition.
Above the charts were three button that were used to select the type of data to be
plotted on the charts. Below the legends was sections that displayed the outcomes of the
prediction given by the Machine Learning Models. This design however had some
limitations in terms of usability, like having three charts alongside each other left no space
for other essential information about the model performance to be displayed. Also, the
outcomes were displayed in the bottom which could have been misinterpreted the students.
14
The reason behind redesigning the UI was space constraints which add challenges
in the design process. If the entire webpage is filled with visualization, there will be no
15
space left to receive the viewer’s input as well as to provide essential information about
the visualization. According to the visual design principles [15] a single webpage should
have the visualization as well as user information section with clear guidelines for user to
interact. Considering space constraint issue, the visual design enforces major decision
i. Pre-attentive Processing
visual depictions where the user will recognize the distinction instantly without giving a
thought.
The final design has major improvements in terms of the usability, dashboard UI
and how the information is being displayed on the dashboard, which is cleaner looking.
16
The new Classification dashboard now has three panels, shown in Figure 8. The
left panel consists of the three sections, Data, Graph and Models. The Data section is used
to select what type of data is to be displayed. The Graph section is used to switch between
charts for Temperature, Humidity and Rainfall Measure. The Model section is used to
select which Machine Learning Models that is to be used for performing prediction.
The center panel is where the graphs are displayed. The charts have the same
interactions as the previous design. The Right panel is where the information is displayed.
The information displayed consists of the Output labels and performance metrics of the
The new Regression Dashboard now has two panels, shown in Figure 9. The left
panel has two section, Data Section which is used to select the type of data to be displayed
and Metrics Section to show the performance metrics of the Machine Learning Model used
to perform prediction. The right panel is where the charts find their place. There two charts
alongside each other for a better comparison between Original data and Predicted data.
The graphs in the new design preserve the same interactions as previous design.
The backend design involves querying the custom FLASK API to fetch the data the
is required for the chart being the displayed. Every type of data and Machine Learning
Model that is selected by the student has an assigned function in the API. For every query
that the frontend makes to the API, the weather dataset is accessed, and the desired data
18
processing operations are performed on it and returned back to the part of frontend the
The resulting JSON object contains details about the weather, which includes Date,
Rainfall in milli meters, Rain today, Rain Tomorrow. All weather details have numeric
apart from Rain today and Rain Tomorrow. I will use these details to make predictions
The Flask API is used to perform a variety of tasks on the data, which includes
fetching Weather dataset, processing and cleaning it so that Machine Learning models use
the part of data that is required reducing the processing time taken by the models ,
normalizing the data which will help Machine Learning models to converge the results and
increasing their performance. The API is also responsible for training the Machine
Learning Models with appropriate data and returning the prediction to the frontend.
As the tool is web-based, special care was taken to select the dataset, Machine Learning
problems that will be solved, and models that will be used to solve these problems, so that
the students have a great learning experience and less wait time. I have selected two easy
In this section I will discuss the coding phase and implementation details with the
tools and platforms used. The data visualizations are created using D3.js which is the main
focus of the application. The interactive visualizations are built using D3.js along with
JavaScript. The UI is created using HTML5, CSS3 and Bootstrap. The custom Flask API,
which is queried for the data to be used for creating interesting and interactive
visualizations, derives its power from Python Data Science libraries like, Sci-Kit Learn,
The Data Visualizations are built keeping in mind the design decisions made earlier.
I have three types of charts, Bubble Chart, Line Chart and Area Chart which are used to
represent the data. The Line Charts, Area Charts and the Data Science Pipeline diagram
are created using the D3.js Path elements. The Bubble Chart used D3.js Force simulation
4.1.1. Path
D3 Scalable Vector Graphics (SVG) Paths are used to construct the Data Science
Pipeline diagram and the Line Chart. In a Line Chart the line paths are used to represent
the weather phenomenon trends. SVG Paths can be used to create a variety of design
elements, like rectangles, circles, ellipses, polylines, polygons, straight lines and curves.
20
The SVG Path element shape is defined by one attribute: d. The SVG Path Mini-
Language contains a series of commands and parameters to define attribute ‘d’. These
commands and parameters, shown in Table 1, are a sequential set of instructions for how
moveto
M(m) x, y Yes Move the pen to a new location. No line is
drawn. All path data must begin with a 'moveto'
command.
L(l) x, y Yes lineto
Draw a line from the current point to the point
(x,y).
H(h) X Yes horizontal-lineto
Draw a horizontal line from the current point to
x.
V(v) y Yes vertical-lineto
Draw a horizontal line from the current point to
y.
curveto
Draw a cubic Bézier curve from the current
C(c) x1 y1 x2 y2 x y Yes point to the point (x,y) using (x1,y1) as the
control point at the beginning of the curve and
(x2,y2) as the control point at the end of the
curve.
elliptical-arc
rx ry Draws an elliptical arc from the current point to
(x, y). The size and orientation of the ellipse are
x-axis-rotation defined by two radii (rx, ry) and an x-axis-
rotation, which indicate how the ellipse as a
A(a) large-arc-flag Yes whole is rotated relative to the current SVG
coordinate system. The center (cx, cy) of the
sweep-flag ellipse is calculated automatically to satisfy the
constraints imposed by the other parameters.
xy large-arc-flag and sweep-flag contribute to the
automatic calculations and help determine how
the arc is drawn.
21
The Data Science Pipeline has been constructed using Arc command of SVG Path
The code below, shown in Figure 11, is an alternative way that can be used to create
a straight line.
The Line Charts are created using SVG Line Generators, shown in Figure 12. Here
.datum is used to define which data is used to create the lines. The Line Generator is defined
using d3.line() the d attribute uses the line generator to plot the line on SVG canvas which
takes the x and y as the parameters and automatically gets the x1, y1, x2, y2 values from
4.1.2. Area
To create an Area Chart I have used SVG Area Generator, shown in Figure 13. I
have used the same data as Line Chart and represented in a different manner. SVG area is
created using SVG path, implemented by filling the area under the graph by selected
color which gives us an impression of an area element, which is done by SVG Area
4.1.3. Circle
I have used circles in the Bubble Chart where each circle represents an entry in the
weather dataset I am using. The color of the circles is decided by the Rain Today and Rain
Tomorrow for each record, shown in Figure 15. The circles can be interacted by using D3
Force simulation which will be discussed in next section. The center coordinates cx and cy
4.1.4. Force
To animate the movement of the circles in the Bubble Chart upon interaction I have
used D3 Force Simulation. We used Collide Force offered by D3 Force Simulation to create
the simulation. I manipulated the x and y values of the center of the circles using two
positioning properties of D3 Force Simulation, forceX and forceY, shown in Figure 16 and
Figure 17.
4.1.5. Brush
The Area Charts can be interacted with using Brush interaction offered by D3.js. It
is used to zoom into the Line Chart and Area Chart. Its implemented by first defining the
Clip path, shown in Figure 18 and Figure 19, which makes sure that nothing is platted
outside the defined area. Then I added the brush interaction over its specified area, shown
in Figure 20.
Tool Tip is used to display more information about the data when a user is
interacting with the visualizations. Here I am using tool tip for the Line Charts when the
user hovers mouse over the chart area, shown in Figure 21.
In this section I will discuss the backend implementation and how it is used to get
the desired results. I will discuss how the Flask API is designed and how Sci-Kit Learn
libraries. It is used in this tool to serve as the backbone. The Flask API used in the tool to
perform all the data related tasks, like cleaning, processing and performing predictions.
27
Every request that the frontend can make to the API has a unique URL that triggers
the associated function in the API. The function imports the data, performs the requested
task and return the results in JSON format. The frontend reads the JSON response form the
API and used the received data for display the requested results.
Below are some examples of how the Flask API is used to fetch Original Data,
shown in Figure 22, and Normalized Data, shown in Figure 23, in order to be displayed in
the charts.
Sci-Kit Learn (SKLearn) is an easy to use Machine Learning Library for Python
that features various classification, regression and clustering algorithms including Support
Vector Machine, Bayes, Linear Regression and Logistic Regression, which has been used
in the tool. These Machine Learning algorithms are access using the Flask API. Along with
various Machine Learning algorithms SKLearn also offer different metrics to assess the
28
performance of the algorithms used. Classification algorithm, shown in Figure 24, can be
assessed on metrics like Accuracy, Precision, Recall, Confusion Matrix, of which Accuracy
and Precision has been used, shown in Figure 25. Regression algorithm, shown in Figure
26, is mainly assessed on Root Mean Squared error and R2 score, shown in Figure 27.
In this section I will discuss how to interact with the Data Science Pipeline
Diagram, shown in Figure 28. The diagram three main area for interaction, the diagram
itself, an example list of areas where Machine Learning finds application, and the “Next”
button used to navigate the diagram. I start by selecting form the list of applications which
give a context to interaction flow. Then by clicking “Next” I can see the blocks that
represent data move into the pipeline. Each “Next” button is clicked I am taken to next step
in the Data Science Problem solving process. The interaction end with diagram showing
the Insights according to the application selected in the beginning, shown in Figure 29.
In this section I will discuss how to interact with the Data Distribution Chart, shown
in Figure 30. The Bubble Chart can be interacted with by clicking on the char area. A single
click separates the bubbles according to the classes they belong to, shown in Figure 31 and
double clicking on it bring backs the chart to its original state. It’s an Explanatory
Visualization, where the user can learn about how the data is distributed amongst the
classes Rain Today and Rain Tomorrow. The chart is accompanied by a legend that helps
The Classification Dashboard, shown in Figure 32, is used to explore the process
of solving the Rain or No Rain Problem. I will be using the weather dataset and based on
the daily weather conditions like, Temperature, Humidity, Rainfall Measure, whether or
not did it rain today, I will predict whether or not will it rain tomorrow.
The Dashboard has three panels. The Left panel is used to explore different form
of data (original, normalized), types of graph (temperature, humidity and rainfall measure)
and algorithm used for prediction (Support Vector Machine, Naïve Bayes, Logistic
Regression). The user can perform prediction by selecting the one of the Machine Learning
Models from the Models list and the click “Predict” in the Data section.
The center panel is used to display the charts and legends, shown in Figure 33. The
Right panel is used to display the rain today and rain tomorrow labels, prediction given by
the Machine Learning algorithms and the performance metrics of the selected Machine
Learning algorithm, shown in Figure 34. The interaction begins with clicking “Loading
Data” in the data section in the left panel. This generates the graph in the center panel which
Temperature Prediction problem. I will be using the weather dataset and based on the daily
The dashboard is divided into two sections. The left panel is used to select the type
of data to be displayed and display information about the regression algorithm used for
performing prediction. The right panel is used to display the charts and legends, which can
Chapter 6: Conclusion
6.1. Summary
The goal of this project is to develop as concrete learning plan to introduce Machine
Learning to High School Students using an interactive visual tool. The project introduces
a novel visual and interactive tool that serves as the probable solution to the problem this
paper aims at solving. This web-based tool can be used for both learning and teaching,
where students can use it to explore the area of Machine Learning or teachers can use it as
The user can have a consolidated understanding of basic Machine Learning fundamentals
The tool is complete in itself but there is always some scope of enhancements
depending upon the needs of the user. The following is a list of some upgrades that can be
1) Increase the user interaction with tool, by implementing the following features:
b) Enter different data for prediction apart from the data in the dataset. This can be
done by having different fields to collect input from the user and use it to give out
the predictions.
37
2) Add the sections to introduce fundamentals of Neural Networks. In order to add the
support for Neural Networks, light weight and efficient models should be used as the
tool is web-based.
3) There is need to perform a Usability Test of the tool by the target users, to gather data
Bibliography
3. S. Yee. and T. Chu, “Model Tuning and the Bias Variance tradeoff,” [Online].
Available: http://www.r2d3.us/visual-intro-to-machine-learning-part-2 [Accessed:
Januray 2019].
10. J. B. Gordon, “Machine Learning for High School Students,” [Online]. Available:
http://www.cs.columbia.edu/~CS4HS/talks/ml_for_hs.pdf [Accessed: February
2019].
39
15. T. Kei, “Principles and elements of visual design: A review of the literature on visual
design of instructional materials,” Educational Studies, vol. 57, International
Christian University, pp. 167-174, April 2015.
17. R. Orban, C. Saden and J. Dinu, “Data Visualization and D3.js,” [Online]. Available:
https://www.udacity.com/course/data-visualization-and-d3js--ud507 [Accessed: June
2019]