You are on page 1of 50

MACHINE LEARNING FOR HIGH SCHOOL STUDENTS

A Project

Presented to the faculty of the Department of Computer Science

California State University, Sacramento

Submitted in partial satisfaction of


the requirements for the degree of

MASTER OF SCIENCE

in

Computer Science

by

Siddharth Chittora

FALL
2019
© 2019

Siddharth Chittora

ALL RIGHTS RESERVED


ii
MACHINE LEARNING FOR HIGH SCHOOL STUDENTS

A Project

by

Siddharth Chittora

Approved by:

__________________________________, Committee Chair


Dr. Anna Baynes

__________________________________, Second Reader


Dr. Haiquan Chan

____________________________
Date

iii
Student: Siddharth Chittora

I certify that this student has met the requirements for format contained in the

University format manual, and this project is suitable for electronic submission to the

library and credit is to be awarded for the project.

__________________________, Graduate Coordinator ___________________


Dr. Jinsong Ouyang Date

Department of Computer Science

iv
Abstract

of

MACHINE LEARNING FOR HIGH SCHOOL STUDENTS

by

Siddharth Chittora

We live in a time which is witnessing a boom in the field of Computer Science

especially in the subfields of Machine Learning and Artificial Intelligence. Yet, there is

a noticeable gap in the number of qualified job applicants who can help solve some of the

Machine Learning problems of the day. There is a current movement to introduce

Computer Science education much earlier, to k-12 students. The hope is, with an earlier

introduction and better continuous training, students will be properly trained and

interested into studying Computer Science in college and able to obtain these coveted

positions after they graduate.

Given this hope and movement, there needs to be more applications and

instructional plans to support k-12 teachers as they bring more Computer Science topics

into the classroom. One such area of Computer Science is Machine Learning. Currently

there are some tools and curriculum available for broader Computer Science areas. But

there is a lack of instructional methods and tools for Machine Learning for High School

Students. And there is especially a lack of visually interactive lesson plans for high

schoolers to learn basic concepts in Machine Learning.

v
To solve this problem, I present a visually interactive instructional tool intended

to introduce basic concepts of Machine Learning to High School Students. Studies have

shown that any content when delivered in form that is visual and interactive along with

textual information is more likely to be retained in memory [1]. The tool is interactive

and also focuses on giving the High School Students a taste of required mathematical

background for Machine Leaning and Artificial Intelligence. Selection of datasets and

problem I am trying to solve are chosen such that the High School Students can relate to

them, for example instead of solving problems like stock price prediction and network

intrusion detection, I will solve problems like Rain and No Rain Prediction and

Temperature Prediction, which can be more interesting for High School Students. The

students are more likely to stay focused and interested if they have a feeling of

achievement after completing a topic. They start with simpler yet interesting problems

and work their way up. Through this tool a student will be able to brush up basic concepts

of mathematics required for Machine Learning and Artificial Intelligence and classical

Machine Learning and Artificial Intelligence algorithms like K Nearest Neighbors, Basic

Linear Classification, Linear Regression, Naïve Bayes and others in more interactive

ways through the use of visualizations such as graphs and charts.

_______________________, Committee Chair


Dr. Anna Baynes

_______________________
Date

vi
ACKNOWLEDGEMENTS

I would like to thank Prof. Anna Baynes, my project advisor for all knowledge

and guidance she has provided me through this project. She has been very supportive and

has led me with care and caution throughout the project.

I would also like to thank Prof. Haiquan Chen for giving me his time and inputs

as second reader.

I would also like to thank my parents, Rajendra and Bharti for their constant

support and love and making me believe that I can accomplish anything when I put my

mind to it. I would like to thank my sister, Palak for always reminding me to chase my

dreams.

I would also like to thank my friends for their support and for standing by my side

in worst times and left me up.

Lastly, I would like to thank the Department of Computer Science, Dr. Jinsong

Ouyang and Dr. Nikrouz Faroughi for giving me this wonderful opportunity to learn and

grow.

vii
TABLE OF CONTENTS
Page

Acknowledgments ..................................................................................................... vii

List of Tables .............................................................................................................. ix

List of Figures ............................................................................................................. x

Chapter

1. INTRODUCTION ................................................................................................. 1

2. BACKGROUND ................................................................................................... 4

3. DESIGN CHOICE ................................................................................................. 7

4. IMPLEMENTATION AND DEVELOPMENT .................................................. 19

5. TOOL WALKTHROUGH .................................................................................. 29

6. CONCLUSION ................................................................................................... 36

Bibliography ............................................................................................................... 38

viii
LIST OF TABLES
Tables Page

1. SVG Path Commands and Parameters ............................................................. 20

ix
LIST OF FIGURES
Figures Page

1. Types of Charts Based on Visual Encodings .................................................... 9

2. Data Science Pipeline Diagram ....................................................................... 10

3. Data Science Pipeline Diagram (After Interaction) ........................................ 11

4. Data Distribution Chart .................................................................................... 12

5. Data Distribution Chart (After Interaction) ..................................................... 12

6. Initial Classification Dashboard ....................................................................... 14

7. Initial Regression Dashboard .......................................................................... 14

8. Improved Classification Dashboard ................................................................ 16

9. Improved Regression Dashboard .................................................................... 17

10. Code Snippet – SVG Path ................................................................................ 21

11. Code Snippet – SVG Line ............................................................................... 21

12. Code Snippet – SVG Line Generator .............................................................. 22

13. Code Snippet – SVG Area Generator ............................................................. 22

14. Code Snippet – Implementing SVG Area Path ............................................... 22

15. Code Snippet – SVG Circle ............................................................................ 23

16. Code Snippet – D3 Force Simulation .............................................................. 24

17. Code Snippet – ForceX and ForceY ............................................................... 24

18. Code Snippet – Defining D3 Brush ................................................................ 25

19. Code Snippet – Defining Clip Path ................................................................. 25

20. Code Snippet – Adding Brush ......................................................................... 25

x
21. Code Snippet – Tool Tip ................................................................................. 26

22. Code Snippet – Original Data ......................................................................... 27

23. Code Snippet – Normalized Data .................................................................... 27

24. Code Snippet – SKLearn Naïve Bayes ........................................................... 28

25. Code Snippet – Classification Metrics ............................................................ 28

26. Code Snippet – SKLearn Linear Regression .................................................. 28

27. Code Snippet – Regression Metrics ................................................................ 28

28. Data Science Pipeline – Beginning – Healthcare Industry ............................. 29

29. Data Science Pipeline – Completed – Healthcare Industry ............................ 30

30. Data Distribution Chart - Initial State ............................................................. 31

31. Data Distribution Chart - After Interaction ...................................................... 31

32. Classification Dashboard - Initial State ........................................................... 33

33. Classification Dashboard - Data Loaded ......................................................... 33

34. Classification Dashboard - After Prediction ................................................... 34

35. Regression Dashboard - Initial State ............................................................... 35

36. Regression Dashboard - After Prediction ....................................................... 35

xi
1

Chapter 1: Introduction

Machine Learning is a growing sub-field of Computer Science, and it being

ubiquitous and involved in many facets of our lives, can be used to attract more students

towards Computer Science. Machine Learning and Artificial Intelligence is present in

many products and applications in our world and our future technological goals; therefore,

it would prove be beneficial if High School Students have an understanding of this area.

We can find a wide variety of online platforms [2, 3] and courses [4, 5] for learning and

experimenting with Machine Leaning and Artificial Intelligence concepts, but these

resources require the students to have a basic understanding of programming and other

related concepts. There is an absence of a concrete learning path in the field of Machine

Learning and Artificial Intelligence that caters High School Students [6, 7], where a student

with no prior experience of programming can understand the fundamentals of Machine

Learning and see steps involved in solving a simple Machine Learning problems [8, 9].

There can be many reasons for unavailability of learning strategies for Machine

Learning and Artificial Intelligence. One of these reasons is the lack of visually interactive

content on Machine Learning and Artificial Intelligence that a High School Student can

comprehend. Most of the online Machine Learning and Artificial Intelligence courses

expect the students to know how to code in a required language like Python and R. Also,

these courses tend to emphasize heavily solely on the part of solving the Machine Learning

problem. Other important steps involved in solving a Data Science pipeline, like data

collection, visualization, processing and analysis, might not be incorporated into the
2

Machine Learning lesson [6]. Most High School Students do not know how to code and

the courses available online can be overwhelming and intimidating. Students need to be

made familiar with the background of the field including the required mathematical

fundamentals and languages used with the help of relevant examples. They need to know

importance of data and the entire process from collection to analysis. There have been a

few attempts, for example curriculum provided to k-12 students for an introduction to

computational thinking, presentations [10] and video lectures but there is still room for

improvement [11, 12].

To address this problem, I present a novel visual interactive tool that will effectively

deliver basic concepts of Machine Learning and Artificial Intelligence. It has been proven

that any content when delivered through a medium that is visual and interactive, along with

textual information, is more likely to be retained in memory [1]. The tool is interactive and

will also focus on giving the High School Students a taste solving a Machine Learning and

Artificial Intelligence problem, by explaining the end-to-end process and steps involved in

the solution. The selection of data sets play a vital role [7] in creating a tool that caters to

school students, for example instead of solving problems like stock price prediction and

network intrusion detection for which an average High School Student is less likely to have

knowledge about, I use problems on Weather Data Analysis, Rain Fall Prediction, and

Temperature Prediction. Students are more likely to stay focused and interested if they

have a feeling of achievement after completing a topic. They should start with simpler yet

interesting problems and work their way up. Through this tool a student will be able to
3

understand basic concepts of Machine Learning and can see a Machine Learning Algorithm

in action by interacting with visualizations such as graphs and charts.


4

Chapter 2: Background

Machine Learning is a growing subfield of Computer Science (CS). Machine

Learning and Artificial Intelligence is the future and we need our future work force, which

are current High Schoolers to understand the benefits and methods of learning this difficult

topic. In this section, I present the related work and background of attempts in making

Computer Science and Machine Learning more approachable for High School Students.

2.1. Exploring Computer Science (ECS):

ECS [13] is a curriculum for introductory k-12 students designed to engage students

in computational thinking and practice. ECS was originally designed for Los Angeles

Unified School District to broaden the participation in Computing particularly for girls and

students of color. After initial success it gained national prominence.

ECS was designed with the aim to familiarize High School Students with the

breadth of Computer Science through exploring topics that are engaging and accessible for

students. The curriculum focuses more on the concepts of Computer Science and help them

understand why a certain tool and language is used to address a particular type of problem

rather than creating the entire course such that it revolves around a certain tool or

programming language.

The goal of ECS is to develop and strengthen the concepts of algorithms, problem

solving, and programming by using the problems with context that are relevant to students

lives. The curriculum was developed around Computer Science content and computational

practices so that students get a feel of what computer scientists do.


5

Despite giving a sound understanding of Computer Science fundamentals, a few

areas where ECS curriculum lack is a detailed learning path for Machine Learning. The

curriculum does not offer hands-on projects based in Machine Learning for the students to

try and experiment with different Machine Learning algorithms.

2.2. Related Research:

There have been a few studies attempting to introduce Data Science to High School

Students. One of the studies [7] proposed conducting workshops to make students work on

Machine Learning examples and make them understand the working of simple algorithms,

focusing on collection of data, processing it and drawing inferences from the predictions.

There is study that proposes setting up a Machine Learning Laboratory [14]. The study

[14] primarily focused on use of K-Means Algorithm which has limited use case.

The related work found do not focus on creating a solid learning plan which

simplifies the fundamentals of Machine Learning to the level of a High School Student

who has limited background in the area.

2.3. Exploratory Visualizations:

A visualization in which the designer wants to provide interactive features for the

users to explore and identify insights themselves is called Exploratory Visualization. This

allows the user to understand the data and its possibilities using different visual

representations. These types of visualizations are created with a high level of granularity
6

and used to explore different stories the data hides. This form of exploratory analysis is a

part of quantitative analysis of the data.

2.4. Explanatory Visualization:

A visualization in which the designer has an idea of what message he wants to

convey and narrates it in the form of a story to the viewer is called Explanatory

Visualization. In order to create these visualizations, the designer extracts information

that he wants the view to notice, by data processing and data analytics. Complex

problems are broken into smaller sub problems so that it is easier for the viewer to focus

on the important information in the data.

2.5. Visual Encoding:

The process of creating visualizations by combinations of different shapes and

textures along with different color, angle, slope and volume combinations is called visual

encoding. A single concept or content can be expressed in many forms of visual encodings.

Hence, it is critical in the design process to use appropriate visual encodings to convey the

content accurately to the viewers. The best visual encoding allows the viewer to require

minimum effort in understanding and analyzing the visualization.

The visual design principles provide guidelines in making design choices [15]. The

simplicity of the visualization makes the users understand the content expressed in the

visualization in a better way.


7

Chapter 3: Design Choice

In order to build any tool, before making design choices, fundamental requirements

need to be gathered. There are three main steps involved in designing a tool, System

Design, Visual Design and Backend Design, each executed to minimize redundancy and

maximize reusability. In this section, I will discuss the methodology used to design the

interactive visualizations of the educational Machine Learning tool.

3.1. System Requirement Analysis

The problem statement in this system’s design is to build an interactive teaching

tool that will be used to introduce the fundamental of Machine Learning to High School

Students by the help interactive visualizations. The tool requires to use a dataset and try to

solve problems understandable for a High School Student. The tool will be web-based and

requires the appropriate web framework which can support the Python script that handles

the Machine Learning tasks.

To represent the data received from the Python backend, in a format that is easily

understood by a High School Student, a novel visualization tool is required. The

visualizations must correctly represent the data for the Machine Learning problem that is

being solved.

3.2. System Design

The tool consists of two parts, custom FLASK API that is used to perform all the

Machine Learning related tasks and deliver data to the frontend in the form of Comma
8

Separated Values (CSV) and Interactive dashboards that are used to demonstrate the

process of solving Machine Learning problems using the data obtained from the FLASK

API. As the tool caters to High School Students, the selection of dataset and Machine

Learning problem statement is very important. The tool uses weather dataset [16] to

demonstrate the process of solving two types of Supervised Learning Problems, namely,

Classification and Regression. The tool also uses an interactive diagram to explain the Data

Science Pipeline, which shows the end-to-end process of solving a Data Science related

problem, right from getting the raw data to processing it and obtaining inference from it.

3.3. Visual Design

The aim of all the visualizations used in the tools is to give High School Student a

basic understanding of Machine Learning and how it can be used to solve day to day

problems in our lives. It is very important to use the appropriate visual encoding to correctly

represent the information being conveyed by the data used in solving a problem. There are

four main types of visualizations in the tool:

1. Data Science Pipeline Diagram

2. Data Distribution Chart

3. Line Charts

4. Area Charts
9

Each of these visualizations have different visual encodings based on the

information they are trying to convey. Different charts can be put into four categories

shown in Figure 1:

1. Relationship

2. Comparison

3. Distribution

4. Composition

Figure 1: Types of Charts Based on Visual Encodings [17]


10

The CSV files obtained from the backend, is input for the visualization and has data

related to comparison and distribution.

The Data Science Pipeline diagram, shown in Figure 2, is an interactive

visualization that is used to exhibit the process solving a typical Data Science problem.

Unlike a typical static diagram, students can interact with different components of the

diagram and get insights of the different steps involved in solving a Data Science problem,

shown in Figure 3.

Figure 2: Data Science Pipeline Diagram


11

Figure 3: Data Science Pipeline Diagram (After Interaction)

The Data Distribution Chart, shown in Figure 4 and 5, is an interactive Bubble

Chart that is used to show how the data is distributed among different labels in the given

Weather dataset. Bubble Chart has been used here because for the dataset that I have, a

Bubble Chart can clearly represent the how the data is distributed. The charts are

accompanied by a legend that tells the student what the color of the circles mean. The

colors of the circles are chosen such that they correctly represent the information that chart

delivers without distracting the students. The color palette consists of red, pink, light green

and green and are self-explanatory.


12

Figure 4: Data Distribution Chart

Figure 5: Data Distribution Chart (After Interaction)

The Line Charts and Area Charts are used to represent the timeline of different

weather conditions in the dataset and can be interacted with in different ways. These charts
13

are part of the Dashboards that are used to demonstrate the steps involved in solving the

two types in Machine Learning problems that I am targeting in this tool. The design process

of the dashboards had two stages.

The initial design of the dashboard had multiple Line Charts side-by-side, shown

in Figure 6 and Figure 7. This allowed the students to compare the data for different

weather conditions easily. The students can interact with the chart by hovering over the

graph and see the actual weather data using a tool tip. Beneath each Line Chart, is an Area

Chart which represented the same data, which can be used to zoom into the graphs using

brush interaction, and the Area Chart was followed by the legend for the charts. The color

scheme for the legend was carefully chosen to represent particular weather condition.

Above the charts were three button that were used to select the type of data to be

plotted on the charts. Below the legends was sections that displayed the outcomes of the

prediction given by the Machine Learning Models. This design however had some

limitations in terms of usability, like having three charts alongside each other left no space

for other essential information about the model performance to be displayed. Also, the

outcomes were displayed in the bottom which could have been misinterpreted the students.
14

Figure 6: Initial Classification Dashboard

Figure 7: Initial Regression Dashboard

The reason behind redesigning the UI was space constraints which add challenges

in the design process. If the entire webpage is filled with visualization, there will be no
15

space left to receive the viewer’s input as well as to provide essential information about

the visualization. According to the visual design principles [15] a single webpage should

have the visualization as well as user information section with clear guidelines for user to

interact. Considering space constraint issue, the visual design enforces major decision

based on visual perception principles.

i. Pre-attentive Processing

Pre-attentive processing is a visual perception which helps in designing distinct

visual depictions where the user will recognize the distinction instantly without giving a

thought.

The final design has major improvements in terms of the usability, dashboard UI

and how the information is being displayed on the dashboard, which is cleaner looking.
16

Figure 8: Improved Classification Dashboard

The new Classification dashboard now has three panels, shown in Figure 8. The

left panel consists of the three sections, Data, Graph and Models. The Data section is used

to select what type of data is to be displayed. The Graph section is used to switch between

charts for Temperature, Humidity and Rainfall Measure. The Model section is used to

select which Machine Learning Models that is to be used for performing prediction.

The center panel is where the graphs are displayed. The charts have the same

interactions as the previous design. The Right panel is where the information is displayed.

The information displayed consists of the Output labels and performance metrics of the

models used for prediction.


17

Figure 9: Improved Regression Dashboard

The new Regression Dashboard now has two panels, shown in Figure 9. The left

panel has two section, Data Section which is used to select the type of data to be displayed

and Metrics Section to show the performance metrics of the Machine Learning Model used

to perform prediction. The right panel is where the charts find their place. There two charts

alongside each other for a better comparison between Original data and Predicted data.

The graphs in the new design preserve the same interactions as previous design.

3.4. Backend Design

The backend design involves querying the custom FLASK API to fetch the data the

is required for the chart being the displayed. Every type of data and Machine Learning

Model that is selected by the student has an assigned function in the API. For every query

that the frontend makes to the API, the weather dataset is accessed, and the desired data
18

processing operations are performed on it and returned back to the part of frontend the

made the request in JSON object.

The resulting JSON object contains details about the weather, which includes Date,

Minimum Temperature, Maximum Temperature, Humidity at 9 am, Humidity at 3 pm,

Rainfall in milli meters, Rain today, Rain Tomorrow. All weather details have numeric

apart from Rain today and Rain Tomorrow. I will use these details to make predictions

according to the problem being addressed.

The Flask API is used to perform a variety of tasks on the data, which includes

fetching Weather dataset, processing and cleaning it so that Machine Learning models use

the part of data that is required reducing the processing time taken by the models ,

normalizing the data which will help Machine Learning models to converge the results and

increasing their performance. The API is also responsible for training the Machine

Learning Models with appropriate data and returning the prediction to the frontend.

As the tool is web-based, special care was taken to select the dataset, Machine Learning

problems that will be solved, and models that will be used to solve these problems, so that

the students have a great learning experience and less wait time. I have selected two easy

level problems one for each type of supervised learning problems.


19

Chapter 4: Implementation and Development

In this section I will discuss the coding phase and implementation details with the

tools and platforms used. The data visualizations are created using D3.js which is the main

focus of the application. The interactive visualizations are built using D3.js along with

JavaScript. The UI is created using HTML5, CSS3 and Bootstrap. The custom Flask API,

which is queried for the data to be used for creating interesting and interactive

visualizations, derives its power from Python Data Science libraries like, Sci-Kit Learn,

Pandas and NumPy.

4.1. Data Visualizations

The Data Visualizations are built keeping in mind the design decisions made earlier.

I have three types of charts, Bubble Chart, Line Chart and Area Chart which are used to

represent the data. The Line Charts, Area Charts and the Data Science Pipeline diagram

are created using the D3.js Path elements. The Bubble Chart used D3.js Force simulation

for the transition effect to animate the bubbles.

4.1.1. Path

D3 Scalable Vector Graphics (SVG) Paths are used to construct the Data Science

Pipeline diagram and the Line Chart. In a Line Chart the line paths are used to represent

the weather phenomenon trends. SVG Paths can be used to create a variety of design

elements, like rectangles, circles, ellipses, polylines, polygons, straight lines and curves.
20

The SVG Path element shape is defined by one attribute: d. The SVG Path Mini-

Language contains a series of commands and parameters to define attribute ‘d’. These

commands and parameters, shown in Table 1, are a sequential set of instructions for how

to draw an SVG path.

Table 1: SVG Path Commands and Parameters [18]

Command Parameters Repeatable Explanation

moveto
M(m) x, y Yes Move the pen to a new location. No line is
drawn. All path data must begin with a 'moveto'
command.
L(l) x, y Yes lineto
Draw a line from the current point to the point
(x,y).
H(h) X Yes horizontal-lineto
Draw a horizontal line from the current point to
x.
V(v) y Yes vertical-lineto
Draw a horizontal line from the current point to
y.
curveto
Draw a cubic Bézier curve from the current
C(c) x1 y1 x2 y2 x y Yes point to the point (x,y) using (x1,y1) as the
control point at the beginning of the curve and
(x2,y2) as the control point at the end of the
curve.
elliptical-arc
rx ry Draws an elliptical arc from the current point to
(x, y). The size and orientation of the ellipse are
x-axis-rotation defined by two radii (rx, ry) and an x-axis-
rotation, which indicate how the ellipse as a
A(a) large-arc-flag Yes whole is rotated relative to the current SVG
coordinate system. The center (cx, cy) of the
sweep-flag ellipse is calculated automatically to satisfy the
constraints imposed by the other parameters.
xy large-arc-flag and sweep-flag contribute to the
automatic calculations and help determine how
the arc is drawn.
21

The Data Science Pipeline has been constructed using Arc command of SVG Path

and Line generator, shown in Figure 10.

Figure 10: Code Snippet - SVG Path

The code below, shown in Figure 11, is an alternative way that can be used to create

a straight line.

Figure 11: Code Snippet - SVG Line

The Line Charts are created using SVG Line Generators, shown in Figure 12. Here

.datum is used to define which data is used to create the lines. The Line Generator is defined

using d3.line() the d attribute uses the line generator to plot the line on SVG canvas which

takes the x and y as the parameters and automatically gets the x1, y1, x2, y2 values from

the data used.


22

Figure 12: Code Snippet - SVG Line Generator

4.1.2. Area

To create an Area Chart I have used SVG Area Generator, shown in Figure 13. I

have used the same data as Line Chart and represented in a different manner. SVG area is

created using SVG path, implemented by filling the area under the graph by selected

color which gives us an impression of an area element, which is done by SVG Area

Generator, shown in Figure 14.

Figure 13: Code Snippet - SVG Area Generator

Figure 14: Code Snippet - Implementing SVG Area Path


23

4.1.3. Circle

I have used circles in the Bubble Chart where each circle represents an entry in the

weather dataset I am using. The color of the circles is decided by the Rain Today and Rain

Tomorrow for each record, shown in Figure 15. The circles can be interacted by using D3

Force simulation which will be discussed in next section. The center coordinates cx and cy

of the circles is generated automatically be the D3 Force Simulation.

Figure 15: Code Snippet - SVG Circle


24

4.1.4. Force

To animate the movement of the circles in the Bubble Chart upon interaction I have

used D3 Force Simulation. We used Collide Force offered by D3 Force Simulation to create

the simulation. I manipulated the x and y values of the center of the circles using two

positioning properties of D3 Force Simulation, forceX and forceY, shown in Figure 16 and

Figure 17.

Figure 16: Code Snippet - D3 Force Simulation

Figure 17: Code Snippet - ForceX and ForceY


25

4.1.5. Brush

The Area Charts can be interacted with using Brush interaction offered by D3.js. It

is used to zoom into the Line Chart and Area Chart. Its implemented by first defining the

Clip path, shown in Figure 18 and Figure 19, which makes sure that nothing is platted

outside the defined area. Then I added the brush interaction over its specified area, shown

in Figure 20.

Figure 18: Code Snippet – Defining D3 Brush

Figure 19: Code Snippet – Defining Clip Path

Figure 20: Code Snippet – Adding Brush


26

4.1.6. Tool Tip

Tool Tip is used to display more information about the data when a user is

interacting with the visualizations. Here I am using tool tip for the Line Charts when the

user hovers mouse over the chart area, shown in Figure 21.

Figure 21: Code Snippet – Tool Tip

4.2. Backend Development

In this section I will discuss the backend implementation and how it is used to get

the desired results. I will discuss how the Flask API is designed and how Sci-Kit Learn

Library is used to perform Machine Learning tasks the frontend requests.

4.2.1. Flask API

Flask is a lightweight Web Server Gateway Interface (WSGI) web application

framework. It is classified as microframework as it does not require particular tool and

libraries. It is used in this tool to serve as the backbone. The Flask API used in the tool to

perform all the data related tasks, like cleaning, processing and performing predictions.
27

Every request that the frontend can make to the API has a unique URL that triggers

the associated function in the API. The function imports the data, performs the requested

task and return the results in JSON format. The frontend reads the JSON response form the

API and used the received data for display the requested results.

Below are some examples of how the Flask API is used to fetch Original Data,

shown in Figure 22, and Normalized Data, shown in Figure 23, in order to be displayed in

the charts.

Figure 22: Code Snippet – Original Data

Figure 23: Code Snippet – Normalized Data

4.2.2. Sci-Kit Learn

Sci-Kit Learn (SKLearn) is an easy to use Machine Learning Library for Python

that features various classification, regression and clustering algorithms including Support

Vector Machine, Bayes, Linear Regression and Logistic Regression, which has been used

in the tool. These Machine Learning algorithms are access using the Flask API. Along with

various Machine Learning algorithms SKLearn also offer different metrics to assess the
28

performance of the algorithms used. Classification algorithm, shown in Figure 24, can be

assessed on metrics like Accuracy, Precision, Recall, Confusion Matrix, of which Accuracy

and Precision has been used, shown in Figure 25. Regression algorithm, shown in Figure

26, is mainly assessed on Root Mean Squared error and R2 score, shown in Figure 27.

Figure 24: Code Snippet – SKLearn Naïve Bayes

Figure 25: Code Snippet – Classification Metrics

Figure 26: Code Snippet – SKLearn Linear Regression

Figure 27: Code Snippet – Regression Metrics


29

Chapter 5: Tool Walkthrough

5.1. Data Science Pipeline

In this section I will discuss how to interact with the Data Science Pipeline

Diagram, shown in Figure 28. The diagram three main area for interaction, the diagram

itself, an example list of areas where Machine Learning finds application, and the “Next”

button used to navigate the diagram. I start by selecting form the list of applications which

give a context to interaction flow. Then by clicking “Next” I can see the blocks that

represent data move into the pipeline. Each “Next” button is clicked I am taken to next step

in the Data Science Problem solving process. The interaction end with diagram showing

the Insights according to the application selected in the beginning, shown in Figure 29.

Figure 28: Data Science Pipeline - Beginning - Healthcare Industry


30

Figure 29: Data Science Pipeline - Completed - Healthcare Industry

5.2. Data Distribution Chart

In this section I will discuss how to interact with the Data Distribution Chart, shown

in Figure 30. The Bubble Chart can be interacted with by clicking on the char area. A single

click separates the bubbles according to the classes they belong to, shown in Figure 31 and

double clicking on it bring backs the chart to its original state. It’s an Explanatory

Visualization, where the user can learn about how the data is distributed amongst the

classes Rain Today and Rain Tomorrow. The chart is accompanied by a legend that helps

understand the color coding using in the chart.


31

Figure 30: Data Distribution Chart – Initial State

Figure 31: Data Distribution Chart – After Interaction


32

5.3. Classification Dashboard

The Classification Dashboard, shown in Figure 32, is used to explore the process

of solving the Rain or No Rain Problem. I will be using the weather dataset and based on

the daily weather conditions like, Temperature, Humidity, Rainfall Measure, whether or

not did it rain today, I will predict whether or not will it rain tomorrow.

The Dashboard has three panels. The Left panel is used to explore different form

of data (original, normalized), types of graph (temperature, humidity and rainfall measure)

and algorithm used for prediction (Support Vector Machine, Naïve Bayes, Logistic

Regression). The user can perform prediction by selecting the one of the Machine Learning

Models from the Models list and the click “Predict” in the Data section.

The center panel is used to display the charts and legends, shown in Figure 33. The

Right panel is used to display the rain today and rain tomorrow labels, prediction given by

the Machine Learning algorithms and the performance metrics of the selected Machine

Learning algorithm, shown in Figure 34. The interaction begins with clicking “Loading

Data” in the data section in the left panel. This generates the graph in the center panel which

the user can interact with.


33

Figure 32: Classification Dashboard – Initial State

Figure 33: Classification Dashboard – Data Loaded


34

Figure 34:Classification Dashboard – After Prediction

5.4. Regression Dashboard

Regression Dashboard, shown in Figure 35 is used to explore the process of solving

Temperature Prediction problem. I will be using the weather dataset and based on the daily

weather conditions like, Minimum Temperature, Maximum Temperature, Humidity,

Rainfall Measure, I will predict the daily temperature.

The dashboard is divided into two sections. The left panel is used to select the type

of data to be displayed and display information about the regression algorithm used for

performing prediction. The right panel is used to display the charts and legends, which can

be used for interaction, as shown in Figure 36.


35

Figure 35: Regression Dashboard – Initial State

Figure 36: Regression Dashboard – After Prediction


36

Chapter 6: Conclusion

6.1. Summary

The goal of this project is to develop as concrete learning plan to introduce Machine

Learning to High School Students using an interactive visual tool. The project introduces

a novel visual and interactive tool that serves as the probable solution to the problem this

paper aims at solving. This web-based tool can be used for both learning and teaching,

where students can use it to explore the area of Machine Learning or teachers can use it as

an educational tool. The tool is accessible at http://ml4hss.herokuapp.com/static/tut.html.

The user can have a consolidated understanding of basic Machine Learning fundamentals

using this tool.

6.2. Future Work

The tool is complete in itself but there is always some scope of enhancements

depending upon the needs of the user. The following is a list of some upgrades that can be

added to the tool:

1) Increase the user interaction with tool, by implementing the following features:

a) Choose between different datasets to explore different problems.

b) Enter different data for prediction apart from the data in the dataset. This can be

done by having different fields to collect input from the user and use it to give out

the predictions.
37

2) Add the sections to introduce fundamentals of Neural Networks. In order to add the

support for Neural Networks, light weight and efficient models should be used as the

tool is web-based.

3) There is need to perform a Usability Test of the tool by the target users, to gather data

that can be used to improve different areas of the tool.


38

Bibliography

1. K. Gutierrez, “Studies Confirm Power of Visuals in eLearning,” [Online]. Available:


https://www.shiftelearning.com/blog/bid/350326/studies-confirm-the-power-of-
visuals-in-elearning [Accessed: March 2019].

2. S. Yee. and T. Chu, “A Visual Introduction to Machine Learning,” [Online].


Available: http://www.r2d3.us/visual-intro-to-machine-learning-part-1 [Accessed:
January 2019].

3. S. Yee. and T. Chu, “Model Tuning and the Bias Variance tradeoff,” [Online].
Available: http://www.r2d3.us/visual-intro-to-machine-learning-part-2 [Accessed:
Januray 2019].

4. K. Malone and S. Thrun, “Intro to Machine Learning,” [Online]. Available:


https://www.udacity.com/course/intro-to-machine-learning--ud120 [Accessed:
October 2019].

5. K. Eremenko, “Machine Learning A-Z: Hands-on Python and R in Data Science,”


[Online]. Available: https://www.udemy.com/course/machinelearning/ [Accessed:
October 2019].

6. R. Gavaldà, “Machine Learning in Secondary Education?,” Universitat Politècnica de


Catalunya, 2008.

7. S. Srikant and V. Aggarwal, “Introducing Data Science to School Kids,” in


Proceedings of the 2017 ACM SIGCSE Technical Symposium on Computer Science
Education, Seattle, WA, pp. 561-566, March 2017.

8. M. Bienkowski, D. W. Rutstein, Y. Xu and K. McElhaney, “Deepening Learning in


High School Computer Science through Practices in the NGSS,” in Proceedings of
the 2016 ACM SIGCSE Technical Symposium on Computing Science Education,
Memphis, TN, pp. 694-694, March 2016.

9. S. Vandenberg, S. Small, M. Fryling, R. Flatland and M. Egan, “A Summer Program


to Attract Potential Computer Science Majors,” in Proceedings of the 2018 ACM
SIGCSE Technical Symposium on Computer Science Education, Baltimore, MD, pp.
467-472, February 2018.

10. J. B. Gordon, “Machine Learning for High School Students,” [Online]. Available:
http://www.cs.columbia.edu/~CS4HS/talks/ml_for_hs.pdf [Accessed: February
2019].
39

11. J. Ho, “AI Classroom Activity: Machine Learning,” [Online]. Available:


https://www.teachermagazine.com.au/articles/ai-classroom-activity-machine-learning
[Accessed: February 2019].

12. S. Wolfram, “Machine Learning for Middle Schoolers,” [Online]. Available:


https://blog.stephenwolfram.com/2017/05/machine-learning-for-middle-schoolers/
[Accessed: February 2019].

13. Exploringcs, “Exploring Computer Science,” [Online]. Available:


http://www.exploringcs.org/curriculum [Accessed: October 2019].

14. S. McGee, R. McGee-Tekula, J. Duck, C. McGee, L. Dettori, R. I. Greenberg, E.


Snow, D. Rutstein, D. Reed, B. Wilkerson, D. Yanek, A. M. Rasmussen and D.
Brylow, “Equal Outcomes 4 All: A Study of Student Learning in ECS,” in
Proceedings of the 2018 ACM SIGCSE Technical Symposium on Computer Science
Education, Baltimore, MD, pp. 50-55, February 2018.

15. T. Kei, “Principles and elements of visual design: A review of the literature on visual
design of instructional materials,” Educational Studies, vol. 57, International
Christian University, pp. 167-174, April 2015.

16. Z. Avagyan, “Weather Dataset,” [Online]. Available:


https://www.kaggle.com/zaraavagyan/weathercsv [Accessed: November 2019].

17. R. Orban, C. Saden and J. Dinu, “Data Visualization and D3.js,” [Online]. Available:
https://www.udacity.com/course/data-visualization-and-d3js--ud507 [Accessed: June
2019]

18. Dashingd3js, “SVG Paths and D3.js,” [Online]. Available:


https://www.dashingd3js.com/svg-paths-and-d3js; [Accessed: February 2019].

You might also like