You are on page 1of 259

MODULE-1:

UNIT 1/9:

Introduction
Completed100 XP
 2 minutes

Unsurprisingly, the role of a data scientist primarily involves exploring and analyzing
data. The results of an analysis might form the basis of a report or a machine learning
model, but it all begins with data, with Python being the most popular programming
language for data scientists.

After decades of open-source development, Python provides extensive functionality


with powerful statistical and numerical libraries:

 NumPy and Pandas simplify analyzing and manipulating data


 Matplotlib provides attractive data visualizations
 Scikit-learn offers simple and effective predictive data analysis
 TensorFlow and PyTorch supply machine learning and deep learning capabilities

Usually, a data analysis project is designed to establish insights around a particular


scenario or to test a hypothesis.

For example, suppose a university professor collects data from their students, including
the number of lectures attended, the hours spent studying, and the final grade achieved
on the end of term exam. The professor could analyze the data to determine if there is a
relationship between the amount of studying a student undertakes and the final grade
they achieve. The professor might use the data to test a hypothesis that only students
who study for a minimum number of hours can expect to achieve a passing grade.
Prerequisites
 Knowledge of basic mathematics
 Some experience programming in Python

Learning objectives
In this module, you will:

 Common data exploration and analysis tasks.


 How to use Python packages like NumPy, Pandas, and Matplotlib to analyze data
UNIT 2/9:

Explore data with NumPy and Pandas


Completed100 XP
 3 minutes

Data scientists can use various tools and techniques to explore, visualize, and
manipulate data. One of the most common ways in which data scientists work with data
is to use the Python language and some specific packages for data processing.

What is NumPy
NumPy is a Python library that gives functionality comparable to mathematical tools
such as MATLAB and R. While NumPy significantly simplifies the user experience, it also
offers comprehensive mathematical functions.

What is Pandas
Pandas is an extremely popular Python library for data analysis and manipulation.
Pandas is like excel for Python - providing easy-to-use functionality for data tables.
Explore data in a Jupyter notebook
Jupyter notebooks are a popular way of running basic scripts using your web browser.
Typically, these notebooks are a single webpage, broken up into text sections and code
sections that are executed on the server rather than your local machine. This means you
can get started quickly without needing to install Python or other tools.

Testing hypotheses
Data exploration and analysis is typically an iterative process, in which the data scientist
takes a sample of data and performs the following kinds of task to analyze it and test
hypotheses:

 Clean data to handle errors, missing values, and other issues.


 Apply statistical techniques to better understand the data, and how the sample
might be expected to represent the real-world population of data, allowing for
random variation.
 Visualize data to determine relationships between variables, and in the case of a
machine learning project, identify features that are potentially predictive of
the label.
 Revise the hypothesis and repeat the process.
UNIT 3/9:
Exercise - Explore data with NumPy and
Pandas
Completed100 XP

 10 minutes
Email is required to activate a sandbox or lab

Your Microsoft account must be linked to a valid email to activate a sandbox or lab. Go
to Microsoft Account Settings to link your email and try again.

For more information, please check the troubleshooting guidance page.

Retry activating

Running all cells



Run all

Save

Save







Editing

Exploring Data with Python


A significant part of a a data scientist's role is to explore, analyze, and visualize data.
There's a wide range of tools and programming languages that they can use to do this;
and of the most popular approaches is to use Jupyter notebooks (like this one) and
Python.

Python is a flexible programming language that is used in a wide range of scenarios;


from web applications to device programming. It's extremely popular in the data science
and machine learning community because of the many packages it supports for data
analysis and visualization.

In this notebook, we'll explore some of these packages, and apply basic techniques to
analyze data. This is not intended to be a comprehensive Python programming exercise;
or even a deep dive into data analysis. Rather, it's intended as a crash course in some of
the common ways in which data scientists can use Python to work with data.
Note: If you've never used the Jupyter Notebooks environment before, there are a few
things you should be aware of:
 Notebooks are made up of cells. Some cells (like this one) contain markdown text, while
others (like the one beneath this one) contain code.
 You can run each code cell by using the ► Run button. the ► Run button will show up
when you hover over the cell.
 The output from each code cell will be displayed immediately below the cell.
 Even though the code cells can be run individually, some variables used in the code are
global to the notebook. That means that you should run all of the code cells in order.
There may be dependencies between code cells, so if you skip a cell, subsequent cells
might not run correctly.

Exploring data arrays with NumPy


Lets start by looking at some simple data.

Suppose a college takes a sample of student grades for a data science class.

Run the code in the cell below by clicking the ► Run button to see the data.
[ ]

data = [50,50,47,97,49,3,53,42,26,74,82,62,37,15,70,27,36,35,48,52,63,64]

print(data)

Press shift + enter to run


The data has been loaded into a Python list structure, which is a good data type for
general data manipulation, but not optimized for numeric analysis. For that, we're going
to use the NumPy package, which includes specific data types and functions for
working with Numbers in Python.

Run the cell below to load the data into a NumPy array.


[ ]

import numpy as np

grades = np.array(data)

print(grades)

Press shift + enter to run

Just in case you're wondering about the differences between a list and a NumPy array,
let's compare how these data types behave when we use them in an expression that
multiplies them by 2.


[ ]

print (type(data),'x 2:', data * 2)

print('---')

print (type(grades),'x 2:', grades * 2)




Press shift + enter to run

CodeMarkdown

Note that multiplying a list by 2 creates a new list of twice the length with the original
sequence of list elements repeated. Multiplying a NumPy array on the other hand
performs an element-wise calculation in which the array behaves like a vector, so we end
up with an array of the same size in which each element has been multiplied by 2.

The key takeaway from this is that NumPy arrays are specifically designed to support
mathematical operations on numeric data - which makes them more useful for data
analysis than a generic list.

You might have spotted that the class type for the numpy array above is
a numpy.ndarray. The nd indicates that this is a structure that can consists of
multiple dimensions (it can have n dimensions). Our specific instance has a single
dimension of student grades.

Run the cell below to view the shape of the array.


[ ]

grades.shape

Press shift + enter to run

The shape confirms that this array has only one dimension, which contains 22 elements
(there are 22 grades in the original list). You can access the individual elements in the
array by their zero-based ordinal position. Let's get the first element (the one in position
0).
[ ]

grades[0]
Press shift + enter to run

Alright, now you know your way around a NumPy array, it's time to perform some
analysis of the grades data.

You can apply aggregations across the elements in the array, so let's find the simple
average grade (in other words, the mean grade value).
[ ]

grades.mean()

Press shift + enter to run

So the mean grade is just around 50 - more or less in the middle of the possible range
from 0 to 100.

Let's add a second set of data for the same students, this time recording the typical
number of hours per week they devoted to studying.
[ ]

# Define an array of study hours

study_hours = [10.0,11.5,9.0,16.0,9.25,1.0,11.5,9.0,8.5,14.5,15.5,

               13.75,9.0,8.0,15.5,8.0,9.0,6.0,10.0,12.0,12.5,12.0]

# Create a 2D array (an array of arrays)

student_data = np.array([study_hours, grades])

# display the array

student_data

Press shift + enter to run


Now the data consists of a 2-dimensional array - an array of arrays. Let's look at its
shape.
[ ]

# Show shape of 2D array

student_data.shape

Press shift + enter to run

The student_data array contains two elements, each of which is an array containing 22


elements.

To navigate this structure, you need to specify the position of each element in the
hierarchy. So to find the first value in the first array (which contains the study hours
data), you can use the following code.
[ ]

# Show the first element of the first element

student_data[0][0]

Press shift + enter to run

Now you have a multidimensional array containing both the student's study time and
grade information, which you can use to compare data. For example, how does the
mean study time compare to the mean grade?
[ ]

# Get the mean value of each sub-array

avg_study = student_data[0].mean()

avg_grade = student_data[1].mean()
print('Average study hours: {:.2f}\nAverage grade: {:.2f}'.format(avg_study, avg_grad
e))

Press shift + enter to run

Exploring tabular data with Pandas


While NumPy provides a lot of the functionality you need to work with numbers, and
specifically arrays of numeric values; when you start to deal with two-dimensional tables
of data, the Pandas package offers a more convenient structure to work with -
the DataFrame.

Run the following cell to import the Pandas library and create a DataFrame with three
columns. The first column is a list of student names, and the second and third columns
are the NumPy arrays containing the study time and grade data.
[ ]

import pandas as pd

df_students = pd.DataFrame({'Name': ['Dan', 'Joann', 'Pedro', 'Rosie', 'Ethan', 'Vick
y', 'Frederic', 'Jimmie', 

                                     'Rhonda', 'Giovanni', 'Francesca', 'Rajab', 'Nai
yana', 'Kian', 'Jenny',

                                     'Jakeem','Helena','Ismat','Anila','Skye','Daniel
','Aisha'],

                            'StudyHours':student_data[0],

                            'Grade':student_data[1]})

df_students 

Press shift + enter to run


Note that in addition to the columns you specified, the DataFrame includes an index to
unique identify each row. We could have specified the index explicitly, and assigned any
kind of appropriate value (for example, an email address); but because we didn't specify
an index, one has been created with a unique integer value for each row.

Finding and filtering data in a DataFrame

You can use the DataFrame's loc method to retrieve data for a specific index value, like
this.
[ ]

# Get the data for index value 5

df_students.loc[5]

Press shift + enter to run

You can also get the data at a range of index values, like this:
[ ]

# Get the rows with index values from 0 to 5

df_students.loc[0:5]

Press shift + enter to run

In addition to being able to use the loc method to find rows based on the index, you
can use the iloc method to find rows based on their ordinal position in the DataFrame
(regardless of the index):
[ ]

# Get data in the first five rows

df_students.iloc[0:5]
Press shift + enter to run

Look carefully at the iloc[0:5] results, and compare them to the loc[0:5] results you


obtained previously. Can you spot the difference?

The loc method returned rows with index label in the list of values from 0 to 5 - which
includes 0, 1, 2, 3, 4, and 5 (six rows). However, the iloc method returns the rows in
the positions included in the range 0 to 5, and since integer ranges don't include the
upper-bound value, this includes positions 0, 1, 2, 3, and 4 (five rows).

iloc identifies data values in a DataFrame by position, which extends beyond rows to


columns. So for example, you can use it to find the values for the columns in positions 1
and 2 in row 0, like this:
[ ]

df_students.iloc[0,[1,2]]

Press shift + enter to run

Let's return to the loc method, and see how it works with columns. Remember
that loc is used to locate data items based on index values rather than positions. In the
absence of an explicit index column, the rows in our dataframe are indexed as integer
values, but the columns are identified by name:
[ ]

df_students.loc[0,'Grade']
Press shift + enter to run

Here's another useful trick. You can use the loc method to find indexed rows based on a
filtering expression that references named columns other than the index, like this:
[ ]

df_students.loc[df_students['Name']=='Aisha']

Press shift + enter to run

Actually, you don't need to explicitly use the loc method to do this - you can simply
apply a DataFrame filtering expression, like this:
[ ]

df_students[df_students['Name']=='Aisha']

Press shift + enter to run

And for good measure, you can achieve the same results by using the
DataFrame's query method, like this:
[ ]

df_students.query('Name=="Aisha"')

Press shift + enter to run

The three previous examples underline an occassionally confusing truth about working
with Pandas. Often, there are multiple ways to achieve the same results. Another
example of this is the way you refer to a DataFrame column name. You can specify the
column name as a named index value (as in the df_students['Name'] examples we've
seen so far), or you can use the column as a property of the DataFrame, like this:
[ ]

df_students[df_students.Name == 'Aisha']

Press shift + enter to run

Loading a DataFrame from a file

We constructed the DataFrame from some existing arrays. However, in many real-world
scenarios, data is loaded from sources such as files. Let's replace the student grades
DataFrame with the contents of a text file.
[ ]

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/grades.csv

df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')

df_students.head()

Press shift + enter to run

The DataFrame's read_csv method is used to load data from text files. As you can see in
the example code, you can specify options such as the column delimiter and which row
(if any) contains column headers (in this case, the delimiter is a comma and the first row
contains the column names - these are the default settings, so the parameters could
have been omitted).
Handling missing values

One of the most common issues data scientists need to deal with is incomplete or
missing data. So how would we know that the DataFrame contains missing values? You
can use the isnull method to identify which individual values are null, like this:
[ ]

df_students.isnull()

Press shift + enter to run

Of course, with a larger DataFrame, it would be inefficient to review all of the rows and
columns individually; so we can get the sum of missing values for each column, like this:
[ ]

df_students.isnull().sum()

Press shift + enter to run

So now we know that there's one missing StudyHours value, and two


missing Grade values.

To see them in context, we can filter the dataframe to include only rows where any of
the columns (axis 1 of the DataFrame) are null.
[ ]

df_students[df_students.isnull().any(axis=1)]
Press shift + enter to run

When the DataFrame is retrieved, the missing numeric values show up as NaN (not a
number).

So now that we've found the null values, what can we do about them?

One common approach is to impute replacement values. For example, if the number of


study hours is missing, we could just assume that the student studied for an average
amount of time and replace the missing value with the mean study hours. To do this, we
can use the fillna method, like this:
[ ]

df_students.StudyHours = df_students.StudyHours.fillna(df_students.StudyHours.mean())

df_students

Press shift + enter to run

Alternatively, it might be important to ensure that you only use data you know to be
absolutely correct; so you can drop rows or columns that contains null values by using
the dropna method. In this case, we'll remove rows (axis 0 of the DataFrame) where any
of the columns contain null values.
[ ]

df_students = df_students.dropna(axis=0, how='any')

df_students

Press shift + enter to run


Explore data in the DataFrame

Now that we've cleaned up the missing values, we're ready to explore the data in the
DataFrame. Let's start by comparing the mean study hours and grades.
[ ]

# Get the mean study hours using to column name as an index

mean_study = df_students['StudyHours'].mean()

# Get the mean grade using the column name as a property (just to make the point!)

mean_grade = df_students.Grade.mean()

# Print the mean study hours and mean grade

print('Average weekly study hours: {:.2f}\nAverage grade: {:.2f}'.format(mean_study, 
mean_grade))

Press shift + enter to run

OK, let's filter the DataFrame to find only the students who studied for more than the
average amount of time.
[ ]

# Get students who studied for the mean or more hours

df_students[df_students.StudyHours > mean_study]

Press shift + enter to run

Note that the filtered result is itself a DataFrame, so you can work with its columns just
like any other DataFrame.

For example, let's find the average grade for students who undertook more than the
average amount of study time.
[ ]

# What was their mean grade?

df_students[df_students.StudyHours > mean_study].Grade.mean()

Press shift + enter to run

Let's assume that the passing grade for the course is 60.

We can use that information to add a new column to the DataFrame, indicating whether
or not each student passed.

First, we'll create a Pandas Series containing the pass/fail indicator (True or False), and
then we'll concatenate that series as a new column (axis 1) in the DataFrame.
[ ]

passes  = pd.Series(df_students['Grade'] >= 60)

df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

df_students

Press shift + enter to run

DataFrames are designed for tabular data, and you can use them to perform many of
the kinds of data analytics operation you can do in a relational database; such as
grouping and aggregating tables of data.

For example, you can use the groupby method to group the student data into groups
based on the Pass column you added previously, and count the number of names in
each group - in other words, you can determine how many students passed and failed.
[ ]

print(df_students.groupby(df_students.Pass).Name.count())
Press shift + enter to run

You can aggregate multiple fields in a group using any available aggregation function.
For example, you can find the mean study time and grade for the groups of students
who passed and failed the course.
[ ]

print(df_students.groupby(df_students.Pass)['StudyHours', 'Grade'].mean())

Press shift + enter to run

DataFrames are amazingly versatile, and make it easy to manipulate data. Many
DataFrame operations return a new copy of the DataFrame; so if you want to modify a
DataFrame but keep the existing variable, you need to assign the result of the operation
to the existing variable. For example, the following code sorts the student data into
descending order of Grade, and assigns the resulting sorted DataFrame to the
original df_students variable.
[ ]

# Create a DataFrame with the data sorted by Grade (descending)

df_students = df_students.sort_values('Grade', ascending=False)

# Show the DataFrame

df_students

Press shift + enter to run


Summary
That's it for now!

Numpy and DataFrames are the workhorses of data science in Python. They provide us
ways to load, explore, and analyze tabular data. As we will see in subsequent modules,
even advanced analysis methods typically rely on Numpy and Pandas for these
important roles.

In our next workbook, we'll take a look at how create graphs and explore your data in
more interesting ways.
UNIT 4/9:

Visualize data
Completed100 XP
 3 minutes

Data scientists visualize data to understand it better. This can mean looking at the raw
data, summary measures such as averages, or graphing the data. Graphs are a powerful
means of viewing data, as we can discern moderately complex patterns quickly without
needing to define mathematical summary measures.

Representing data visually


Representing data visually typically means graphing it. This is done to provide a fast
qualitative assessment of our data, which can be useful for understanding results,
finding outlier values, understanding how numbers are distributed, and so on.

While sometimes we know ahead of time what kind of graph will be most useful, other
times we use graphs in an exploratory way. To understand the power of data
visualization, consider the data below: the location (x,y) of a self-driving car. In its raw
form, it's hard to see any real patterns. The mean or average, tells us that its path was
centred around x=0.2 and y=0.3, and the range of numbers appears to be between
about -2 and 2.

Time Location-X Location-Y


0 0 2
1 1.682942 1.080605
2 1.818595 -0.83229
3 0.28224 -1.97998
4 -1.5136 -1.30729
5 -1.91785 0.567324
6 -0.55883 1.920341
7 1.313973 1.507805
12 0.00001 0.00001
13 0.840334 1.814894
14 1.981215 0.273474
15 1.300576 -1.51938
16 -0.57581 -1.91532
17 -1.92279 -0.55033
Time Location-X Location-Y
18 -1.50197 1.320633
19 0.299754 1.977409
20 1.825891 0.816164

If we now plot Location-X over time, we can see that we appear to have some missing
values between times 7 and 12.
If we graph X vs Y, we end up with a map of where the car has driven. It’s instantly
obvious that the car has been driving in a circle, but at some point drove to the center
of that circle.
Graphs aren't limited to 2D scatter plots like those above, but can be used to explore
other kinds of data, like proportions - shown through pie charts, stacked bar graphs -
how data are spread - with histograms, box and whisker plots - and how two data sets
differ. Often, when we're trying to understand raw data or results, we may experiment
with different types of graphs until we come across one that explains the data in a
visually intuitive way.
UNIT 5/9:

Exploring data with Python - visualize data


In this notebook, we'll apply basic techniques to analyze data with basic statistics and
visualise using graphs.

Loading our data


Before we begin, lets load the same data about study hours that we analysed in the
previous notebook. We will also recalculate who passed in the same way as last time
Run the code in the cell below by clicking the ► Run button to see the data.


CodeMarkdown

[ ]

import pandas as pd

# Load data from a text file

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/grades.csv

df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')

# Remove any rows with missing data

df_students = df_students.dropna(axis=0, how='any')

# Calculate who passed, assuming '60' is the grade needed to pass

passes  = pd.Series(df_students['Grade'] >= 60)

# Save who passed to the Pandas dataframe

df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

# Print the result out into this notebook

df_students 
Press shift + enter to run

Visualizing data with Matplotlib


DataFrames provide a great way to explore and analyze tabular data, but sometimes a
picture is worth a thousand rows and columns. The Matplotlib library provides the
foundation for plotting data visualizations that can greatly enhance your ability to
analyze the data.

Let's start with a simple bar chart that shows the grade of each student.
[ ]

# Ensure plots are displayed inline in the notebook

%matplotlib inline

from matplotlib import pyplot as plt

# Create a bar plot of name vs grade

plt.bar(x=df_students.Name, height=df_students.Grade)

# Display the plot

plt.show()

Press shift + enter to run

Well, that worked; but the chart could use some improvements to make it clearer what
we're looking at.

Note that you used the pyplot class from Matplotlib to plot the chart. This class
provides a whole bunch of ways to improve the visual elements of the plot. For example,
the following code:
 Specifies the color of the bar chart.
 Adds a title to the chart (so we know what it represents)
 Adds labels to the X and Y (so we know which axis shows which data)
 Adds a grid (to make it easier to determine the values for the bars)
 Rotates the X markers (so we can read them)
[ ]

# Create a bar plot of name vs grade

plt.bar(x=df_students.Name, height=df_students.Grade, color='orange')

# Customize the chart

plt.title('Student Grades')

plt.xlabel('Student')

plt.ylabel('Grade')

plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)

plt.xticks(rotation=90)

# Display the plot

plt.show()

Press shift + enter to run

A plot is technically contained with a Figure. In the previous examples, the figure was
created implicitly for you; but you can create it explicitly. For example, the following
code creates a figure with a specific size.
[ ]

# Create a Figure

fig = plt.figure(figsize=(8,3))

# Create a bar plot of name vs grade

plt.bar(x=df_students.Name, height=df_students.Grade, color='orange')

# Customize the chart

plt.title('Student Grades')

plt.xlabel('Student')

plt.ylabel('Grade')
plt.grid(color='#95a5a6', linestyle='--', linewidth=2, axis='y', alpha=0.7)

plt.xticks(rotation=90)

# Show the figure

plt.show()

Press shift + enter to run

A figure can contain multiple subplots, each on its own axis.

For example, the following code creates a figure with two subplots - one is a bar chart
showing student grades, and the other is a pie chart comparing the number of passing
grades to non-passing grades.
[ ]

# Create a figure for 2 subplots (1 row, 2 columns)

fig, ax = plt.subplots(1, 2, figsize = (10,4))

# Create a bar plot of name vs grade on the first axis

ax[0].bar(x=df_students.Name, height=df_students.Grade, color='orange')

ax[0].set_title('Grades')

ax[0].set_xticklabels(df_students.Name, rotation=90)

# Create a pie chart of pass counts on the second axis

pass_counts = df_students['Pass'].value_counts()

ax[1].pie(pass_counts, labels=pass_counts)

ax[1].set_title('Passing Grades')

ax[1].legend(pass_counts.keys().tolist())

# Add a title to the Figure

fig.suptitle('Student Data')

# Show the figure

fig.show()
Press shift + enter to run

Until now, you've used methods of the Matplotlib.pyplot object to plot charts. However,
Matplotlib is so foundational to graphics in Python that many packages, including
Pandas, provide methods that abstract the underlying Matplotlib functions and simplify
plotting. For example, the DataFrame provides its own methods for plotting data, as
shown in the following example to plot a bar chart of study hours.
[ ]

df_students.plot.bar(x='Name', y='StudyHours', color='teal', figsize=(6,4))

Press shift + enter to run

Getting started with statistical analysis


Now that you know how to use Python to manipulate and visualize data, you can start
analyzing it.

A lot of data science is rooted in statistics, so we'll explore some basic statistical
techniques.
Note: This is not intended to teach you statistics - that's much too big a topic for this
notebook. It will however introduce you to some statistical concepts and techniques that
data scientists use as they explore data in preparation for machine learning modeling.

Descriptive statistics and data distribution

When examining a variable (for example a sample of student grades), data scientists are
particularly interested in its distribution (in other words, how are all the different grade
values spread across the sample). The starting point for this exploration is often to
visualize the data as a histogram, and see how frequently each value for the variable
occurs.
[ ]

# Get the variable to examine

var_data = df_students['Grade']

# Create a Figure

fig = plt.figure(figsize=(10,4))

# Plot a histogram

plt.hist(var_data)

# Add titles and labels

plt.title('Data Distribution')

plt.xlabel('Value')

plt.ylabel('Frequency')

# Show the figure

fig.show()

Press shift + enter to run

The histogram for grades is a symmetric shape, where the most frequently occurring
grades tend to be in the middle of the range (around 50), with fewer grades at the
extreme ends of the scale.

Measures of central tendency

To understand the distribution better, we can examine so-called measures of central


tendency; which is a fancy way of describing statistics that represent the "middle" of the
data. The goal of this is to try to find a "typical" value. Common ways to define the
middle of the data include:
 The mean: A simple average based on adding together all of the values in the sample set, and
then dividing the total by the number of samples.
 The median: The value in the middle of the range of all of the sample values.
 The mode: The most commonly occuring value in the sample set*.

Let's calculate these values, along with the minimum and maximum values for
comparison, and show them on the histogram.
*Of course, in some sample sets , there may be a tie for the most common value - in
which case the dataset is described as bimodal or even multimodal.
[ ]

# Get the variable to examine

var = df_students['Grade']

# Get statistics

min_val = var.min()

max_val = var.max()

mean_val = var.mean()

med_val = var.median()

mod_val = var.mode()[0]

print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,

                                                                                      
  mean_val,

                                                                                      
  med_val,

                                                                                      
  mod_val,

                                                                                      
  max_val))

# Create a Figure

fig = plt.figure(figsize=(10,4))

# Plot a histogram

plt.hist(var)

# Add lines for the statistics

plt.axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)

plt.axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)

plt.axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)

plt.axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)

plt.axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

# Add titles and labels

plt.title('Data Distribution')
plt.xlabel('Value')

plt.ylabel('Frequency')

# Show the figure

fig.show()

Press shift + enter to run

For the grade data, the mean, median, and mode all seem to be more or less in the
middle of the minimum and maximum, at around 50.

Another way to visualize the distribution of a variable is to use a box plot (sometimes


called a box-and-whiskers plot). Let's create one for the grade data.
[ ]

# Get the variable to examine

var = df_students['Grade']

# Create a Figure

fig = plt.figure(figsize=(10,4))

# Plot a histogram

plt.boxplot(var)

# Add titles and labels

plt.title('Data Distribution')

# Show the figure

fig.show()

Press shift + enter to run


The box plot shows the distribution of the grade values in a different format to the
histogram. The box part of the plot shows where the inner two quartiles of the data
reside - so in this case, half of the grades are between approximately 36 and 63.
The whiskers extending from the box show the outer two quartiles; so the other half of
the grades in this case are between 0 and 36 or 63 and 100. The line in the box indicates
the median value.

For learning, it can be useful to combine histograms and box plots, with the box plot's
orientation changed to align it with the histogram (in some ways, it can be helpful to
think of the histogram as a "front elevation" view of the distribution, and the box plot as
a "plan" view of the distribution from above.)
[ ]

# Create a function that we can re-use

def show_distribution(var_data):

    from matplotlib import pyplot as plt

    # Get statistics

    min_val = var_data.min()

    max_val = var_data.max()

    mean_val = var_data.mean()

    med_val = var_data.median()

    mod_val = var_data.mode()[0]

    print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,

                                                                                      
      mean_val,

                                                                                      
      med_val,

                                                                                      
      mod_val,

                                                                                      
      max_val))

    # Create a figure for 2 subplots (2 rows, 1 column)

    fig, ax = plt.subplots(2, 1, figsize = (10,4))

    # Plot the histogram   

    ax[0].hist(var_data)

    ax[0].set_ylabel('Frequency')
    # Add lines for the mean, median, and mode

    ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)

    ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)

    ax[0].axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)

    ax[0].axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)

    ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

    # Plot the boxplot   

    ax[1].boxplot(var_data, vert=False)

    ax[1].set_xlabel('Value')

    # Add a title to the Figure

    fig.suptitle('Data Distribution')

    # Show the figure

    fig.show()

# Get the variable to examine

col = df_students['Grade']

# Call the function

show_distribution(col)

Press shift + enter to run

All of the measurements of central tendency are right in the middle of the data
distribution, which is symmetric with values becoming progressively lower in both
directions from the middle.

To explore this distribution in more detail, you need to understand that statistics is
fundamentally about taking samples of data and using probability functions to
extrapolate information about the full population of data.

What does this mean? Samples refer to the data we have on hand - such as information
about these 22 students' study habits and grades. The population refers to all possible
data we could collect - such as every student's grades and study habits across every
educational institution throughout the history of time. Usually we're interested in the
population but it's simply not practical to collect all of that data. Instead, we need to try
estimate what the population is like from the small amount of data (samples) that we
have.

If we have enough samples, we can calculate something called a probability density


function, which estimates the distribution of grades for the full population.

The pyplot class from Matplotlib provides a helpful plot function to show this density.
[ ]

def show_density(var_data):

    from matplotlib import pyplot as plt

    fig = plt.figure(figsize=(10,4))

    # Plot density

    var_data.plot.density()

    # Add titles and labels

    plt.title('Data Density')

    # Show the mean, median, and mode

    plt.axvline(x=var_data.mean(), color = 'cyan', linestyle='dashed', linewidth = 2)

    plt.axvline(x=var_data.median(), color = 'red', linestyle='dashed', linewidth = 2
)

    plt.axvline(x=var_data.mode()[0], color = 'yellow', linestyle='dashed', linewidth 
= 2)

    # Show the figure

    plt.show()

# Get the density of Grade

col = df_students['Grade']

show_density(col)

Press shift + enter to run


As expected from the histogram of the sample, the density shows the characteristic "bell
curve" of what statisticians call a normal distribution with the mean and mode at the
center and symmetric tails.

Summary
Well done! There were a number of new concepts in here, so let's summarise.

Here we have:
1. Made graphs with matplotlib
2. Seen how to customise these graphs
3. Calculated basic statistics, such as medians
4. Looked at the spread of data using box plots and histograms
5. Learned about samples vs populations
6. Estimated what the population of graphse might look like from a sample of grades.

In our next notebook we will look at spotting unusual data, and finding relationships
between data.

Further Reading
To learn more about the Python packages you explored in this notebook, see the
following documentation:
 NumPy
 Pandas
 Matplotlib
CodeMarkdown

Empty Markdown cell. Double click or press enter to add content.


UNIT 6/9:

Examine real world data


Completed100 XP
 3 minutes

Data presented in educational material is often remarkably perfect, designed to show


students how to find clear relationships between variables. ‘Real world’ data is a bit less
simple.

Because of the complexity of ‘real world’ data, raw data has to be inspected for issues
before being used.

As such, best practice is to inspect the raw data and process it before use, which reduces
errors or issues, typically by removing erroneous data points or modifying the data into
a more useful form.

Real world data issues


Real world data can contain many different issues that can affect the utility of the data,
and our interpretation of the results.

It's important to realize that most real-world data are influenced by factors that weren't
recorded at the time. For example, we might have a table of race-car track times
alongside engine sizes, but various other factors that weren't written down—such as the
weather—probably also played a role. If problematic, the influence of these factors can
often be reduced by increasing the size of the dataset.

In other situations data points that are clearly outside of what is expected—also known
as ‘outliers’—can sometimes be safely removed from analyses, though care must be
taken to not remove data points that provide real insights.

Another common issue in real-world data is bias. Bias refers to a tendency to select
certain types of values more frequently than others, in a way that misrepresents the
underlying population, or ‘real world’. Bias can sometimes be identified by exploring
data while keeping in mind basic knowledge about where the data came from.

Remember, real-world data will always have issues, but this is often a surmountable
problem. Remember to:
 Check for missing values and badly recorded data
 Consider removal of obvious outliers
 Consider what real-world factors might affect your analysis and consider if your
dataset size is large enough to handle this
 Check for biased raw data and consider your options to fix this, if found
UNIT 7/9:

Exercise - Examine real world data

Exploring data with Python - real world data


Last time, we looked at grades for our student data, and investigated this visually with
histograms and box plots. Now we will look into more complex cases, describe the data
more fully, and discuss how to make basic comparisons between data.

Real world data distributions

Last time, we looked at grades for our student data, and estimated from this sample
what the full population of grades might look like. Just to refresh, lets take a look at this
data again.

Run the code below to print out the data and make a histogram + boxplot that show
the grades for our sample of students.
CodeMarkdown

[ ]

import pandas as pd

from matplotlib import pyplot as plt

# Load data from a text file

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/grades.csv

df_students = pd.read_csv('grades.csv',delimiter=',',header='infer')

# Remove any rows with missing data

df_students = df_students.dropna(axis=0, how='any')

# Calculate who passed, assuming '60' is the grade needed to pass

passes  = pd.Series(df_students['Grade'] >= 60)

# Save who passed to the Pandas dataframe

df_students = pd.concat([df_students, passes.rename("Pass")], axis=1)

# Print the result out into this notebook

print(df_students)
# Create a function that we can re-use

def show_distribution(var_data):

    '''

    This function will make a distribution (graph) and display it

    '''

    # Get statistics

    min_val = var_data.min()

    max_val = var_data.max()

    mean_val = var_data.mean()

    med_val = var_data.median()

    mod_val = var_data.mode()[0]

    print('Minimum:{:.2f}\nMean:{:.2f}\nMedian:{:.2f}\nMode:{:.2f}\nMaximum:{:.2f}\
n'.format(min_val,

                                                                                      
      mean_val,

                                                                                      
      med_val,

                                                                                      
      mod_val,

                                                                                      
      max_val))

    # Create a figure for 2 subplots (2 rows, 1 column)

    fig, ax = plt.subplots(2, 1, figsize = (10,4))

    # Plot the histogram   

    ax[0].hist(var_data)

    ax[0].set_ylabel('Frequency')

    # Add lines for the mean, median, and mode

    ax[0].axvline(x=min_val, color = 'gray', linestyle='dashed', linewidth = 2)

    ax[0].axvline(x=mean_val, color = 'cyan', linestyle='dashed', linewidth = 2)

    ax[0].axvline(x=med_val, color = 'red', linestyle='dashed', linewidth = 2)

    ax[0].axvline(x=mod_val, color = 'yellow', linestyle='dashed', linewidth = 2)

    ax[0].axvline(x=max_val, color = 'gray', linestyle='dashed', linewidth = 2)

    # Plot the boxplot   
    ax[1].boxplot(var_data, vert=False)

    ax[1].set_xlabel('Value')

    # Add a title to the Figure

    fig.suptitle('Data Distribution')

    # Show the figure

    fig.show()

show_distribution(df_students['Grade'])

Press shift + enter to run

As you might recall, our data had the mean and mode at the center, with data spread
symmetrically from there.

Now let's take a look at the distribution of the study hours data.
[ ]

# Get the variable to examine

col = df_students['StudyHours']

# Call the function

show_distribution(col)

Press shift + enter to run

The distribution of the study time data is significantly different from that of the grades.

Note that the whiskers of the box plot only begin at around 6.0, indicating that the vast
majority of the first quarter of the data is above this value. The minimum is marked with
an o, indicating that it is statistically an outlier - a value that lies significantly outside the
range of the rest of the distribution.
Outliers can occur for many reasons. Maybe a student meant to record "10" hours of
study time, but entered "1" and missed the "0". Or maybe the student was abnormally
lazy when it comes to studying! Either way, it's a statistical anomaly that doesn't
represent a typical student. Let's see what the distribution looks like without it.
[ ]

# Get the variable to examine

# We will only get students who have studied more than one hour

col = df_students[df_students.StudyHours>1]['StudyHours']

# Call the function

show_distribution(col)

Press shift + enter to run

For learning purposes we have just treated the value 1 is a true outlier here and
excluded it. In the real world, though, it would be unusual to exclude data at the
extremes without more justification when our sample size is so small. This is because the
smaller our sample size, the more likely it is that our sampling is a bad representation of
the whole population (here, the population means grades for all students, not just our
22). For example, if we sampled study time for another 1000 students, we might find
that it's actually quite common to not study much!

When we have more data available, our sample becomes more reliable. This makes it
easier to consider outliers as being values that fall below or above percentiles within
which most of the data lie. For example, the following code uses the
Pandas quantile function to exclude observations below the 0.01th percentile (the value
above which 99% of the data reside).
[ ]

# calculate the 0.01th percentile

q01 = df_students.StudyHours.quantile(0.01)

# Get the variable to examine

col = df_students[df_students.StudyHours>q01]['StudyHours']

# Call the function
show_distribution(col)

Press shift + enter to run

Tip: You can also eliminate outliers at the upper end of the distribution by defining a
threshold at a high percentile value - for example, you could use the quantile function
to find the 0.99 percentile below which 99% of the data reside.

With the outliers removed, the box plot shows all data within the four quartiles. Note
that the distribution is not symmetric like it is for the grade data though - there are
some students with very high study times of around 16 hours, but the bulk of the data is
between 7 and 13 hours; The few extremely high values pull the mean towards the
higher end of the scale.

Let's look at the density for this distribution.


[ ]

def show_density(var_data):

    fig = plt.figure(figsize=(10,4))

    # Plot density

    var_data.plot.density()

    # Add titles and labels

    plt.title('Data Density')

    # Show the mean, median, and mode

    plt.axvline(x=var_data.mean(), color = 'cyan', linestyle='dashed', linewidth = 2)

    plt.axvline(x=var_data.median(), color = 'red', linestyle='dashed', linewidth = 2
)

    plt.axvline(x=var_data.mode()[0], color = 'yellow', linestyle='dashed', linewidth 
= 2)

    # Show the figure

    plt.show()

# Get the density of StudyHours

show_density(col)
Press shift + enter to run

This kind of distribution is called right skewed. The mass of the data is on the left side of
the distribution, creating a long tail to the right because of the values at the extreme
high end; which pull the mean to the right.

Measures of variance

So now we have a good idea where the middle of the grade and study hours data
distributions are. However, there's another aspect of the distributions we should
examine: how much variability is there in the data?

Typical statistics that measure variability in the data include:


 Range: The difference between the maximum and minimum. There's no built-in function for
this, but it's easy to calculate using the min and max functions.
 Variance: The average of the squared difference from the mean. You can use the built-
in var function to find this.
 Standard Deviation: The square root of the variance. You can use the built-in std function to
find this.
[ ]

for col_name in ['Grade','StudyHours']:

    col = df_students[col_name]

    rng = col.max() - col.min()

    var = col.var()

    std = col.std()

    print('\n{}:\n - Range: {:.2f}\n - Variance: {:.2f}\n - Std.Dev: {:.2f}
'.format(col_name, rng, var, std))

Press shift + enter to run


Of these statistics, the standard deviation is generally the most useful. It provides a
measure of variance in the data on the same scale as the data itself (so grade points for
the Grade distribution and hours for the StudyHours distribution). The higher the
standard deviation, the more variance there is when comparing values in the distribution
to the distribution mean - in other words, the data is more spread out.

When working with a normal distribution, the standard deviation works with the
particular characteristics of a normal distribution to provide even greater insight. Run
the cell below to see the relationship between standard deviations and the data in the
normal distribution.
[ ]

import scipy.stats as stats

# Get the Grade column

col = df_students['Grade']

# get the density

density = stats.gaussian_kde(col)

# Plot the density

col.plot.density()

# Get the mean and standard deviation

s = col.std()

m = col.mean()

# Annotate 1 stdev

x1 = [m-s, m+s]

y1 = density(x1)

plt.plot(x1,y1, color='magenta')

plt.annotate('1 std (68.26%)', (x1[1],y1[1]))

# Annotate 2 stdevs

x2 = [m-(s*2), m+(s*2)]

y2 = density(x2)

plt.plot(x2,y2, color='green')

plt.annotate('2 std (95.45%)', (x2[1],y2[1]))

# Annotate 3 stdevs

x3 = [m-(s*3), m+(s*3)]
y3 = density(x3)

plt.plot(x3,y3, color='orange')

plt.annotate('3 std (99.73%)', (x3[1],y3[1]))

# Show the location of the mean

plt.axvline(col.mean(), color='cyan', linestyle='dashed', linewidth=1)

plt.axis('off')

plt.show()

Press shift + enter to run

The horizontal lines show the percentage of data within 1, 2, and 3 standard deviations
of the mean (plus or minus).

In any normal distribution:


 Approximately 68.26% of values fall within one standard deviation from the mean.
 Approximately 95.45% of values fall within two standard deviations from the mean.
 Approximately 99.73% of values fall within three standard deviations from the mean.

So, since we know that the mean grade is 49.18, the standard deviation is 21.74, and
distribution of grades is approximately normal; we can calculate that 68.26% of students
should achieve a grade between 27.44 and 70.92.

The descriptive statistics we've used to understand the distribution of the student data
variables are the basis of statistical analysis; and because they're such an important part
of exploring your data, there's a built-in Describe method of the DataFrame object that
returns the main descriptive statistics for all numeric columns.
[ ]

df_students.describe()
Press shift + enter to run

Comparing data
Now that you know something about the statistical distribution of the data in your
dataset, you're ready to examine your data to identify any apparent relationships
between variables.

First of all, let's get rid of any rows that contain outliers so that we have a sample that is
representative of a typical class of students. We identified that the StudyHours column
contains some outliers with extremely low values, so we'll remove those rows.
[ ]

df_sample = df_students[df_students['StudyHours']>1]

df_sample

Press shift + enter to run

Comparing numeric and categorical variables

The data includes two numeric variables (StudyHours and Grade) and


two categorical variables (Name and Pass). Let's start by comparing the
numeric StudyHours column to the categorical Pass column to see if there's an
apparent relationship between the number of hours studied and a passing grade.

To make this comparison, let's create box plots showing the distribution of StudyHours
for each possible Pass value (true and false).
[ ]

df_sample.boxplot(column='StudyHours', by='Pass', figsize=(8,5))

Press shift + enter to run


Comparing the StudyHours distributions, it's immediately apparent (if not particularly
surprising) that students who passed the course tended to study for more hours than
students who didn't. So if you wanted to predict whether or not a student is likely to
pass the course, the amount of time they spend studying may be a good predictive
feature.

Comparing numeric variables

Now let's compare two numeric variables. We'll start by creating a bar chart that shows
both grade and study hours.
[ ]

# Create a bar plot of name vs grade and study hours

df_sample.plot(x='Name', y=['Grade','StudyHours'], kind='bar', figsize=(8,5))

Press shift + enter to run

The chart shows bars for both grade and study hours for each student; but it's not easy
to compare because the values are on different scales. Grades are measured in grade
points, and range from 3 to 97; while study time is measured in hours and ranges from 1
to 16.

A common technique when dealing with numeric data in different scales is


to normalize the data so that the values retain their proportional distribution, but are
measured on the same scale. To accomplish this, we'll use a technique
called MinMax scaling that distributes the values proportionally on a scale of 0 to 1. You
could write the code to apply this transformation; but the Scikit-Learn library provides a
scaler to do it for you.
[ ]

from sklearn.preprocessing import MinMaxScaler

# Get a scaler object

scaler = MinMaxScaler()

# Create a new dataframe for the scaled values

df_normalized = df_sample[['Name', 'Grade', 'StudyHours']].copy()
# Normalize the numeric columns

df_normalized[['Grade','StudyHours']] = scaler.fit_transform(df_normalized[['Grade','
StudyHours']])

# Plot the normalized values

df_normalized.plot(x='Name', y=['Grade','StudyHours'], kind='bar', figsize=(8,5))

Press shift + enter to run

With the data normalized, it's easier to see an apparent relationship between grade and
study time. It's not an exact match, but it definitely seems like students with higher
grades tend to have studied more.

So there seems to be a correlation between study time and grade; and in fact, there's a
statistical correlation measurement we can use to quantify the relationship between
these columns.
[ ]

df_normalized.Grade.corr(df_normalized.StudyHours)

Press shift + enter to run

The correlation statistic is a value between -1 and 1 that indicates the strength of a
relationship. Values above 0 indicate a positive correlation (high values of one variable
tend to coincide with high values of the other), while values below 0 indicate
a negative correlation (high values of one variable tend to coincide with low values of
the other). In this case, the correlation value is close to 1; showing a strongly positive
correlation between study time and grade.
Note: Data scientists often quote the maxim "correlation is not causation". In other
words, as tempting as it might be, you shouldn't interpret the statistical correlation as
explaining why one of the values is high. In the case of the student data, the statistics
demonstrates that students with high grades tend to also have high amounts of study
time; but this is not the same as proving that they achieved high grades because they
studied a lot. The statistic could equally be used as evidence to support the nonsensical
conclusion that the students studied a lot because their grades were going to be high.

Another way to visualise the apparent correlation between two numeric columns is to
use a scatter plot.
[ ]

# Create a scatter plot

df_sample.plot.scatter(title='Study Time vs Grade', x='StudyHours', y='Grade')

Press shift + enter to run

Again, it looks like there's a discernible pattern in which the students who studied the
most hours are also the students who got the highest grades.

We can see this more clearly by adding a regression line (or a line of best fit) to the plot
that shows the general trend in the data. To do this, we'll use a statistical technique
called least squares regression.
Warning - Math Ahead!
Cast your mind back to when you were learning how to solve linear equations in school,
and recall that the slope-intercept form of a linear equation looks like this:
y=mx+by=mx+b
In this equation, y and x are the coordinate variables, m is the slope of the line, and b is
the y-intercept (where the line goes through the Y-axis).
In the case of our scatter plot for our student data, we already have our values
for x (StudyHours) and y (Grade), so we just need to calculate the intercept and slope of
the straight line that lies closest to those points. Then we can form a linear equation that
calculates a new y value on that line for each of our x (StudyHours) values - to avoid
confusion, we'll call this new y value f(x) (because it's the output from a linear
equation function based on x). The difference between the original y (Grade) value and
the f(x) value is the error between our regression line and the actual Grade achieved by
the student. Our goal is to calculate the slope and intercept for a line with the lowest
overall error.
Specifically, we define the overall error by taking the error for each point, squaring it,
and adding all the squared errors together. The line of best fit is the line that gives us
the lowest value for the sum of the squared errors - hence the name least squares
regression.

Fortunately, you don't need to code the regression calculation yourself -


the SciPy package includes a stats class that provides a linregress method to do the
hard work for you. This returns (among other things) the coefficients you need for the
slope equation - slope (m) and intercept (b) based on a given pair of variable samples
you want to compare.
[ ]

from scipy import stats

df_regression = df_sample[['Grade', 'StudyHours']].copy()

# Get the regression slope and intercept

m, b, r, p, se = stats.linregress(df_regression['StudyHours'], df_regression['Grade']
)

print('slope: {:.4f}\ny-intercept: {:.4f}'.format(m,b))

print('so...\n f(x) = {:.4f}x + {:.4f}'.format(m,b))

# Use the function (mx + b) to calculate f(x) for each x (StudyHours) value

df_regression['fx'] = (m * df_regression['StudyHours']) + b

# Calculate the error between f(x) and the actual y (Grade) value

df_regression['error'] = df_regression['fx'] - df_regression['Grade']

# Create a scatter plot of Grade vs StudyHours

df_regression.plot.scatter(x='StudyHours', y='Grade')

# Plot the regression line

plt.plot(df_regression['StudyHours'],df_regression['fx'], color='cyan')

# Display the plot

plt.show()

Press shift + enter to run


Note that this time, the code plotted two distinct things - the scatter plot of the sample
study hours and grades is plotted as before, and then a line of best fit based on the
least squares regression coefficients is plotted.

The slope and intercept coefficients calculated for the regression line are shown above
the plot.

The line is based on the f(x) values calculated for each StudyHours value. Run the


following cell to see a table that includes the following values:
 The StudyHours for each student.
 The Grade achieved by each student.
 The f(x) value calculated using the regression line coefficients.
 The error between the calculated f(x) value and the actual Grade value.

Some of the errors, particularly at the extreme ends, and quite large (up to over 17.5
grade points); but in general, the line is pretty close to the actual grades.
[ ]

# Show the original x,y values, the f(x) value, and the error

df_regression[['StudyHours', 'Grade', 'fx', 'error']]

Press shift + enter to run

Using the regression coefficients for prediction

Now that you have the regression coefficients for the study time and grade relationship,
you can use them in a function to estimate the expected grade for a given amount of
study.
[ ]

# Define a function based on our regression coefficients

def f(x):

    m = 6.3134

    b = -17.9164

    return m*x + b

study_time = 14
# Get f(x) for study time

prediction = f(study_time)

# Grade can't be less than 0 or more than 100

expected_grade = max(0,min(100,prediction))

#Print the estimated grade

print ('Studying for {} hours per week may result in a grade of {:.0f}'.format(study_
time, expected_grade))

Press shift + enter to run

So by applying statistics to sample data, you've determined a relationship between


study time and grade; and encapsulated that relationship in a general function that can
be used to predict a grade for a given amount of study time.

This technique is in fact the basic premise of machine learning. You can take a set of
sample data that includes one or more features (in this case, the number of hours
studied) and a known label value (in this case, the grade achieved) and use the sample
data to derive a function that calculates predicted label values for any given set of
features.

Summary
Here we've looked at:
1. What an outlier is and how to remove them
2. How data can be skewed
3. How to look at the spread of data
4. Basic ways to compare variables, such as grades and study time

Further Reading
To learn more about the Python packages you explored in this notebook, see the
following documentation:
 NumPy
 Pandas
 Matplotlib


CodeMarkdown

[ ]

Press shift + enter to run


UNIT 8/9:

Knowledge check
200 XP
 3 minutes

Answer the following questions to check your learning.

1. 

You have a NumPy array with the shape (2,20). What does this tell you about the
elements in the array?

The array is two dimensional, consisting of two arrays each with 20 elements

The array contains 2 elements, with the values 2 and 20

The array contains 20 elements, all with the value 2

ANSWER: 1
2. 

You have a Pandas DataFrame named df_sales containing daily sales data. The
DataFrame contains the following columns: year, month, day_of_month, sales_total. You
want to find the average sales_total value. Which code should you use?

df_sales['sales_total'].avg()

df_sales['sales_total'].mean()

mean(df_sales['sales_total'])

ANSWER: 2
3. You have a DataFrame containing data about daily ice cream sales. You use the corr
method to compare the avg_temp and units_sold columns, and get a result of 0.97.
What does this result indicate?

On the day with the maximum units_sold value, the avg_temp value was 0.97

Days with high avg_temp values tend to coincide with days that have high units_sold
values

The units_sold value is, on average, 97% of the avg_temp value

ANSWER: 2
UNIT 9/9:

Summary
Completed100 XP
 1 minute

In this module, you learned how to use Python to explore, visualize, and manipulate
data. Data exploration is at the core of data science, and is a key element in data
analysis and machine learning.

Machine learning is a subset of data science that deals with predictive modeling. In
other words, machine learning uses data to creates predictive models, in order to
predict unknown values. You might use machine learning to predict how much food a
supermarket needs to order, or to identify plants in photographs.

Machine learning works by identifying relationships between data values that describe
characteristics of something—its features, such as the height and color of a plant—and
the value we want to predict—the label, such as the species of plant. These relationships
are built into a model through a training process.

Challenge: Analyze Flight Data


If this exercises in this module have inspired you to try exploring data for yourself, why
not take on the challenge of a real world dataset containing flight records from the US
Department of Transportation? You'll find the challenge in the 01 - Flights
Challenge.ipynb notebook!

 Note

The time to complete this optional challenge is not included in the estimated time for
this module - you can spend as little or as much time on it as you like!
MODULE 2:

Train and evaluate regression models

UNIT 1/9:

Introduction
Completed100 XP
 2 minutes

Regression is where models predict a number.

In machine learning, the goal of regression is to create a model that can predict a
numeric, quantifiable value, such as a price, amount, size, or other scalar number.

Regression is a statistical technique of fundamental importance to science because of its


ease of interpretation, robustness, and speed in calculation. Regression models provide
an excellent foundation to understanding how more complex machine learning
techniques work.

In real world situations, particularly when little data are available, regression models are
very useful for making predictions. For example, if a company that rents bicycles wants
to predict the expected number of rentals on a given day in the future, a regression
model can predict this number. A model could be created using existing data such as
the number of bicycles that were rented on days where the season, day of the week, and
so on, were also recorded.
Prerequisites
 Knowledge of basic mathematics
 Some experience programming in Python

Learning objectives
In this module, you will:

 When to use regression models.


 How to train and evaluate regression models using the Scikit-Learn framework.
UNIT 2/9:

What is regression?
Completed100 XP

 8 minutes

Regression works by establishing a relationship between variables in the data that


represent characteristics—known as the features—of the thing being observed, and the
variable we're trying to predict—known as the label. Recall our company that rents
bicycles and wants to predict the expected number of rentals in a given day. In this case,
features include things like the day of the week, month, and so on, while the label is the
number of bicycle rentals.

To train the model, we start with a data sample containing the features, as well as
known values for the label - so in this case we need historical data that includes dates,
weather conditions, and the number of bicycle rentals.

We'll then split this data sample into two subsets:

 A training dataset to which we'll apply an algorithm that determines a function


encapsulating the relationship between the feature values and the known label values.
 A validation or test dataset that we can use to evaluate the model by using it to generate
predictions for the label and comparing them to the actual known label values.

The use of historic data with known label values to train a model makes regression an
example of supervised machine learning.

A simple example

Let's take a simple example to see how the training and evaluation process works in
principle. Suppose we simplify the scenario so that we use a single feature—average
daily temperature—to predict the bicycle rentals label.

We start with some data that includes known values for the average daily temperature
feature and the bicycle rentals label.

Temperature Rentals

56 115
Temperature Rentals

61 126

67 137

72 140

76 152

82 156

54 114

62 129

Now we'll randomly select five of these observations and use them to train a regression
model. When we're talking about ‘training a model’, what we mean is finding a function
(a mathematical equation; let’s call it f) that can use the temperature feature (which we’ll
call x) to calculate the number of rentals (which we’ll call y). In other words, we need to
define the following function: f(x) = y.

Our training dataset looks like this:

x y

56 115

61 126

67 137

72 140

76 152

Let's start by plotting the training values for x and y on a chart:


Now we need to fit these values to a function, allowing for some random variation. You
can probably see that the plotted points form an almost straight diagonal line - in other
words, there's an apparent linear relationship between x and y, so we need to find a
linear function that's the best fit for the data sample. There are various algorithms we
can use to determine this function, which will ultimately find a straight line with minimal
overall variance from the plotted points; like this:

The line represents a linear function that can be used with any value of x to apply
the slope of the line and its intercept (where the line crosses the y axis when x is 0) to
calculate y. In this case, if we extended the line to the left we'd find that when x is 0, y is
around 20, and the slope of the line is such that for each unit of x you move along to
the right, y increases by around 1.7. Our f function therefore can be calculated as 20 +
1.7x.

Now that we've defined our predictive function, we can use it to predict labels for the
validation data we held back and compare the predicted values (which we typically
indicate with the symbol ŷ, or "y-hat") with the actual known y values.

x y ŷ

82 156 159.4

54 114 111.8

62 129 125.4

Let's see how the y and ŷ values compare in a plot:

The plotted points that are on the function line are the predicted ŷ values calculated by
the function, and the other plotted points are the actual y values.

There are various ways we can measure the variance between the predicted and actual
values, and we can use these metrics to evaluate how well the model predicts.

 Note
Machine learning is based in statistics and math, and it's important to be aware of
specific terms that statisticians and mathematicians (and therefore data scientists) use.
You can think of the difference between a predicted label value and the actual label
value as a measure of error. However, in practice, the "actual" values are based on
sample observations (which themselves may be subject to some random variance). To
make it clear that we're comparing a predicted value (ŷ) with an observed value (y) we
refer to the difference between them as the residuals. We can summarize the residuals
for all of the validation data predictions to calculate the overall loss in the model as a
measure of its predictive performance.

One of the most common ways to measure the loss is to square the individual residuals,
sum the squares, and calculate the mean. Squaring the residuals has the effect of basing
the calculation on absolute values (ignoring whether the difference is negative or
positive) and giving more weight to larger differences. This metric is called the Mean
Squared Error.

For our validation data, the calculation looks like this:

y ŷ y - ŷ (y - ŷ)2

156 159.4 -3.4 11.56

114 111.8 2.2 4.84

129 125.4 3.6 12.96

Sum ∑ 29.36

Mean x̄ 9.79

So the loss for our model based on the MSE metric is 9.79.

So is that any good? It's difficult to tell because MSE value isn't expressed in a
meaningful unit of measurement. We do know that the lower the value is, the less loss
there is in the model; and therefore, the better it is predicting. This makes it a useful
metric to compare two models and find the one that performs best.

Sometimes, it's more useful to express the loss in the same unit of measurement as the
predicted label value itself - in this case, the number of rentals. It's possible to do this by
calculating the square root of the MSE, which produces a metric known, unsurprisingly,
as the Root Mean Squared Error (RMSE).

√9.79 = 3.13
So our model's RMSE indicates that the loss is just over 3, which you can interpret
loosely as meaning that on average, incorrect predictions are wrong by around 3 rentals.

There are many other metrics that can be used to measure loss in a regression. For
example, R2 (R-Squared) (sometimes known as coefficient of determination) is the
correlation between x and y squared. This produces a value between 0 and 1 that
measures the amount of variance that can be explained by the model. Generally, the
closer this value is to 1, the better the model predicts.
UNIT 3/9:

Exercise - Train and evaluate a regression


model
Completed100 XP

 8 minutes
Email is required to activate a sandbox or lab

Your Microsoft account must be linked to a valid email to activate a sandbox or lab. Go
to Microsoft Account Settings to link your email and try again.

For more information, please check the troubleshooting guidance page.

Retry activating

Toolbar keybinding hints are now hidden






No kernels available

Kernel



Viewing

Regression
Supervised machine learning techniques involve training a model to operate on a set
of features and predict a label using a dataset that includes some already-known label
values. The training process fits the features to the known labels to define a general
function that can be applied to new features for which the labels are unknown, and
predict them. You can think of this function like this, in which y represents the label we
want to predict and x represents the features the model uses to predict it.
y=f(x)y=f(x)

In most cases, x is actually a vector that consists of multiple feature values, so to be a


little more precise, the function could be expressed like this:
y=f([x1,x2,x3,...])y=f([x1,x2,x3,...])

The goal of training the model is to find a function that performs some kind of
calculation to the x values that produces the result y. We do this by applying a machine
learning algorithm that tries to fit the x values to a calculation that
produces y reasonably accurately for all of the cases in the training dataset.

There are lots of machine learning algorithms for supervised learning, and they can be
broadly divided into two types:
 Regression algorithms: Algorithms that predict a y value that is a numeric value, such as the
price of a house or the number of sales transactions.
 Classification algorithms: Algorithms that predict to which category, or class, an observation
belongs. The y value in a classification model is a vector of probability values between 0 and 1,
one for each class, indicating the probability of the observation belonging to each class.

In this notebook, we'll focus on regression, using an example based on a real study in
which data for a bicycle sharing scheme was collected and used to predict the number
of rentals based on seasonality and weather conditions. We'll use a simplified version of
the dataset from that study.
Citation: The data used in this exercise is derived from Capital Bikeshare and is used in
accordance with the published license agreement.

Explore the Data


The first step in any machine learning project is to explore the data that you will use to
train a model. The goal of this exploration is to try to understand the relationships
between its attributes; in particular, any apparent correlation between the features and
the label your model will try to predict. This may require some work to detect and fix
issues in the data (such as dealing with missing values, errors, or outlier values), deriving
new feature columns by transforming or combining existing features (a process known
as feature engineering), normalizing numeric features (values you can measure or count)
so they're on a similar scale, and encoding categorical features (values that represent
discrete categories) as numeric indicators.
Let's start by loading the bicycle sharing data as a Pandas DataFrame and viewing the
first few rows.


CodeMarkdown

[ ]

import pandas as pd

# load the training dataset

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/daily-bike-share.csv

bike_data = pd.read_csv('daily-bike-share.csv')

bike_data.head()

Press shift + enter to run

The data consists of the following columns:


 instant: A unique row identifier
 dteday: The date on which the data was observed - in this case, the data was collected daily; so
there's one row per date.
 season: A numerically encoded value indicating the season (1:winter, 2:spring, 3:summer, 4:fall)
 yr: The year of the study in which the observation was made (the study took place over two
years - year 0 represents 2011, and year 1 represents 2012)
 mnth: The calendar month in which the observation was made (1:January ... 12:December)
 holiday: A binary value indicating whether or not the observation was made on a public holiday)
 weekday: The day of the week on which the observation was made (0:Sunday ... 6:Saturday)
 workingday: A binary value indicating whether or not the day is a working day (not a weekend
or holiday)
 weathersit: A categorical value indicating the weather situation (1:clear, 2:mist/cloud, 3:light
rain/snow, 4:heavy rain/hail/snow/fog)
 temp: The temperature in celsius (normalized)
 atemp: The apparent ("feels-like") temperature in celsius (normalized)
 hum: The humidity level (normalized)
 windspeed: The windspeed (normalized)
 rentals: The number of bicycle rentals recorded.

In this dataset, rentals represents the label (the y value) our model must be trained to


predict. The other columns are potential features (x values).

As mentioned previously, you can perform some feature engineering to combine or


derive new features. For example, let's add a new column named day to the dataframe
by extracting the day component from the existing dteday column. The new column
represents the day of the month from 1 to 31.
[ ]

bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day

bike_data.head(32)

Press shift + enter to run

OK, let's start our analysis of the data by examining a few key descriptive statistics. We
can use the dataframe's describe method to generate these for the numeric features as
well as the rentals label column.
[ ]

numeric_features = ['temp', 'atemp', 'hum', 'windspeed']

bike_data[numeric_features + ['rentals']].describe()

Press shift + enter to run

The statistics reveal some information about the distribution of the data in each of the
numeric fields, including the number of observations (there are 731 records), the mean,
standard deviation, minimum and maximum values, and the quartile values (the
threshold values for 25%, 50% - which is also the median, and 75% of the data). From
this, we can see that the mean number of daily rentals is around 848; but there's a
comparatively large standard deviation, indicating a lot of variance in the number of
rentals per day.

We might get a clearer idea of the distribution of rentals values by visualizing the data.
Common plot types for visualizing numeric data distributions are histograms and box
plots, so let's use Python's matplotlib library to create one of each of these for
the rentals column.
[ ]

import pandas as pd

import matplotlib.pyplot as plt

# This ensures plots are displayed inline in the Jupyter notebook

%matplotlib inline

# Get the label column

label = bike_data['rentals']

# Create a figure for 2 subplots (2 rows, 1 column)

fig, ax = plt.subplots(2, 1, figsize = (9,12))

# Plot the histogram   

ax[0].hist(label, bins=100)

ax[0].set_ylabel('Frequency')

# Add lines for the mean, median, and mode

ax[0].axvline(label.mean(), color='magenta', linestyle='dashed', linewidth=2)

ax[0].axvline(label.median(), color='cyan', linestyle='dashed', linewidth=2)

# Plot the boxplot   

ax[1].boxplot(label, vert=False)

ax[1].set_xlabel('Rentals')

# Add a title to the Figure

fig.suptitle('Rental Distribution')

# Show the figure

fig.show()
Press shift + enter to run

The plots show that the number of daily rentals ranges from 0 to just over 3,400.
However, the mean (and median) number of daily rentals is closer to the low end of that
range, with most of the data between 0 and around 2,200 rentals. The few values above
this are shown in the box plot as small circles, indicating that they are outliers - in other
words, unusually high or low values beyond the typical range of most of the data.

We can do the same kind of visual exploration of the numeric features. Let's create a
histogram for each of these.
[ ]

# Plot a histogram for each numeric feature

for col in numeric_features:

    fig = plt.figure(figsize=(9, 6))

    ax = fig.gca()

    feature = bike_data[col]

    feature.hist(bins=100, ax = ax)

    ax.axvline(feature.mean(), color='magenta', linestyle='dashed', linewidth=2)

    ax.axvline(feature.median(), color='cyan', linestyle='dashed', linewidth=2)

    ax.set_title(col)

plt.show()

Press shift + enter to run

The numeric features seem to be more normally distributed, with the mean and median
nearer the middle of the range of values, coinciding with where the most commonly
occurring values are.
Note: The distributions are not truly normal in the statistical sense, which would result in
a smooth, symmetric "bell-curve" histogram with the mean and mode (the most
common value) in the center; but they do generally indicate that most of the
observations have a value somewhere near the middle.
We've explored the distribution of the numeric values in the dataset, but what about the
categorical features? These aren't continuous numbers on a scale, so we can't use
histograms; but we can plot a bar chart showing the count of each discrete value for
each category.
[ ]

import numpy as np

# plot a bar plot for each categorical feature count

categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit'
, 'day']

for col in categorical_features:

    counts = bike_data[col].value_counts().sort_index()

    fig = plt.figure(figsize=(9, 6))

    ax = fig.gca()

    counts.plot.bar(ax = ax, color='steelblue')

    ax.set_title(col + ' counts')

    ax.set_xlabel(col) 

    ax.set_ylabel("Frequency")

plt.show()

Press shift + enter to run

Many of the categorical features show a more or less uniform distribution (meaning


there's roughly the same number of rows for each category). Exceptions to this include:
 holiday: There are many fewer days that are holidays than days that aren't.
 workingday: There are more working days than non-working days.
 weathersit: Most days are category 1 (clear), with category 2 (mist and cloud) the next most
common. There are comparatively few category 3 (light rain or snow) days, and no
category 4 (heavy rain, hail, or fog) days at all.

Now that we know something about the distribution of the data in our columns, we can
start to look for relationships between the features and the rentals label we want to be
able to predict.
For the numeric features, we can create scatter plots that show the intersection of
feature and label values. We can also calculate the correlation statistic to quantify the
apparent relationship..
[ ]

for col in numeric_features:

    fig = plt.figure(figsize=(9, 6))

    ax = fig.gca()

    feature = bike_data[col]

    label = bike_data['rentals']

    correlation = feature.corr(label)

    plt.scatter(x=feature, y=label)

    plt.xlabel(col)

    plt.ylabel('Bike Rentals')

    ax.set_title('rentals vs ' + col + '- correlation: ' + str(correlation))

plt.show()

Press shift + enter to run

The results aren't conclusive, but if you look closely at the scatter plots
for temp and atemp, you can see a vague diagonal trend showing that higher rental
counts tend to coincide with higher temperatures; and a correlation value of just over
0.5 for both of these features supports this observation. Conversely, the plots
for hum and windspeed show a slightly negative correlation, indicating that there are
fewer rentals on days with high humidity or windspeed.

Now let's compare the categorical features to the label. We'll do this by creating box
plots that show the distribution of rental counts for each category.
[ ]

# plot a boxplot for the label by each categorical feature

for col in categorical_features:

    fig = plt.figure(figsize=(9, 6))
    ax = fig.gca()

    bike_data.boxplot(column = 'rentals', by = col, ax = ax)

    ax.set_title('Label by ' + col)

    ax.set_ylabel("Bike Rentals")

plt.show()

Press shift + enter to run

The plots show some variance in the relationship between some category values and
rentals. For example, there's a clear difference in the distribution of rentals on weekends
(weekday 0 or 6) and those during the working week (weekday 1 to 5). Similarly, there
are notable differences for holiday and workingday categories. There's a noticeable
trend that shows different rental distributions in spring and summer months compared
to winter and fall months. The weathersit category also seems to make a difference in
rental distribution. The day feature we created for the day of the month shows little
variation, indicating that it's probably not predictive of the number of rentals.
Train a Regression Model
Now that we've explored the data, it's time to use it to train a regression model that
uses the features we've identified as potentially predictive to predict the rentals label.
The first thing we need to do is to separate the features we want to use to train the
model from the label we want it to predict.
[ ]

# Separate features and labels

X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','tem
p', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values

print('Features:',X[:10], '\nLabels:', y[:10], sep='\n')
Press shift + enter to run

After separating the dataset, we now have numpy arrays named X containing the
features, and y containing the labels.

We could train a model using all of the data; but it's common practice in supervised
learning to split the data into two subsets; a (typically larger) set with which to train the
model, and a smaller "hold-back" set with which to validate the trained model. This
enables us to evaluate how well the model performs when used with the validation
dataset by comparing the predicted labels to the known labels. It's important to split the
data randomly (rather than say, taking the first 70% of the data for training and keeping
the rest for validation). This helps ensure that the two subsets of data are statistically
comparable (so we validate the model with data that has a similar statistical distribution
to the data on which it was trained).

To randomly split the data, we'll use the train_test_split function in the scikit-


learn library. This library is one of the most widely used machine learning packages for
Python.
[ ]

from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)

print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape
[0]))

Press shift + enter to run

Now we have the following four datasets:


 X_train: The feature values we'll use to train the model
 y_train: The corresponding labels we'll use to train the model
 X_test: The feature values we'll use to validate the model
 y_test: The corresponding labels we'll use to validate the model

Now we're ready to train a model by fitting a suitable regression algorithm to the
training data. We'll use a linear regression algorithm, a common starting point for
regression that works by trying to find a linear relationship between the X values and
the y label. The resulting model is a function that conceptually defines a line where
every possible X and y value combination intersect.

In Scikit-Learn, training algorithms are encapsulated in estimators, and in this case we'll
use the LinearRegression estimator to train a linear regression model.
[ ]

# Train the model

from sklearn.linear_model import LinearRegression

# Fit a linear regression model on the training set

model = LinearRegression().fit(X_train, y_train)

print (model)

Press shift + enter to run

Evaluate the Trained Model

Now that we've trained the model, we can use it to predict rental counts for the features
we held back in our validation dataset. Then we can compare these predictions to the
actual label values to evaluate how well (or not!) the model is working.
[ ]

import numpy as np

predictions = model.predict(X_test)

np.set_printoptions(suppress=True)

print('Predicted labels: ', np.round(predictions)[:10])

print('Actual labels   : ' ,y_test[:10])
Press shift + enter to run

Comparing each prediction with its corresponding "ground truth" actual value isn't a
very efficient way to determine how well the model is predicting. Let's see if we can get
a better indication by visualizing a scatter plot that compares the predictions to the
actual labels. We'll also overlay a trend line to get a general sense for how well the
predicted labels align with the true labels.
[ ]

import matplotlib.pyplot as plt

%matplotlib inline

plt.scatter(y_test, predictions)

plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')

plt.title('Daily Bike Share Predictions')

# overlay the regression line

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()

Press shift + enter to run

There's a definite diagonal trend, and the intersections of the predicted and actual
values are generally following the path of the trend line; but there's a fair amount of
difference between the ideal function represented by the line and the results. This
variance represents the residuals of the model - in other words, the difference between
the label predicted when the model applies the coefficients it learned during training to
the validation data, and the actual value of the validation label. These residuals when
evaluated from the validation data indicate the expected level of error when the model
is used with new data for which the label is unknown.

You can quantify the residuals by calculating a number of commonly used evaluation
metrics. We'll focus on the following three:
 Mean Square Error (MSE): The mean of the squared differences between predicted and actual
values. This yields a relative metric in which the smaller the value, the better the fit of the model
 Root Mean Square Error (RMSE): The square root of the MSE. This yields an absolute metric in
the same unit as the label (in this case, numbers of rentals). The smaller the value, the better the
model (in a simplistic sense, it represents the average number of rentals by which the
predictions are wrong!)
 Coefficient of Determination (usually known as R-squared or R2): A relative metric in which
the higher the value, the better the fit of the model. In essence, this metric represents how much
of the variance between predicted and actual label values the model is able to explain.
Note: You can find out more about these and other metrics for evaluating regression
models in the Scikit-Learn documentation

Let's use Scikit-Learn to calculate these metrics for our model, based on the predictions
it generated for the validation data.
[ ]

from sklearn.metrics import mean_squared_error, r2_score

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)

rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

Press shift + enter to run

So now we've quantified the ability of our model to predict the number of rentals. It
definitely has some predictive power, but we can probably do better!

Summary
Here we've explored our data and fit a basic regression model. In the next notebook, we
will try a number of other regression algorithms to improve performance

Further Reading
To learn more about Scikit-Learn, see the Scikit-Learn documentation.
UNIT 4/9:

Discover new regression models


Completed100 XP
 5 minutes

In Unit 2, we looked at fitting a straight line to data points. However, regression can fit
many kinds of relationships, including those with multiple factors, and those where the
importance of one factor depends on another.

Experimenting with models


Regression models are often chosen because they work with small data samples, are
robust, easy to interpret, and a variety exist.

Linear regression is the simplest form of regression, with no limit to the number of
features used. Linear regression comes in many forms - often named by the number of
features used and the shape of the curve that fits.

Decision trees take a step-by-step approach to predicting a variable. If we think of our


bicycle example, the decision tree may be first split examples between ones that are
during Spring/Summer and Autumn/Winter, make a prediction based on the day of the
week. Spring/Summer-Monday may have a bike rental rate of 100 per day, while
Autumn/Winter-Monday may have a rental rate of 20 per day.

Ensemble algorithms construct not just one decision tree, but a large number of trees -
allowing better predictions on more complex data. Ensemble algorithms, such as
Random Forest, are widely used in machine learning and science due to their strong
prediction abilities.

Data scientists often experiment with using different models. In the following exercise,
we'll experiment with different types of models to compare how they perform on the
same data.
UNIT 5/9:

Exercise - Experiment with more


powerful regression models
Regression - Experimenting with additional models
In the previous notebook, we used simple regression models to look at the relationship between
features of a bike rentals dataset. In this notebook, we'll experiment with more complex models
to improve our regression performance.

Let's start by loading the bicycle sharing data as a Pandas DataFrame and viewing the first few
rows. We'll also split our data into training and test datasets.


CodeMarkdown

[ ]

# Import modules we'll need for this notebook

import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

# load the training dataset

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/daily-bike-share.csv

bike_data = pd.read_csv('daily-bike-share.csv')

bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day

numeric_features = ['temp', 'atemp', 'hum', 'windspeed']
categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit'
, 'day']

bike_data[numeric_features + ['rentals']].describe()

print(bike_data.head())

# Separate features and labels

# After separating the dataset, we now have numpy arrays named **X** containing the f
eatures, and **y** containing the labels.

X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','tem
p', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values

# Split data 70%-30% into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)

print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape
[0]))

Press shift + enter to run

Now we have the following four datasets:


 X_train: The feature values we'll use to train the model
 y_train: The corresponding labels we'll use to train the model
 X_test: The feature values we'll use to validate the model
 y_test: The corresponding labels we'll use to validate the model

Now we're ready to train a model by fitting a suitable regression algorithm to the training data.

Experiment with Algorithms


The linear regression algorithm we used last time to train the model has some predictive
capability, but there are many kinds of regression algorithm we could try, including:
 Linear algorithms: Not just the Linear Regression algorithm we used above (which is technically
an Ordinary Least Squares algorithm), but other variants such as Lasso and Ridge.
 Tree-based algorithms: Algorithms that build a decision tree to reach a prediction.
 Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to improve
generalizability.
Note: For a full list of Scikit-Learn estimators that encapsulate algorithms for supervised
machine learning, see the Scikit-Learn documentation. There are many algorithms to choose
from, but for most real-world scenarios, the Scikit-Learn estimator cheat sheet can help you find
a suitable starting point.

Try Another Linear Algorithm

Let's try training our regression model by using a Lasso algorithm. We can do this by just
changing the estimator in the training code.
[ ]

from sklearn.linear_model import Lasso

# Fit a lasso model on the training set

model = Lasso().fit(X_train, y_train)

print (model, "\n")

# Evaluate the model using the test data

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)

rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

# Plot predicted vs actual

plt.scatter(y_test, predictions)

plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')

plt.title('Daily Bike Share Predictions')

# overlay the regression line

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()
Press shift + enter to run

Try a Decision Tree Algorithm

As an alternative to a linear model, there's a category of algorithms for machine learning that
uses a tree-based approach in which the features in the dataset are examined in a series of
evaluations, each of which results in a branch in a decision tree based on the feature value. At
the end of each series of branches are leaf-nodes with the predicted label value based on the
feature values.

It's easiest to see how this works with an example. Let's train a Decision Tree regression model
using the bike rental data. After training the model, the code below will print the model
definition and a text representation of the tree it uses to predict label values.
[ ]

from sklearn.tree import DecisionTreeRegressor

from sklearn.tree import export_text

# Train the model

model = DecisionTreeRegressor().fit(X_train, y_train)

print (model, "\n")

# Visualize the model tree

tree = export_text(model)

print(tree)

Press shift + enter to run

So now we have a tree-based model; but is it any good? Let's evaluate it with the test data.
[ ]

# Evaluate the model using the test data

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)
rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

# Plot predicted vs actual

plt.scatter(y_test, predictions)

plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')

plt.title('Daily Bike Share Predictions')

# overlay the regression line

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()

Press shift + enter to run

The tree-based model doesn't seem to have improved over the linear model, so what else could
we try?

Try an Ensemble Algorithm

Ensemble algorithms work by combining multiple base estimators to produce an optimal model,
either by applying an aggregate function to a collection of base models (sometimes referred to
a bagging) or by building a sequence of models that build on one another to improve predictive
performance (referred to as boosting).

For example, let's try a Random Forest model, which applies an averaging function to multiple
Decision Tree models for a better overall model.
[ ]

from sklearn.ensemble import RandomForestRegressor

# Train the model
model = RandomForestRegressor().fit(X_train, y_train)

print (model, "\n")

# Evaluate the model using the test data

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)

rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

# Plot predicted vs actual

plt.scatter(y_test, predictions)

plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')

plt.title('Daily Bike Share Predictions')

# overlay the regression line

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()

Press shift + enter to run

For good measure, let's also try a boosting ensemble algorithm. We'll use a Gradient Boosting
estimator, which like a Random Forest algorithm builds multiple trees, but instead of building
them all independently and taking the average result, each tree is built on the outputs of the
previous one in an attempt to incrementally reduce the loss (error) in the model.
[ ]

# Train the model
from sklearn.ensemble import GradientBoostingRegressor

# Fit a lasso model on the training set

model = GradientBoostingRegressor().fit(X_train, y_train)

print (model, "\n")

# Evaluate the model using the test data

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)

rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

# Plot predicted vs actual

plt.scatter(y_test, predictions)

plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')

plt.title('Daily Bike Share Predictions')

# overlay the regression line

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()

Press shift + enter to run

Summary
Here we've tried a number of new regression algorithms to improve performance. In our
notebook we'll look at 'tuning' these algorithms to improve performance.
Further Reading
To learn more about Scikit-Learn, see the Scikit-Learn documentation.


[ ]

Press shift + enter to run


UNIT 6/9:

Improve models with hyperparameters


Completed100 XP

 5 minutes

Simple models with small datasets can often be fit in a single step, while larger datasets
and more complex models must be fit by repeatedly using the model with training data
and comparing the output with the expected label. If the prediction is accurate enough,
we consider the model trained. If not, we adjust the model slightly and loop again.

Hyperparameters are values that change the way that the model is fit during these
loops. Learning rate, for example, is a hyperparameter that sets how much a model is
adjusted during each training cycle. A high learning rate means a model can be trained
faster, but if it’s too high the adjustments can be so large that the model is never ‘finely
tuned’ and not optimal.

Preprocessing data
Preprocessing refers to changes you make to your data before it is passed to the model.
We have previously read that preprocessing can involve cleaning your dataset. While
this is important, preprocessing can also include changing the format of your data, so
it's easier for the model to use. For example, data described as ‘red’, ‘orange’, ‘yellow’,
‘lime’, and ‘green’, may work better if converted into a format more native to computers,
such as numbers stating the amount of red and the amount of green.

Scaling features

The most common preprocessing step is to scale features so they fall between zero and
one. For example, the weight of a bike and the distance a person travels on a bike may
be two very different numbers, but by scaling both numbers to between zero and one
allows models to learn more effectively from the data.

Using categories as features

In machine learning, you can also use categorical features such as 'bicycle', 'skateboard’
or 'car'. These features are represented by 0 or 1 values in one-hot vectors - vectors
that have a 0 or 1 for each possible value. For example, bicycle, skateboard, and car
might respectively be (1,0,0), (0,1,0), and (0,0,1).
UNIT 7/9:

Exercise - Optimize and save models


Regression - Optimize and save models
In the previous notebook, we used complex regression models to look at the
relationship between features of a bike rentals dataset. In this notebook, we'll see if we
can improve the performance of these models even further.

Let's start by loading the bicycle sharing data as a Pandas DataFrame and viewing the
first few rows. As usual, we'll also split our data into training and test datasets.


CodeMarkdown

[ ]

# Import modules we'll need for this notebook

import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.model_selection import train_test_split

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

# load the training dataset

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/daily-bike-share.csv

bike_data = pd.read_csv('daily-bike-share.csv')

bike_data['day'] = pd.DatetimeIndex(bike_data['dteday']).day

numeric_features = ['temp', 'atemp', 'hum', 'windspeed']

categorical_features = ['season','mnth','holiday','weekday','workingday','weathersit'
, 'day']
bike_data[numeric_features + ['rentals']].describe()

print(bike_data.head())

# Separate features and labels

# After separating the dataset, we now have numpy arrays named **X** containing the f
eatures, and **y** containing the labels.

X, y = bike_data[['season','mnth', 'holiday','weekday','workingday','weathersit','tem
p', 'atemp', 'hum', 'windspeed']].values, bike_data['rentals'].values

# Split data 70%-30% into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)

print ('Training Set: %d rows\nTest Set: %d rows' % (X_train.shape[0], X_test.shape
[0]))

Press shift + enter to run

Now we have the following four datasets:


 X_train: The feature values we'll use to train the model
 y_train: The corresponding labels we'll use to train the model
 X_test: The feature values we'll use to validate the model
 y_test: The corresponding labels we'll use to validate the model

Now we're ready to train a model by fitting a boosting ensemble algorithm, as in our last
notebook. Recall that a Gradient Boosting estimator, is like a Random Forest algorithm,
but instead of building them all trees independently and taking the average result, each
tree is built on the outputs of the previous one in an attempt to incrementally reduce
the loss (error) in the model.
[ ]

# Train the model

from sklearn.ensemble import GradientBoostingRegressor, RandomForestRegressor

# Fit a lasso model on the training set

model = GradientBoostingRegressor().fit(X_train, y_train)

print (model, "\n")

# Evaluate the model using the test data
predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)

rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

# Plot predicted vs actual

plt.scatter(y_test, predictions)

plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')

plt.title('Daily Bike Share Predictions')

# overlay the regression line

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()

Press shift + enter to run

Optimize Hyperparameters
Take a look at the GradientBoostingRegressor estimator definition in the output
above, and note that it, like the other estimators we tried previously, includes a large
number of parameters that control the way the model is trained. In machine learning,
the term parameters refers to values that can be determined from data; values that you
specify to affect the behavior of a training algorithm are more correctly referred to
as hyperparameters.

The specific hyperparameters for an estimator vary based on the algorithm that the
estimator encapsulates. In the case of the GradientBoostingRegressor estimator, the
algorithm is an ensemble that combines multiple decision trees to create an overall
predictive model. You can learn about the hyperparameters for this estimator in
the Scikit-Learn documentation.

We won't go into the details of each hyperparameter here, but they work together to
affect the way the algorithm trains a model. In many cases, the default values provided
by Scikit-Learn will work well; but there may be some advantage in modifying
hyperparameters to get better predictive performance or reduce training time.

So how do you know what hyperparameter values you should use? Well, in the absence
of a deep understanding of how the underlying algorithm works, you'll need to
experiment. Fortunately, SciKit-Learn provides a way to tune hyperparameters by trying
multiple combinations and finding the best result for a given performance metric.

Let's try using a grid search approach to try combinations from a grid of possible values
for the learning_rate and n_estimators hyperparameters of
the GradientBoostingRegressor estimator.
[ ]

from sklearn.model_selection import GridSearchCV

from sklearn.metrics import make_scorer, r2_score

# Use a Gradient Boosting algorithm

alg = GradientBoostingRegressor()

# Try these hyperparameter values

params = {

 'learning_rate': [0.1, 0.5, 1.0],

 'n_estimators' : [50, 100, 150]

 }

# Find the best hyperparameter combination to optimize the R2 metric

score = make_scorer(r2_score)

gridsearch = GridSearchCV(alg, params, scoring=score, cv=3, return_train_score=True)

gridsearch.fit(X_train, y_train)

print("Best parameter combination:", gridsearch.best_params_, "\n")

# Get the best model

model=gridsearch.best_estimator_

print(model, "\n")
# Evaluate the model using the test data

predictions = model.predict(X_test)

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)

rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

# Plot predicted vs actual

plt.scatter(y_test, predictions)

plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')

plt.title('Daily Bike Share Predictions')

# overlay the regression line

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()

Press shift + enter to run

Note: The use of random values in the Gradient Boosting algorithm results in slightly
different metrics each time. In this case, the best model produced by hyperparameter
tuning is unlikely to be significantly better than one trained with the default
hyperparameter values; but it's still useful to know about the hyperparameter tuning
technique!

Preprocess the Data


We trained a model with data that was loaded straight from a source file, with only
moderately successful results.

In practice, it's common to perform some preprocessing of the data to make it easier for
the algorithm to fit a model to it. There's a huge range of preprocessing transformations
you can perform to get your data ready for modeling, but we'll limit ourselves to a few
common techniques:

Scaling numeric features

Normalizing numeric features so they're on the same scale prevents features with large
values from producing coefficients that disproportionately affect the predictions. For
example, suppose your data includes the following numeric features:

A B C

48
3 65
0

Normalizing these features to the same scale may result in the following values
(assuming A contains values from 0 to 10, B contains values from 0 to 1000, and C
contains values from 0 to 100):

A B C

0.3 0.48 0.65

There are multiple ways you can scale numeric data, such as calculating the minimum
and maximum values for each column and assigning a proportional value between 0
and 1, or by using the mean and standard deviation of a normally distributed variable to
maintain the same spread of values on a different scale.

Encoding categorical variables

Machine learning models work best with numeric features rather than text values, so
you generally need to convert categorical features into numeric representations. For
example, suppose your data includes the following categorical feature.
Size

You can apply ordinal encoding to substitute a unique integer value for each category,
like this:

Size

Another common technique is to use one hot encoding to create individual binary (0 or


1) features for each possible category value. For example, you could use one-hot
encoding to translate the possible categories into binary columns like this:

Size_
Size_S Size_L
M

1 0 0

0 1 0
Size_
Size_S Size_L
M

0 0 1

To apply these preprocessing transformations to the bike rental, we'll make use of a
Scikit-Learn feature named pipelines. These enable us to define a set of preprocessing
steps that end with an algorithm. You can then fit the entire pipeline to the data, so that
the model encapsulates all of the preprocessing steps as well as the regression
algorithm. This is useful, because when we want to use the model to predict values from
new data, we need to apply the same transformations (based on the same statistical
distributions and category encodings used with the training data).
Note: The term pipeline is used extensively in machine learning, often to mean very
different things! In this context, we're using it to refer to pipeline objects in Scikit-Learn,
but you may see it used elsewhere to mean something else.
[ ]

# Train the model

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.impute import SimpleImputer

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.linear_model import LinearRegression

import numpy as np

# Define preprocessing for numeric columns (scale them)

numeric_features = [6,7,8,9]

numeric_transformer = Pipeline(steps=[

    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode them)

categorical_features = [0,1,2,3,4,5]

categorical_transformer = Pipeline(steps=[

    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps
preprocessor = ColumnTransformer(

    transformers=[

        ('num', numeric_transformer, numeric_features),

        ('cat', categorical_transformer, categorical_features)])

# Create preprocessing and training pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),

                           ('regressor', GradientBoostingRegressor())])

# fit the pipeline to train a linear regression model on the training set

model = pipeline.fit(X_train, (y_train))

print (model)

Press shift + enter to run

OK, the model is trained, including the preprocessing steps. Let's see how it performs
with the validation data.
[ ]

# Get predictions

predictions = model.predict(X_test)

# Display metrics

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)

rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

# Plot predicted vs actual

plt.scatter(y_test, predictions)

plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')
plt.title('Daily Bike Share Predictions')

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()

Press shift + enter to run

The pipeline is composed of the transformations and the algorithm used to train the
model. To try an alternative algorithm you can just change that step to a different kind
of estimator.
[ ]

# Use a different estimator in the pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),

                           ('regressor', RandomForestRegressor())])

# fit the pipeline to train a linear regression model on the training set

model = pipeline.fit(X_train, (y_train))

print (model, "\n")

# Get predictions

predictions = model.predict(X_test)

# Display metrics

mse = mean_squared_error(y_test, predictions)

print("MSE:", mse)

rmse = np.sqrt(mse)

print("RMSE:", rmse)

r2 = r2_score(y_test, predictions)

print("R2:", r2)

# Plot predicted vs actual

plt.scatter(y_test, predictions)
plt.xlabel('Actual Labels')

plt.ylabel('Predicted Labels')

plt.title('Daily Bike Share Predictions - Preprocessed')

z = np.polyfit(y_test, predictions, 1)

p = np.poly1d(z)

plt.plot(y_test,p(y_test), color='magenta')

plt.show()

Press shift + enter to run

We've now seen a number of common techniques used to train predictive models for
regression. In a real project, you'd likely try a few more algorithms, hyperparameters,
and preprocessing transformations; but by now you should have got the general idea.
Let's explore how you can use the trained model with new data.

Use the Trained Model

First, let's save the model.


[ ]

import joblib

# Save the model as a pickle file

filename = './bike-share.pkl'

joblib.dump(model, filename)

Press shift + enter to run

Now, we can load it whenever we need it, and use it to predict labels for new data. This
is often called scoring or inferencing.
[ ]

# Load the model from the file

loaded_model = joblib.load(filename)

# Create a numpy array containing a new observation (for example tomorrow's seasonal 
and weather forecast information)

X_new = np.array([[1,1,0,3,1,1,0.226957,0.22927,0.436957,0.1869]]).astype('float64')

print ('New sample: {}'.format(list(X_new[0])))

# Use the model to predict tomorrow's rentals

result = loaded_model.predict(X_new)

print('Prediction: {:.0f} rentals'.format(np.round(result[0])))

Press shift + enter to run

The model's predict method accepts an array of observations, so you can use it to


generate multiple predictions as a batch. For example, suppose you have a weather
forecast for the next five days; you could use the model to predict bike rentals for each
day based on the expected weather conditions.
[ ]

# An array of features based on five-day weather forecast

X_new = np.array([[0,1,1,0,0,1,0.344167,0.363625,0.805833,0.160446],

                  [0,1,0,1,0,1,0.363478,0.353739,0.696087,0.248539],

                  [0,1,0,2,0,1,0.196364,0.189405,0.437273,0.248309],

                  [0,1,0,3,0,1,0.2,0.212122,0.590435,0.160296],

                  [0,1,0,4,0,1,0.226957,0.22927,0.436957,0.1869]])

# Use the model to predict rentals

results = loaded_model.predict(X_new)

print('5-day rental predictions:')

for prediction in results:

    print(np.round(prediction))
Press shift + enter to run

Summary
That concludes the notebooks for this module on regression. In this notebook we ran a
complex regression, tuned it, saved the model, and used it to predict outcomes for the
future.

Further Reading
To learn more about Scikit-Learn, see the Scikit-Learn documentation.
UNIT 8/9:

Knowledge check
200 XP
 3 minutes

Answer the following questions to check your learning.

1. 

You are using scikit-learn to train a regression model from a dataset of sales data. You
want to be able to evaluate the model to ensure it will predict accurately with new data.
What should you do?

Use all of the data to train the model. Then use all of the data to evaluate it

Train the model using only the feature columns, and then evaluate it using only the label
column

Split the data randomly into two subsets. Use one subset to train the model, and the
other to evaluate it

ANSWER: 3 (A common way to train and evaluate models is to hold-back an evaluation


dataset when training)
2. 

You have created a model object using the scikit-learn LinearRegression class. What
should you do to train the model?

Call the predict() method of the model object, specifying the training feature and label
arrays

Call the fit() method of the model object, specifying the training feature and label arrays

Call the score() method of the model object, specifying the training feature and test
feature arrays

ANSWER:2 (To train a model, use the fit() method.)


3. 

You train a regression model using scikit-learn. When you evaluate it with test data, you
determine that the model achieves an R-squared metric of 0.95. What does this metric
tell you about the model?

The model explains most of the variance between predicted and actual values.

The model is 95% accurate

On average, predictions are 0.95 higher than actual values

ANSWER:1 (The R-squared metric is a measure of how much of the variance can be
explained by the model.)
UNIT 9/9:

Summary
Completed100 XP
 1 minute

In this module, you learned how regression can be used to create a machine learning
model that predicts numeric values. You then used the scikit-learn framework in Python
to train and evaluate a regression model.

While scikit-learn is a popular framework for writing code to train regression models,
you can also create machine learning solutions for regression using the graphical tools
in Microsoft Azure Machine Learning. You can learn more about no-code development
of regression models using Azure Machine Learning in the Create a Regression Model
with Azure Machine Learning designer module.

Challenge: Predict Real Estate Prices


Think you're ready to create your own regression model? Try the challenge of predicting
real estate property prices in the 02 - Real Estate Regression Challenge.ipynb notebook!

 Note

The time to complete this optional challenge is not included in the estimated time for
this module - you can spend as little or as much time on it as you like!
MODULE 3:

Train and evaluate classification models


Unit 1/9:

Introduction
Completed100 XP
 2 minutes

Classification is a form of machine learning in which you train a model to predict which
category an item belongs to. For example, a health clinic might use diagnostic data such
as a patient's height, weight, blood pressure, blood-glucose level to predict whether or
not the patient is diabetic.

Categorical data has distinct 'classes', rather than numeric values. Some kinds of data
can be either numeric or categorical: the time to run a race could be a time in seconds,
or we could split times into classes of ‘fast’, ‘medium’ and ‘slow’ - categorical. While
other kinds of data can only be categorical, such as a type of shape - ‘circle’, ‘triangle’, or
‘square’.

Prerequisites
 Knowledge of basic mathematics
 Some experience programming in Python
Learning objectives
In this module, you will:

 When to use classification


 How to train and evaluate a classification model using the Scikit-Learn framework
UNIT 2/9:

What is classification?
Completed100 XP

 5 minutes

Binary classification is classification with two categories. For example, we could label


patients as non-diabetic or diabetic.

The class prediction is made by determining the probability for each possible class as a


value between 0 -impossible - and 1 - certain. The total probability for all classes is 1, as
the patient is definitely either diabetic or non-diabetic. So, if the predicted probability of
a patient being diabetic is 0.3, then there is a corresponding probability of 0.7 that the
patient is non-diabetic.

A threshold value, usually 0.5, is used to determine the predicted class - so if


the positive class (in this case, diabetic) has a predicted probability greater than the
threshold, then a classification of diabetic is predicted.

Training and evaluating a classification model


Classification is an example of a supervised machine learning technique, which means it
relies on data that includes known feature values (for example, diagnostic measurements
for patients) as well as known label values (for example, a classification of non-diabetic
or diabetic). A classification algorithm is used to fit a subset of the data to a function
that can calculate the probability for each class label from the feature values. The
remaining data is used to evaluate the model by comparing the predictions it generates
from the features to the known class labels.

A simple example

Let's explore a simple example to help explain the key principles. Suppose we have the
following patient data, which consists of a single feature (blood-glucose level) and a
class label 0 for non-diabetic, 1 for diabetic.

Blood-Glucose Diabetic

82 0
Blood-Glucose Diabetic

92 0

112 1

102 0

115 1

107 1

87 0

120 1

83 0

119 1

104 1

105 0

86 0

109 1

We'll use the first eight observations to train a classification model, and we'll start by
plotting the blood-glucose feature (which we'll call x) and the predicted diabetic label
(which we'll call y).
What we need is a function that calculates a probability value for y based on x (in other
words, we need the function f(x) = y). You can see from the chart that patients with a
low blood-glucose level are all non-diabetic, while patients with a higher blood-glucose
level are diabetic. It seems like the higher the blood-glucose level, the more probable it
is that a patient is diabetic, with the inflexion point being somewhere between 100 and
110. We need to fit a function that calculates a value between 0 and 1 for y to these
values.

One such function is a logistic function, which forms a sigmoidal (S-shaped) curve, like
this:

Now we can use the function to calculate a probability value that y is positive, meaning
the patient is diabetic, from any value of x by finding the point on the function line
for x. We can set a threshold value of 0.5 as the cut-off point for the class label
prediction.

Let's test it with the data values we held-back:


Points plotted below the threshold line will yield a predicted class of 0 - non-diabetic -
and points above the line will be predicted as 1 - diabetic.

Now we can compare the label predictions based on the logistic function encapsulated
in the model (which we'll call ŷ, or "y-hat") to the actual class labels (y).

x y ŷ

83 0 0

119 1 1

104 1 0

105 0 1

86 0 0

109 1 1
UNIT 3/9:

Exercise - Train and evaluate a


classification model
Classification
Supervised machine learning techniques involve training a model to operate on a set
of features and predict a label using a dataset that includes some already-known label
values. You can think of this function like this, in which y represents the label we want to
predict and X represents the vector of features the model uses to predict it.
y=f([x1,x2,x3,...])y=f([x1,x2,x3,...])

Classification is a form of supervised machine learning in which you train a model to use
the features (the x values in our function) to predict a label (y) that calculates the
probability of the observed case belonging to each of a number of possible classes, and
predicting an appropriate label. The simplest form of classification
is binary classification, in which the label is 0 or 1, representing one of two classes; for
example, "True" or "False"; "Internal" or "External"; "Profitable" or "Non-Profitable"; and
so on.
Binary Classification
In this notebook, we will focus on an example of binary classification, where the model
must predict a label that belongs to one of two classes. In this exercise, we'll train a
binary classifier to predict whether or not a patient should be tested for diabetes based
on some medical data.

Explore the data

Run the following cell to load a CSV file of patent data into a Pandas dataframe:
Citation: The diabetes dataset used in this exercise is based on data originally collected
by the National Institute of Diabetes and Digestive and Kidney Diseases.


CodeMarkdown

Empty Markdown cell. Double click or press enter to add content.


[ ]

import pandas as pd

# load the training dataset

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/diabetes.csv

diabetes = pd.read_csv('diabetes.csv')

diabetes.head()

Press shift + enter to run

This data consists of diagnostic information about some patients who have been tested
for diabetes. Scroll to the right if necessary, and note that the final column in the dataset
(Diabetic) contains the value 0 for patients who tested negative for diabetes, and 1 for
patients who tested positive. This is the label that we will train our model to predict;
most of the other columns (Pregnancies,PlasmaGlucose,DiastolicBloodPressure, and
so on) are the features we will use to predict the Diabetic label.

Let's separate the features from the labels - we'll call the features X and the label y:
[ ]

# Separate features and labels

features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThi
ckness','SerumInsulin','BMI','DiabetesPedigree','Age']
label = 'Diabetic'

X, y = diabetes[features].values, diabetes[label].values

for n in range(0,4):

    print("Patient", str(n+1), "\n  Features:",list(X[n]), "\n  Label:", y[n])

Press shift + enter to run

Now let's compare the feature distributions for each label value.
[ ]

from matplotlib import pyplot as plt

%matplotlib inline

features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThi
ckness','SerumInsulin','BMI','DiabetesPedigree','Age']
for col in features:

    diabetes.boxplot(column=col, by='Diabetic', figsize=(6,6))

    plt.title(col)

plt.show()
Press shift + enter to run

For some of the features, there's a noticeable difference in the distribution for each label
value. In particular, Pregnancies and Age show markedly different distributions for
diabetic patients than for non-diabetic patients. These features may help predict
whether or not a patient is diabetic.

Split the data

Our dataset includes known values for the label, so we can use this to train a classifier so
that it finds a statistical relationship between the features and the label value; but how
will we know if our model is any good? How do we know it will predict correctly when
we use it with new data that it wasn't trained with? Well, we can take advantage of the
fact we have a large dataset with known label values, use only some of it to train the
model, and hold back some to test the trained model - enabling us to compare the
predicted labels with the already known labels in the test set.

In Python, the scikit-learn package contains a large number of functions we can use to


build a machine learning model - including a train_test_split function that ensures we
get a statistically random split of training and test data. We'll use that to split the data
into 70% for training and hold back 30% for testing.
[ ]

from sklearn.model_selection import train_test_split

# Split data 70%-30% into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)

print ('Training cases: %d\nTest cases: %d' % (X_train.shape[0], X_test.shape
[0]))

Press shift + enter to run

Train and Evaluate a Binary Classification Model

OK, now we're ready to train our model by fitting the training features (X_train) to the
training labels (y_train). There are various algorithms we can use to train the model. In
this example, we'll use Logistic Regression, which (despite its name) is a well-established
algorithm for classification. In addition to the training features and labels, we'll need to
set a regularization parameter. This is used to counteract any bias in the sample, and
help the model generalize well by avoiding overfitting the model to the training data.
Note: Parameters for machine learning algorithms are generally referred to
as hyperparameters (to a data scientist, parameters are values in the data itself
- hyperparameters are defined externally from the data!)
[ ]

# Train the model

from sklearn.linear_model import LogisticRegression

# Set regularization rate

reg = 0.01

# train a logistic regression model on the training set

model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

print (model)

Press shift + enter to run

Now we've trained the model using the training data, we can use the test data we held
back to evaluate how well it predicts. Again, scikit-learn can help us do this. Let's start
by using the model to predict labels for our test set, and compare the predicted labels
to the known labels:
[ ]

predictions = model.predict(X_test)

print('Predicted labels: ', predictions)

print('Actual labels:    ' ,y_test)

Press shift + enter to run


The arrays of labels are too long to be displayed in the notebook output, so we can only
compare a few values. Even if we printed out all of the predicted and actual labels, there
are too many of them to make this a sensible way to evaluate the model.
Fortunately, scikit-learn has a few more tricks up its sleeve, and it provides some
metrics that we can use to evaluate the model.

The most obvious thing you might want to do is to check the accuracy of the
predictions - in simple terms, what proportion of the labels did the model predict
correctly?
[ ]

from sklearn.metrics import accuracy_score

print('Accuracy: ', accuracy_score(y_test, predictions))

Press shift + enter to run

The accuracy is returned as a decimal value - a value of 1.0 would mean that the model
got 100% of the predictions right; while an accuracy of 0.0 is, well, pretty useless!

Summary
Here we prepared our data by splitting it into test and train datasets, and applied
logistic regression - a way of applying binary labels to our data. Our model was able to
predict whether patients had diabetes with what appears like reasonable accuracy. But is
this good enough? In the next notebook we will look at alternatives to accuracy that can
be much more useful in machine learning.
[ ]

from sklearn. metrics import classification_report

print(classification_report(y_test, predictions))
Press shift + enter to run

The classification report includes the following metrics for each class (0 and 1)
note that the header row may not line up with the values!
 Precision: Of the predictions the model made for this class, what proportion were correct?
 Recall: Out of all of the instances of this class in the test dataset, how many did the model
identify?
 F1-Score: An average metric that takes both precision and recall into account.
 Support: How many instances of this class are there in the test dataset?

The classification report also includes averages for these metrics, including a weighted
average that allows for the imbalance in the number of cases of each class.

Because this is a binary classification problem, the 1 class is considered positive and its


precision and recall are particularly interesting - these in effect answer the questions:
 Of all the patients the model predicted are diabetic, how many are actually diabetic?
 Of all the patients that are actually diabetic, how many did the model identify?

You can retrieve these values on their own by using


the precision_score and recall_score metrics in scikit-learn (which by default assume a
binary classification model).
[ ]

from sklearn.metrics import precision_score, recall_score

print("Overall Precision:",precision_score(y_test, predictions))

print("Overall Recall:",recall_score(y_test, predictions))

Press shift + enter to run

The precision and recall metrics are derived from four possible prediction outcomes:
 True Positives: The predicted label and the actual label are both 1.
 False Positives: The predicted label is 1, but the actual label is 0.
 False Negatives: The predicted label is 0, but the actual label is 1.
 True Negatives: The predicted label and the actual label are both 0.

These metrics are generally tabulated for the test set and shown together as a confusion
matrix, which takes the following form:
T
FP
N

FN TP

Note that the correct (true) predictions form a diagonal line from top left to bottom
right - these figures should be significantly higher than the false predictions if the model
is any good.

In Python, you can use the sklearn.metrics.confusion_matrix function to find these


values for a trained classifier:
[ ]

from sklearn.metrics import confusion_matrix

# Print the confusion matrix

cm = confusion_matrix(y_test, predictions)

print (cm)

Press shift + enter to run

Until now, we've considered the predictions from the model as being either 1 or 0 class
labels. Actually, things are a little more complex than that. Statistical machine learning
algorithms, like logistic regression, are based on probability; so what actually gets
predicted by a binary classifier is the probability that the label is true (P(y)) and the
probability that the label is false (1 - P(y)). A threshold value of 0.5 is used to decide
whether the predicted label is a 1 (P(y) > 0.5) or a 0 (P(y) <= 0.5). You can use
the predict_proba method to see the probability pairs for each case:
[ ]

y_scores = model.predict_proba(X_test)

print(y_scores)
Press shift + enter to run

The decision to score a prediction as a 1 or a 0 depends on the threshold to which the


predicted probabilities are compared. If we were to change the threshold, it would affect
the predictions; and therefore change the metrics in the confusion matrix. A common
way to evaluate a classifier is to examine the true positive rate (which is another name
for recall) and the false positive rate for a range of possible thresholds. These rates are
then plotted against all possible thresholds to form a chart known as a received operator
characteristic (ROC) chart, like this:
[ ]

from sklearn.metrics import roc_curve

from sklearn.metrics import confusion_matrix

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

# calculate ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])

# plot ROC curve

fig = plt.figure(figsize=(6, 6))

# Plot the diagonal 50% line

plt.plot([0, 1], [0, 1], 'k--')

# Plot the FPR and TPR achieved by our model

plt.plot(fpr, tpr)

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve')

plt.show()
Press shift + enter to run

The ROC chart shows the curve of the true and false positive rates for different threshold
values between 0 and 1. A perfect classifier would have a curve that goes straight up the
left side and straight across the top. The diagonal line across the chart represents the
probability of predicting correctly with a 50/50 random prediction; so you obviously
want the curve to be higher than that (or your model is no better than simply guessing!).

The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall
performance of the model. The closer to 1 this value is, the better the model. Once
again, scikit-Learn includes a function to calculate this metric.
[ ]

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_test,y_scores[:,1])

print('AUC: ' + str(auc))

Press shift + enter to run

Perform preprocessing in a pipeline

In this case, the ROC curve and its AUC indicate that the model performs better than a
random guess which is not bad considering we performed very little preprocessing of
the data.

In practice, it's common to perform some preprocessing of the data to make it easier for
the algorithm to fit a model to it. There's a huge range of preprocessing transformations
you can perform to get your data ready for modeling, but we'll limit ourselves to a few
common techniques:
 Scaling numeric features so they're on the same scale. This prevents features with large values
from producing coefficients that disproportionately affect the predictions.
 Encoding categorical variables. For example, by using a one hot encoding technique you can
create individual binary (true/false) features for each possible category value.

To apply these preprocessing transformations, we'll make use of a Scikit-Learn feature


named pipelines. These enable us to define a set of preprocessing steps that end with an
algorithm. You can then fit the entire pipeline to the data, so that the model
encapsulates all of the preprocessing steps as well as the regression algorithm. This is
useful, because when we want to use the model to predict values from new data, we
need to apply the same transformations (based on the same statistical distributions and
category encodings used with the training data).
Note: The term pipeline is used extensively in machine learning, often to mean very
different things! In this context, we're using it to refer to pipeline objects in Scikit-Learn,
but you may see it used elsewhere to mean something else.
[ ]

# Train the model

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.linear_model import LogisticRegression

import numpy as np

# Define preprocessing for numeric columns (normalize them so they're on the same sca
le)

numeric_features = [0,1,2,3,4,5,6]

numeric_transformer = Pipeline(steps=[

    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode the Age column)

categorical_features = [7]

categorical_transformer = Pipeline(steps=[

    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps

preprocessor = ColumnTransformer(

    transformers=[

        ('num', numeric_transformer, numeric_features),

        ('cat', categorical_transformer, categorical_features)])

# Create preprocessing and training pipeline
pipeline = Pipeline(steps=[('preprocessor', preprocessor),

                           ('logregressor', LogisticRegression(C=1/reg, solver
="liblinear"))])

# fit the pipeline to train a logistic regression model on the training set

model = pipeline.fit(X_train, (y_train))

print (model)

Press shift + enter to run

The pipeline encapsulates the preprocessing steps as well as model training.

Let's use the model trained by this pipeline to predict labels for our test set, and
compare the performance metrics with the basic model we created previously.
[ ]

# Get predictions from test data

predictions = model.predict(X_test)

y_scores = model.predict_proba(X_test)

# Get evaluation metrics

cm = confusion_matrix(y_test, predictions)

print ('Confusion Matrix:\n',cm, '\n')

print('Accuracy:', accuracy_score(y_test, predictions))

print("Overall Precision:",precision_score(y_test, predictions))

print("Overall Recall:",recall_score(y_test, predictions))

auc = roc_auc_score(y_test,y_scores[:,1])

print('AUC: ' + str(auc))

# calculate ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])

# plot ROC curve

fig = plt.figure(figsize=(6, 6))

# Plot the diagonal 50% line
plt.plot([0, 1], [0, 1], 'k--')

# Plot the FPR and TPR achieved by our model

plt.plot(fpr, tpr)

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve')

plt.show()

Press shift + enter to run

The results look a little better, so clearly preprocessing the data has made a difference.

Try a different algorithm

Now let's try a different algorithm. Previously we used a logistic regression algorithm,
which is a linear algorithm. There are many kinds of classification algorithm we could try,
including:
 Support Vector Machine algorithms: Algorithms that define a hyperplane that separates
classes.
 Tree-based algorithms: Algorithms that build a decision tree to reach a prediction
 Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to
improve generalizability.

This time, We'll use the same preprocessing steps as before, but we'll train the model
using an ensemble algorithm named Random Forest that combines the outputs of
multiple random decision trees (for more details, see the Scikit-Learn documentation).
[ ]

from sklearn.ensemble import RandomForestClassifier

# Create preprocessing and training pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),

                           ('logregressor', RandomForestClassifier(n_estimators=100
))])

# fit the pipeline to train a random forest model on the training set
model = pipeline.fit(X_train, (y_train))

print (model)

Press shift + enter to run

Let's look at the performance metrics for the new model.


[ ]

predictions = model.predict(X_test)

y_scores = model.predict_proba(X_test)

cm = confusion_matrix(y_test, predictions)

print ('Confusion Matrix:\n',cm, '\n')

print('Accuracy:', accuracy_score(y_test, predictions))

print("Overall Precision:",precision_score(y_test, predictions))

print("Overall Recall:",recall_score(y_test, predictions))

auc = roc_auc_score(y_test,y_scores[:,1])

print('\nAUC: ' + str(auc))

# calculate ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])

# plot ROC curve

fig = plt.figure(figsize=(6, 6))

# Plot the diagonal 50% line

plt.plot([0, 1], [0, 1], 'k--')

# Plot the FPR and TPR achieved by our model

plt.plot(fpr, tpr)

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve')

plt.show()
Press shift + enter to run

That looks better!

Use the Model for Inferencing

Now that we have a reasonably useful trained model, we can save it for use later to
predict labels for new data:
[ ]

import joblib

# Save the model as a pickle file

filename = './diabetes_model.pkl'

joblib.dump(model, filename)

Press shift + enter to run

When we have some new observations for which the label is unknown, we can load the
model and use it to predict values for the unknown label:
[ ]

# Load the model from the file

model = joblib.load(filename)

# predict on a new sample

# The model accepts an array of feature arrays (so you can predict the classes of mul
tiple patients in a single call)

# We'll create an array with a single array of features, representing one patient

X_new = np.array([[2,180,74,24,21,23.9091702,1.488172308,22]])

print ('New sample: {}'.format(list(X_new[0])))
# Get a prediction

pred = model.predict(X_new)

# The model returns an array of predictions - one for each set of features submitted

# In our case, we only submitted one patient, so our prediction is the first one in t
he resulting array.

print('Predicted class is {}'.format(pred[0]))

Press shift + enter to run

Summary
In this notebook, we looked at the basics of binary classification. We will move onto
more complex classification problems in the following notebook.
UNIT 4/9:

Evaluate classification models


Completed100 XP
 4 minutes

The training accuracy of a classification model is much less important than how well that
model will work when given new, unseen data. After all, we train models so that they can
be used on new data we find in the real world. So, after we have trained a classification
model, we should evaluate how it performs on a set of new, unseen data.

In the previous units, we created a model that would predict whether a patient had
diabetes or not based on their blood glucose level. Now, when applied to some data
that wasn't part of the training set we get the following predictions:

x y ŷ
83 0 0
119 1 1
104 1 0
105 0 1
86 0 0
109 1 1
Recall that x refers to blood glucose level, y refers to whether they're actually diabetic,
and ŷ refers to the model’s prediction as to whether they're diabetic or not.

Simply calculating how many predictions were correct is sometimes misleading or too
simplistic for us to understand the kinds of errors it will make in the real world. To get
more detailed information, we can tabulate the results in a structure called a confusion
matrix, like this:

The confusion matrix shows the total number of cases where:

 The model predicted 0 and the actual label is 0 (true negatives; top left)
 The model predicted 1 and the actual label is 1 (true positives; bottom right)
 The model predicted 0 and the actual label is 1 (false negatives; bottom left)
 The model predicted 1 and the actual label is 0 (false positives; top right)

The cells in the confusion matrix are often shaded so that higher values have a deeper
shade. This makes it easier to see a strong diagonal trend from top-left to bottom-right,
highlighting the cells where the predicted value and actual value are the same.

From these core values, you can calculate a range of other metrics that can help you
evaluate the performance of the model. For example:

 Accuracy: (TP+TN)/(TP+TN+FP+FN) - out all of the predictions, how many were


correct?
 Recall: TP/(TP+FN) - of all the cases that are positive, how many did the model
identify?
 Precision: TP/(TP+FP) - of all the cases that the model predicted to be positive,
how many actually are positive?
UNIT 5/9:

Exercise - Perform classification with


alternative metrics
Classification Metrics
In the last notebook we fit binary classifier to predict whether patients were diabetic or
not. We used accuracy as a measure of how well the model performed, but accuracy
isn't everything. In this notebook, we will look at alternatives to accuracy that can be
much more useful in machine learning.

Alternative metrics for binary classifiers


Accuracy seems like a sensible metric to evaluate (and to a certain extent it is), but you
need to be careful about drawing too many conclusions from the accuracy of a classifier.
Remember that it's simply a measure of how many cases were predicted correctly.
Suppose only 3% of the population is diabetic. You could create a classifier that always
just predicts 0, and it would be 97% accurate - but not terribly helpful in identifying
patients with diabetes!
Fortunately, there are some other metrics that reveal a little more about how our model
is performing. Scikit-Learn includes the ability to create a classification report that
provides more insight than raw accuracy alone.

To get started, run the next cell to load our data and train our model like last time.


CodeMarkdown

Empty Markdown cell. Double click or press enter to add content.


[ ]

import pandas as pd

from matplotlib import pyplot as plt

%matplotlib inline

from sklearn.model_selection import train_test_split

from sklearn.metrics import accuracy_score

# load the training dataset

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/diabetes.csv

diabetes = pd.read_csv('diabetes.csv')

# Separate features and labels

features = ['Pregnancies','PlasmaGlucose','DiastolicBloodPressure','TricepsThickness'
,'SerumInsulin','BMI','DiabetesPedigree','Age']

label = 'Diabetic'

X, y = diabetes[features].values, diabetes[label].values

# Split data 70%-30% into training set and test set

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_stat
e=0)

print ('Training cases: %d\nTest cases: %d' % (X_train.shape[0], X_test.shape[0]))

# Train the model

from sklearn.linear_model import LogisticRegression

# Set regularization rate
reg = 0.01

# train a logistic regression model on the training set

model = LogisticRegression(C=1/reg, solver="liblinear").fit(X_train, y_train)

predictions = model.predict(X_test)

print('Predicted labels: ', predictions)

print('Actual labels:    ' ,y_test)

print('Accuracy: ', accuracy_score(y_test, predictions))

Press shift + enter to run

One of the simplest places to start is a classification report. Run the next cell to see a
range of alternatives ways to assess our model
[ ]

from sklearn. metrics import classification_report

print(classification_report(y_test, predictions))

Press shift + enter to run

The classification report includes the following metrics for each class (0 and 1)
note that the header row may not line up with the values!
 Precision: Of the predictions the model made for this class, what proportion were correct?
 Recall: Out of all of the instances of this class in the test dataset, how many did the model
identify?
 F1-Score: An average metric that takes both precision and recall into account.
 Support: How many instances of this class are there in the test dataset?

The classification report also includes averages for these metrics, including a weighted
average that allows for the imbalance in the number of cases of each class.
Because this is a binary classification problem, the 1 class is considered positive and its
precision and recall are particularly interesting - these in effect answer the questions:
 Of all the patients the model predicted are diabetic, how many are actually diabetic?
 Of all the patients that are actually diabetic, how many did the model identify?

You can retrieve these values on their own by using


the precision_score and recall_score metrics in scikit-learn (which by default assume a
binary classification model).
[ ]

from sklearn.metrics import precision_score, recall_score

print("Overall Precision:",precision_score(y_test, predictions))

print("Overall Recall:",recall_score(y_test, predictions))

Press shift + enter to run

The precision and recall metrics are derived from four possible prediction outcomes:
 True Positives: The predicted label and the actual label are both 1.
 False Positives: The predicted label is 1, but the actual label is 0.
 False Negatives: The predicted label is 0, but the actual label is 1.
 True Negatives: The predicted label and the actual label are both 0.

These metrics are generally tabulated for the test set and shown together as a confusion
matrix, which takes the following form:

T
FP
N

FN TP

Note that the correct (true) predictions form a diagonal line from top left to bottom
right - these figures should be significantly higher than the false predictions if the model
is any good.
In Python, you can use the sklearn.metrics.confusion_matrix function to find these
values for a trained classifier:
[ ]

from sklearn.metrics import confusion_matrix

# Print the confusion matrix

cm = confusion_matrix(y_test, predictions)

print (cm)

Press shift + enter to run

Until now, we've considered the predictions from the model as being either 1 or 0 class
labels. Actually, things are a little more complex than that. Statistical machine learning
algorithms, like logistic regression, are based on probability; so what actually gets
predicted by a binary classifier is the probability that the label is true (P(y)) and the
probability that the label is false (1 - P(y)). A threshold value of 0.5 is used to decide
whether the predicted label is a 1 (P(y) > 0.5) or a 0 (P(y) <= 0.5). You can use
the predict_proba method to see the probability pairs for each case:
[ ]

y_scores = model.predict_proba(X_test)

print(y_scores)

Press shift + enter to run

The decision to score a prediction as a 1 or a 0 depends on the threshold to which the


predicted probabilities are compared. If we were to change the threshold, it would affect
the predictions; and therefore change the metrics in the confusion matrix. A common
way to evaluate a classifier is to examine the true positive rate (which is another name
for recall) and the false positive rate for a range of possible thresholds. These rates are
then plotted against all possible thresholds to form a chart known as a received operator
characteristic (ROC) chart, like this:
[ ]

from sklearn.metrics import roc_curve

from sklearn.metrics import confusion_matrix

import matplotlib

import matplotlib.pyplot as plt

%matplotlib inline

# calculate ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])

# plot ROC curve

fig = plt.figure(figsize=(6, 6))

# Plot the diagonal 50% line

plt.plot([0, 1], [0, 1], 'k--')

# Plot the FPR and TPR achieved by our model

plt.plot(fpr, tpr)

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve')

plt.show()

Press shift + enter to run

The ROC chart shows the curve of the true and false positive rates for different threshold
values between 0 and 1. A perfect classifier would have a curve that goes straight up the
left side and straight across the top. The diagonal line across the chart represents the
probability of predicting correctly with a 50/50 random prediction; so you obviously
want the curve to be higher than that (or your model is no better than simply guessing!).
The area under the curve (AUC) is a value between 0 and 1 that quantifies the overall
performance of the model. The closer to 1 this value is, the better the model. Once
again, scikit-Learn includes a function to calculate this metric.
[ ]

from sklearn.metrics import roc_auc_score

auc = roc_auc_score(y_test,y_scores[:,1])

print('AUC: ' + str(auc))

Press shift + enter to run

Perform preprocessing in a pipeline

In this case, the ROC curve and its AUC indicate that the model performs better than a
random guess which is not bad considering we performed very little preprocessing of
the data.

In practice, it's common to perform some preprocessing of the data to make it easier for
the algorithm to fit a model to it. There's a huge range of preprocessing transformations
you can perform to get your data ready for modeling, but we'll limit ourselves to a few
common techniques:
 Scaling numeric features so they're on the same scale. This prevents features with large values
from producing coefficients that disproportionately affect the predictions.
 Encoding categorical variables. For example, by using a one hot encoding technique you can
create individual binary (true/false) features for each possible category value.

To apply these preprocessing transformations, we'll make use of a Scikit-Learn feature


named pipelines. These enable us to define a set of preprocessing steps that end with an
algorithm. You can then fit the entire pipeline to the data, so that the model
encapsulates all of the preprocessing steps as well as the regression algorithm. This is
useful, because when we want to use the model to predict values from new data, we
need to apply the same transformations (based on the same statistical distributions and
category encodings used with the training data).
Note: The term pipeline is used extensively in machine learning, often to mean very
different things! In this context, we're using it to refer to pipeline objects in Scikit-Learn,
but you may see it used elsewhere to mean something else.
[ ]

# Train the model

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.preprocessing import StandardScaler, OneHotEncoder

from sklearn.linear_model import LogisticRegression

import numpy as np

# Define preprocessing for numeric columns (normalize them so they're on the same sca
le)

numeric_features = [0,1,2,3,4,5,6]

numeric_transformer = Pipeline(steps=[

    ('scaler', StandardScaler())])

# Define preprocessing for categorical features (encode the Age column)

categorical_features = [7]

categorical_transformer = Pipeline(steps=[

    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

# Combine preprocessing steps

preprocessor = ColumnTransformer(

    transformers=[

        ('num', numeric_transformer, numeric_features),

        ('cat', categorical_transformer, categorical_features)])

# Create preprocessing and training pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),

                           ('logregressor', LogisticRegression(C=1/reg, solver
="liblinear"))])

# fit the pipeline to train a logistic regression model on the training set

model = pipeline.fit(X_train, (y_train))

print (model)
Press shift + enter to run

The pipeline encapsulates the preprocessing steps as well as model training.

Let's use the model trained by this pipeline to predict labels for our test set, and
compare the performance metrics with the basic model we created previously.
[ ]

# Get predictions from test data

predictions = model.predict(X_test)

y_scores = model.predict_proba(X_test)

# Get evaluation metrics

cm = confusion_matrix(y_test, predictions)

print ('Confusion Matrix:\n',cm, '\n')

print('Accuracy:', accuracy_score(y_test, predictions))

print("Overall Precision:",precision_score(y_test, predictions))

print("Overall Recall:",recall_score(y_test, predictions))

auc = roc_auc_score(y_test,y_scores[:,1])

print('AUC: ' + str(auc))

# calculate ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])

# plot ROC curve

fig = plt.figure(figsize=(6, 6))

# Plot the diagonal 50% line

plt.plot([0, 1], [0, 1], 'k--')

# Plot the FPR and TPR achieved by our model

plt.plot(fpr, tpr)

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve')

plt.show()
Press shift + enter to run

The results look a little better, so clearly preprocessing the data has made a difference.

Try a different algorithm

Now let's try a different algorithm. Previously we used a logistic regression algorithm,
which is a linear algorithm. There are many kinds of classification algorithm we could try,
including:
 Support Vector Machine algorithms: Algorithms that define a hyperplane that separates
classes.
 Tree-based algorithms: Algorithms that build a decision tree to reach a prediction
 Ensemble algorithms: Algorithms that combine the outputs of multiple base algorithms to
improve generalizability.

This time, We'll use the same preprocessing steps as before, but we'll train the model
using an ensemble algorithm named Random Forest that combines the outputs of
multiple random decision trees (for more details, see the Scikit-Learn documentation).
[ ]

from sklearn.ensemble import RandomForestClassifier

# Create preprocessing and training pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),

                           ('logregressor', RandomForestClassifier(n_estimators=100))
])

# fit the pipeline to train a random forest model on the training set

model = pipeline.fit(X_train, (y_train))

print (model)
Press shift + enter to run

Let's look at the performance metrics for the new model.


[ ]

predictions = model.predict(X_test)

y_scores = model.predict_proba(X_test)

cm = confusion_matrix(y_test, predictions)

print ('Confusion Matrix:\n',cm, '\n')

print('Accuracy:', accuracy_score(y_test, predictions))

print("Overall Precision:",precision_score(y_test, predictions))

print("Overall Recall:",recall_score(y_test, predictions))

auc = roc_auc_score(y_test,y_scores[:,1])

print('\nAUC: ' + str(auc))

# calculate ROC curve

fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])

# plot ROC curve

fig = plt.figure(figsize=(6, 6))

# Plot the diagonal 50% line

plt.plot([0, 1], [0, 1], 'k--')

# Plot the FPR and TPR achieved by our model

plt.plot(fpr, tpr)

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive Rate')

plt.title('ROC Curve')

plt.show()

Press shift + enter to run

That looks better!


Use the Model for Inferencing

Now that we have a reasonably useful trained model, we can save it for use later to
predict labels for new data:
[ ]

import joblib

# Save the model as a pickle file

filename = './diabetes_model.pkl'

joblib.dump(model, filename)

Press shift + enter to run

When we have some new observations for which the label is unknown, we can load the
model and use it to predict values for the unknown label:
[ ]

# Load the model from the file

model = joblib.load(filename)

# predict on a new sample

# The model accepts an array of feature arrays (so you can predict the classes of mul
tiple patients in a single call)

# We'll create an array with a single array of features, representing one patient

X_new = np.array([[2,180,74,24,21,23.9091702,1.488172308,22]])

print ('New sample: {}'.format(list(X_new[0])))

# Get a prediction

pred = model.predict(X_new)

# The model returns an array of predictions - one for each set of features submitted

# In our case, we only submitted one patient, so our prediction is the first one in t
he resulting array.

print('Predicted class is {}'.format(pred[0]))
Press shift + enter to run

Summary
In this notebook, we looked at a range of metrics for binary classification and tried a few
algorithms beyond logistic regression. We will move onto more complex classification
problems in the following notebook.

UNIT 6/9:

Create multiclass classification models


Completed100 XP
 4 minutes

It's also possible to create multiclass classification models, in which there are more than
two possible classes. For example, the health clinic might expand the diabetes model to
classify patients as:
 Non-diabetic
 Type-1 diabetic
 Type-2 diabetic

The individual class probability values would still add up to a total of 1 as the patient is
definitely in only one of the three classes, and the most probable class would be
predicted by the model.

Using Multiclass classification models


Multiclass classification can be thought of as a combination of multiple binary classifiers.
There are two ways in which you approach the problem:

 One vs Rest (OVR), in which a classifier is created for each possible class value,
with a positive outcome for cases where the prediction is this class, and negative
predictions for cases where the prediction is any other class. For example, a
classification problem with four possible shape classes (square, circle, triangle,
hexagon) would require four classifiers that predict:
o square or not
o circle or not
o triangle or not
o hexagon or not
 One vs One (OVO), in which a classifier for each possible pair of classes is created.
The classification problem with four shape classes would require the following
binary classifiers:
o square or circle
o square or triangle
o square or hexagon
o circle or triangle
o circle or hexagon
o triangle or hexagon

In both approaches, the overall model must take into account all of these predictions to
determine which single category the item belongs to.

Fortunately, in most machine learning frameworks, including scikit-learn, implementing


a multiclass classification model is not significantly more complex than binary
classification - and in most cases, the estimators used for binary classification implicitly
support multiclass classification by abstracting an OVR algorithm, an OVO algorithm, or
by allowing a choice of either.
Unit 7/9:

Exercise - Train and evaluate multiclass


classification models
Multiclass Classification
In the last notebook, we looked at binary classification. This works well when the data
observations belong to one of two classes or categories, such as "True" or "False". When
the data can be categorized into more than two classes, you must use a multiclass
classification algorithm.

Multiclass classification can be thought of as a combination of multiple binary classifiers.


There are two ways in which you approach the problem:
 One vs Rest (OVR), in which a classifier is created for each possible class value, with a positive
outcome for cases where the prediction is this class, and negative predictions for cases where
the prediction is any other class. A classification problem with four possible shape classes
(square, circle, triangle, hexagon) would require four classifiers that predict:
o square or not
o circle or not
o triangle or not
o hexagon or not
 One vs One (OVO), in which a classifier for each possible pair of classes is created. The
classification problem with four shape classes would require the following binary classifiers:
o square or circle
o square or triangle
o square or hexagon
o circle or triangle
o circle or hexagon
o triangle or hexagon

In both approaches, the overall model that combines the classifiers generates a vector
of predictions in which the probabilities generated from the individual binary classifiers
are used to determine which class to predict.

Fortunately, in most machine learning frameworks, including scikit-learn, implementing


a multiclass classification model is not significantly more complex than binary
classification - and in most cases, the estimators used for binary classification implicitly
support multiclass classification by abstracting an OVR algorithm, an OVO algorithm, or
by allowing a choice of either.
More Information: To learn more about estimator support for multiclass classification
in Scikit-Learn, see the Scikit-Learn documentation.

Explore the data

Let's start by examining a dataset that contains observations of multiple classes. We'll
use a dataset that contains observations of three different species of penguin.
Citation: The penguins dataset used in the this exercise is a subset of data collected and
made available by Dr. Kristen Gorman and the Palmer Station, Antarctica LTER, a
member of the Long Term Ecological Research Network.


CodeMarkdown

[ ]

import pandas as pd

# load the training dataset

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/penguins.csv

penguins = pd.read_csv('penguins.csv')

# Display a random sample of 10 observations

sample = penguins.sample(10)

sample

Press shift + enter to run

The dataset contains the following columns:


 CulmenLength: The length in mm of the penguin's culmen (bill).
 CulmenDepth: The depth in mm of the penguin's culmen.
 FlipperLength: The length in mm of the penguin's flipper.
 BodyMass: The body mass of the penguin in grams.
 Species: An integer value that represents the species of the penguin.

The Species column is the label we want to train a model to predict. The dataset


includes three possible species, which are encoded as 0, 1, and 2. The actual species
names are revealed by the code below:
[ ]

penguin_classes = ['Adelie', 'Gentoo', 'Chinstrap']

print(sample.columns[0:5].values, 'SpeciesName')

for index, row in penguins.sample(10).iterrows():
    print('[',row[0], row[1], row[2], row[3], int(row[4]),']',penguin_classes[int(row
[4])])

Press shift + enter to run

Now that we know what the features and labels in the data represent, let's explore the
dataset. First, let's see if there are any missing (null) values.
[ ]

# Count the number of null values for each column

penguins.isnull().sum()

Press shift + enter to run

It looks like there are some missing feature values, but no missing labels. Let's dig a little
deeper and see the rows that contain nulls.
[ ]

# Show rows containing nulls

penguins[penguins.isnull().any(axis=1)]

Press shift + enter to run

There are two rows that contain no feature values at all (NaN stands for "not a
number"), so these won't be useful in training a model. Let's discard them from the
dataset.
[ ]

# Drop rows containing NaN values

penguins=penguins.dropna()

#Confirm there are now no nulls

penguins.isnull().sum()

Press shift + enter to run

Now that we've dealt with the missing values, let's explore how the features relate to the
label by creating some box charts.
[ ]

from matplotlib import pyplot as plt

%matplotlib inline

penguin_features = ['CulmenLength','CulmenDepth','FlipperLength','BodyMass']

penguin_label = 'Species'

for col in penguin_features:

    penguins.boxplot(column=col, by=penguin_label, figsize=(6,6))

    plt.title(col)

plt.show()

Press shift + enter to run

From the box plots, it looks like species 0 and 2 (Adelie and Chinstrap) have similar data
profiles for culmen depth, flipper length, and body mass, but Chinstraps tend to have
longer culmens. Species 1 (Gentoo) tends to have fairly clearly differentiated features
from the others; which should help us train a good classification model.
Prepare the data

Just as for binary classification, before training the model, we need to separate the
features and label, and then split the data into subsets for training and validation. We'll
also apply a stratification technique when splitting the data to maintain the proportion
of each label value in the training and validation datasets.
[ ]

from sklearn.model_selection import train_test_split

# Separate features and labels

penguins_X, penguins_y = penguins[penguin_features].values, penguins[penguin_label].v
alues

# Split data 70%-30% into training set and test set

x_penguin_train, x_penguin_test, y_penguin_train, y_penguin_test = train_test_split(p
enguins_X, penguins_y,

                                                                                    t
est_size=0.30,

                                                                                    r
andom_state=0,

                                                                                    s
tratify=penguins_y)

print ('Training Set: %d, Test Set: %d \n' % (x_penguin_train.shape[0], x_penguin_tes
t.shape[0]))

Press shift + enter to run

Train and evaluate a multiclass classifier

Now that we have a set of training features and corresponding training labels, we can fit
a multiclass classification algorithm to the data to create a model. Most scikit-learn
classification algorithms inherently support multiclass classification. We'll try a logistic
regression algorithm.
[ ]

from sklearn.linear_model import LogisticRegression
# Set regularization rate

reg = 0.1

# train a logistic regression model on the training set

multi_model = LogisticRegression(C=1/reg, solver='lbfgs', multi_class='auto', max_ite
r=10000).fit(x_penguin_train, y_penguin_train)

print (multi_model)

Press shift + enter to run

Now we can use the trained model to predict the labels for the test features, and
compare the predicted labels to the actual labels:
[ ]

penguin_predictions = multi_model.predict(x_penguin_test)

print('Predicted labels: ', penguin_predictions[:15])

print('Actual labels   : ' ,y_penguin_test[:15])

Press shift + enter to run

Let's look at a classification report.


[ ]

from sklearn. metrics import classification_report

print(classification_report(y_penguin_test, penguin_predictions))
Press shift + enter to run

As with binary classification, the report includes precision and recall metrics for each


class. However, while with binary classification we could focus on the scores for
the positive class; in this case, there are multiple classes so we need to look at an overall
metric (either the macro or weighted average) to get a sense of how well the model
performs across all three classes.

You can get the overall metrics separately from the report using the scikit-learn metrics
score classes, but with multiclass results you must specify which average metric you
want to use for precision and recall.
[ ]

from sklearn.metrics import accuracy_score, precision_score, recall_score

print("Overall Accuracy:",accuracy_score(y_penguin_test, penguin_predictions))

print("Overall Precision:",precision_score(y_penguin_test, penguin_predictions, avera
ge='macro'))

print("Overall Recall:",recall_score(y_penguin_test, penguin_predictions, average='ma
cro'))

Press shift + enter to run

Now let's look at the confusion matrix for our model:


[ ]

from sklearn.metrics import confusion_matrix

# Print the confusion matrix

mcm = confusion_matrix(y_penguin_test, penguin_predictions)

print(mcm)

Press shift + enter to run


The confusion matrix shows the intersection of predicted and actual label values for
each class - in simple terms, the diagonal intersections from top-left to bottom-right
indicate the number of correct predictions.

When dealing with multiple classes, it's generally more intuitive to visualize this as a
heat map, like this:
[ ]

import numpy as np

import matplotlib.pyplot as plt

%matplotlib inline

plt.imshow(mcm, interpolation="nearest", cmap=plt.cm.Blues)

plt.colorbar()

tick_marks = np.arange(len(penguin_classes))

plt.xticks(tick_marks, penguin_classes, rotation=45)

plt.yticks(tick_marks, penguin_classes)

plt.xlabel("Predicted Species")

plt.ylabel("Actual Species")

plt.show()

Press shift + enter to run

The darker squares in the confusion matrix plot indicate high numbers of cases, and you
can hopefully see a diagonal line of darker squares indicating cases where the predicted
and actual label are the same.

In the case of a multiclass classification model, a single ROC curve showing true positive
rate vs false positive rate is not possible. However, you can use the rates for each class in
a One vs Rest (OVR) comparison to create a ROC chart for each class.
[ ]

from sklearn.metrics import roc_curve

from sklearn.metrics import roc_auc_score
# Get class probability scores

penguin_prob = multi_model.predict_proba(x_penguin_test)

# Get ROC metrics for each class

fpr = {}

tpr = {}

thresh ={}

for i in range(len(penguin_classes)):    

    fpr[i], tpr[i], thresh[i] = roc_curve(y_penguin_test, penguin_prob[:,i], pos_labe
l=i)

    

# Plot the ROC chart

plt.plot(fpr[0], tpr[0], linestyle='--',color='orange', label=penguin_classes[0] + ' 
vs Rest')

plt.plot(fpr[1], tpr[1], linestyle='--',color='green', label=penguin_classes[1] + ' v
s Rest')

plt.plot(fpr[2], tpr[2], linestyle='--',color='blue', label=penguin_classes[2] + ' vs 
Rest')

plt.title('Multiclass ROC curve')

plt.xlabel('False Positive Rate')

plt.ylabel('True Positive rate')

plt.legend(loc='best')

plt.show()

Press shift + enter to run

To quantify the ROC performance, you can calculate an aggregate area under the curve
score that is averaged across all of the OVR curves.
[ ]

auc = roc_auc_score(y_penguin_test,penguin_prob, multi_class='ovr')

print('Average AUC:', auc)
Press shift + enter to run

Preprocess data in a pipeline

Again, just like with binary classification, you can use a pipeline to apply preprocessing
steps to the data before fitting it to an algorithm to train a model. Let's see if we can
improve the penguin predictor by scaling the numeric features in a transformation steps
before training. We'll also try a different algorithm (a support vector machine), just to
show that we can!
[ ]

from sklearn.preprocessing import StandardScaler

from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline

from sklearn.svm import SVC

# Define preprocessing for numeric columns (scale them)

feature_columns = [0,1,2,3]

feature_transformer = Pipeline(steps=[

    ('scaler', StandardScaler())

    ])

# Create preprocessing steps

preprocessor = ColumnTransformer(

    transformers=[

        ('preprocess', feature_transformer, feature_columns)])

# Create training pipeline

pipeline = Pipeline(steps=[('preprocessor', preprocessor),

                           ('regressor', SVC(probability=True))])

# fit the pipeline to train a linear regression model on the training set

multi_model = pipeline.fit(x_penguin_train, y_penguin_train)

print (multi_model)
Press shift + enter to run

Now we can evaluate the new model.


[ ]

# Get predictions from test data

penguin_predictions = multi_model.predict(x_penguin_test)

penguin_prob = multi_model.predict_proba(x_penguin_test)

# Overall metrics

print("Overall Accuracy:",accuracy_score(y_penguin_test, penguin_predictions))

print("Overall Precision:",precision_score(y_penguin_test, penguin_predictions, avera
ge='macro'))

print("Overall Recall:",recall_score(y_penguin_test, penguin_predictions, average='ma
cro'))

print('Average AUC:', roc_auc_score(y_penguin_test,penguin_prob, multi_class='ovr'))

# Confusion matrix

plt.imshow(mcm, interpolation="nearest", cmap=plt.cm.Blues)

plt.colorbar()

tick_marks = np.arange(len(penguin_classes))

plt.xticks(tick_marks, penguin_classes, rotation=45)

plt.yticks(tick_marks, penguin_classes)

plt.xlabel("Predicted Species")

plt.ylabel("Actual Species")

plt.show()

Press shift + enter to run


Use the model with new data observations

Now let's save our trained model so we can use it again later.
[ ]

import joblib

# Save the model as a pickle file

filename = './penguin_model.pkl'

joblib.dump(multi_model, filename)

Press shift + enter to run

OK, so now we have a trained model. Let's use it to predict the class of a new penguin
observation:
[ ]

# Load the model from the file

multi_model = joblib.load(filename)

# The model accepts an array of feature arrays (so you can predict the classes of mul
tiple penguin observations in a single call)

# We'll create an array with a single array of features, representing one penguin

x_new = np.array([[50.4,15.3,224,5550]])

print ('New sample: {}'.format(x_new[0]))

# The model returns an array of predictions - one for each set of features submitted

# In our case, we only submitted one penguin, so our prediction is the first one in t
he resulting array.

penguin_pred = multi_model.predict(x_new)[0]

print('Predicted class is', penguin_classes[penguin_pred])
Press shift + enter to run

You can also submit a batch of penguin observations to the model, and get back a
prediction for each one.
[ ]

# This time our input is an array of two feature arrays

x_new = np.array([[49.5,18.4,195, 3600],

         [38.2,20.1,190,3900]])

print ('New samples:\n{}'.format(x_new))

# Call the web service, passing the input data

predictions = multi_model.predict(x_new)

# Get the predicted classes.

for prediction in predictions:

    print(prediction, '(' + penguin_classes[prediction] +')')

Press shift + enter to run

Summary
Classification is one of the most common forms of machine learning, and by following
the basic principles we've discussed in this notebook you should be able to train and
evaluate classification models with scikit-learn. It's worth spending some time
investigating classification algorithms in more depth, and a good starting point is
the Scikit-Learn documentation.
[ ]

Press shift + enter to run


UNIT 8/9:

Knowledge check
200 XP
 3 minutes

Answer the following questions to check your learning.

1. 

You plan to use scikit-learn to train a model that predicts credit default risk. The model
must predict a value of 0 for loan applications that should be automatically approved,
and 1 for applications where there is a risk of default that requires human consideration.
What kind of model is required?

A binary classification model

A multi-class classification model

A linear regression model

ANSWER: 1 ( A binary classification model predicts probability for two classes. )
2. 

You have trained a classification model using the scikit-learn LogisticRegression class.
You want to use the model to return labels for new data in the array x_new. Which code
should you use?

model.predict(x_new)

model.fit(x_new)

model.score(x_new, y_new)

ANSWER: 1 (Use the predict method for inferencing labels for new data.)

3. 

You train a binary classification model using scikit-learn. When you evaluate it with test
data, you determine that the model achieves an overall Recall metric of 0.81. What does
this metric indicate?

The model correctly predicted 81% of the test cases

81% of the cases predicted as positive by the model were actually positive

The model correctly identified 81% of positive cases as positive


ANSWER: 3 ( Recall indicates the proportion of actual positive cases that the classifier
correctly identified. )

UNIT 9/9:

Summary
Completed100 XP

 1 minute

In this module, you learned how classification can be used to create a machine learning
model that predicts categories, or classes. You then used the scikit-learn framework in
Python to train and evaluate a classification model.

While scikit-learn is a popular framework for writing code to train classification models,
you can also create machine learning solutions for classification using the graphical
tools in Microsoft Azure Machine Learning. You can learn more about no-code
development of classification models using Azure Machine Learning in the Create a
classification model with Azure Machine Learning designer module.

Challenge: Classify Wines


Feel like challenging yourself to train a classification model? Try the challenge in
the /challenges/03 - Wine Classification Challenge.ipynb notebook to see if you can
classify wines into their grape varietals!

 Note

The time to complete this optional challenge is not included in the estimated time for
this module - you can spend as little or as much time on it as you like!

Explore other modules


 Create machine learning models
 Foundations of data science for machine learning

MODULE:
Train and evaluate clustering models
UNIT 1/7:
Introduction
Completed100 XP
 2 minutes
Clustering is the process of grouping objects with similar objects. For example, in the
image below we have a collection of 2D coordinates that have been clustered into three
categories - top left (yellow), bottom (red), and top right (blue).
A major difference between clustering and classification models is that clustering is an
‘unsupervised’ method, where ‘training’ is done without labels. Instead, models identify
examples that have a similar collection of features. In the image above, examples that
are in a similar location are grouped together.

Clustering is common and useful for exploring new data where patterns between data
points, such as high-level categories, are not yet known. It's used in many fields that
need to automatically label complex data, including analysis of social networks, brain
connectivity, spam filtering, and so on.
Unit 2/7:

What is clustering?
Completed100 XP
 5 minutes

Clustering is a form of unsupervised machine learning in which observations are grouped


into clusters based on similarities in their data values, or features. This kind of machine
learning is considered unsupervised because it does not make use of previously
known label values to train a model; in a clustering model, the label is the cluster to
which the observation is assigned, based purely on its features.

For example, suppose a botanist observes a sample of flowers and records the number
of petals and leaves on each flower.

It may be useful to group these flowers into clusters based on similarities between their
features.

There are many ways this could be done. For example, if most flowers have the same
number of leaves, they could be grouped into those with many vs few petals.
Alternatively, if both petal and leaf counts vary considerably there may be a pattern to
discover, such as those with many leaves also having many petals. The goal of the
clustering algorithm is to find the optimal way to split the dataset into groups. What
‘optimal’ means depends on both the algorithm used and the dataset that is provided.

Although this flower example may be simple for a human to achieve with only a few
samples, as the dataset grows to thousands of samples or to more than two features,
clustering algorithms become very useful to quickly dissect a dataset into groups.
Unit 3/7:

Exercise - Train and evaluate a clustering


model

Clustering - Introduction
In contrast to supervised machine learning, unsupervised learning is used when there is no
"ground truth" from which to train and validate label predictions. The most common form of
unsupervised learning is clustering, which is simllar conceptually to classification, except that
the the training data does not include known values for the class label to be predicted. Clustering
works by separating the training cases based on similarities that can be determined from their
feature values. Think of it this way; the numeric features of a given entity can be thought of as
vector coordinates that define the entity's position in n-dimensional space. What a clustering
model seeks to do is to identify groups, or clusters, of entities that are close to one another while
being separated from other clusters.

For example, let's take a look at a dataset that contains measurements of different species of
wheat seed.
Citation: The seeds dataset used in the this exercise was originally published by the Institute of
Agrophysics of the Polish Academy of Sciences in Lublin, and can be downloaded from the UCI
dataset repository (Dua, D. and Graff, C. (2019). UCI Machine Learning
Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California, School of
Information and Computer Science).


CodeMarkdown

[ ]

import pandas as pd

# load the training dataset

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/seeds.csv
data = pd.read_csv('seeds.csv')

# Display a random sample of 10 observations (just the features)

features = data[data.columns[0:6]]

features.sample(10)

Press shift + enter to run

As you can see, the dataset contains six data points (or features) for each instance (observation)
of a seed. So you could interpret these as coordinates that describe each instance's location in six-
dimensional space.

Now, of course six-dimensional space is difficult to visualise in a three-dimensional world, or on


a two-dimensional plot; so we'll take advantage of a mathematical technique called Principal
Component Analysis (PCA) to analyze the relationships between the features and summarize
each observation as coordinates for two principal components - in other words, we'll translate the
six-dimensional feature values into two-dimensional coordinates.
[ ]

from sklearn.preprocessing import MinMaxScaler

from sklearn.decomposition import PCA

# Normalize the numeric features so they're on the same scale

scaled_features = MinMaxScaler().fit_transform(features[data.columns[0:6]])

# Get two principal components

pca = PCA(n_components=2).fit(scaled_features)

features_2d = pca.transform(scaled_features)

features_2d[0:10]

Press shift + enter to run


Now that we have the data points translated to two dimensions, we can visualize them in a plot:
[ ]

import matplotlib.pyplot as plt

%matplotlib inline

plt.scatter(features_2d[:,0],features_2d[:,1])

plt.xlabel('Dimension 1')

plt.ylabel('Dimension 2')

plt.title('Data')

plt.show()

Press shift + enter to run

Hopefully you can see at least two, arguably three, reasonably distinct groups of data points; but
here lies one of the fundamental problems with clustering - without known class labels, how do
you know how many clusters to separate your data into?

One way we can try to find out is to use a data sample to create a series of clustering models with
an incrementing number of clusters, and measure how tightly the data points are grouped within
each cluster. A metric often used to measure this tightness is the within cluster sum of
squares (WCSS), with lower values meaning that the data points are closer. You can then plot
the WCSS for each model.
[ ]

#importing the libraries

import numpy as np

import matplotlib.pyplot as plt

from sklearn.cluster import KMeans

%matplotlib inline

# Create 10 models with 1 to 10 clusters

wcss = []

for i in range(1, 11):

    kmeans = KMeans(n_clusters = i)
    # Fit the data points

    kmeans.fit(features.values)

    # Get the WCSS (inertia) value

    wcss.append(kmeans.inertia_)

    

#Plot the WCSS values onto a line graph

plt.plot(range(1, 11), wcss)

plt.title('WCSS by Clusters')

plt.xlabel('Number of clusters')

plt.ylabel('WCSS')

plt.show()

Press shift + enter to run

The plot shows a large reduction in WCSS (so greater tightness) as the number of
clusters increases from one to two, and a further noticable reduction from two to three
clusters. After that, the reduction is less pronounced, resulting in an "elbow" in the chart
at around three clusters. This is a good indication that there are two to three reasonably
well separated clusters of data points.

Summary
Here we looked at what clustering means, and how to determine whether clustering
might be appropriate for your data. In the next notebook, we will look at two ways of
labelling the data automatically.
UNIT 4/7:

Evaluate different types of clustering


Completed100 XP

 5 minutes

Training a clustering model


There are multiple algorithms you can use for clustering. One of the most commonly
used algorithms is K-Means clustering that, in its simplest form, consists of the following
steps:

1. The feature values are vectorized to define n-dimensional coordinates (where n is the
number of features). In the flower example, we have two features (number of petals and
number of leaves), so the feature vector has two coordinates that we can use to
conceptually plot the data points in two-dimensional space.
2. You decide how many clusters you want to use to group the flowers, and call this value k.
For example, to create three clusters, you would use a k value of 3. Then k points are
plotted at random coordinates. These points will ultimately be the center points for each
cluster, so they're referred to as centroids.
3. Each data point (in this case flower) is assigned to its nearest centroid.
4. Each centroid is moved to the center of the data points assigned to it based on the mean
distance between the points.
5. After moving the centroid, the data points may now be closer to a different centroid, so
the data points are reassigned to clusters based on the new closest centroid.
6. The centroid movement and cluster reallocation steps are repeated until the clusters
become stable or a pre-determined maximum number of iterations is reached.

The following animation shows this process:


Hierarchical Clustering
Hierarchical clustering is another type of clustering algorithm in which clusters
themselves belong to a larger group, which belong to even larger groups, and so on.
The result is that data points can be clusters in differing degrees of precision: with a
large number of very small and precise groups, or a small number of larger groups.

For example, if we apply clustering to the meanings of words, we may get a group
containing adjectives specific to emotions (‘angry’, ‘happy’, and so on), which itself
belongs to a group containing all human-related adjectives (‘happy’, ‘handsome’,
‘young’), and this belongs to an even higher group containing all adjectives (‘happy’,
‘green’, ‘handsome’, ‘hard’ etc.).
Hierarchical clustering is useful for not only breaking data into groups, but
understanding the relationships between these groups. A major advantage of
hierarchical clustering is that it does not require the number of clusters to be defined in
advance, and can sometimes provide more interpretable results than non-hierarchical
approaches. The major drawback is that these approaches can take much longer to
compute than simpler approaches and sometimes are not suitable for large datasets.
UNIT 5/7:
Exercise - Train and evaluate advanced clustering
models

Clustering - K-Means and Hierachial


In the last notebook, we learned that data can be broken into clusters and learned how
to see whether data may be compatible with such an analysis. In this notebook, we will
perform this clustering automatically.

To get started, run the cell below to load our data


Citation: The seeds dataset used in the this exercise was originally published by the
Institute of Agrophysics of the Polish Academy of Sciences in Lublin, and can be
downloaded from the UCI dataset repository (Dua, D. and Graff, C. (2019). UCI Machine
Learning Repository http://archive.ics.uci.edu/ml. Irvine, CA: University of California,
School of Information and Computer Science).


CodeMarkdown

[ ]

import pandas as pd

# load the training dataset

!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-
machine-learning/main/Data/ml-basics/seeds.csv

data = pd.read_csv('seeds.csv')

# Display a random sample of 10 observations (just the features)

features = data[data.columns[0:6]]

features.sample(10)

from sklearn.preprocessing import MinMaxScaler

from sklearn.decomposition import PCA

# Normalize the numeric features so they're on the same scale
scaled_features = MinMaxScaler().fit_transform(features[data.columns[0:6]])

# Get two principal components

pca = PCA(n_components=2).fit(scaled_features)

features_2d = pca.transform(scaled_features)

features_2d[0:10]

Press shift + enter to run

K-Means Clustering
The algorithm we used to create our test clusters is K-Means. This is a commonly used
clustering algorithm that separates a dataset into K clusters of equal variance. The
number of clusters, K, is user defined. The basic algorithm has the following steps:
1. A set of K centroids are randomly chosen.
2. Clusters are formed by assigning the data points to their closest centroid.
3. The means of each cluster is computed and the centroid is moved to the mean.
4. Steps 2 and 3 are repeated until a stopping criteria is met. Typically, the algorithm terminates
when each new iteration results in negligable movement of centroids and the clusters become
static.
5. When the clusters stop changing, the algorithm has converged, defining the locations of the
clusters - note that the random starting point for the centroids means that re-running the
algorithm could result in slightly different clusters, so training usually involves multiple
iterations, reinitializing the centroids each time, and the model with the best WCSS is selected.

Let's try using K-Means on our seeds data with a K value of 3.


[ ]

from sklearn.cluster import KMeans

# Create a model based on 3 centroids

model = KMeans(n_clusters=3, init='k-means++', n_init=100, max_iter=1000)

# Fit to the data and predict the cluster assignments for each data point

km_clusters = model.fit_predict(features.values)

# View the cluster assignments

km_clusters
Press shift + enter to run

Let's see those cluster assignments with the two-dimensional data points.
[ ]

import matplotlib.pyplot as plt

%matplotlib inline

def plot_clusters(samples, clusters):

    col_dic = {0:'blue',1:'green',2:'orange'}

    mrk_dic = {0:'*',1:'x',2:'+'}

    colors = [col_dic[x] for x in clusters]

    markers = [mrk_dic[x] for x in clusters]

    for sample in range(len(clusters)):

        plt.scatter(samples[sample][0], samples[sample][1], color = colors[sample], m
arker=markers[sample], s=100)

    plt.xlabel('Dimension 1')

    plt.ylabel('Dimension 2')

    plt.title('Assignments')

    plt.show()

plot_clusters(features_2d, km_clusters)

Press shift + enter to run

Hopefully, the the data has been separated into three distinct clusters.

So what's the practical use of clustering? In some cases, you may have data that you
need to group into distict clusters without knowing how many clusters there are or what
they indicate. For example a marketing organization might want to separate customers
into distinct segments, and then investigate how those segments exhibit different
purchasing behaviors.

Sometimes, clustering is used as an initial step towards creating a classification model.


You start by identifying distinct groups of data points, and then assign class labels to
those clusters. You can then use this labelled data to train a classification model.

In the case of the seeds data, the different species of seed are already known and
encoded as 0 (Kama), 1 (Rosa), or 2 (Canadian), so we can use these identifiers to
compare the species classifications to the clusters identified by our unsupervised
algorithm
[ ]

seed_species = data[data.columns[7]]

plot_clusters(features_2d, seed_species.values)

Press shift + enter to run

There may be some differences between the cluster assignments and class labels, but
the K-Means model should have done a reasonable job of clustering the observations so
that seeds of the same species are generally in the same cluster.
Hierarchical Clustering
Hierarchical clustering methods make fewer distributional assumptions when compared
to K-means methods. However, K-means methods are generally more scalable,
sometimes very much so.

Hierarchical clustering creates clusters by either a divisive method


or agglomerative method. The divisive method is a "top down" approach starting with
the entire dataset and then finding partitions in a stepwise manner. Agglomerative
clustering is a "bottom up** approach. In this lab you will work with agglomerative
clustering which roughly works as follows:
1. The linkage distances between each of the data points is computed.
2. Points are clustered pairwise with their nearest neighbor.
3. Linkage distances between the clusters are computed.
4. Clusters are combined pairwise into larger clusters.
5. Steps 3 and 4 are repeated until all data points are in a single cluster.

The linkage function can be computed in a number of ways:


 Ward linkage measures the increase in variance for the clusters being linked,
 Average linkage uses the mean pairwise distance between the members of the two clusters,
 Complete or Maximal linkage uses the maximum distance between the members of the two
clusters.

Several different distance metrics are used to compute linkage functions:


 Euclidian or l2 distance is the most widely used. This metric is only choice for the Ward linkage
method.
 Manhattan or l1 distance is robust to outliers and has other interesting properties.
 Cosine similarity, is the dot product between the location vectors divided by the magnitudes of
the vectors. Notice that this metric is a measure of similarity, whereas the other two metrics are
measures of difference. Similarity can be quite useful when working with data such as images or
text documents.

Agglomerative Clustering

Let's see an example of clustering the seeds data using an agglomerative clustering
algorithm.
[ ]

from sklearn.cluster import AgglomerativeClustering

agg_model = AgglomerativeClustering(n_clusters=3)

agg_clusters = agg_model.fit_predict(features.values)

agg_clusters

Press shift + enter to run

So what do the agglomerative cluster assignments look like?


[ ]

import matplotlib.pyplot as plt

%matplotlib inline

def plot_clusters(samples, clusters):

    col_dic = {0:'blue',1:'green',2:'orange'}
    mrk_dic = {0:'*',1:'x',2:'+'}

    colors = [col_dic[x] for x in clusters]

    markers = [mrk_dic[x] for x in clusters]

    for sample in range(len(clusters)):

        plt.scatter(samples[sample][0], samples[sample][1], color = colors[sample], m
arker=markers[sample], s=100)

    plt.xlabel('Dimension 1')

    plt.ylabel('Dimension 2')

    plt.title('Assignments')

    plt.show()

plot_clusters(features_2d, agg_clusters)

Press shift + enter to run

Summary
Here we practiced using K-means and heirachial clustering. This unsupervised learning
has the ability to take unlabelled data and identify which of these data are similar to one
another.

Further Reading
To learn more about clustering with scikit-learn, see the scikit-learn documentation.
UNIT 6/7:
Knowledge check
200 XP

 3 minutes

Answer the following questions to check your learning.

1. 

K-Means clustering is an example of which kind of machine learning?

Supervised machine learning

Unsupervised machine learning

Reinforcement learning

ANSWER: 2 (Clustering is a form of unsupervised machine learning in which the


training data does not include known labels. )
2. 

You are using scikit-learn to train a K-Means clustering model that groups observations
into three clusters. How should you create the KMeans object to accomplish this goal?

model = KMeans(n_clusters=3)

model = KMeans(n_init=3)

model = KMeans(max_iter=3)

ANSWER: 1. (The n_clusters parameter determines the number of clusters.)


Summary
Completed100 XP

 1 minute

In this module, you learned how clustering can be used to create unsupervised machine
learning models that group data observations into clusters. You then used the scikit-
learn framework in Python to train a clustering model.

While scikit-learn is a popular framework for writing code to train clustering models, you
can also create machine learning solutions for clustering using the graphical tools in
Microsoft Azure Machine Learning. You can learn more about no-code development of
clustering models using Azure Machine Learning in the Create a clustering model with
Azure Machine Learning designer module.

Challenge: Cluster Unlabelled Data


Now that you've seen how to create a clustering model, why not try for yourself? You'll
find a clustering challenge in the 04 - Clustering Challenge.ipynb notebook!

 Note

The time to complete this optional challenge is not included in the estimated time for
this module - you can spend as little or as much time on it as you like!

Module complete:
MODULE 5: TRAIN AND EVAULATE DEEP
LEARNING MODELS

UNIT 1/9:
Introduction
Completed100 XP

 5 minutes

Deep learning is an advanced form of machine learning that tries to emulate the way the
human brain learns.

In your brain, you have nerve cells called neurons, which are connected to one another
by nerve extensions that pass electrochemical signals through the network.

When the first neuron in the network is stimulated, the input signal is processed, and if
it exceeds a particular threshold, the neuron is activated and passes the signal on to the
neurons to which it is connected. These neurons in turn may be activated and pass the
signal on through the rest of the network. Over time, the connections between the
neurons are strengthened by frequent use as you learn how to respond effectively. For
example, if someone throws a ball towards you, your neuron connections enable you to
process the visual information and coordinate your movements to catch the ball. If you
perform this action repeatedly, the network of neurons involved in catching a ball will
grow stronger as you learn how to be better at catching a ball.

Deep learning emulates this biological process using artificial neural networks that
process numeric inputs rather than electrochemical stimuli.

The incoming nerve connections are replaced by numeric inputs that are typically
identified as x. When there's more than one input value, x is considered a vector with
elements named x1, x2, and so on.

Associated with each x value is a weight (w), which is used to strengthen or weaken the


effect of the x value to simulate learning. Additionally, a bias (b) input is added to
enable fine-grained control over the network. During the training process,
the w and b values will be adjusted to tune the network so that it "learns" to produce
correct outputs.

The neuron itself encapsulates a function that calculates a weighted sum of x, w, and b.
This function is in turn enclosed in an activation function that constrains the result (often
to a value between 0 and 1) to determine whether or not the neuron passes an output
onto the next layer of neurons in the network.
UNIT 2/9:

Deep neural network concepts


Completed100 XP

 10 minutes

Before exploring how to train a deep neural network (DNN) machine learning model,
let's consider what we're trying to achieve. Machine learning is concerned with
predicting a label based on some features of a particular observation. In simple terms, a
machine learning model is a function that calculates y (the label) from x (the
features): f(x)=y.

A simple classification example


For example, suppose your observation consists of some measurements of a penguin.

Specifically, the measurements are:

 The length of the penguin's bill.


 The depth of the penguin's bill.
 The length of the penguin's flipper.
 The penguin's weight.

In this case, the features (x) are a vector of four values, or mathematically, x=[x1,x2,x3,x4].

Let's suppose that the label we're trying to predict (y) is the species of the penguin, and
that there are three possible species it could be:

1. Adelie
2. Gentoo
3. Chinstrap

This is an example of a classification problem, in which the machine learning model must


predict the most probable class to which the observation belongs. A classification model
accomplishes this by predicting a label that consists of the probability for each class. In
other words, y is a vector of three probability values; one for each of the possible
classes: y=[P(0),P(1),P(2)].

You train the machine learning model by using observations for which you already know
the true label. For example, you may have the following feature measurements for
an Adelie specimen:

x=[37.3, 16.8, 19.2, 30.0]

You already know that this is an example of an Adelie (class 0), so a perfect classification
function should result in a label that indicates a 100% probability for class 0, and a 0%
probability for classes 1 and 2:

y=[1, 0, 0]

A deep neural network model


So how would we use deep learning to build a classification model for the penguin
classification model? Let's look at an example:
The deep neural network model for the classifier consists of multiple layers of artificial
neurons. In this case, there are four layers:

 An input layer with a neuron for each expected input (x) value.


 Two so-called hidden layers, each containing five neurons.
 An output layer containing three neurons - one for each class probability (y) value to be
predicted by the model.

Because of the layered architecture of the network, this kind of model is sometimes
referred to as a multilayer perceptron. Additionally, notice that all neurons in the input
and hidden layers are connected to all neurons in the subsequent layers - this is an
example of a fully connected network.

When you create a model like this, you must define an input layer that supports the
number of features your model will process, and an output layer that reflects the
number of outputs you expect it to produce. You can decide how many hidden layers
you want to include and how many neurons are in each of them; but you have no
control over the input and output values for these layers - these are determined by the
model training process.

Training a deep neural network


The training process for a deep neural network consists of multiple iterations,
called epochs. For the first epoch, you start by assigning random initialization values for
the weight (w) and bias b values. Then the process is as follows:

1. Features for data observations with known label values are submitted to the input layer.
Generally, these observations are grouped into batches (often referred to as mini-batches).
2. The neurons then apply their function, and if activated, pass the result onto the next layer
until the output layer produces a prediction.
3. The prediction is compared to the actual known value, and the amount of variance
between the predicted and true values (which we call the loss) is calculated.
4. Based on the results, revised values for the weights and bias values are calculated to
reduce the loss, and these adjustments are backpropagated to the neurons in the network
layers.
5. The next epoch repeats the batch training forward pass with the revised weight and bias
values, hopefully improving the accuracy of the model (by reducing the loss).
 Note

Processing the training features as a batch improves the efficiency of the training
process by processing multiple observations simultaneously as a matrix of features with
vectors of weights and biases. Linear algebraic functions that operate with matrices and
vectors also feature in 3D graphics processing, which is why computers with graphic
processing units (GPUs) provide significantly better performance for deep learning
model training than central processing unit (CPU) only computers.

A closer look at loss functions and backpropagation


The previous description of the deep learning training process mentioned that the loss
from the model is calculated and used to adjust the weight and bias values. How exactly
does this work?

Calculating loss

Suppose one of the samples passed through the training process contains features of
an Adelie specimen (class 0). The correct output from the network would be [1, 0, 0].
Now suppose that the output produced by the network is [0.4, 0.3, 0.3]. Comparing
these, we can calculate an absolute variance for each element (in other words, how far is
each predicted value away from what it should be) as [0.6, 0.3, 0.3].

In reality, since we're actually dealing with multiple observations, we typically aggregate
the variance - for example by squaring the individual variance values and calculating the
mean, so we end up with a single, average loss value, like 0.18.

Optimizers

Now, here's the clever bit. The loss is calculated using a function, which operates on the
results from the final layer of the network, which is also a function. The final layer of
network operates on the outputs from the previous layers, which are also functions. So
in effect, the entire model from the input layer right through to the loss calculation is
just one big nested function. Functions have a few really useful characteristics, including:

 You can conceptualize a function as a plotted line comparing its output with each of its
variables.
 You can use differential calculus to calculate the derivative of the function at any point
with respect to its variables.

Let's take the first of these capabilities. We can plot the line of the function to show how
an individual weight value compares to loss, and mark on that line the point where the
current weight value matches the current loss value.
Now let's apply the second characteristic of a function. The derivative of a function for a
given point indicates whether the slope (or gradient) of the function output (in this case,
loss) is increasing or decreasing with respect to a function variable (in this case, the
weight value). A positive derivative indicates that the function is increasing, and a
negative derivative indicates that it is decreasing. In this case, at the plotted point for
the current weight value, the function has a downward gradient. In other words,
increasing the weight will have the effect of decreasing the loss.

We use an optimizer to apply this same trick for all of the weight and bias variables in
the model and determine in which direction we need to adjust them (up or down) to
reduce the overall amount of loss in the model. There are multiple commonly used
optimization algorithms, including stochastic gradient descent (SGD), Adaptive Learning
Rate (ADADELTA), Adaptive Momentum Estimation (Adam), and others; all of which are
designed to figure out how to adjust the weights and biases to minimize loss.

Learning rate

Now, the obvious next question is, by how much should the optimizer adjust the
weights and bias values? If you look at the plot for our weight value, you can see that
increasing the weight by a small amount will follow the function line down (reducing the
loss), but if we increase it by too much, the function line starts to go up again, so we
might actually increase the loss; and after the next epoch, we might find we need to
reduce the weight.
The size of the adjustment is controlled by a parameter that you set for training called
the learning rate. A low learning rate results in small adjustments (so it can take more
epochs to minimize the loss), while a high learning rate results in large adjustments (so
you might miss the minimum altogether).
UNIT 3/9:

Exercise - Train a deep neural network


Completed100 XP

 25 minutes

So far in this module, you've learned a lot about the theory and principles of deep
learning with neural networks. The best way to learn how to apply this theory is to
actually build a deep learning model, and that's what you'll do in this exercise.

There are many frameworks available for training deep neural networks, and in this
exercise you can choose to explore either (or both) of two of the most popular deep
learning frameworks for Python: PyTorch and TensorFlow.

Before you start


To complete the exercise, you'll need:

 A Microsoft Azure subscription. If you don't already have one, you can sign up for a free
trial at https://azure.microsoft.com/free.
 An Azure Machine Learning workspace with a compute instance and the ml-
basics repository cloned.
 Note

This module makes use of an Azure Machine Learning workspace. If you are completing
this module in preparation for the Azure Data Scientist certification, consider creating
the workspace once, and reusing it in other modules. After completing the exercise, be
sure to follow the Clean Up instructions to stop compute resources, and retain the
workspace if you plan to reuse it.

Create an Azure Machine Learning workspace

If you don't already have an Azure Machine Learning workspace in your Azure
subscription, follow these steps to create one:

1. Sign into the Azure portal using the Microsoft account associated with your Azure
subscription.
2. On the Azure Home page, under Azure services, select Create a resource.
The Create a resource pane appears.

3. In the Search services and marketplace search box, search for and select Machine


Learning. The Machine Learning pane appears.

4. Select Create. The Machine learning pane appears.

5. On the Basics tab, Enter the following values to each setting.

Setting Value

Project Details

Subscription Select the Azure subscription you'd like to use for this exercise.

Resource group Select the Create new link, and name the new resource group with a unique name, and selec

Workspace details

Workspace name Enter a unique name for your app. For example, you could use <yourname>;machinelearn.

Region From the dropdown list, select any available location.

6. Accept the remaining defaults, and select Review + create.

7. After validation passes, select Create.

Wait for your workspace resource to be created as it can take a few minutes.

8. When deployment completes, select Go to resource. Your Machine learning pane


appears.

9. Select Launch studio, or go to https://ml.azure.com, and sign in using your


Microsoft account. The Microsoft Azure Machine Learning studio page appears.

10. In Azure Machine Learning studio, toggle the ☰ icon at the top left to
expane/collapse its menu pane. You can use these options to manage the
resources in your workspace.
Create a compute instance

To run the notebook used in this exercise, you will need a compute instance in your
Azure Machine Learning workspace.

1. In the left menu pane, under Manage, select Compute. The Compute pane


appears.

2. On the Compute Instances tab, if you already have a compute instance, start it;


otherwise, create a new compute instance by selecting New. The Create compute
instance pane appears.

3. Enter the following values for each setting:

 Compute name: enter a unique name


 Virtual machine type: CPU
 Virtual machine size: Select from recommended options: Standard_DS11_v2

4. Select Create. The Compute pane reappears with your Compute instance listed.

5. Wait for the compute instance to start as this may take a couple of minutes. Under
the State column, your Compute instance will change to Running.

Clone the ml-basics repository

The files used in this module, and other related modules, are published in
the MicrosoftDocs/ml-basics GitHub repository. If you haven't already done so, use the
following steps to clone the repository to your Azure Machine Learning workspace:

1. Select Workspaces in the left-hand menu of Azure Machine Learning studio, then


select the workspace you created in the list.

2. Under the Author column on the left, select the Notebooks link to open Jupyter


Notebooks. The Notebooks pane appears.

3. Select the Terminal button on the right. A terminal shell appears.

4. Run the following commands to change the current directory to


the Users directory, and clone the ml-basics repository, which contains the
notebook and files you will use in this exercise.
BashCopy
cd Users
git clone https://github.com/microsoftdocs/ml-basics

5. After the command has completed and the checkout of the files is done, close the
terminal tab and view the home page in your Jupyter notebook file explorer.

6. Open the Users folder - it should contain an ml-basics folder, containing the files


you will use in this module.

 Note

We highly recommend using Jupyter in an Azure Machine Learning workspace for this
exercise. This setup ensures the correct version of Python and the various packages you
will need are installed; and after creating the workspace once, you can reuse it in other
modules. If you prefer to complete the exercise in a Python environment on your own
computer, you can do so. You'll find details for configuring a local development
environment that uses Visual Studio Code at Running the labs on your own
computer. Be aware that if you choose to do this, the instructions in the exercise may
not match your notebooks user interface.

Train a deep neural network model


After you've created a Jupyter environment, and cloned the ml-basics repository, you're
ready to explore deep learning.

1. In Jupyter, in the ml-basics folder, open either the Deep Neural Networks


(PyTorch).ipynb or Deep Neural Networks (Tensorflow).ipynb notebook,
depending on your framework preference, and follow the instructions it contains.

2. When you've finished, close and halt all notebooks.

When you've finished working through the notebook, return to this module and move
on to the next unit to learn more.
UNIT 4/9:
Convolutional neural networks
Completed100 XP

 10 minutes

While you can use deep learning models for any kind of machine learning, they're
particularly useful for dealing with data that consists of large arrays of numeric values -
such as images. Machine learning models that work with images are the foundation for
an area artificial intelligence called computer vision, and deep learning techniques have
been responsible for driving amazing advances in this area over recent years.

At the heart of deep learning's success in this area is a kind of model called
a convolutional neural network, or CNN. A CNN typically works by extracting features
from images, and then feeding those features into a fully connected neural network to
generate a prediction. The feature extraction layers in the network have the effect of
reducing the number of features from the potentially huge array of individual pixel
values to a smaller feature set that supports label prediction.

Layers in a CNN
CNNs consist of multiple layers, each performing a specific task in extracting features or
predicting labels.

Convolution layers

One of the principal layer types is a convolutional layer that extracts important features
in images. A convolutional layer works by applying a filter to images. The filter is defined
by a kernel that consists of a matrix of weight values.

For example, a 3x3 filter might be defined like this:

Copy
1 -1 1
-1 0 -1
1 -1 1

An image is also just a matrix of pixel values. To apply the filter, you "overlay" it on an
image and calculate a weighted sum of the corresponding image pixel values under the
filter kernel. The result is then assigned to the center cell of an equivalent 3x3 patch in a
new matrix of values that is the same size as the image. For example, suppose a 6 x 6
image has the following pixel values:

Copy
255 255 255 255 255 255
255 255 100 255 255 255
255 100 100 100 255 255
100 100 100 100 100 255
255 255 255 255 255 255
255 255 255 255 255 255

Applying the filter to the top-left 3x3 patch of the image would work like this:

Copy
255 255 255 1 -1 1 (255 x 1)+(255 x -1)+(255 x 1) +
255 255 100 x -1 0 -1 = (255 x -1)+(255 x 0)+(100 x -1) + = 155
255 100 100 1 -1 1 (255 x1 )+(100 x -1)+(100 x 1)

The result is assigned to the corresponding pixel value in the new matrix like this:

Copy
? ? ? ? ? ?
? 155 ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?

Now the filter is moved along (convolved), typically using a step size of 1 (so moving
along one pixel to the right), and the value for the next pixel is calculated

Copy
255 255 255 1 -1 1 (255 x 1)+(255 x -1)+(255 x 1) +
255 100 255 x -1 0 -1 = (255 x -1)+(100 x 0)+(255 x -1) + = -155
100 100 100 1 -1 1 (100 x1 )+(100 x -1)+(100 x 1)

So now we can fill in the next value of the new matrix.

Copy
? ? ? ? ? ?
? 155 -155 ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?
? ? ? ? ? ?

The process repeats until we've applied the filter across all of the 3x3 patches of the
image to produce a new matrix of values like this:

Copy
? ? ? ? ? ?
? 155 -155 155 -155 ?
? -155 310 -155 155 ?
? 310 155 310 0 ?
? -155 -155 -155 0 ?
? ? ? ? ? ?

Because of the size of the filter kernel, we can't calculate values for the pixels at the
edge; so we typically just apply a padding value (often 0):

Copy
0 0 0 0 0 0
0 155 -155 155 -155 0
0 -155 310 -155 155 0
0 310 155 310 0 0
0 -155 -155 -155 0 0
0 0 0 0 0 0

The output of the convolution is typically passed to an activation function, which is often
a Rectified Linear Unit (ReLU) function that ensures negative values are set to 0:
Copy
0 0 0 0 0 0
0 155 0 155 0 0
0 0 310 0 155 0
0 310 155 310 0 0
0 0 0 0 0 0
0 0 0 0 0 0

The resulting matrix is a feature map of feature values that can be used to train a
machine learning model.

Note: The values in the feature map can be greater than the maximum value for a pixel
(255), so if you wanted to visualize the feature map as an image you would need
to normalize the feature values between 0 and 255.

The convolution process is shown in the animation below.


1. An image is passed to the convolutional layer. In this case, the image is a simple geometric
shape.
2. The image is composed of an array of pixels with values between 0 and 255 (for color
images, this is usually a 3-dimensional array with values for red, green, and blue channels).
3. A filter kernel is generally initialized with random weights (in this example, we've chosen
values to highlight the effect that a filter might have on pixel values; but in a real CNN, the
initial weights would typically be generated from a random Gaussian distribution). This
filter will be used to extract a feature map from the image data.
4. The filter is convolved across the image, calculating feature values by applying a sum of
the weights multiplied by their corresponding pixel values in each position. A Rectified
Linear Unit (ReLU) activation function is applied to ensure negative values are set to 0.
5. After convolution, the feature map contains the extracted feature values, which often
emphasize key visual attributes of the image. In this case, the feature map highlights the
edges and corners of the triangle in the image.

Typically, a convolutional layer applies multiple filter kernels. Each filter produces a
different feature map, and all of the feature maps are passed onto the next layer of the
network.

Pooling layers

After extracting feature values from images, pooling (or downsampling) layers are used


to reduce the number of feature values while retaining the key differentiating features
that have been extracted.

One of the most common kinds of pooling is max pooling in which a filter is applied to
the image, and only the maximum pixel value within the filter area is retained. So for
example, applying a 2x2 pooling kernel to the following patch of an image would
produce the result 155.

Copy
0 0
0 155

Note that the effect of the 2x2 pooling filter is to reduce the number of values from 4 to
1.

As with convolutional layers, pooling layers work by applying the filter across the whole
feature map. The animation below shows an example of max pooling for an image map.
1. The feature map extracted by a filter in a convolutional layer contains an array of feature
values.
2. A pooling kernel is used to reduce the number of feature values. In this case, the kernel
size is 2x2, so it will produce an array with quarter the number of feature values.
3. The pooling kernel is convolved across the feature map, retaining only the highest pixel
value in each position.

Dropping layers

One of the most difficult challenges in a CNN is the avoidance of overfitting, where the
resulting model performs well with the training data but doesn't generalize well to new
data on which it wasn't trained. One technique you can use to mitigate overfitting is to
include layers in which the training process randomly eliminates (or "drops") feature
maps. This may seem counterintuitive, but it's an effective way to ensure that the model
doesn't learn to be over-dependent on the training images.
Other techniques you can use to mitigate overfitting include randomly flipping,
mirroring, or skewing the training images to generate data that varies between training
epochs.

Flattening layers

After using convolutional and pooling layers to extract the salient features in the
images, the resulting feature maps are multidimensional arrays of pixel values. A
flattening layer is used to flatten the feature maps into a vector of values that can be
used as input to a fully connected layer.

Fully connected layers

Usually, a CNN ends with a fully connected network in which the feature values are
passed into an input layer, through one or more hidden layers, and generate predicted
values in an output layer.

A basic CNN architecture might look similar to this:

1. Images are fed into a convolutional layer. In this case, there are two filters, so each image
produces two feature maps.
2. The feature maps are passed to a pooling layer, where a 2x2 pooling kernel reduces the
size of the feature maps.
3. A dropping layer randomly drops some of the feature maps to help prevent overfitting.
4. A flattening layer takes the remaining feature map arrays and flattens them into a vector.
5. The vector elements are fed into a fully connected network, which generates the
predictions. In this case, the network is a classification model that predicts probabilities for
three possible image classes (triangle, square, and circle).

Training a CNN model


As with any deep neural network, a CNN is trained by passing batches of training data
through it over multiple epochs, adjusting the weights and bias values based on the loss
calculated for each epoch. In the case of a CNN, backpropagation of adjusted weights
includes filter kernel weights used in convolutional layers as well as the weights used in
fully connected layers.
UNIT 5/9:

Exercise - Train a convolutional neural


network
Completed100 XP

 45 minutes

PyTorch and TensorFlow both offer comprehensive support for building convolutional
neural networks as classification models for images.

In this exercise, you'll use your preferred framework to create a simple CNN-based
image classifier for images of simple geometric shapes. The same principles can be
applied to images of any kind.

To complete the exercise:

1. In Jupyter, in the ml-basics folder, open the Convolutional Neural Networks


(PyTorch).ipynb or Convolutional Neural Networks
(Tensorflow).ipynb notebook, depending on your framework preference, and
follow the instructions it contains.

2. When you've finished, close and halt all notebooks.

When you've finished working through the notebook, return to this module and move
on to the next unit to learn more.
UNIT 6/9:
Transfer learning
Completed100 XP

 5 minutes

In life, it’s often easier to learn a new skill if you already have expertise in a similar,
transferrable skill. For example, it’s probably easier to teach someone how to drive a bus
if they have already learned how to drive a car. The driver can build on the driving skills
they've already learned in a car, and apply them to driving a bus.

The same principle can be applied to training deep learning models through a
technique called transfer learning.

How transfer learning works


A Convolutional Neural Network (CNN) for image classification is typically composed of
multiple layers that extract features, and then use a final fully connected layer to classify
images based on these features.

Conceptually, this neural network consists of two distinct sets of layers:


1. A set of layers from the base model that perform feature extraction.
2. A fully connected layer that takes the extracted features and uses them for class prediction.

The feature extraction layers apply convolutional filters and pooling to emphasize
edges, corners, and other patterns in the images that can be used to differentiate them,
and in theory should work for any set of images with the same dimensions as the input
layer of the network. The prediction layer maps the features to a set of outputs that
represent probabilities for each class label you want to use to classify the images.

By separating the network into these types of layers, we can take the feature extraction
layers from a model that has already been trained and append one or more layers to
use the extracted features for prediction of the appropriate class labels for your images.
This approach enables you to keep the pre-trained weights for the feature extraction
layers, which means you only need to train the prediction layers you have added.

There are many established convolutional neural network architectures for image
classification that you can use as the base model for transfer learning, so you can build
on the work someone else has already done to easily create an effective image
classification model.
UNIT 7/9:

Exercise - Use transfer learning


Completed100 XP

 30 minutes

PyTorch and TensorFlow both support a library of existing models that you can use as
the basis for transfer learning.

In this exercise, you'll use your preferred framework to train a convolutional neural
network model by using transfer learning.

To complete the exercise:

1. In Jupyter, in the ml-basics folder, open either the Transfer Learning


(PyTorch).ipynb or Transfer Learning (Tensorflow).ipynb notebook, depending
on your framework preference, and follow the instructions it contains.

2. When you've finished, close and halt all notebooks.

Clean-up
If you used a compute instance in an Azure Machine Learning workspace to complete
the exercises, use these steps to clean up.

1. Close all Jupyter notebooks and the Jupyter home page.

2. In Azure Machine Learning Studio, on the Compute page, select your compute


instance, and in the top menu bar, select Stop.

If you don't intend to complete other modules that require the Azure Machine Learning
workspace, you can delete the resource group you created for it from your Azure
subscription.
UNIT 8/9:

Knowledge check
200 XP

 3 minutes

Answer the following questions to check your learning.

1. 

You are creating a deep neural network to train a classification model that predicts to
which of three classes an observation belongs based on 10 numeric features. Which of
the following statements is true of the network architecture?

The input layer should contain three nodes

The network should contain three hidden layers

The output layer should contain three nodes

ANSWER: 3 (The output layer should contain a node for each possible class value.)
2. 

You are training a deep neural network. You configure the training process to use 50
epochs. What effect does this configuration have?

The entire training dataset is passed through the network 50 times

The training data is split into 50 subsets, and each subset is passed through the network
The first 50 rows of data are used to train the model, and the remaining rows are used
to validate it
ANSWER: 1 (The number of epochs determines the number of training passes for
the full dataset.)

3. 

You are creating a deep neural network. You increase the Learning Rate parameter.
What effect does this setting have?

More records are included in each batch passed through the network

Larger adjustments are made to weight values during backpropagation

More hidden layers are added to the network

ANSWER: 2 (Increasing the learning rate causes backpropagation to make larger


weight adjustments. )

4. 

You are creating a convolutional neural network. You want to reduce the size of the
feature maps that are generated by a convolutional layer. What should you do?

Reduce the size of the filter kernel used in the convolutional layer

Increase the number of filters in the convolutional layer

Add a pooling layer after the convolutional layer


Answer: 3 (
A pooling layer reduces the number of features in a feature map.)
UNIT 9/9:
Summary
Completed100 XP

 1 minute

In this module you learned about the fundamental principles of deep learning, and how
to create deep neural network models using PyTorch or Tensorflow. You also explored
the use of convolutional neural networks to create image classification models.

Deep learning techniques are at the cutting edge of machine learning and artificial
intelligence, and are used to implement enterprise solutions. If this module has inspired
you to build machine learning solutions, you should consider learning how Azure
Machine Learning can help you train, deploy, and manage models at scale. You can
learn how to use Azure Machine Learning to manage machine learning operations in
the Build AI solutions with Azure Machine Learning service learning path.
UNIT 1/8:
Introduction
Completed100 XP

 1 minute

Machine Learning is the foundation for most artificial intelligence solutions. Creating an
intelligent solution often begins with the use of machine learning to train predictive
models using historic data that you have collected.

Azure Machine Learning is a cloud service that you can use to train and manage machine
learning models.

In this module, you'll learn to:

 Identify the machine learning process.


 Understand Azure Machine Learning capabilities.
 Use automated machine learning in Azure Machine Learning studio to train and deploy a
predictive model.

To complete this module, you'll need a Microsoft Azure subscription. If you don't
already have one, you can sign up for a free trial at https://azure.microsoft.com.
UNIT 2/8:
What is machine learning?
Completed100 XP

 5 minutes

Machine learning is a technique that uses mathematics and statistics to create a model
that can predict unknown values.

For example, suppose Adventure Works Cycles is a business that rents cycles in a city.
The business could use historic data to train a model that predicts daily rental demand
in order to make sure sufficient staff and cycles are available.

To do this, Adventure Works could create a machine learning model that takes
information about a specific day (the day of week, the anticipated weather conditions,
and so on) as an input, and predicts the expected number of rentals as an output.

Mathematically, you can think of machine learning as a way of defining a function (let's
call it f) that operates on one or more features of something (which we'll call x) to
calculate a predicted label (y) - like this:

f(x) = y
In this bicycle rental example, the details about a given day (day of the week, weather,
and so on) are the features (x), the number of rentals for that day is the label (y), and the
function (f) that calculates the number of rentals based on the information about the
day is encapsulated in a machine learning model.

The specific operation that the f function performs on x to calculate y depends on a


number of factors, including the type of model you're trying to create and the specific
algorithm used to train the model. Additionally in most cases, the data used to train the
machine learning model requires some pre-processing before model training can be
performed.

Types of machine learning


There are two general approaches to machine learning, supervised and unsupervised
machine learning. In both approaches, you train a model to make predictions.

The supervised machine learning approach requires you to start with a


dataset with known label values. Two types of supervised machine learning tasks include
regression and classification.

 Regression: used to predict a continuous value; like a price, a sales total, or some other
measure.
 Classification: used to determine a binary class label; like whether a patient has diabetes
or not.

The unsupervised machine learning approach starts with a dataset without known


label values. One type of unsupervised machine learning task is clustering.

 Clustering: used to determine labels by grouping similar information into label groups;
like grouping measurements from birds into species.

The following video discusses the various kinds of machine learning model you can
create, and the process generally followed to train and use them.

THERE IS A VIIDEO WHICH CANNOT BE COPIED,


GO AND WATCH IT
UNIT 3/8:
What is Azure Machine Learning studio?
Completed100 XP

 5 minutes

Training and deploying an effective machine learning model involves a lot of work,
much of it time-consuming and resource-intensive. Azure Machine Learning is a cloud-
based service that helps simplify some of the tasks it takes to prepare data, train a
model, and deploy a predictive service.

Most importantly, Azure Machine Learning helps data scientists increase their efficiency
by automating many of the time-consuming tasks associated with training models; and
it enables them to use cloud-based compute resources that scale effectively to handle
large volumes of data while incurring costs only when actually used.

Azure Machine Learning workspace


To use Azure Machine Learning, you first create a workspace resource in your Azure
subscription. You can then use this workspace to manage data, compute resources,
code, models, and other artifacts related to your machine learning workloads.

After you have created an Azure Machine Learning workspace, you can develop
solutions with the Azure machine learning service either with developer tools or the
Azure Machine Learning studio web portal.

Azure Machine Learning studio


Azure Machine Learning studio is a web portal for machine learning solutions in Azure.
It includes a wide range of features and capabilities that help data scientists prepare
data, train models, publish predictive services, and monitor their usage. To begin using
the web portal, you need to assign the workspace you created in the Azure portal to
Azure Machine Learning studio
Azure Machine Learning compute
At its core, Azure Machine Learning is a service for training and managing machine
learning models, for which you need compute on which to run the training process.

Compute targets are cloud-based resources on which you can run model training and
data exploration processes.

In Azure Machine Learning studio, you can manage the compute targets for your data
science activities. There are four kinds of compute resource you can create:

 Compute Instances: Development workstations that data scientists can use to work with
data and models.
 Compute Clusters: Scalable clusters of virtual machines for on-demand processing of
experiment code.
 Inference Clusters: Deployment targets for predictive services that use your trained
models.
 Attached Compute: Links to existing Azure compute resources, such as Virtual Machines
or Azure Databricks clusters.
UNIT 4/8:

What is Azure Automated Machine


Learning?
Completed100 XP

 3 minutes

Azure Machine Learning includes an automated machine learning capability that


automatically tries multiple pre-processing techniques and model-training algorithms in
parallel. These automated capabilities use the power of cloud compute to find the best
performing supervised machine learning model for your data.

Automated machine learning allows you to train models without extensive data science
or programming knowledge. For people with a data science and programming
background, it provides a way to save time and resources by automating algorithm
selection and hyperparameter tuning.

You can create an automated machine learning job in Azure Machine Learning studio.
In Azure Machine Learning, operations that you run are called jobs. You can configure
multiple settings for your job before starting an automated machine learning run. The
run configuration provides the information needed to specify your training script,
compute target, and Azure ML environment in your run configuration and run a training
job.
UNIT 5/8:
Understand the AutoML process
Completed100 XP

 5 minutes

You can think of the steps in a machine learning process as:

1. Prepare data: Identify the features and label in a dataset. Pre-process, or clean and
transform, the data as needed.
2. Train model: Split the data into two groups, a training and a validation set. Train a
machine learning model using the training data set. Test the machine learning model for
performance using the validation data set.
3. Evaluate performance: Compare how close the model's predictions are to the known
labels.
4. Deploy a predictive service: After you train a machine learning model, you can deploy
the model as an application on a server or device so that others can use it.

These are the same steps in the automated machine learning process with Azure
Machine Learning.

Prepare data
Machine learning models must be trained with existing data. Data scientists expend a lot
of effort exploring and pre-processing data, and trying various types of model-training
algorithms to produce accurate models, which is time consuming, and often makes
inefficient use of expensive compute hardware.

In Azure Machine Learning, data for model training and other operations is usually
encapsulated in an object called a dataset. You can create your own dataset in Azure
Machine Learning studio.
Train model
The automated machine learning capability in Azure Machine Learning
supports supervised machine learning models - in other words, models for which the
training data includes known label values. You can use automated machine learning to
train models for:

 Classification (predicting categories or classes)


 Regression (predicting numeric values)
 Time series forecasting (predicting numeric values at a future point in time)
In Automated Machine Learning you can select from several types of
tasks: 
In Automated Machine Learning, you can select configurations for the primary metric,
type of model used for training, exit criteria, and concurrency limits.
Importantly, AutoML will split data into a training set and a validation set. You can
configure the details in the settings before you run the job.
Evaluate performance
After the job has finished you can review the best performing model. In this case, you
used exit criteria to stop the job. Thus the "best" model the job generated might not be
the best possible model, just the best one found within the time allowed for this
exercise.

The best model is identified based on the evaluation metric you specified, Normalized
root mean squared error.

A technique called cross-validation is used to calculate the evaluation metric. After the


model is trained using a portion of the data, the remaining portion is used to iteratively
test, or cross-validate, the trained model. The metric is calculated by comparing the
predicted value from the test with the actual known value, or label.

The difference between the predicted and actual value, known as the residuals, indicates
the amount of error in the model. The performance metric root mean squared
error (RMSE), is calculated by squaring the errors across all of the test cases, finding the
mean of these squares, and then taking the square root. What all of this means is that
smaller this value is, the more accurate the model's predictions. The normalized root
mean squared error (NRMSE) standardizes the RMSE metric so it can be used for
comparison between models which have variables on different scales.

The Residual Histogram shows the frequency of residual value ranges. Residuals


represent variance between predicted and true values that can't be explained by the
model, in other words, errors. You should hope to see the most frequently occurring
residual values clustered around zero. You want small errors with fewer errors at the
extreme ends of the scale.
The Predicted vs. True chart should show a diagonal trend in which the predicted value
correlates closely to the true value. The dotted line shows how a perfect model should
perform. The closer the line of your model's average predicted value is to the dotted
line, the better its performance. A histogram below the line chart shows the distribution
of true values.

After you've used automated machine learning to train some models, you can deploy
the best performing model as a service for client applications to use.

Deploy a predictive service


In Azure Machine Learning, you can deploy a service as an Azure Container Instances
(ACI) or to an Azure Kubernetes Service (AKS) cluster. For production scenarios, an AKS
deployment is recommended, for which you must create an inference cluster compute
target. In this exercise, you'll use an ACI service, which is a suitable deployment target
for testing, and does not require you to create an inference cluster.
UNIT 6/8:

Exercise - Explore Automated Machine


Learning in Azure ML
Completed100 XP

 50 minutes

In this exercise, you will use a dataset of historical bicycle rental details to train a model
that predicts the number of bicycle rentals that should be expected on a given day,
based on seasonal and meteorological features.

 Note

To complete this lab, you will need an Azure subscription in which you have
administrative access.

Launch the exercise and follow the instructions.


UNIT 7/8:
Knowledge check
200 XP

 4 minutes
1. 

An automobile dealership wants to use historic car sales data to train a machine learning
model. The model should predict the price of a pre-owned car based on its make,
model, engine size, and mileage. What kind of machine learning model should the
dealership use automated machine learning to create?

Classification

Regression

Time series forecasting

ANSWER: 2 (To predict a numeric value, use a regression model.)

2. 

A bank wants to use historic loan repayment records to categorize loan applications as
low-risk or high-risk based on characteristics like the loan amount, the income of the
borrower, and the loan period. What kind of machine learning model should the bank
use automated machine learning to create?

Classification

Regression

Time series forecasting


ANSWER: 1 (To predict a category, or class, use a classification model. )
3. 

You want to use automated machine learning to train a regression model with the best
possible R2 score. How should you configure the automated machine learning
experiment?

Set the Primary metric to R2 score

Block all algorithms other than GradientBoosting

Enable featurization

ANSWER: 1 ( The primary metric determines the metric used to


evaluate the best performing model )
Unit 8/8:
Summary
Completed100 XP

 1 minute

In this module, you learned how to:

 Identify the machine learning process.


 Understand Azure Machine Learning capabilities.
 Use automated machine learning in Azure Machine Learning studio to train and deploy a
predictive model.
Create a regression model with Azure
Machine Learning
UNIT 1/8:
Introduction
Completed100 XP

 2 minutes

You can use Microsoft Azure Machine Learning designer to create regression models by
using a drag and drop visual interface, without needing to write any code.

In this module, you'll learn how to:

 Identify regression machine learning scenarios.


 Use Azure Machine Learning designer to train a regression model.
 Use a regression model for inferencing.
 Deploy a regression model as a service.

To complete this module, you'll need a Microsoft Azure subscription. If you don't
already have one, you can sign up for a free trial at https://azure.microsoft.com.
UNIT 2/8:
Identify regression machine learning
scenarios
Completed100 XP
 3 minutes

Regression is a form of machine learning used to understand the relationships between


variables to predict a desired outcome. Regression predicts a numeric label or outcome
based on variables, or features. For example, an automobile sales company might use
the characteristics of a car (such as engine size, number of seats, mileage, and so on) to
predict its likely selling price. In this case, the characteristics of the car are the features,
and the selling price is the label.

Regression is an example of a supervised machine learning technique in which you train


a model using data that includes both the features and known values for the label, so
that the model learns to fit the feature combinations to the label. Then, after training
has been completed, you can use the trained model to predict labels for new items for
which the label is unknown.

Scenarios for regression machine learning models


Regression machine learning models are used in many industries. A few scenarios are:

 Using characteristics of houses, such as square footage and number of rooms, to


predict home prices.
 Using characteristics of farm conditions, such as weather and soil quality, to predict
crop yield.
 Using characteristics of a past campaign, such as advertising logs, to predict future
advertisement clicks.
UNIT 3/8:
What is Azure Machine Learning?
Completed100 XP

 5 minutes

Training and deploying an effective machine learning model involves a lot of work,
much of it time-consuming and resource-intensive. Azure Machine Learning is a cloud-
based service that helps simplify some of the tasks it takes to prepare data, train a
model, and deploy a predictive service. Regression machine learning models can be
built using Azure Machine Learning.

Most importantly, Azure Machine Learning helps data scientists increase their efficiency
by automating many of the time-consuming tasks associated with training models. It
enables them to use cloud-based compute resources that scale effectively to handle
large volumes of data while incurring costs only when actually used.

Azure Machine Learning workspace


To use Azure Machine Learning, you first create a workspace resource in your Azure
subscription. You can then use this workspace to manage data, compute resources,
code, models, and other artifacts related to your machine learning workloads.

After you have created an Azure Machine Learning workspace, you can develop
solutions with the Azure machine learning service either with developer tools or the
Azure Machine Learning studio web portal.

Azure Machine Learning studio


Azure Machine Learning studio is a web portal for machine learning solutions in Azure.
It includes a wide range of features and capabilities that help data scientists prepare
data, train models, publish predictive services, and monitor their usage. To begin using
the web portal, you need to assign the workspace you created in the Azure portal to
Azure Machine Learning studio.
Azure Machine Learning compute
At its core, Azure Machine Learning is a service for training and managing machine
learning models, for which you need compute resources on which to run the training
process. Compute targets are cloud-based resources on which you can run model
training and data exploration processes.

In Azure Machine Learning studio, you can manage the compute targets for your data
science activities. There are four kinds of compute resource you can create:

 Compute Instances: Development workstations that data scientists can use to work with
data and models.
 Compute Clusters: Scalable clusters of virtual machines for on-demand processing of
experiment code.
 Inference Clusters: Deployment targets for predictive services that use your trained
models.
 Attached Compute: Links to existing Azure compute resources, such as Virtual Machines
or Azure Data bricks clusters.
UNIT 4/8:
What is Azure Machine Learning
designer?
100 XP
 4 minutes

In Azure Machine Learning studio, there are several ways to author regression machine
learning models. One way is to use a visual interface called designer that you can use to
train, test, and deploy machine learning models. The drag-and-drop interface makes use
of clearly defined inputs and outputs that can be shared, reused, and version controlled.

Each designer project, known as a pipeline, has a left panel for navigation and a canvas
on your right hand side. To use designer, identify the building blocks, or components,
needed for your model, place and connect them on your canvas, and run a machine
learning job.

Pipelines
Pipelines let you organize, manage, and reuse complex machine learning workflows
across projects and users. A pipeline starts with the dataset from which you want to train
the model. Each time you run a pipeline, the configuration of the pipeline and its results
are stored in your workspace as a pipeline job.

Components
An Azure Machine Learning component encapsulates one step in a machine learning
pipeline. You can think of a component as a programming function and as a building
block for Azure Machine Learning pipelines. In a pipeline project, you can access data
assets and components from the left panel's Asset Library tab.
Datasets
You can create data assets on the Data page from local files, a datastore, web files, and
Open Datasets. These data assets will appear along with standard sample datasets
in designer's Asset
Library. 

Azure Machine Learning Jobs


An Azure Machine Learning (ML) job executes a task against a specified compute target.
Jobs enable systematic tracking for your machine learning experimentation and
workflows. Once a job is created, Azure ML maintains a run record for the job. All of
your jobs' run records can be viewed in Azure ML studio.

In your designer project, you can access the status of a pipeline job using
the Submitted jobs tab on the left
pane. 

You can find all the jobs you have run in a workspace on the Jobs page.
UNIT 5/8:
Understand steps for regression
Completed100 XP

 6 minutes

You can think of the steps to train and evaluate a regression machine learning model as:

1. Prepare data: Identify the features and label in a dataset. Pre-process, or clean and
transform, the data as needed.
2. Train model: Split the data into two groups, a training and a validation set. Train a
machine learning model using the training data set. Test the machine learning model for
performance using the validation data set.
3. Evaluate performance: Compare how close the model's predictions are to the known
labels.
4. Deploy a predictive service: After you train a machine learning model, you need to
convert the training pipeline into a real-time inference pipeline. Then you can deploy the
model as an application on a server or device so that others can use it.

Let's follow these four steps as they appear in Azure designer.

Prepare data
Azure machine learning designer has several pre-built components that can be used to
prepare data for training. These components enable you to clean data, normalize
features, join tables, and
more. 

Train model
To train a regression model, you need a dataset that includes historical features,
characteristics of the entity for which you want to make a prediction, and
known label values. The label is the quantity you want to train a model to predict.

It's common practice to train the model using a subset of the data, while holding back
some data with which to test the trained model. This enables you to compare the labels
that the model predicts with the actual known labels in the original dataset.

You will use designer's Score Model component to generate the predicted class label


value. Once you connect all the components, you will want to run an experiment, which
will use the data asset on the canvas to train and score a model.
Evaluate performance
After training a model, it is important to evaluate its performance. There are many
performance metrics and methodologies for evaluating how well a model makes
predictions. You can review evaluation metrics on the completed job page by right-
clicking on the Evaluate model component.
 Mean Absolute Error (MAE): The average difference between predicted values and true
values. This value is based on the same units as the label, in this case dollars. The lower
this value is, the better the model is predicting.
 Root Mean Squared Error (RMSE): The square root of the mean squared difference
between predicted and true values. The result is a metric based on the same unit as the
label (dollars). When compared to the MAE (above), a larger difference indicates greater
variance in the individual errors (for example, with some errors being very small, while
others are large).
 Relative Squared Error (RSE): A relative metric between 0 and 1 based on the square of
the differences between predicted and true values. The closer to 0 this metric is, the better
the model is performing. Because this metric is relative, it can be used to compare models
where the labels are in different units.
 Relative Absolute Error (RAE): A relative metric between 0 and 1 based on the absolute
differences between predicted and true values. The closer to 0 this metric is, the better the
model is performing. Like RSE, this metric can be used to compare models where the
labels are in different units.
 Coefficient of Determination (R2): This metric is more commonly referred to as R-
Squared, and summarizes how much of the variance between predicted and true values is
explained by the model. The closer to 1 this value is, the better the model is performing.

Deploy a predictive service


You have the ability to deploy a service that can be used in real-time. In order to
automate your model into a service that makes continuous predictions, you need to
create and deploy an inference pipeline.

Inference pipeline

To deploy your pipeline, you must first convert the training pipeline into a real-time
inference pipeline. This process removes training components and adds web service
inputs and outputs to handle requests.

The inference pipeline performs the same data transformations as the first pipeline
for new data. Then it uses the trained model to infer, or predict, label values based on its
features. This model will form the basis for a predictive service that you can publish for
applications to use.
You can create an inference pipeline by selecting the menu above a completed
job. 

Deployment

After creating the inference pipeline, you can deploy it as an endpoint. In the endpoints
page, you can view deployment details, test your pipeline service with sample data, and
find credentials to connect your pipeline service to a client application.

It will take a while for your endpoint to be deployed. The Deployment state on
the Details tab will indicate Healthy when deployment is successful.
On the Test tab, you can test your deployed service with sample data in a JSON format.
The test tab is a tool you can use to quickly check to see if your model is behaving as
expected. Typically it is helpful to test the service before connecting it to an application.
You can find credentials for your service on the Consume tab. These credentials are
used to connect your trained machine learning model as a service to a client application.
UNIT 6/8:
Exercise - Explore regression with Azure
Machine Learning designer
Completed100 XP

 55 minutes

In this exercise, you will train a regression model that predicts the price of an
automobile based on its characteristics.

 Note

To complete this lab, you will need an Azure subscription in which you have
administrative access.

Launch the exercise and follow the instructions.


UNIT 7/8:
Knowledge check
200 XP
 3 minutes
1. 

In Azure Machine Learning studio, what can you use to author regression machine
learning pipelines using a drag-and-drop interface?

Notebooks

Automated machine learning

Designer

ANSWER: 3 (You can use Designer to author regression pipelines with a drag-and-drop
interface. )

2. 

You are creating a training pipeline for a regression model. You use a dataset that has
multiple numeric columns in which the values are on different scales. You want to
transform the numeric columns so that the values are all on a similar scale. You also
want the transformation to scale relative to the minimum and maximum values in each
column. Which module should you add to the pipeline?

Select Columns in a Dataset

Clean Missing Data

Normalize Data

ANSWER: 3 (When you transform numeric data to be on a similar scale, use a Normalize
Data module. )
3. 

Why do you split data into training and validation sets?

Data is split into two sets in order to create two models, one model with the training set
and a different model with the validation set.

Splitting data into two sets enables you to compare the labels that the model predicts
with the actual known labels in the original dataset.

Only split data when you use the Azure Machine Learning Designer, not in other
machine learning scenarios.

ANSWER: 2 ( You want to test the model created with training data on validation data to
see how well the model performs with data it was not trained on. )
UNIT 8/8:

Summary
Completed100 XP
 2 minutes

In this module, you learned how to:

 Identify regression machine learning scenarios.


 Use Azure Machine Learning designer to train a regression model.
 Use a regression model for inferencing.
 Deploy a regression model as a service.

You might also like