You are on page 1of 21

Advanced Applied Business

Analytics
Machine Learning: Regression
Andy Oh
03 May 2021

0.1 Objectives
The course will begin with a gentle introduction to linear regression starting with correlation
analysis before moving into multiple regression models. The key assumptions in linear
regression is beyond the coverage of this course. Heuristics data transformation to address
non-normal data distribution will also be covered in this course.
Linear regression models are used to predict a continuous value. For example, predicting the
price of a house given its sizes, location and number of rooms as input variables. In machine
learning, linear regression is considered as a supervised learning technique, where a set of
input variables and their output are known. The data from the input variables (X) are trained
to predict the output variable (Y). The accuracy of the prediction are then compared to the
known outcome to determine the overall quality of the model.

1 Correlation Analysis
1.1 Scatter plot
The study of relationship between two variables (interval or ratio data) often starts with
visual representation by using a scatter plot. Consider a simple dataset of advertising
spending on YouTube and Sales revenue, can you observe the relationship between the
advertising cost and sales revenue?

## Warning: namespace 'highr' is not available and has been replaced


## by .GlobalEnv when processing object '<unknown>'
Year youtube sales

2001 276.12 26.52

2002 53.40 12.48

2003 20.64 11.16

2004 181.80 22.20

2005 216.96 15.48

2006 10.44 8.64

2007 69.00 14.16

2008 144.24 15.84

2009 10.32 5.76

2010 239.76 12.72

2011 79.32 10.32

2012 257.64 20.88

2013 28.56 11.04

2014 117.00 11.64

2015 244.92 22.80


2016 234.48 26.88

2017 81.36 15.00

2018 337.68 29.28

2019 83.04 13.56

2020 176.76 17.52

From the scatterplot, we can observe a positive relationship between advertising spending
and sales revenue. Higher advertising spending on YouTube will bring about higher sales
revenue for the company.

There are times when the relationship between two variables are not so clear. In the
scatterplot below, we are not able to observe a clear pattern like the previous scatter plot.
In correlation, we can generalise the relationship between two variables
as positive, negative and no correlation.

Correlation coefficient describes the quantitative strength of two variables (ratio or interval).
It is often referred to as Pearson’s correlation or Pearson’s r.
A correlation of -1.00 and +1.00 indicates perfect correlation.
1.2 Correlation using Azure
Use the Compute Linear Correlation module in Azure Machine Learning Studio to
compute a set of Pearson correlation coefficients for each possible pair of variables in the
input dataset.
The Pearson correlation coefficient, sometimes called Pearson’s R test, is a statistical value
that measures the linear relationship between two variables. By examining the coefficient
values, you can infer something about the strength of the relationship between the two
variables, and whether they are positively correlated or negatively correlated.
source: https://docs.microsoft.com/en-us/azure/machine-learning/studio-module-
reference/compute-linear-correlation

1.3 Azure Demonstration


1.3.1 About marketing dataset
Marketing dataset contains the impact of three advertising medias (youtube, facebook and
newspaper) on sales. Data are the advertising budget in thousands of dollars along with the
sales. All columns the numerical in nature.

• Use marketing.csv dataset and create a new experiment.


• Setup the necessary instances to the workspace shown in the video.
• Follow the settings and run the experiment to view the output.
• Relationship between sales and youtube, facebook and newspaper can be visualize by
plotting two columns as shown in the video

From the correlation output, which variable has the highest correlation to sales? Is the
relationship a positive or negative one?

1.4 Correlation and Linearity


High correlation does not necessarily mean that underlying relationship between the 2
variables is a highly linear one. Here are some visual examples of high correlation
coeefficient without high linearity.

1.5 Correlation and Causation


Without an domain understanding the relationship of data, it can be dangerous to assume
causality. In the case of a high correlation coefficient between sales and youtube of 0.782, it
may not necessary mean that sales revenue is caused by youtube advertising expenditure.
There could be a third variable that caused the change of salesrevenue.
Correlation analysis only confirms the fact data the given data moves in tandem

2 Simple Linear Regression


While correlation is useful to understand the strength of the relationship between two
variables, it does not estimate how much the change of one variable (X) will affect the change
of the other variable (Y). For example, how much sales revenue is expected to increase if the
advertising spending on YouTube is to increase to $1000 per day?
The technique to provide such estimation is called Regression Analysis.

2.1 Ordinary Least Square Method (OLS)


The objective of OLS is to use data to position a line that best represents the relationship
between X and Y variable.
In linear regression, the relationship between the independent (X) and dependent
variables (Y) are assmued to be linear.
The equation of the regression line is Ŷ=a+bX where,

• Ŷ is the estimated value of the dependent variable


• X is the any value of the independent variable
• b is the slope of the line
• a is the intercept of Y

The objective of least square method is to produce a line with the least total
distance between the predicted (Ŷ) and the observed values of the dependent variable (Y).
In other words, OLS minimises the sum of squares differences between the observed value of
the dependent variable (Y) and its predicted value (Ŷ).
Formula to calculate β (beta) with OLS:
β=∑ni=1(xi−x⎯⎯⎯)(yi−y⎯⎯⎯)∑ni=1(xi−x⎯⎯⎯)2

where x⎯⎯⎯ and y⎯⎯⎯ is the mean of x & y


Notice the numerator of OLS formula is similar to correlation coefficient while the
denominator is simply the variance, σ2, of x.

source: https://www.hackerearth.com
After calculating the Ŷ for each observed Y, three different sum of squares can be derived
between Yi, Y⎯⎯⎯⎯ and Ŷ.
• Total Sum of Squares (TSS) = ∑ni=1(Yi−Y¯)2
• Explained Sum of Squares (ESS) = ∑ni=1(Ŷi−Y¯)2
• Residual Sum of Squares (RSS) = ∑ni=1(Yi−Ŷi)2

Notice that TSS = ESS + RSS.

2.2 Azure Demonstration


Let’s create an Azure experiment to carry out a linear regression modeling.

• Use the marketing.csv dataset and create a new experiment.


• Setup the necessary instances to the workspace shown in the video.
• Follow the settings and run the experiment to view the output.

2.2.1 Ordinary Least Square option


Note that the Linear Regression instance in Azure provides 2 options:

• Ordinary Least Square


• Online Gradient Descent

Throughout this course, we will select Ordinary Least Square and set L2 regularization
weight to zero for Linear Regression modeling.

2.2.2 Train Model Results


The results of the train model instance will show the weights as follows:
We can express the equation of the regression line
is Ŷ=a+bX as: Ŷ=8.43911+0.0475366youtube, where 0.0475366 is the weight of youtube
and 8.43911 is the weight of the intercept. In statistics, the weights are also known
as Coefficients of the Estimates.
We can interpret the equation as:

• A dollar unit increase in youTube will result to an increase of 0.0475366 in sales


revenue.

In other words, the equation allows us to predict the sales revenue given a value of YouTube
advertising.

2.2.3 Evaluate Model


While the Cofficients of the Estimates inform us about how much each coefficient accounts
for the rate of change to the dependent variable, Coefficient of Determination inform us the
proportion of the total variation in the dependent variable Y is explained by the variation in
the independent variable X.
Root Mean Square Error (RMSE) on the other hand, sums up how much error
the coefficients of the estimates. Think of RMSE as the difference between the predicted and
actual value).
2.3 Coefficient of Determination
R2 is useful when there are two or more dependent variables involved in the regression
analysis, how these variables collectively explained the variation of Y.
From the results of the Evaluate Model, we can say that the independent variable (YouTube)
explains 61.1875% of variation in the dependent variable (Sales).
R2 range from 0 to 1.00 and to account for the variation, it is often converted to percentage.
For example, R2 of 0.611875 is often interpreted as 61.19% instead.
The higher the value of R2, the more the independent variables can account for the variation
of Y. Conversely, the lower the R2, the lesser the independent variables can account for the
variation of Y

2.4 Root Mean Square Error


Root Mean Square Error (RMSE) is the standard deviation of the residuals (prediction
errors). Residuals are a measure of how far from the regression line data points are; RMSE is
a measure of how spread out these residuals are. In other words, it tells you how concentrated
the data is around the line of best fit.
source:https://www.statisticshowto.com/probability-and-statistics/regression-analysis/rmse-
root-mean-square-error/
Here is an simplified illustration of how RMSE is derived using the actual and predicted
values of each observation from the Score Model instance.
2.5 Model Building Activity
2.5.1 Learning Objectives
• Setup the instances in Azure workspace for linear regression.
• Configure each instance correctly.
• Evaluate the model performance.

2.5.2 Tasks
• Create a linear regression model using marketing.csv.
• Setup newspaper as an independent variable to predict sales in one experiment.
• Use facebook as an independent variable to predict sales in one experiment.
• Express the weights as an equation to predict sales for each experiment
• Which variable seem to produce less errors and able to explain the variability
of sales better?

3 Multiple Linear Regression


Multiple linear regression equation is given as :
Ŷ=a+b1X1+b2X2+b3X3.....+bjXk where

• Ŷ is the estimated value of the dependent variable.


• X is the any value of the independent variable.
• bj is the amount by which Y changes when a particular X increases by a unit with the
values of all other independent variables held constant. Subscript j denotes each
independent variable and subscript k denotes the number of independent variables.
• a is the intercept of Y.

3.1 Azure Demonstration


Let’s create a new Azure experiment to carry out a multiple linear regression modeling.

• Copy the simple linear regression experiment from 2.2.


• Setup the necessary instances to the workspace shown in the video.
• Follow the settings and run the experiment to view the output.

3.1.1 Train Model Results


The results of the train model instance will show the weights as follows:
From the weights, we can express the equation of the regression line
as: Ŷ=3.52667+0.18853facebook+0.0457646youtube−0.00103749newspaper.
We can interpret the equation as follows:

• A dollar unit increased in youtube will result in an increase of 0.0457646 in sales


revenue.
• A dollar unit increased in facebook will result in an increase of 0.18853 in sales
revenue.
• A dollar unit increased in newspaper will result in an decrease of 0.00103749 in sales
revenue.

Which independent variable is the most important one to affect sales positively?

3.2 Coefficient of Determination


From the results of the Evaluate Model, we can say that the independent variables (youTube,
facebook and newspaper) explains 89.72% of variation in the dependent variable (Sales).

3.3 Root Mean Square Error


The RMSE from the Evaluate Model is 2.002284. Notice the RMSE of a multiple linear
regression is lower than the simple linear regression using a single independent variable to
predict sales.

3.4 Model Comparison


With both the R2 and RMSE improved when all 3 independent variables are used to predict
sales, we can conclude that this model is better in predicting sales as compared to a model
with one independent variable.

4 Regression with Decision Trees


It is also possible to perform regression with decision trees.
When there is a non-linear relationship between the independent variables (predictors) and
dependent variables (target), regression using decision trees is more suitable than linear
regression.

4.1 Azure Demonstration


Let’s create a new Azure experiment to carry out a multiple regression model using Decision
Forest Regressiontree.

• Copy the multiple linear regression experiment from 3.1.


• Setup the necessary instances to the workspace shown in the video.
• Follow the settings and run the experiment to view the output.

4.1.1 Permutation Feature Importance


Unlike linear regression, regression tree results of the train model instance do not
contain weights. The output is the nodes and leaves of the trees constructed based on the
parameters in ’Decision Forest Regression` instance.
We do not know the magnitude of change caused by each predictor, i.e. how much sales will
change when there is an unit increase in the predictor.
Permutation Feature Importance instance computes a set of feature importance scores for
your dataset. You can use these scores to help you determine the best features to use in a
model i.e. which predictors is more important.
Follow the next Azure demonstration to use Permutation Feature Importance instance.
Select Root Mean Square Error in the Metric for measuring performance dropdown as
this is a regression model.
From the output of Permutation Feature Importance, we can see that Decision Forest
Regression trees list the importance of the predictors in the order of youtube, facebook and
newspaper’
5 Linear versus Regression Trees
We can further compare linear and regression tree models by evaluating which model
predicts a more accurate point estimates of the sales by looking at the RMSE and R2.

5.1 Azure Demonstration


Let’s create a new Azure experiment to compare the models using
the marketing.csv dataset.

• Setup the necessary instances to the workspace shown in the video.


• Follow the settings and run the experiment to view the output.

From the output of the Evaluation Model instance, we can observe that regression tree
model produces a lower RMSE and higher R2 than linear regression model.
6 Categorical Data with Linear
Regression
Linear regression can also be performed when the independent variables are qualitative or
categorical data.

6.1 Azure Demonstration


Let’s create an Azure experiment to carry out a simple linear regression modeling with
categorical data.

• Use the heartache.csv dataset and create a new experiment.


• Setup the necessary instances to the workspace shown in the video.
• Follow the settings and run the experiment to view the output.

6.1.1 Train Model Results


The results of the train model instance will show the weights as follows:
The equation of the regression line
becomes: cholesterol^=3.41921+1.8474gender_male_1+1.57181gender_female_0.
We can interpret the equation as:

• For male: The cholesterol estimate is 3.41921+(1.8474∗1)+(1.57181∗0)=5.26661


• For female: The estimate for cholesterol
is 3.41921+(1.8474∗0)+(1.57181∗1)=4.99102

Under the hood, the category of gender (male and female) is encoded as 0 or 1. When
using male as an input to predict cholesterol, male takes on a value of 1 and female takes on a
value of 0 . Conversely, when using female as an input to predict cholesterol, female takes
on a value of 1 and male takes on a value of 0 . This is method is commonly known as one
hot encoding.

6.2 Multiple Categorical Data


Let’s add the rest of the categorical columns to predict cholesterol by modifying the earlier
experiment.

The results of the train model instance will show more weights as more categorical data are
added to the regression

You might also like