You are on page 1of 7

EM0442 Lab 4

Singapore Polytechnic
EM0442- Artificial Intelligence and Data Analytics for Aerospace
Lab 4: Multiple Linear Regression

1 Learning Objectives

When you have completed this lab, you should be able to:
1. Understand how to perform multiple linear regression (MLR) using least squares
2. Understand how to display the results of MLR.
3. Apply the method of p-values to if independent variables included in the
regression are significant.
4. Examine the correlation matrix for our data.

2 Multiple Linear Regression


Multiple Linear Regression (MLR) is the process of trying to find the best plane or hyperplane to fit
a set of data pairs. The complete listing of the Python script to do this can be found in the Appendix.
We will look at the four main sections of this program once again as we will want to modify it later.

2.1 Data input and preparation

Temp = [14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1,22.6,17.2]
Temp = np.array(Temp)
Income = [2680,4030,2170,4030,4900,6270,4850,7270,6500,4940,5460,5100]
Income = np.array(Income)
Sales = [215,325,185,332,406,522,412,614,544,421,445,408]
Sales = np.array(Sales)

Here we see the Temperature data being entered as a list as well as the Sales vector and now we
have an Income vector all of which need to be converted to numpy ndarrays.

If needed, a scatterplot can be drawn at this point. But for our lab it is not shown till the end of the
program.

2.2 Computation

In a similar fashion to our Simple Linear Regression example, we use a least squares method to
obtain a solution to the line of best fit that explains the data given. As before, we create a matrix
that contains the independent variable. The leftmost column of this matrix has to be a series of ones
and this time we stack on two sets of variables. This can be repeated for more variables.

Singapore Polytechnic, School of EEE v1.1 Page 1


EM0442 Lab 4

def MLRegress(Temp,Income,Sales):
Z = np.ones(Temp.shape) # join two vectors
Z = np.vstack((Z, Temp)) # by vertical stacking
Z = np.vstack((Z, Income)) # by vertical stacking
Z = Z.T # array of numberdatapoints x # variables
b,resid,rank,sgl = np.linalg.lstsq(Z,Sales,rcond=None)
return b

This time, the b vector is returned such that b[2] * x2 b[1] * x1 + b[0] will be the three dimensional
(3D) plane line that gives the best fit in the LSE sense to the data vector x.
Also note that can we only visualize the results for two independent variables and one
dependent variable in 3D space.

2.3 Post processing

In order to show some meaningful data, we need to plot the predicted values obtained by the least
squares procedure in a 2D representation of a 3D plot – as isometric sort of view.

3D surface (plane)

In this case, we cannot just have two 1D sets of data and expect it to be plotted for us. We are
plotting a surface and the coordinates of all the points on that surface needs to be generated.

1_ The values of all x and y coordinates need to be generated first using the arange() function.
These values are purely for plotting, so they need not be in the original set of data. These data
vectors are xmesh and ymesh.
2_ These vectors are passed to the meshgrid function which now generates a grid of data.
In other words, the xmesh(Temp) vector is now duplicated by the number of elements in
the ymesh(Income) vector. And similarly for the other way round.

3_ Now all the z axis data needs to be generated in zmesh, in this case by the line:

Singapore Polytechnic, School of EEE v1.1 Page 2


EM0442 Lab 4

zmesh = (b[0] + b[1]*xmesh + b[2]*ymesh)

So that zmesh contains the predicted Sales values, so that a surface can be plotted.

# Prepare the 3D plot


xmesh = np.arange(min(Temp),max(Temp)) # generate range of values
ymesh = np.arange(min(Income),max(Income),100)
xmesh, ymesh = np.meshgrid(xmesh, ymesh)
zmesh = (b[0] + b[1]*xmesh + b[2]*ymesh) # compute z values

2.4 Display output


The facilities to do all the simulated 3D printing is from the mpl_toolkits.mplot3d
library from which we import the Axes3D module. This module has its own plot command. In the
upper part of the figure, we construct a special 3D scatter plot which shows if predicted data
points are above or below the plane.

from mpl_toolkits.mplot3d import Axes3D


. . .
ax.plot([xs[i], xs[i]], [ys[i], ys[i]], [0, zs[i]], ':', linewidth=2, color=colr, alpha=.5)
# plot up arrow for Sales above actual else down arrow
ax.plot([xs[i]], [ys[i]], [zs[i]], markertype, markersize=8,
markerfacecolor='none', color= colr,label='ib')
. . .
ax.plot_surface(xmesh, ymesh, zmesh, alpha=0.2) # plot it

This is done by comparing the computed (zsb = b[0] + b[1]*xs[i] + b[2]*ys[i]) with the actual
value (zs[i]) as shown below.

xs = Temp; ys = Income; zs = Sales


for i in range(len(xs)):
# check if computed z is < actual value
zsb = b[0] + b[1]*xs[i] + b[2]*ys[i]
# use differing colours to indicate if actual z value is above computed
if zsb < zs[i]:
markertype = '^'; colr = 'b' # computed is below
else:
markertype = 'v'; colr = 'g'

2.5 Try it out!


Copy the code and run it and view the results. You may wish to modify the outputs and add in
various kinds of annotations. Save the program and get ready for a modification.

Singapore Polytechnic, School of EEE v1.1 Page 3


EM0442 Lab 4

3 Backward elimination
We will now use the p-value to identify the significant independent variables. We will use the
statsmodels.api library and use only a few of its functions.

3.1 Bringing in a statistics library – try it out!

Modify the previous program by importing the library. You may either modify or copy the code in
the MLRegress function to another function MLRegressOLS to easily compare results. Add in the lines
in BOLD which will give you the summary statistics. Also, add: from statsmodels.api
import sm.

b,resid,rank,sgl = np.linalg.lstsq(Z,Sales,rcond=None)
est = sm.OLS(Sales, Z)
est2 = est.fit()
print(est2.summary())
return b

Run the program again and look at the output of the OLS function.

Parameters p-values for x1,x2 Actual variable name


x1
x2
R-squared

Which variable is not considered significant?

4 Looking at the correlation matrix


In the final part of this lab, we will examine the correlation matrix for our dataset. It is placed
here to emphasise that this matrix should be a final tool to gain insights into the multiple
linear analysis. Note the following:
1. The use of the corrcoef module from numpy to compute the correlation matrix.
2. Displaying the matrix is done using the imshow function from matplotlib. Essentially the
matrix is treated as an image – in this case the size is 3 x 3 pixels.
The colorbar shows the colour associated with the values in the image.

Compare your result with that in the lecture slides. Note that the correlation coefficients are all
above 0.95 and yet one of the independent variables are not useful.

5 Trying your own dataset

A delivery company trying to predict the travel time for his drivers. To conduct an analysis 10
random samples are collected from past trips are listed below.

Miles Travelled Num of Deliveries Gas Price Travel Time


89 4 3.84 7
66 1 3.19 5.4
78 3 3.78 6.6
111 6 3.89 7.4

Singapore Polytechnic, School of EEE v1.1 Page 4


EM0442 Lab 4

44 1 3.57 4.8
77 3 3.57 6.4
80 3 3.03 7
66 2 3.51 5.6
109 5 3.54 7.3
76 3 3.25 6.4

i) Check the correlation between dependent and each independent variable independently.
ii) Investigate independent variables, is there a collinearity?
iii) Drop the variable with weak correlation and conduct MLR analysis. What is your
conclusions?

Singapore Polytechnic, School of EEE v1.1 Page 5


EM0442 Lab 4

APPENDIX – Program Listing – functions – 1 / 3 (append to 2 / 3)

# Multiple Linear Regression


import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pdb
def MLRegress(Temp,Income,Sales):
Z = np.ones(Temp.shape) # join two vectors
Z = np.vstack((Z, Temp)) # by vertical stacking
Z = np.vstack((Z, Income)) # by vertical stacking
Z = Z.T # array of numberdatapoints x # variables
b,resid,rank,sgl = np.linalg.lstsq(Z,Sales,rcond=None)
return b
def plot3Dstem(Temp,Income,Sales,b,ax):
# for a 3D plot of a plane, we want to see sample data points that are
# above or below the plane, in the z (vertical) axis
xs = Temp; ys = Income; zs = Sales
for i in range(len(xs)):
# check if computed z is < actual value
zsb = b[0] + b[1]*xs[i] + b[2]*ys[i]
# use differing colours to indicate if actual z value is above computed
if zsb < zs[i]:
markertype = '^'; colr = 'b' # computed is below
else:
markertype = 'v'; colr = 'g'
# special 3D stem plot
ax.plot([xs[i], xs[i]], [ys[i], ys[i]], [0, zs[i]], ':', linewidth=2, color=colr, alpha=.5)
# plot up arrow for Sales above actual else down arrow
ax.plot([xs[i]], [ys[i]], [zs[i]], markertype, markersize=8,
markerfacecolor='none', color= colr,label='ib')
return # nothing to return!

Singapore Polytechnic, School of EEE v1.1 Page 6


EM0442 Lab 4

APPENDIX – Program Listing – functions – 2/3


# Main program
Temp = [14.2,16.4,11.9,15.2,18.5,22.1,19.4,25.1,23.4,18.1,22.6,17.2]
Temp = np.array(Temp)
Sales = [215,325,185,332,406,522,412,614,544,421,445,408]
Sales = np.array(Sales)
Income = [2680,4030,2170,4030,4900,6270,4850,7270,6500,4940,5460,5100]
Income = np.array(Income)
# compute
b = MLRegress(Temp,Income,Sales)
# prepare output - set the basic plot first
fig = plt.figure(figsize=(8,6), dpi=100) # 800 x 600 pixels
ax = fig.add_subplot(111, projection='3d') # recommended approach
plt.title('Ice-cream sales vs Temperature and Income');
ax.set_xlabel('Temperature (deg C)')
ax.set_ylabel('Income $')
ax.set_zlabel('Sales $')
# plot the stems
plot3Dstem(Temp,Income,Sales,b,ax)

# Plot the plane


# Prepare the 3D plot
xmesh = np.arange(min(Temp),max(Temp)) # generate range of values
ymesh = np.arange(min(Income),max(Income),100)
xmesh, ymesh = np.meshgrid(xmesh, ymesh)
zmesh = (b[0] + b[1]*xmesh + b[2]*ymesh) # compute z values
ax.plot_surface(xmesh, ymesh, zmesh, alpha=0.2) # plot it
plt.savefig('MLR.svg')
plt.show()

APPENDIX – Program Listing – Correlation plot – 3/3


# correlation coeff
xyz = np.vstack((Temp,Income,Sales)) # vertical stack #variables X #samples
corrcoef = np.corrcoef(xyz)
fig, ax = plt.subplots()
im=ax.imshow(corrcoef, cmap='coolwarm')
ax.figure.colorbar(im)
ax.xaxis.set(ticks=(0, 1, 2), ticklabels=('Temp', 'Income', 'Sales'))
ax.yaxis.set(ticks=(0, 1, 2), ticklabels=('Temp', 'Income', 'Sales'))
ax.set_title('Correlation coefficients')
ax.set_ylim(2.5, -0.5) # make sure the coloured squares fit into the boundary need to try!
for i in range(3):
for j in range(3):
ax.text(j, i, (int(corrcoef[i, j]*100))/100, ha='center', va='center', color='y')
plt.savefig('CorrCoef.svg')
plt.show()
Singapore Polytechnic, School of EEE v1.1 Page 7

You might also like