You are on page 1of 19

Introduction

This report looks at the outcomes of a project seeking to find out how the sales of a

website selling online products are affected by a number of different factors relating

to their individual buyers, amongst other variables. This website is able to determine

a number of factors relating to the online buyer, including agegender, salary, gender,

geographical region as well as if it was the weekday or during the Christmas period.

Within this report I will investigate which of these factors influence the amount of

money individuals spend on the website and create a multiple linear regression

(MLR) model using SPSS. This report will also look at the relationship between the

number of visits on the website and the number of items old though the website and

using Excel, use a Time Series Forecast to predict the number of visits to the

website during the month of December. sal

Methodology

Multiple Linear Regression

I have used aan MLR analysis model as we are dealing with multiple independent

variables. TotalSpent is the independent variable, as we are trying to find which

factors influence the amount spent on their the online shop.

Modelling assumptions being consideredI will be considering:

- There is no multicollinearity between independent variables

- The values of the residuals are normally distributed

1
- The value of residuals is independent

- The variance of the residuals is constant

As I am using MLR, I will assume that every dependant variable can be expressed

as a linear function of:

y=β 0 + β 1 x 1 + β 2 x 2 +⋯+ β k x k +ϵ

y - Dependent variable

β 0 - A constant which when all independent variables are equal to 0, is the

expected value.

x 1 , … , x k - The different independent variables.

β 1 , … , β k - The coefficients of each independent variables and shows how much the

dependent variable is expected to increase after increasing the independent variable

by one unit.

ϵ - The random error, that is normally distributed and has a mean of 0.

To find out the unknown coefficients, through SPSS is the ordinary least squares. In

SPSS this is found in the unstandardized B column of the Coefficients data set.

I will conduct a hypothesis test on the model, asking if any of my independent

variables are significant:

H 0 : β 1=β 2=… β k =0

H 1 : At least one β i ≠ 0

If I find evidence to reject the alternate hypothesis, ( H 1) (p-value ¿0.05), this means

at least one of the independent variables is significant but not which one.

2
However, if a β i is not significant (p-value ¿0.05), this means that the accompanying

dependent variable x i does not provide statistical improvements to the model in the

presence of the other variables.

The R2 value measures how well our model fits the data. When increasing the

number of variables, the R2 value will increase or stay the same., Iit is therefore

important to note whether an extra variable included is significant enough for it to

stay in the model.

I will thereforealso use the adjusted R2 value as it considers the sample size and how

many variables are in the model. As well as calculating the Fit.

We have variables which are categorical (they don’t are nottake numerical

variables), these are: sSex, wWeekend, CChristmas, and rRegion. Sex, Weekend

and Christmas are binary values,All of these, excepting region, those which are

either yes or no,measured in binary terms or two values, i.e. or in this case, male or

female, weekend for or weekday. , Christmas or not Christmas, SPSS will convert

these into variables it can read.

Base cases are:

Sex: Female

Weekend: Weekday

Christmas: Not Christmas

Region: I have set up for the base case to be North.

3
I will be using a Backward Reduction Method,i (Backward Stepwise Regression, n.d.ii),

this is where wea method that considers consider all variables possible, and then

removes the variable which has the highest non-significant p-value. This happens

until all variables left are significant. This can be done in SPSS by changing the

method to backward under analysis, regression, linear.

By using this reduction method, we cannot guarantee it is the best model as we will

not consider every possible combination of variables, especially since once a

variable has been removed, it cannot be added back., and Aas such we cannot test

if It will be significant on its own or with another selection of variables;, it will only be

testing to see if that variable is not significant in the presence of all the other

variables at that time.

Methodology

Time Series

4
I will be using aA Ttime Sseries analysis will be used as we are dealing with time

dependent variables that is are collected on one subject over many periods of time.

As it is very unlikely that time dependent variables can be linearly modelled and to

forecast, we must extrapolate the data. As such we cannot use simple linear

regression.

I will be therefore usingAs such a decomposition model which will allows me to

forecast for the month of December. The steps involved to model this data are as

follows::

1) Identifying the trend:- I will be using a centred moving average with a period of

24 is used as it is equal to the length of one season, using time from 01:00 to

00:00. I have chosen this over a simple moving average as a CMA removes

the lag from the moving average. The CMA will remove the seasonal effects

and identify the long run trend to make a forecast. As our K is even, we will

need to take and average of the averages.

2) Identifying the sSeasonality: - This is the tendency of time series data to

repeat itself after every period, k. I will be using a multiplicative model, as the

seasonality effects increases for an increasing trend, and variation is

proportional to the trend., Aafter graphing the average number of visits per

hour, we can see that the seasonality increases over time, which can be seen

asis evident there are through non-parallel lines at the peaks and troughs. We

want the constant of proportionality remains to remain the same for every

period, and so we use the adjusted seasonality.

5
3) Identifying the cyclical behaviour: - As our data set only spans one year and

cyclical behaviour can be seen for time series longer then than 10 ten years, I

will be excluding it it infrom my model.

4) Assess model error:- I will be finding the Root Mean Square Error, Mean

Absolute Deviation, and Mean Absolute Percentage Error to measure how

well my model fits to with the data.

5) Use model to make forecasts:- the assumption I am making is that the future

values will behave in a similar way as in the past.

Steps done taken in Excel:

1. Create a period column to record how many data points there are, then plot
the data.

By codifying the time period, it enables me toI can easily view where each
data point is.

2. Create a centred moving average with k= 24

This is done to smooth out the seasonal effects in the data, making the trend
easier to identify and use for forecasting future values. We make k equal to
the length of one season, to effectively remove the seasonal effects.

3. Create a trend using forecast.linear function

This is our simple linear regression, done in excel.

4. Create Seasonality and Adjusted Seasonality

yt
This is the original value divided by the fitted model value- St =
Tt

The adjusted seasonality shows how much higher or lower it is to the trend
line. To do this we take the average value of the period and divide it by the
average of the averages.

6
We want the average of the adjusted seasonality to be 1, where the trend line
is in the centre of our data points.

5. Create a fitted model

This is where we multiply our trend (T) data points by the Adjusted
Seasonality (S), this is done to create a formula for the average number of
visits per hour. This is the T * S.

6. Deal with error metrics


This is done to quantify how well of a fit we have compared to the observed
data.
RMSE= √ ¿ is the average distance that predicted values are to observed
values.
¿
MAD= ∑t ∨O t −Pt ∨ n ¿ is the average distance from fitted to predicted.
O t −P t ¿
MAPE= ∑t ∨ ∨ ¿ is our percentage error.
Ot n

To make a forecast, we use the equation:


y t ≈ Trend∗Seasonalityt∗Randomnesst

y t ≈ T t∗S t∗ϵ t

Results & / Discussion

Multiple Linear Regression

To test for multicollinearity, the variance inflation factor (VIF) should be below 10,

and ideally below 5. This can be seen below where VIF values are in range of as

well as tolerance levels above 0.2. No assumption has been violated.

7
The residuals should follow a normal regression and be be normally distributed. To

test this, we see if the normal predicted probability plot. The closer the data points

are to the diagonal line, the more normally distributed the residuals are.

As we can see, the data points are very close to the line. No assumption has been

violated.

To see if all errors are independent of each other and other variables, we look at the

Durbin-Watson statistic,, Ffor this to be true, the Durbin-Watson value should be

close to 2. This can be found under the Model Summery box.

We can see here that the value is 2.072, so no assumption has been violated.

8
The quality of fit for the model can be inferred off the Model Summary box, the R 2

value of the 3rd model is 0.477, this means that 47.7% of the data fits the regression

model (the amount of variation), as this is lower than 50% it may therefore not

produce an accurate predictive model. Our adjusted R 2 remained at 0.476 which

indicates that the variables removed, did not add much to the model.

Homoscedasticity is where the residuals are equally distributed or have a trend. To

see this, we plot a scatterplot with ZPRED on the x-axis and ZRESID on the y-axis.

What can be seen here is that therea is a random spread of data points, meaning no

assumption has been violated.

9
As we have used a Backwards Rreduction Mmodel, seen above is are the 3 three

different models

donecompleted from using this way of reduction. This removes what is not

significant. In the first model is the Region_Dummy is removed as the sig level of,

0.632 is higher than the 0.05 P value. For our 2nd second model, Sex is removed as

the 0.246 significance level is also higher than the 0.05 p value. It is important to

note that our (Constant) also shows as non-significant with a sig level of 0.459 .,

10
Hhowever, we will not remove it as it is fundamental to the modelling and prediction

of data.

The third 3rd model, with all significant coefficients, shows us our values that can be

expressed as a linear function of:

y=β 0 + β 1 x 1 + β 2 x 2 +⋯+ β k x k +ϵ

With (constant) being β 0, Age being β 1, Salary is β 2, is Weekend β 3 and Christmas β 4

This will therefore look like:

y=−13.720−15.728 x 1+0.035 x 2 +27.856 x3 +50.161 x 4 +ϵ

Where x1 is the age of a person, x 2 is the Salary, x3 is whether it is Christmas or not

and x4 is whether it is the weekend or not.

11
We can now use this equation to predict how much a 35-year-old female who lives in

the South, earns a salary of £27,000, would spend on a weekday during Christmas.

As the Region is not included in our model, we will ignore it.

y=(−13.720+27.856+50.161)−15.728(35)+0.035 (27,000)+ ϵ

The predicted value for TotalSpent is therefore 458.817 + ϵ

Results & Discussion

Time Series

12
The graph above shows in blue, The Average Number of Visits per hour in blue.

The fitted model in red dashes, is almost indistinguishable to the original model. This

means that the error is very small. Which can be interpreted with our R 2 value of

0.9992;, this means our model fits the data very well, as our fitted values are close to

the observed values.

The orange CMA-24 line is the centred moving average with k equal to 24

The yellow December line shows the forecast amount for December, this which is

shown below.

13
Time Period Predicted Fitted value:

From the equation y t ≈ T t∗S t∗ϵ t , we ignore randomness but still keep it in the model

as there will be a degree of randomness in the model.

Our December period ranges from 265 to 288.

For example: y 265 ≈ T 265∗S 265

Our T will be β 0 + β 1 t which is the equation above on the graph:

y = 7.4665x + 2308.3

Our S value is the adjusted average in relation to what time the value is, for example

the S value for when t is 265 corresponds to 01:00 and the adjusted value for 01:00

is 1.07168463. Therefore, our final equation is

y 265 =(7.4665 (265)+2308.3)∗1.07168463 = 4594.22 which is close to the Excel version

circled above.
14
Our RSME value is 65.24. As our data ranges from 2000 to 5000+, an RSME of

65.24 tells us we have a good measure for how accurate our model can predict our

December values.

Our MAD value is 50.7. this tells us how far the average all values are from the

Mean. As with our RSME, as this is value is low compared to our data. This can be

inferred as our model having a good level of fit

Our MAPE is 1.5328%. The Error of model is low, this means our model closely fits

the data.

The hour of the day which is busiest is 20:00;0, this can be seen in the graph as at

every peak, the highest value, falls on the 20th, 44th, 68th, 92nd, 116th, 140th, 164th,

188th, 212th, 236th, 260th value, which are all at the time of 20:00

Expected for 20:00 in December is:

y 284 =¿ (7.4665(284) + 2308.3) * 1.43230122

= 6,343.35, or from our excel predicted circled above, is 6,343.40

15
Conclusion

Multiple linear Regression

In our model, Age, Salary, Christmas, and Weekend are the significant factors which

influence the amount of money individuals spend on the online store.

Our final model had significance of less than 0.05, meaning that we will reject the

null hypothesis, with our variables still included in the model being significant.

Factors which aren’t significant:

Region and Sex.

It would be beneficial to use a stepwise selection model, as variables can be added

or removed at will. A model that includes interactions may also be beneficial for the

company to see and understand relationships within the independent variables.

Time series

We can see from this, that at 8pm, the website gains the greatest number of visits,

as well as the predicted values for the month of December. As the trend is

increasing, and the predicted December values being more than previous monthsthe

rest, it would be reasonable to assume that the volume of stock will have to be

higher than previous months.

Our model is well made as it closely mirrors the original values. The addition of the

number of items sold though the months may create a better prediction model. iii

16
References

https://www.statisticssolutions.com/testing-assumptions-of-linear-

regression-in-spss/

https://statisticsbyjim.com/regression/multicollinearity-in-regression-

analysis/

https://www.open.ac.uk/socialsciences/spsstutorial/files/tutorials/

assumptions.pdf

https://www.analystsoft.com/en/products/statplus/content/help/pdf/

analysis_regression_backward_stepwise_elimination_regression_model.pd

https://statisticsbyjim.com/regression/interpret-r-squared-regression/

https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-

statistics.php

https://corporatefinanceinstitute.com/resources/knowledge/other/r-

squared/

17
https://www.statisticshowto.com/probability-and-statistics/statistics-

definitions/adjusted-r2/

https://www.statisticshowto.com/wp-content/uploads/2016/02/

SPSS_lab2Regression.pdf

https://stattrek.com/multiple-regression/dummy-variables.aspx

https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-

statistics.php

https://stattrek.com/multiple-regression/dummy-variables.aspx

https://corporatefinanceinstitute.com/resources/knowledge/other/adjusted-

r-squared/

https://en.wikibooks.org/wiki/Using_SPSS_and_PASW/

Confidence_Intervals

18
i
Website insert here

ii

iii
https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-in-

spss/ https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-

in-spss/

You might also like