Professional Documents
Culture Documents
This report looks at the outcomes of a project seeking to find out how the sales of a
website selling online products are affected by a number of different factors relating
to their individual buyers, amongst other variables. This website is able to determine
a number of factors relating to the online buyer, including agegender, salary, gender,
geographical region as well as if it was the weekday or during the Christmas period.
Within this report I will investigate which of these factors influence the amount of
money individuals spend on the website and create a multiple linear regression
(MLR) model using SPSS. This report will also look at the relationship between the
number of visits on the website and the number of items old though the website and
using Excel, use a Time Series Forecast to predict the number of visits to the
Methodology
I have used aan MLR analysis model as we are dealing with multiple independent
1
- The value of residuals is independent
As I am using MLR, I will assume that every dependant variable can be expressed
y=β 0 + β 1 x 1 + β 2 x 2 +⋯+ β k x k +ϵ
y - Dependent variable
expected value.
β 1 , … , β k - The coefficients of each independent variables and shows how much the
by one unit.
To find out the unknown coefficients, through SPSS is the ordinary least squares. In
SPSS this is found in the unstandardized B column of the Coefficients data set.
H 0 : β 1=β 2=… β k =0
H 1 : At least one β i ≠ 0
If I find evidence to reject the alternate hypothesis, ( H 1) (p-value ¿0.05), this means
at least one of the independent variables is significant but not which one.
2
However, if a β i is not significant (p-value ¿0.05), this means that the accompanying
dependent variable x i does not provide statistical improvements to the model in the
The R2 value measures how well our model fits the data. When increasing the
number of variables, the R2 value will increase or stay the same., Iit is therefore
I will thereforealso use the adjusted R2 value as it considers the sample size and how
We have variables which are categorical (they don’t are nottake numerical
variables), these are: sSex, wWeekend, CChristmas, and rRegion. Sex, Weekend
and Christmas are binary values,All of these, excepting region, those which are
either yes or no,measured in binary terms or two values, i.e. or in this case, male or
female, weekend for or weekday. , Christmas or not Christmas, SPSS will convert
Sex: Female
Weekend: Weekday
3
I will be using a Backward Reduction Method,i (Backward Stepwise Regression, n.d.ii),
this is where wea method that considers consider all variables possible, and then
removes the variable which has the highest non-significant p-value. This happens
until all variables left are significant. This can be done in SPSS by changing the
By using this reduction method, we cannot guarantee it is the best model as we will
variable has been removed, it cannot be added back., and Aas such we cannot test
if It will be significant on its own or with another selection of variables;, it will only be
testing to see if that variable is not significant in the presence of all the other
Methodology
Time Series
4
I will be using aA Ttime Sseries analysis will be used as we are dealing with time
dependent variables that is are collected on one subject over many periods of time.
As it is very unlikely that time dependent variables can be linearly modelled and to
forecast, we must extrapolate the data. As such we cannot use simple linear
regression.
forecast for the month of December. The steps involved to model this data are as
follows::
1) Identifying the trend:- I will be using a centred moving average with a period of
24 is used as it is equal to the length of one season, using time from 01:00 to
00:00. I have chosen this over a simple moving average as a CMA removes
the lag from the moving average. The CMA will remove the seasonal effects
and identify the long run trend to make a forecast. As our K is even, we will
repeat itself after every period, k. I will be using a multiplicative model, as the
proportional to the trend., Aafter graphing the average number of visits per
hour, we can see that the seasonality increases over time, which can be seen
asis evident there are through non-parallel lines at the peaks and troughs. We
want the constant of proportionality remains to remain the same for every
5
3) Identifying the cyclical behaviour: - As our data set only spans one year and
cyclical behaviour can be seen for time series longer then than 10 ten years, I
4) Assess model error:- I will be finding the Root Mean Square Error, Mean
5) Use model to make forecasts:- the assumption I am making is that the future
1. Create a period column to record how many data points there are, then plot
the data.
By codifying the time period, it enables me toI can easily view where each
data point is.
This is done to smooth out the seasonal effects in the data, making the trend
easier to identify and use for forecasting future values. We make k equal to
the length of one season, to effectively remove the seasonal effects.
yt
This is the original value divided by the fitted model value- St =
Tt
The adjusted seasonality shows how much higher or lower it is to the trend
line. To do this we take the average value of the period and divide it by the
average of the averages.
6
We want the average of the adjusted seasonality to be 1, where the trend line
is in the centre of our data points.
This is where we multiply our trend (T) data points by the Adjusted
Seasonality (S), this is done to create a formula for the average number of
visits per hour. This is the T * S.
y t ≈ T t∗S t∗ϵ t
To test for multicollinearity, the variance inflation factor (VIF) should be below 10,
and ideally below 5. This can be seen below where VIF values are in range of as
7
The residuals should follow a normal regression and be be normally distributed. To
test this, we see if the normal predicted probability plot. The closer the data points
are to the diagonal line, the more normally distributed the residuals are.
As we can see, the data points are very close to the line. No assumption has been
violated.
To see if all errors are independent of each other and other variables, we look at the
We can see here that the value is 2.072, so no assumption has been violated.
8
The quality of fit for the model can be inferred off the Model Summary box, the R 2
value of the 3rd model is 0.477, this means that 47.7% of the data fits the regression
model (the amount of variation), as this is lower than 50% it may therefore not
indicates that the variables removed, did not add much to the model.
see this, we plot a scatterplot with ZPRED on the x-axis and ZRESID on the y-axis.
What can be seen here is that therea is a random spread of data points, meaning no
9
As we have used a Backwards Rreduction Mmodel, seen above is are the 3 three
different models
donecompleted from using this way of reduction. This removes what is not
significant. In the first model is the Region_Dummy is removed as the sig level of,
0.632 is higher than the 0.05 P value. For our 2nd second model, Sex is removed as
the 0.246 significance level is also higher than the 0.05 p value. It is important to
note that our (Constant) also shows as non-significant with a sig level of 0.459 .,
10
Hhowever, we will not remove it as it is fundamental to the modelling and prediction
of data.
The third 3rd model, with all significant coefficients, shows us our values that can be
y=β 0 + β 1 x 1 + β 2 x 2 +⋯+ β k x k +ϵ
11
We can now use this equation to predict how much a 35-year-old female who lives in
the South, earns a salary of £27,000, would spend on a weekday during Christmas.
y=(−13.720+27.856+50.161)−15.728(35)+0.035 (27,000)+ ϵ
Time Series
12
The graph above shows in blue, The Average Number of Visits per hour in blue.
The fitted model in red dashes, is almost indistinguishable to the original model. This
means that the error is very small. Which can be interpreted with our R 2 value of
0.9992;, this means our model fits the data very well, as our fitted values are close to
The orange CMA-24 line is the centred moving average with k equal to 24
The yellow December line shows the forecast amount for December, this which is
shown below.
13
Time Period Predicted Fitted value:
From the equation y t ≈ T t∗S t∗ϵ t , we ignore randomness but still keep it in the model
y = 7.4665x + 2308.3
Our S value is the adjusted average in relation to what time the value is, for example
the S value for when t is 265 corresponds to 01:00 and the adjusted value for 01:00
circled above.
14
Our RSME value is 65.24. As our data ranges from 2000 to 5000+, an RSME of
65.24 tells us we have a good measure for how accurate our model can predict our
December values.
Our MAD value is 50.7. this tells us how far the average all values are from the
Mean. As with our RSME, as this is value is low compared to our data. This can be
Our MAPE is 1.5328%. The Error of model is low, this means our model closely fits
the data.
The hour of the day which is busiest is 20:00;0, this can be seen in the graph as at
every peak, the highest value, falls on the 20th, 44th, 68th, 92nd, 116th, 140th, 164th,
188th, 212th, 236th, 260th value, which are all at the time of 20:00
15
Conclusion
In our model, Age, Salary, Christmas, and Weekend are the significant factors which
Our final model had significance of less than 0.05, meaning that we will reject the
null hypothesis, with our variables still included in the model being significant.
or removed at will. A model that includes interactions may also be beneficial for the
Time series
We can see from this, that at 8pm, the website gains the greatest number of visits,
as well as the predicted values for the month of December. As the trend is
increasing, and the predicted December values being more than previous monthsthe
rest, it would be reasonable to assume that the volume of stock will have to be
Our model is well made as it closely mirrors the original values. The addition of the
number of items sold though the months may create a better prediction model. iii
16
References
https://www.statisticssolutions.com/testing-assumptions-of-linear-
regression-in-spss/
https://statisticsbyjim.com/regression/multicollinearity-in-regression-
analysis/
https://www.open.ac.uk/socialsciences/spsstutorial/files/tutorials/
assumptions.pdf
https://www.analystsoft.com/en/products/statplus/content/help/pdf/
analysis_regression_backward_stepwise_elimination_regression_model.pd
https://statisticsbyjim.com/regression/interpret-r-squared-regression/
https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-
statistics.php
https://corporatefinanceinstitute.com/resources/knowledge/other/r-
squared/
17
https://www.statisticshowto.com/probability-and-statistics/statistics-
definitions/adjusted-r2/
https://www.statisticshowto.com/wp-content/uploads/2016/02/
SPSS_lab2Regression.pdf
https://stattrek.com/multiple-regression/dummy-variables.aspx
https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-
statistics.php
https://stattrek.com/multiple-regression/dummy-variables.aspx
https://corporatefinanceinstitute.com/resources/knowledge/other/adjusted-
r-squared/
https://en.wikibooks.org/wiki/Using_SPSS_and_PASW/
Confidence_Intervals
18
i
Website insert here
ii
iii
https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-in-
spss/ https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-
in-spss/