Online Sales Factors Analysis

Introduction
This report looks at the outcomes of a project seeking to find out how the sales of a
website selling online products are affected by a number of different factors relating
to their individual buyers, amongst other variables. This website is able to determine
a number of factors relating to the online buyer, including agegender, salary, gender,
geographical region as well as if it was the weekday or during the Christmas period.
Within this report I will investigate which of these factors influence the amount of
money individuals spend on the website and create a multiple linear regression
(MLR) model using SPSS. This report will also look at the relationship between the
number of visits on the website and the number of items old though the website and
using Excel, use a Time Series Forecast to predict the number of visits to the
website during the month of December. sal
Methodology
Multiple Linear Regression
I have used aan MLR analysis model as we are dealing with multiple independent
variables. TotalSpent is the independent variable, as we are trying to find which
factors influence the amount spent on their the online shop.
Modelling assumptions being consideredI will be considering:
- There is no multicollinearity between independent variables
- The values of the residuals are normally distributed
1
- The value of residuals is independent
- The variance of the residuals is constant
As I am using MLR, I will assume that every dependant variable can be expressed
as a linear function of:
y=β 0 + β 1 x 1 + β 2 x 2 +⋯+ β k x k +ϵ
y - Dependent variable
β 0 - A constant which when all independent variables are equal to 0, is the
expected value.
x 1 , … , x k - The different independent variables.
β 1 , … , β k - The coefficients of each independent variables and shows how much the
dependent variable is expected to increase after increasing the independent variable
by one unit.
ϵ - The random error, that is normally distributed and has a mean of 0.
To find out the unknown coefficients, through SPSS is the ordinary least squares. In
SPSS this is found in the unstandardized B column of the Coefficients data set.
I will conduct a hypothesis test on the model, asking if any of my independent
variables are significant:
H 0 : β 1=β 2=… β k =0
H 1 : At least one β i ≠ 0
If I find evidence to reject the alternate hypothesis, ( H 1) (p-value ¿0.05), this means
at least one of the independent variables is significant but not which one.
2
However, if a β i is not significant (p-value ¿0.05), this means that the accompanying
dependent variable x i does not provide statistical improvements to the model in the
presence of the other variables.
The R2 value measures how well our model fits the data. When increasing the
number of variables, the R2 value will increase or stay the same., Iit is therefore
important to note whether an extra variable included is significant enough for it to
stay in the model.
I will thereforealso use the adjusted R2 value as it considers the sample size and how
many variables are in the model. As well as calculating the Fit.
We have variables which are categorical (they don’t are nottake numerical
variables), these are: sSex, wWeekend, CChristmas, and rRegion. Sex, Weekend
and Christmas are binary values,All of these, excepting region, those which are
either yes or no,measured in binary terms or two values, i.e. or in this case, male or
female, weekend for or weekday. , Christmas or not Christmas, SPSS will convert
these into variables it can read.
Base cases are:
Sex: Female
Weekend: Weekday
Christmas: Not Christmas
Region: I have set up for the base case to be North.
3
I will be using a Backward Reduction Method,i (Backward Stepwise Regression, n.d.ii),
this is where wea method that considers consider all variables possible, and then
removes the variable which has the highest non-significant p-value. This happens
until all variables left are significant. This can be done in SPSS by changing the
method to backward under analysis, regression, linear.
By using this reduction method, we cannot guarantee it is the best model as we will
not consider every possible combination of variables, especially since once a
variable has been removed, it cannot be added back., and Aas such we cannot test
if It will be significant on its own or with another selection of variables;, it will only be
testing to see if that variable is not significant in the presence of all the other
variables at that time.
Methodology
Time Series
4
I will be using aA Ttime Sseries analysis will be used as we are dealing with time
dependent variables that is are collected on one subject over many periods of time.
As it is very unlikely that time dependent variables can be linearly modelled and to
forecast, we must extrapolate the data. As such we cannot use simple linear
regression.
I will be therefore usingAs such a decomposition model which will allows me to
forecast for the month of December. The steps involved to model this data are as
follows::
1) Identifying the trend:- I will be using a centred moving average with a period of
24 is used as it is equal to the length of one season, using time from 01:00 to
00:00. I have chosen this over a simple moving average as a CMA removes
the lag from the moving average. The CMA will remove the seasonal effects
and identify the long run trend to make a forecast. As our K is even, we will
need to take and average of the averages.
2) Identifying the sSeasonality: - This is the tendency of time series data to
repeat itself after every period, k. I will be using a multiplicative model, as the
seasonality effects increases for an increasing trend, and variation is
proportional to the trend., Aafter graphing the average number of visits per
hour, we can see that the seasonality increases over time, which can be seen
asis evident there are through non-parallel lines at the peaks and troughs. We
want the constant of proportionality remains to remain the same for every
period, and so we use the adjusted seasonality.
5
3) Identifying the cyclical behaviour: - As our data set only spans one year and
cyclical behaviour can be seen for time series longer then than 10 ten years, I
will be excluding it it infrom my model.
4) Assess model error:- I will be finding the Root Mean Square Error, Mean
Absolute Deviation, and Mean Absolute Percentage Error to measure how
well my model fits to with the data.
5) Use model to make forecasts:- the assumption I am making is that the future
values will behave in a similar way as in the past.
Steps done taken in Excel:
1. Create a period column to record how many data points there are, then plot
the data.
By codifying the time period, it enables me toI can easily view where each
data point is.
2. Create a centred moving average with k= 24
This is done to smooth out the seasonal effects in the data, making the trend
easier to identify and use for forecasting future values. We make k equal to
the length of one season, to effectively remove the seasonal effects.
3. Create a trend using forecast.linear function
This is our simple linear regression, done in excel.
4. Create Seasonality and Adjusted Seasonality
yt
This is the original value divided by the fitted model value- St =
Tt
The adjusted seasonality shows how much higher or lower it is to the trend
line. To do this we take the average value of the period and divide it by the
average of the averages.
6
We want the average of the adjusted seasonality to be 1, where the trend line
is in the centre of our data points.
5. Create a fitted model
This is where we multiply our trend (T) data points by the Adjusted
Seasonality (S), this is done to create a formula for the average number of
visits per hour. This is the T * S.
6. Deal with error metrics

This is done to quantify how well of a fit we have compared to the observed
data.
RMSE= √ ¿ is the average distance that predicted values are to observed
values.
¿
MAD= ∑t ∨O t −Pt ∨ n ¿ is the average distance from fitted to predicted.
O t −P t ¿
MAPE= ∑t ∨ ∨ ¿ is our percentage error.
Ot n
To make a forecast, we use the equation:

y t ≈ Trend∗Seasonalityt∗Randomnesst
y t ≈ T t∗S t∗ϵ t
Results & / Discussion
Multiple Linear Regression
To test for multicollinearity, the variance inflation factor (VIF) should be below 10,
and ideally below 5. This can be seen below where VIF values are in range of as
well as tolerance levels above 0.2. No assumption has been violated.
7
The residuals should follow a normal regression and be be normally distributed. To
test this, we see if the normal predicted probability plot. The closer the data points
are to the diagonal line, the more normally distributed the residuals are.
As we can see, the data points are very close to the line. No assumption has been
violated.
To see if all errors are independent of each other and other variables, we look at the
Durbin-Watson statistic,, Ffor this to be true, the Durbin-Watson value should be
close to 2. This can be found under the Model Summery box.
We can see here that the value is 2.072, so no assumption has been violated.
8
The quality of fit for the model can be inferred off the Model Summary box, the R 2
value of the 3rd model is 0.477, this means that 47.7% of the data fits the regression
model (the amount of variation), as this is lower than 50% it may therefore not
produce an accurate predictive model. Our adjusted R 2 remained at 0.476 which
indicates that the variables removed, did not add much to the model.
Homoscedasticity is where the residuals are equally distributed or have a trend. To
see this, we plot a scatterplot with ZPRED on the x-axis and ZRESID on the y-axis.
What can be seen here is that therea is a random spread of data points, meaning no
assumption has been violated.
9
As we have used a Backwards Rreduction Mmodel, seen above is are the 3 three
different models
donecompleted from using this way of reduction. This removes what is not
significant. In the first model is the Region_Dummy is removed as the sig level of,
0.632 is higher than the 0.05 P value. For our 2nd second model, Sex is removed as
the 0.246 significance level is also higher than the 0.05 p value. It is important to
note that our (Constant) also shows as non-significant with a sig level of 0.459 .,
10
Hhowever, we will not remove it as it is fundamental to the modelling and prediction
of data.
The third 3rd model, with all significant coefficients, shows us our values that can be
expressed as a linear function of:
y=β 0 + β 1 x 1 + β 2 x 2 +⋯+ β k x k +ϵ
With (constant) being β 0, Age being β 1, Salary is β 2, is Weekend β 3 and Christmas β 4
This will therefore look like:
y=−13.720−15.728 x 1+0.035 x 2 +27.856 x3 +50.161 x 4 +ϵ
Where x1 is the age of a person, x 2 is the Salary, x3 is whether it is Christmas or not
and x4 is whether it is the weekend or not.
11
We can now use this equation to predict how much a 35-year-old female who lives in
the South, earns a salary of £27,000, would spend on a weekday during Christmas.
As the Region is not included in our model, we will ignore it.
y=(−13.720+27.856+50.161)−15.728(35)+0.035 (27,000)+ ϵ
The predicted value for TotalSpent is therefore 458.817 + ϵ
Results & Discussion
Time Series
12
The graph above shows in blue, The Average Number of Visits per hour in blue.
The fitted model in red dashes, is almost indistinguishable to the original model. This
means that the error is very small. Which can be interpreted with our R 2 value of
0.9992;, this means our model fits the data very well, as our fitted values are close to
the observed values.
The orange CMA-24 line is the centred moving average with k equal to 24
The yellow December line shows the forecast amount for December, this which is
shown below.
13
Time Period Predicted Fitted value:
From the equation y t ≈ T t∗S t∗ϵ t , we ignore randomness but still keep it in the model
as there will be a degree of randomness in the model.
Our December period ranges from 265 to 288.
For example: y 265 ≈ T 265∗S 265
Our T will be β 0 + β 1 t which is the equation above on the graph:
y = 7.4665x + 2308.3
Our S value is the adjusted average in relation to what time the value is, for example
the S value for when t is 265 corresponds to 01:00 and the adjusted value for 01:00
is 1.07168463. Therefore, our final equation is
y 265 =(7.4665 (265)+2308.3)∗1.07168463 = 4594.22 which is close to the Excel version
circled above.
14
Our RSME value is 65.24. As our data ranges from 2000 to 5000+, an RSME of
65.24 tells us we have a good measure for how accurate our model can predict our
December values.
Our MAD value is 50.7. this tells us how far the average all values are from the
Mean. As with our RSME, as this is value is low compared to our data. This can be
inferred as our model having a good level of fit
Our MAPE is 1.5328%. The Error of model is low, this means our model closely fits
the data.
The hour of the day which is busiest is 20:00;0, this can be seen in the graph as at
every peak, the highest value, falls on the 20th, 44th, 68th, 92nd, 116th, 140th, 164th,
188th, 212th, 236th, 260th value, which are all at the time of 20:00
Expected for 20:00 in December is:
y 284 =¿ (7.4665(284) + 2308.3) * 1.43230122
= 6,343.35, or from our excel predicted circled above, is 6,343.40
15
Conclusion
Multiple linear Regression
In our model, Age, Salary, Christmas, and Weekend are the significant factors which
influence the amount of money individuals spend on the online store.
Our final model had significance of less than 0.05, meaning that we will reject the
null hypothesis, with our variables still included in the model being significant.
Factors which aren’t significant:
Region and Sex.
It would be beneficial to use a stepwise selection model, as variables can be added
or removed at will. A model that includes interactions may also be beneficial for the
company to see and understand relationships within the independent variables.
Time series
We can see from this, that at 8pm, the website gains the greatest number of visits,
as well as the predicted values for the month of December. As the trend is
increasing, and the predicted December values being more than previous monthsthe
rest, it would be reasonable to assume that the volume of stock will have to be
higher than previous months.
Our model is well made as it closely mirrors the original values. The addition of the
number of items sold though the months may create a better prediction model. iii
16
References
https://www.statisticssolutions.com/testing-assumptions-of-linear-
regression-in-spss/
https://statisticsbyjim.com/regression/multicollinearity-in-regression-
analysis/
https://www.open.ac.uk/socialsciences/spsstutorial/files/tutorials/
assumptions.pdf
https://www.analystsoft.com/en/products/statplus/content/help/pdf/
analysis_regression_backward_stepwise_elimination_regression_model.pd
https://statisticsbyjim.com/regression/interpret-r-squared-regression/
https://statistics.laerd.com/spss-tutorials/multiple-regression-using-spss-
statistics.php
https://corporatefinanceinstitute.com/resources/knowledge/other/r-
squared/
17
https://www.statisticshowto.com/probability-and-statistics/statistics-
definitions/adjusted-r2/
https://www.statisticshowto.com/wp-content/uploads/2016/02/
SPSS_lab2Regression.pdf
https://stattrek.com/multiple-regression/dummy-variables.aspx
https://statistics.laerd.com/spss-tutorials/linear-regression-using-spss-
statistics.php
https://stattrek.com/multiple-regression/dummy-variables.aspx
https://corporatefinanceinstitute.com/resources/knowledge/other/adjusted-
r-squared/
https://en.wikibooks.org/wiki/Using_SPSS_and_PASW/
Confidence_Intervals
18
i
Website insert here
ii
‌
iii
https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-in-
spss/ https://www.statisticssolutions.com/testing-assumptions-of-linear-regression-
in-spss/

Online Sales Factors Analysis

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Online Sales Factors Analysis

Uploaded by

Copyright:

Available Formats

Introduction

website during the month of December. sal

Multiple Linear Regression

variables. TotalSpent is the independent variable, as we are trying to find which

factors influence the amount spent on their the online shop.

Modelling assumptions being consideredI will be considering:

- There is no multicollinearity between independent variables

- The values of the residuals are normally distributed

- The variance of the residuals is constant

as a linear function of:

β 0 - A constant which when all independent variables are equal to 0, is the

x 1 , … , x k - The different independent variables.

dependent variable is expected to increase after increasing the independent variable

ϵ - The random error, that is normally distributed and has a mean of 0.

I will conduct a hypothesis test on the model, asking if any of my independent

variables are significant:

presence of the other variables.

important to note whether an extra variable included is significant enough for it to

stay in the model.

many variables are in the model. As well as calculating the Fit.

these into variables it can read.

Base cases are:

Christmas: Not Christmas

Region: I have set up for the base case to be North.

method to backward under analysis, regression, linear.

not consider every possible combination of variables, especially since once a

variables at that time.

I will be therefore usingAs such a decomposition model which will allows me to

need to take and average of the averages.

2) Identifying the sSeasonality: - This is the tendency of time series data to

seasonality effects increases for an increasing trend, and variation is

period, and so we use the adjusted seasonality.

will be excluding it it infrom my model.

Absolute Deviation, and Mean Absolute Percentage Error to measure how

well my model fits to with the data.

values will behave in a similar way as in the past.

Steps done taken in Excel:

2. Create a centred moving average with k= 24

3. Create a trend using forecast.linear function

This is our simple linear regression, done in excel.

4. Create Seasonality and Adjusted Seasonality

5. Create a fitted model

6. Deal with error metrics

To make a forecast, we use the equation:

Results & / Discussion

Multiple Linear Regression

well as tolerance levels above 0.2. No assumption has been violated.

Durbin-Watson statistic,, Ffor this to be true, the Durbin-Watson value should be

close to 2. This can be found under the Model Summery box.

produce an accurate predictive model. Our adjusted R 2 remained at 0.476 which

Homoscedasticity is where the residuals are equally distributed or have a trend. To

assumption has been violated.

expressed as a linear function of:

With (constant) being β 0, Age being β 1, Salary is β 2, is Weekend β 3 and Christmas β 4

This will therefore look like:

y=−13.720−15.728 x 1+0.035 x 2 +27.856 x3 +50.161 x 4 +ϵ

Where x1 is the age of a person, x 2 is the Salary, x3 is whether it is Christmas or not

and x4 is whether it is the weekend or not.

As the Region is not included in our model, we will ignore it.

The predicted value for TotalSpent is therefore 458.817 + ϵ

Results & Discussion

the observed values.