You are on page 1of 31

Business Decision Making

For Management

UMCDN3-15-1
Illustrative & descriptive statistics
Presentation by

Dr Ana Sendova-Franks

Notes: page 3
25th January, 2018
• Introduction to evidence-based decision-making in business
• Extensive use of the statistics package SPSS
• 3 in class online tests using SPSS to analyse data
UWE Software for home: SPSS 24
http://www1.uwe.ac.uk/its/itresources/software.aspxC

Free versions for Windows & Mac

Instructions are included in the download

The download file is a .zip folder, so please 'extract' before running the setup file

AUTHORISATION CODE IS ON DOWNLOAD PAGE


Illustrative and descriptive statistics
Example: Profiling coffee shop customers

As part of a marketing exercise, a coffee shop wishes to find out more about who its customers
are. In order to do this a dataset was created that comprises a random sample of 80 customers
who have a loyalty card. The following information was obtained from their customer account
records and data generated by swiping their loyalty cards when purchasing coffee at the shop:
 
 Gender – Female/Male
 Age (years)
 Income - Less than £20,000/£20,000-£29,999/£30,000-£39,999/£40,000 or more
 Favourite coffee (most frequently purchased) - Americano/Cappuccino/Espresso/Latte
 Expenditure on coffee in last month (£)
These measurements are called variables because they will vary from one individual to
another and none of the measurements is likely to be the same (constant) for every
individual in the study.

For example, the first five participants in the study might produce the following data for
the five variables:

Even for these five people the data vary quite considerably, for example age varies from
28 years to 50 years, and expenditure from £28.50 to £60.75.
Types of variable
• Age and Expenditure are numerical
• Gender, Income and Favourite form categories which are denoted by letters or
words

It is very important to be able to recognise which type of variable you are working
with, as the appropriate methods for graphical display, summary and analysis of
the variable depend on its type.
 
In SPSS the type of variables are classified using the following three levels of
measurement: Gender and Favourite
 Nominal – categorical data with no inherent order; Income
 Ordinal – categorical data with an inherent Age
order;
and Expenditure
 Scale – numeric data measured on a scale.
Note that when using SPSS (and other statistical packages) the categorical variables are
usually coded numerically.
For example, the coffee data above appear as:

One of your tasks in this week computer labs is coding this into SPSS!
Exploratory Data Analysis (EDA)
EDA is the process of a preliminary study of the data looking for interesting patterns

EDA is often characterised by a consideration of four aspects:


• centre
• spread
• shape
• outliers (unusual observations)

Some of the ways of carrying out EDA include:


• tabulation
• graphical displays
• calculation of summary statistics
Frequency tables & bar/pie charts

From the above table we can see that 61.3% of the customers in the sample were female.

It is hoped that the sample is representative of the population of all customers who may
visit the coffee shop, and thus we would infer that the majority of all customers are
female.
So what is the distribution of income of the customers?

From the above we can see that the modal class (most frequently occurring)
is for customers earning in the £20 thousands.

When we have ordinal data it can be appropriate to use the cumulative


frequency column in the table.
For instance we could say that 87.5% of customers earn below £40,000.
So what is the most popular type of coffee?

It appears that Cappuccino is the most popular with 43.8% of customers having it
as their favourite type.
Latte too is quite popular (32.5%).
Americano is the least popular with just 7.5% of customers having it as their
favourite.
Summary statistics

The most common numerical summary statistics are measures of:

 central tendency (mean, median and mode) - “averages” or typical


values in the centre of the data values;

 dispersion (range and standard deviation) - how spread out or


variable the data is.
So what do the summary statistics of Age tell us?

Statisticians reserve N to represent the sample size.


Here we see as expected that we have a sample size of 80 and it is reported that
there were no missing values.
So what do the summary statistics of Age tell us?

The (arithmetic) mean is calculated by adding up all the data values and then dividing
by the number of values.
Here we see that the average age of the customers is 32.90 years.
Whilst SPSS has given the mean to two decimal places, a common reporting
convention is to report such statistics rounded to one more decimal place than the data
was recorded, i.e. 32.9 years.
So what do the summary statistics of Age tell us?

The median is the value in the middle of a data set when they are ordered in size.
Here we see that the median age is 31.00 years.
So what do the summary statistics of Age tell us?

The mode is the most frequently occurring value in the data set.
In this case 35 year-olds appear the most (if you create a frequency table of the
data you will in fact find that 9 out of the 80 customers are 35).
So what do the summary statistics of Age tell us?

The mean is considered to be the first choice for reporting an average as it is the
most powerful and it uses all of the data values in its calculation.
However, it can be distorted in the presence of outliers (unusually small or large
values). In that case the median is deemed to be a more robust measure to report;
it is not affected by outliers.
The mode is more applicable to report when there are not many distinct values in
the data set.
So what do the summary statistics of Age tell us?

The minimum and maximum values give us an idea on the two extrema of the data.
Here we see that the age of our customers varies from the youngest age of 20 up to
the eldest of 68.
So what do the summary statistics of Age tell us?

The range is the difference between the maximum and minimum values and is a measure
of how dispersed the data is; the larger the range the more spread out a data set is.
If the range was zero then the data would not vary at all.
Here we have a range of 68-20=48 years.
So what do the summary statistics of Age tell us?

The superior measure of dispersion is the standard deviation which in a layman’s


sense can be interpreted as the average amount a value is above or below the mean.
The standard deviation cannot be negative, the smallest value of zero indicates that the
data does not vary.
The larger the standard deviation value the more varied the data is.
Here we have a standard deviation of 8.965 years.
Another way to interpret the standard deviation is to use either of the following rules
of thumb:
 Approximately 95% of a population lies between mean ± 2 × standard deviation
 Almost all (99.7%) of a population lies between mean ± 3 × standard deviation

Applying the two standard deviation rule we would expect 95% of customers to be
approximately between 32.90 ± (2×8.965) years, i.e. between 14.97 and 50.83 years.
Histogram

Summary
statistics
Shape?
Skewed

Typical Unusually
value high value
Boxplot

Mild outlier: 51st


observation

‘Max’

Upper quartile
Median
Lower quartile
‘Min’
Formal report writing

It is common to formally report means and medians to one more decimal place
than the data was recorded to and for standard deviations to two more decimal
places.

Also to use the following abbreviations:

sample size: N mean: M median: Mdn standard deviation: SD


 
For example:
The data set comprised the age (years) of 80 customers (M = 32.9, SD =
8.97).
Summarising Expenditure

How many people were in the sample? 80 What was the least expenditure? £2.85

What was the biggest expenditure? £66.20 What was is mean expenditure? £33.42

Were there any unusual expenditures amounts? No outliers


Relationships between dependent and independent variables

A dependent variable (or response variable) is an outcome of interest whose


value is believed to depend upon an independent variable (or explanatory
variable)

Favourite coffee depend on Gender?


Gender depend on Favourite coffee?

Income explain Expenditure?


Expenditure explain Income?
Relationships between dependent and independent variables

A dependent variable (or response variable) is an outcome of interest whose


value is believed to depend upon an independent variable (or explanatory
variable)

For the coffee data we have the following:


 
• Dependent variables - Favourite and Expenditure
• Independent variables – Gender, Age and Income
• The scatterplot above clearly shows a relationship between expenditure and age;
there is a tendency for expenditure to increase with age.
• Note that there is a graphing convention to put the dependent variable on the y-axis
and the independent variable on the x-axis.
• Cross-tabulation and clustered bar charts above reveal that females clearly favour
latte (44.9%) and to a lesser extent cappuccino (34.7%) whereas males favour
mainly cappuccino (58%).
• For both genders Americano is the least popular coffee (female 8.2%, male 6.5%).
• The multiple boxplots above show that the high earners (£40,000 or more) appear
to on average spend more than the others.
• The expenditure for this income group also appears to vary over a smaller range
than the others.
• Interestingly the customer who spent the most comes from the lowest income group.
Computer labs
Two of this weeks computer lab tasks relate to what you have seen in this lecture.
This week will give you an introduction to statistics using SPSS.

In week 5 you have the first in class online test that uses SPSS.

You might also like