Professional Documents
Culture Documents
For Management
UMCDN3-15-1
Illustrative & descriptive statistics
Presentation by
Dr Ana Sendova-Franks
Notes: page 3
25th January, 2018
• Introduction to evidence-based decision-making in business
• Extensive use of the statistics package SPSS
• 3 in class online tests using SPSS to analyse data
UWE Software for home: SPSS 24
http://www1.uwe.ac.uk/its/itresources/software.aspxC
The download file is a .zip folder, so please 'extract' before running the setup file
As part of a marketing exercise, a coffee shop wishes to find out more about who its customers
are. In order to do this a dataset was created that comprises a random sample of 80 customers
who have a loyalty card. The following information was obtained from their customer account
records and data generated by swiping their loyalty cards when purchasing coffee at the shop:
Gender – Female/Male
Age (years)
Income - Less than £20,000/£20,000-£29,999/£30,000-£39,999/£40,000 or more
Favourite coffee (most frequently purchased) - Americano/Cappuccino/Espresso/Latte
Expenditure on coffee in last month (£)
These measurements are called variables because they will vary from one individual to
another and none of the measurements is likely to be the same (constant) for every
individual in the study.
For example, the first five participants in the study might produce the following data for
the five variables:
Even for these five people the data vary quite considerably, for example age varies from
28 years to 50 years, and expenditure from £28.50 to £60.75.
Types of variable
• Age and Expenditure are numerical
• Gender, Income and Favourite form categories which are denoted by letters or
words
It is very important to be able to recognise which type of variable you are working
with, as the appropriate methods for graphical display, summary and analysis of
the variable depend on its type.
In SPSS the type of variables are classified using the following three levels of
measurement: Gender and Favourite
Nominal – categorical data with no inherent order; Income
Ordinal – categorical data with an inherent Age
order;
and Expenditure
Scale – numeric data measured on a scale.
Note that when using SPSS (and other statistical packages) the categorical variables are
usually coded numerically.
For example, the coffee data above appear as:
One of your tasks in this week computer labs is coding this into SPSS!
Exploratory Data Analysis (EDA)
EDA is the process of a preliminary study of the data looking for interesting patterns
From the above table we can see that 61.3% of the customers in the sample were female.
It is hoped that the sample is representative of the population of all customers who may
visit the coffee shop, and thus we would infer that the majority of all customers are
female.
So what is the distribution of income of the customers?
From the above we can see that the modal class (most frequently occurring)
is for customers earning in the £20 thousands.
It appears that Cappuccino is the most popular with 43.8% of customers having it
as their favourite type.
Latte too is quite popular (32.5%).
Americano is the least popular with just 7.5% of customers having it as their
favourite.
Summary statistics
The (arithmetic) mean is calculated by adding up all the data values and then dividing
by the number of values.
Here we see that the average age of the customers is 32.90 years.
Whilst SPSS has given the mean to two decimal places, a common reporting
convention is to report such statistics rounded to one more decimal place than the data
was recorded, i.e. 32.9 years.
So what do the summary statistics of Age tell us?
The median is the value in the middle of a data set when they are ordered in size.
Here we see that the median age is 31.00 years.
So what do the summary statistics of Age tell us?
The mode is the most frequently occurring value in the data set.
In this case 35 year-olds appear the most (if you create a frequency table of the
data you will in fact find that 9 out of the 80 customers are 35).
So what do the summary statistics of Age tell us?
The mean is considered to be the first choice for reporting an average as it is the
most powerful and it uses all of the data values in its calculation.
However, it can be distorted in the presence of outliers (unusually small or large
values). In that case the median is deemed to be a more robust measure to report;
it is not affected by outliers.
The mode is more applicable to report when there are not many distinct values in
the data set.
So what do the summary statistics of Age tell us?
The minimum and maximum values give us an idea on the two extrema of the data.
Here we see that the age of our customers varies from the youngest age of 20 up to
the eldest of 68.
So what do the summary statistics of Age tell us?
The range is the difference between the maximum and minimum values and is a measure
of how dispersed the data is; the larger the range the more spread out a data set is.
If the range was zero then the data would not vary at all.
Here we have a range of 68-20=48 years.
So what do the summary statistics of Age tell us?
Applying the two standard deviation rule we would expect 95% of customers to be
approximately between 32.90 ± (2×8.965) years, i.e. between 14.97 and 50.83 years.
Histogram
Summary
statistics
Shape?
Skewed
Typical Unusually
value high value
Boxplot
‘Max’
Upper quartile
Median
Lower quartile
‘Min’
Formal report writing
It is common to formally report means and medians to one more decimal place
than the data was recorded to and for standard deviations to two more decimal
places.
How many people were in the sample? 80 What was the least expenditure? £2.85
What was the biggest expenditure? £66.20 What was is mean expenditure? £33.42
In week 5 you have the first in class online test that uses SPSS.