A BDM Statistics Lecture 1 Summary Stats

Business Decision Making
For Management
UMCDN3-15-1
Illustrative & descriptive statistics
Presentation by
Dr Ana Sendova-Franks
Notes: page 3
25th January, 2018
• Introduction to evidence-based decision-making in business
• Extensive use of the statistics package SPSS
• 3 in class online tests using SPSS to analyse data
UWE Software for home: SPSS 24
http://www1.uwe.ac.uk/its/itresources/software.aspxC
Free versions for Windows & Mac
Instructions are included in the download
The download file is a .zip folder, so please 'extract' before running the setup file
AUTHORISATION CODE IS ON DOWNLOAD PAGE

Illustrative and descriptive statistics
Example: Profiling coffee shop customers
As part of a marketing exercise, a coffee shop wishes to find out more about who its customers
are. In order to do this a dataset was created that comprises a random sample of 80 customers
who have a loyalty card. The following information was obtained from their customer account
records and data generated by swiping their loyalty cards when purchasing coffee at the shop:

 Gender – Female/Male
 Age (years)
 Income - Less than £20,000/£20,000-£29,999/£30,000-£39,999/£40,000 or more
 Favourite coffee (most frequently purchased) - Americano/Cappuccino/Espresso/Latte
 Expenditure on coffee in last month (£)
These measurements are called variables because they will vary from one individual to
another and none of the measurements is likely to be the same (constant) for every
individual in the study.
For example, the first five participants in the study might produce the following data for
the five variables:
Even for these five people the data vary quite considerably, for example age varies from
28 years to 50 years, and expenditure from £28.50 to £60.75.
Types of variable
• Age and Expenditure are numerical
• Gender, Income and Favourite form categories which are denoted by letters or
words
It is very important to be able to recognise which type of variable you are working
with, as the appropriate methods for graphical display, summary and analysis of
the variable depend on its type.

In SPSS the type of variables are classified using the following three levels of
measurement: Gender and Favourite
 Nominal – categorical data with no inherent order; Income
 Ordinal – categorical data with an inherent Age
order;
and Expenditure
 Scale – numeric data measured on a scale.
Note that when using SPSS (and other statistical packages) the categorical variables are
usually coded numerically.
For example, the coffee data above appear as:
One of your tasks in this week computer labs is coding this into SPSS!
Exploratory Data Analysis (EDA)
EDA is the process of a preliminary study of the data looking for interesting patterns
EDA is often characterised by a consideration of four aspects:

• centre
• spread
• shape
• outliers (unusual observations)
Some of the ways of carrying out EDA include:

• tabulation
• graphical displays
• calculation of summary statistics
Frequency tables & bar/pie charts
From the above table we can see that 61.3% of the customers in the sample were female.
It is hoped that the sample is representative of the population of all customers who may
visit the coffee shop, and thus we would infer that the majority of all customers are
female.
So what is the distribution of income of the customers?
From the above we can see that the modal class (most frequently occurring)
is for customers earning in the £20 thousands.
When we have ordinal data it can be appropriate to use the cumulative

frequency column in the table.
For instance we could say that 87.5% of customers earn below £40,000.
So what is the most popular type of coffee?
It appears that Cappuccino is the most popular with 43.8% of customers having it
as their favourite type.
Latte too is quite popular (32.5%).
Americano is the least popular with just 7.5% of customers having it as their
favourite.
Summary statistics
The most common numerical summary statistics are measures of:
 central tendency (mean, median and mode) - “averages” or typical

values in the centre of the data values;
 dispersion (range and standard deviation) - how spread out or

variable the data is.
So what do the summary statistics of Age tell us?
Statisticians reserve N to represent the sample size.

Here we see as expected that we have a sample size of 80 and it is reported that
there were no missing values.
The (arithmetic) mean is calculated by adding up all the data values and then dividing
by the number of values.
Here we see that the average age of the customers is 32.90 years.
Whilst SPSS has given the mean to two decimal places, a common reporting
convention is to report such statistics rounded to one more decimal place than the data
was recorded, i.e. 32.9 years.
The median is the value in the middle of a data set when they are ordered in size.
Here we see that the median age is 31.00 years.
The mode is the most frequently occurring value in the data set.
In this case 35 year-olds appear the most (if you create a frequency table of the
data you will in fact find that 9 out of the 80 customers are 35).
The mean is considered to be the first choice for reporting an average as it is the
most powerful and it uses all of the data values in its calculation.
However, it can be distorted in the presence of outliers (unusually small or large
values). In that case the median is deemed to be a more robust measure to report;
it is not affected by outliers.
The mode is more applicable to report when there are not many distinct values in
the data set.
The minimum and maximum values give us an idea on the two extrema of the data.
Here we see that the age of our customers varies from the youngest age of 20 up to
the eldest of 68.
The range is the difference between the maximum and minimum values and is a measure
of how dispersed the data is; the larger the range the more spread out a data set is.
If the range was zero then the data would not vary at all.
Here we have a range of 68-20=48 years.
The superior measure of dispersion is the standard deviation which in a layman’s

sense can be interpreted as the average amount a value is above or below the mean.
The standard deviation cannot be negative, the smallest value of zero indicates that the
data does not vary.
The larger the standard deviation value the more varied the data is.
Here we have a standard deviation of 8.965 years.
Another way to interpret the standard deviation is to use either of the following rules
of thumb:
 Approximately 95% of a population lies between mean ± 2 × standard deviation
 Almost all (99.7%) of a population lies between mean ± 3 × standard deviation
Applying the two standard deviation rule we would expect 95% of customers to be
approximately between 32.90 ± (2×8.965) years, i.e. between 14.97 and 50.83 years.
Histogram
Summary
statistics
Shape?
Skewed
Typical Unusually
value high value
Boxplot
Mild outlier: 51st

observation
‘Max’
Upper quartile
Median
Lower quartile
‘Min’
Formal report writing
It is common to formally report means and medians to one more decimal place
than the data was recorded to and for standard deviations to two more decimal
places.
Also to use the following abbreviations:
sample size: N mean: M median: Mdn standard deviation: SD

For example:
The data set comprised the age (years) of 80 customers (M = 32.9, SD =
8.97).
Summarising Expenditure
How many people were in the sample? 80 What was the least expenditure? £2.85
What was the biggest expenditure? £66.20 What was is mean expenditure? £33.42
Were there any unusual expenditures amounts? No outliers

Relationships between dependent and independent variables
A dependent variable (or response variable) is an outcome of interest whose

value is believed to depend upon an independent variable (or explanatory
variable)
Favourite coffee depend on Gender?

Gender depend on Favourite coffee?
Income explain Expenditure?

Expenditure explain Income?
Relationships between dependent and independent variables
A dependent variable (or response variable) is an outcome of interest whose

value is believed to depend upon an independent variable (or explanatory
variable)
For the coffee data we have the following:

• Dependent variables - Favourite and Expenditure
• Independent variables – Gender, Age and Income
• The scatterplot above clearly shows a relationship between expenditure and age;
there is a tendency for expenditure to increase with age.
• Note that there is a graphing convention to put the dependent variable on the y-axis
and the independent variable on the x-axis.
• Cross-tabulation and clustered bar charts above reveal that females clearly favour
latte (44.9%) and to a lesser extent cappuccino (34.7%) whereas males favour
mainly cappuccino (58%).
• For both genders Americano is the least popular coffee (female 8.2%, male 6.5%).
• The multiple boxplots above show that the high earners (£40,000 or more) appear
to on average spend more than the others.
• The expenditure for this income group also appears to vary over a smaller range
than the others.
• Interestingly the customer who spent the most comes from the lowest income group.
Computer labs
Two of this weeks computer lab tasks relate to what you have seen in this lecture.
This week will give you an introduction to statistics using SPSS.
In week 5 you have the first in class online test that uses SPSS.

A BDM Statistics Lecture 1 Summary Stats

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

A BDM Statistics Lecture 1 Summary Stats

Uploaded by

Copyright:

Available Formats

Business Decision Making

Free versions for Windows & Mac

Instructions are included in the download

AUTHORISATION CODE IS ON DOWNLOAD PAGE

EDA is often characterised by a consideration of four aspects:

Some of the ways of carrying out EDA include:

When we have ordinal data it can be appropriate to use the cumulative

The most common numerical summary statistics are measures of:

 central tendency (mean, median and mode) - “averages” or typical

 dispersion (range and standard deviation) - how spread out or

Statisticians reserve N to represent the sample size.

The superior measure of dispersion is the standard deviation which in a layman’s

Mild outlier: 51st

Also to use the following abbreviations:

sample size: N mean: M median: Mdn standard deviation: SD

Were there any unusual expenditures amounts? No outliers

A dependent variable (or response variable) is an outcome of interest whose

Favourite coffee depend on Gender?

Income explain Expenditure?

A dependent variable (or response variable) is an outcome of interest whose

For the coffee data we have the following:

You might also like