You are on page 1of 69

Section 2

Descriptive Statistics
Part 2: Descriptive Measures

Learning Objectives
Measures of centre
Measures of dispersion (spread)
Standardized (Z) scores
Identifying potential outliers
Box & whisker plots (boxplots)

Goal 2 of Descriptive
Statistics
In part 1, we looked at grouping and
graphing data.
Methods depended on data type.

In part 2, well look at deriving


summary statistics for data:
Measures of centre
Measures of spread

Again, methods depend on data type.


3

Measures of Centre
Most common measure of centre is the
MEAN (average).

Mean =
(sum of numbers)/(# of numbers)

Aside: Sigma Notation


Mathematical short-hand notation for
writing long sums.
If you have, say, n data points, called

x1 , x2 , x3 , , xn
Then:

x1 x2 xn
5

Sigma Notation: Example


Suppose you have 4 data points (n =
4):
5, 9, 13.5, 18.2

Then:

x1 5, x2 9, x3 13.5, x4 18.2

x1 x2 x3 x4 5 9 13.5 18.2
45.7
6

Sigma Notation
In other words, sigma notation just
means add them up.

Formula for the Mean


Words:
mean = sum of numbers/# of numbers
Sigma Notation:

Median
The value that separates the top and
bottom halves of ORDERED data.
To find the median: ~
x
Arrange data into increasing order.
Odd # of data points: Take middle number.
Even # of data points: Take average of the
two middle numbers.

Median: Example
Suppose you have the following data:
4, 6, 0, 3.1, 5.5, 7, 4

Step 1: Arrange into increasing order:


0, 3.1, 4, 4, 5.5, 6, 7

Step 2: Cross a number off the left and


right simultaneously until you get down
to only 1 or 2 numbers. ~
x
10

Median: Example
Suppose you now have this data:
5, 1, 8, 9, 0, 7, 2, 1

Step 1: Arrange.
0, 1, 1, 2, 5, 7, 8, 9

Step 2: Cross off.

~
x
11

Mode
Most frequent data point.
A data set can have any number of
modes.
0 modes: All data points occur once.
1 mode: One observation occurs more
than the others.
2 modes: Two observations occur equally
more often than the others.
Etc
12

Mode
Example: For the following data sets,
find the mode.
(a) 5, 9, 1, 4, 9, 4, 9, 6.
(b) 5, 9, 1, 4, 9, 4, 9, 6, 4.
(c) 1, 2, 3, 4, 5, 6, 7.

13

In Class Exercise 2.2.1:


Sensitivity to Outliers
Consider the following data:
20, 21, 25, 25, 26, 27, 28, 29.

Answer the following questions.


Calculate the mean.
Calculate the median.
Find the mode.

14

In Class Exercise 2.2.1:


Sensitivity to Outliers
Now consider the same data set, but
with an outlier:
20, 21, 25, 25, 26, 27, 28, 29, 331.

Recalculate the mean, median, and


mode.
Which of the mean, median, or mode is
most sensitive to (most affected by)
the outlier?
15

Robustness
Measures that are NOT sensitive to
outliers are called Robust.
Therefore, the following measures of
centre are robust:

16

Which Measure of Centre is


Best?
Depends on your data.
If you have a few outliers
Generally use

If the data is CONTINUOUS with NO


outliers, then the MEAN is best.
Qualitative data:
Only the

is possible.

Mode NOT normally used for


continuous data.

17

Measures of Centre:
Graphically
Where are the measures of centre on
the following distributions?

18

Measures of Centre:
Graphically

19

Using Minitab
Minitab quickly calculates measures of
centre for you (seen in section 1 for the
average of the circles).

20

Measures of Spread
Does the mean or median tell you how
your data is spread out (dispersed)?
NO! For example:
Consider the following data:
49, 50, 51
Mean = 50, median = 50.

Now, suppose this is your data:


0, 50, 100
Mean = 50, median = 50.

21

Measures of Spread
It is very important in statistical
analyses to be able to describe how the
data is spread out.
Three main ways of doing this:
Range
Standard Deviation
Interquartile Range
22

Range
Range: Max value Min value of the
data.
Example: 34, 10, 49, 28, 51, 19.
Range = 51 10 = 41.
Is it sensitive to outliers?

23

Standard Deviation
Range only involves the maximum and
minimum observations.
It therefore ignores ALMOST ALL of
your data!
Standard deviation takes ALL
observations into account.
Measures how much, on average, each
value differs from the mean.
24

Calculating Sample Standard


Deviation
1. Find the mean of the data.
2. Find the DIFFERENCE between each
data point and the mean. These are
called RESIDUALS.
3. Square ALL residuals from step 2.

25

Calculating Sample Standard


Deviation
4. ADD all the numbers from step 3.
5. DIVIDE by n 1. This gives SAMPLE
VARIANCE.

26

Calculating Sample Standard


Deviation
6. Finally, take the SQUARE ROOT to get
the sample standard deviation.

(x x)

n 1

27

Example Calculation
A company conducted a survey to
determine how long it takes their
employees to get to work. The data is
recorded in minutes. Find the variance
and standard deviation of the data.
Include units.
13.0, 17.5, 24.6, 18.0, 20.4, 17.7.

28

Example Calculation
Step 1: Find the mean. DO NOT
round intermediate calculations, but
round your final answer (in this case,
the standard deviation) two decimal
places.

13 17.5 24.6 18 20.4 17.7


x
6
x 18.53333333 minutes
29

Example Calculation
(Data was 13, 17.5, 24.6, 18, 20.4, 17.7 with
mean 18.53333333).
For the rest of the calculation, use a table:
xi x

( xi x ) 2

17.5

-1.03333333

1.06777777

24.6

6.06666667

36.80444449

18

-0.53333333

0.28444444

20.4

1.86666667

3.48444446

17.7

-0.83333333

0.69444444

xi
13

TOTAL = 72.95333334
30

Example Calculation
Divide that total by n 1. This gives
variance.
Variance S 2

Take the square root to get standard


deviation. (As specified earlier, round to
2 decimal places).
Standard Deviation S 14.59066667 3.82
31

In Class Exercise 2.2.2


The height data from 5 UPEI students
are (in inches): 65, 75, 71, 68, 66.
Calculate the variance and standard
deviation. Include units in each.

x x

n 1

32

Standard Deviation:
Graphically
The mean measures where the
CENTRE of your data is.
Standard deviation measures how
SPREAD OUT your data is.
Large S => lots of spread, and vice
versa.

33

Standard Deviation:
Graphically
Example: The datasets graphed on the
next slide have the same mean, and
are graphed on the same scales.
Which one has the larger standard
deviation?

34

Standard Deviation:
Graphically

35

Standard Deviation: Why


Square?
Recall that to calculate S, we have to
square the residuals:

( xi x )

To see why, determine what would


happen if we DIDNT square them.
36

Standard Deviation: Why


Square?
Consider the data: 2, 5, 10, 12, 17.
Mean = 9.2
xi x

xi
2

-7.2

-4.2

10

0.8

12

2.8

17

7.8

( xi x ) 2

Always
Happens!

Total =
37

Remember for Later


The total (and therefore, the
MEAN) of a set of RESIDUALS
is ALWAYS 0!

38

Sample Standard Deviation


As you can see, calculating S is fairly
tedious by hand.
Minitab can do this quickly!
Its one of the calculations that are
done using the stat->basic statistics
(the same way we found the mean in
section 1).
39

Statistics Vs. Parameters


We use descriptive measures (mean,
standard deviation, etc.) of samples to
ESTIMATE the descriptive measure of a
population.
Statistic: A descriptive measure for a
SAMPLE.
Parameter: A descriptive measure for
a POPULATION.
40

Statistics Vs. Parameters:


Notation
The SAMPLE mean is

We use it to estimate the POPULATION


mean:

(Greek letter " mu" )


41

Statistics Vs. Parameters:


Notation
The SAMPLE variance and standard
deviation are
2
Statistics

S and S

Use them to estimate the


POPULATION variance and standard
deviation:
Parameters

and (Greek letter " sigma" )


2

42

Population Parameters
The MEAN of a POPULATION is
calculated in the same way as for a
SAMPLE.
The STANDARD DEVIATION of a
POPULATION is slightly different than
that of a sample.

43

Standard Deviation Formulas


Sample Standard
Population Standard
Deviation
Deviation
Slight Difference: sample mean vs.
population mean

(x x)
i

n 1

(x )
i

Main Difference: n 1 vs. n.


Reason is given in section 4.
44

Standardized (Z) Scores


Suppose you somehow have
population data, with the following
parameters:

17 , 2

Suppose you have an observation, say,


x = 25. How many standard deviations
away from the mean is this value?
45

Standardized (Z) Scores


To do this, first think about the
DISTANCE between the observation
and the mean:
Next, figure out how many standard
deviations this is:

46

Standardized (Z) Scores


A Z score gives a formula to
determine this information.
Z = Number of standard deviations an
observation is from the mean.
From our thought process,
Z = (distance from mean)/(St.Dev.), or:

47

Example

x
Z

Suppose you have the following data


for a population.

28.1, 5.83
Calculate the Z scores of the data
points: x = 39.4, x = 13.6, x = 28.1
What is the significance of the SIGN
(+, -, or zero) of your Z score?
48

Z Scores
If a Z score is POSITIVE, then the
observation it came from was ABOVE
the mean.
If a Z score is NEGATIVE, then the
observation was BELOW the mean.
If a Z score is ZERO, then the
observation EQUALLED the mean.
49

In Class Exercise 2.2.3


For the following population data:
6.6, 10.4, 11.7, 15.3

Its easily calculated that

11, 3.1
(a) Calculate the Z scores of ALL the
observations (round each to 1 decimal).
(b) Find the mean and population standard
deviation of these Z scores (1 decimal).
50

Remember for Later


The MEAN of a set of Z
scores is
The STANDARD DEVIATION
of a set of Z scores is
This information will be used in
section 4 and for the rest of the
course!
51

Interquartile Range
Standard deviation uses the MEAN to
determine spread.
Interquartile Range uses the MEDIAN.
Also provides a method of determining
potential outliers.

52

Quartiles: Definitions
A common practice in statistics is to
divide your data into QUARTERS.
First Quartile (Q1): The observation
below which is the bottom 25% of the
data.
Second Quartile (Q2): Same, but 50%.
Note that Q2 =

Third Quartile (Q3): Same, but 75%.


53

How to find Quartiles


Find Q2 (the median) first.
Q1 = median of data to the LEFT of Q2.
Ignore Q2 and everything to its right.

Q3 = median of data to the RIGHT of


Q2.
Ignore Q2 and everything to its left.

54

Quartiles: Example
Find the quartiles for the following data
sets:
(a) 12, 34, 21, 9, 20, 16, 80, 45, 32.
(b) 105, 51, 142, 88, 100, 97.

First: Arrange
(a) 9, 12, 16, 20, 21, 32, 34, 45, 80.
(b) 51, 88, 97, 100, 105, 142.

Now, just find medians (Q2 first)!


55

5 Number Summary
The following five values of a data set
make up its 5 number summary:
Minimum
Q1
Q2
Q3
Maximum

56

5 Number Summary
Example: Find the 5 number
summary for the data set (a) of our
quartile example.
You must arrange the data first, which was:
9, 12, 16, 20, 21, 32, 34, 45, 80.

Min=9, Q1=14, Q2=21, Q3=39.5,


max=80.
5 number summary written in curly
brackets: {9, 14, 21, 39.5, 80}.57

Interquartile Range (IQR)


IQR = Q3 Q1
Gives an idea of the spread of the inner
half of the data.
Example: For our dataset (a) of the
quartile example,
IQR = 39.5 14 = 25.5

58

Upper and Lower Limits


The upper and lower limits use the IQR
to give a method of determining
potential outliers.
Lower Limit (LL) = Q1 1.5(IQR)
Upper Limit (UL) = Q3 + 1.5(IQR)

59

Potential Outliers
Data that falls WITHIN the upper and
lower limits is considered OK.
The following data points are
considered POTENTIAL OUTLIERS:
Higher than UL.
Lower than LL.

60

Example: Potential Outliers


For dataset (a) of the quartile example,
find the potential outliers, if any.
(Ordered) data was:
9, 12, 16, 20, 21, 32, 34, 45, 80.

61

Example: Potential Outliers


(Continued)
5 number summary was
{9, 14, 21, 39.5, 80}
IQR = Q3 Q1 = 25.5.
LL = Q1 1.5(IQR) = 14 1.5(25.5)
= -24.25
UL = Q3 + 1.5(IQR) = 39.5 + 1.5(25.5)
= 77.75
62

Example: Potential Outliers


(Continued)
Therefore, all data points BETWEEN
-24.25 and 77.75 are OK:
Potential Outliers

Data in here is OK
LL = -24.25

UL = 77.75
63

Example: Potential Outliers


(Continued)
Therefore, the only potential outlier for
that dataset is

64

Adjacent Values
Adjacent values are the two values
WITHIN the LL and UL, but CLOSEST
to them.
Example: In our quartile example:
9, 12, 16, 20, 21, 32, 34, 45, 80:
LL was -24.25 and UL was 77.75.

Thus, the adjacent values are


65

Modified Boxplots
A way to picture the 5 number
summary, adjacent values, and
potential outliers.
Modified boxplot:
Make a box from the three quartiles.
The whiskers (lines) are drawn from the
box to the ADJACENT values.
Mark the potential outliers as *.
66

Modified
Boxplots
80
70
60
50
40
30
20
10
0

Data (arranged):
9, 12, 16, 20, 21, 32, 34, 45, 80.
5 number summary was {9,
14, 21, 39.5, 80}
Adjacent values: 9, 45
Potential outlier: 80

1. Make a suitable scale.


2. Mark a small horizontal line for
each quartile and connect to
make a box.
3. Mark the adjacent values and
connect.
4. Mark a * for each potential
outlier (dont connect).
67

In Class Exercise 2.2.4


For the following dataset:
105, 10, 205, 88, 100, 97, 60, 127

Find:
5 number summary.
Lower limit and upper limit.
Adjacent values.
Potential outliers.

Make a modified boxplot.


68

Boxplots in Minitab
Minitab makes modified boxplots.
Warning: Its mechanism for finding Q1
and Q3 is a bit different from the way
we do it by hand.
They will be fairly close to what you
would find by hand.

69