You are on page 1of 16

Math and Statistics 1

Unit 3: Describing positions in data sets


Instructors:
David Leonard, Ph.D. Daniel Dan, Ph.D. Joanne Yu, MSc.
david.leonard@modul.ac.at daniel.dan@modul.ac.at
joanne.yu@modul.ac.at
In the last session, we:
…used descriptive statistics to describe data sets in terms of:
• how large the values are (measures of location/central tendency)
• mean, median, mode
• how spread out the values are (measures of dispersion/variability)
• range, interquartile range, mean absolute deviation,
variance, standard deviation, coefficient of variation
…noted that different notation is used to distinguish whether we
are using data coming from a sample or the whole population
…saw that the formulas for sample statistics are often different to
those for population parameters (so that we obtain an “unbiased
estimate” when extrapolating from samples to populations)

We continue using the same sample data today to explore


how to describe particular positions within data sets
2
Example: concert tickets
We want to know the average amount that female students at MU spent on
concert tickets last year.
The population: all those about whom we want to make a general statement…
all female students at MU

If we asked them all, we would be conducting a census, but this takes too long

Instead, we ask a random sample (e.g. every 10th name from an alphabetical list)
and get the following data from 20 students:
60, 20, 40, 80, 30, 70, 100, 50 ,50, 90, 60, 40, 50, 60, 30, 60, 70, 80, 40, 110

Because the sample was randomly selected from the population, we hope that
the sample is representative of the population. If that is the case, the statistics
calculated for the sample will approximate the parameters for the population.

3
We have described the data set using:
Minimum Mean, median, mode
Maximum
Range
Q1 Q3
IQR

F 4
R
E
Q 3 3
U
E 2 2 2 2
N
C
Y 1 1

Class ($) 20 30 40 50 60 70 80 90 100 110


Frequency ƒ 1 2 3 3 4 2 2 0 2 1
𝒏 𝒏

∑ | 𝒙 − 𝒙  | ∑ ( 𝒙 − 𝒙 )𝟐 = 𝒔
𝑴𝑨𝑫 = 𝒊=𝟏
𝒔 = 𝟐 𝒊 =𝟏 𝑪𝑽 = (𝟏𝟎𝟎 %)
𝒏 𝒏− 𝟏 𝒙

Next we describe other positions in the data set,


starting with a new visualization: the box plot
4
Box plots
A box plot is another way of graphically displaying a distribution. It provides a
summary of the distribution by using five numbers: the minimum value, Q1, Q2,
Q3, and the maximum value, plus any outliers
outlier
max.
On a number line, the box extends from Q1 to Q3
(50% of the values will fall within this range)

Q3
A line within the box marks the median (Q2) Q2

Q1
From each end of the box, lines (called whiskers) extend
to the min. and max. values (not including outliers) min.

Beyond the ends of the whiskers, outliers are marked as dots/asterisks


5
Info. obtained from box plots (1)
Signs of symmetrical distributions:
Median is close to the center of the box
Whiskers are approximately the same length

6
Info. obtained from box plots (2)
Signs of positively (right) skewed distributions:
Median lies towards the left of the box
The right whisker is longer than the left

7
Info. obtained from box plots (3)
Signs of negatively (left) skewed distributions:
Median lies towards the right of the box
The left whisker is longer than the right

8
Quartiles: special Percentiles
We have seen how to divide our ordered set of values into quartiles.
Each quartile cuts off ¼ (25%) of the total observations.

Median
Q1 Q2 Q3

20, 30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 60, 70, 70, 80, 80, 100, 100, 110

40 60 75

P25 P50 P75

We can refer to Q1 as the 25th percentile (P25) because it cuts of 25%


of the values on the left of the distribution.
We can also calculate other percentiles within the data set.
9
Calculating Percentiles (1)

We might start with a given data value, and want to establish what
percentile it represents: let’s consider the value 80

𝟏𝟓+ 𝟎 .𝟓
¿
𝟐𝟎 ¿𝟎.𝟕𝟕𝟓 %

20, 30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 60, 70, 70, 80, 80, 100, 100, 110

P77.5

10
Calculating Percentiles (2)

Alternatively, we might want to find the value in a data set that


corresponds to a certain percentile (Pk): let’s find P12, the 12th
percentile

𝟐𝟎 ∗ 𝟏𝟐
¿
𝟏𝟎𝟎 ¿ 𝟐 .𝟒
Because the position (2.4) is not an integer, we round up to the next
whole number (3). The third value in the data set is the 12th percentile

20, 30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 60, 70, 70, 80, 80, 100, 100, 110

P12
11
Calculating Percentiles (3)

Perhaps when we find the position corresponding to a certain percentile,


the result is a whole number: for example, let’s find P10

𝟐𝟎 ∗ 𝟏𝟎
¿
𝟏𝟎𝟎 ¿𝟐
When the position (c) is an integer, such as 2, we identify the required
percentile as the average of the values in the c and c+1 positions.
c c+1
20, 30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 60, 70, 70, 80, 80, 100, 100, 110

30
Like P25, P50, and P75 are special percentiles (QUARTILES),
P10 so P10, P20, P30…P90 are special percentiles (DECILES).
P50 = D5 = Q2 = median
12
z score or Standardized Score
A different way we can describe the position of a certain value in our
data set (student concert expenditures) is by allocating it a z-score

20, 30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 60, 70, 70, 80, 80, 100, 100, 110

The z-score tells us how many standard deviations a particular value is


above or below the mean.

SAMPLE z-score POPULATION z-score


𝒙−µ
𝒛 𝒔𝒄𝒐𝒓𝒆=
𝝈
To calculate the z score for a particular value, subtract the mean of the
data set from the value and divide the result by the standard deviation.

13
z score calculation
We already know the sample mean and sample standard deviation for
our data set.
What is the z score for the value of 100?

20, 30, 30, 40, 40, 40, 50, 50, 50, 60, 60, 60, 60, 70, 70, 80, 80, 100, 100, 110

SAMPLE

The value of 100 is located 1.6 standard deviations above the mean (we
know it is above the mean, because the z score is positive)

The value of 20 is located about 1.6 standard deviations below the


mean ()

z scores are affected by outliers, because outliers directly affect both


the mean and the standard deviation.

14
z score visualisation

The z score tells us the position of the data value relative to the mean…
how far the value is above (+) or below (-) the mean

z score= -1 z score= 0 z score= 1


(𝒙 − 𝒔 ≈ 35) (𝒙=60) (𝒙+ 𝒔 ≈ 8 5)
𝒔 ≈ 𝟐𝟓 𝒔 ≈ 𝟐𝟓 z score= 1.6
𝟒𝟎=𝟏.𝟔∗ 𝒔

15
Practice questions

For z scores and percentiles,


use Jaisingh Chapter 4, pages 86 to 96.

16

You might also like