You are on page 1of 31

1/31

Statistics

Descriptive Statistics

Shaheena Bashir

FALL, 2019
2/31
Outline

Introduction

Measures of Location
Mean
Median

Measures of Dispersion
Standard Deviation
Quantiles
Five Number Summary
Chebyshev’s Rule

o
3/31
Introduction

Background

I The graphical techniques are useful in capturing a general


sense of the important features in a one-variable data
collection.
I None of them is designed to provide us with more numerically
descriptive information about the data.
I Need better ways of condensing or summarizing data
collections so that we can interpret and compare them more
effectively.
This often requires a number of different numerical data
summaries.

o
4/31
Measures of Location

o
5/31
Measures of Location
Mean

If we interpret the visual center of a data collection to be the


balance point where data values larger than the center are equally
balanced by those that are smaller than the center, the numerical
average or mean is a natural statistic for identifying and measuring
the center.
o
6/31
Measures of Location
Mean

Mean

This is the case, for example, when our interpretation of the visual
center corresponds to a value for which the numerical contribution
from data points that are greater than the ‘center’ is equally
balanced by the numerical contribution from those that are less
than it. In such settings, the appropriate statistic to measure this
‘visual center’ is naturally the average, or mean, of the collected
observations.

o
7/31
Measures of Location
Mean

Mean

The mean of the observations in a data collection is their numerical


average. That is, the sum of the data values divided by the
number of observations in the data collection. So, if x1, x2, . . . , xn
are the n observations in our data collection, their mean is

n
1X
x̄ = xi
n
i=1
1
= (x1 + x2 + · · · + xn )
n
The population mean is also computed the same way but is
denoted as µ.
o
8/31
Measures of Location
Mean

Mean: Example

Consider the data collection [1.9, 2.5, 3.6, 3.8, 3.2]. The mean for
this data collection is
(1.9 + 2.5 + 3.6 + 3.8 + 3.2)
x̄ =
5
= 3

o
9/31
Measures of Location
Mean

Mean: Limitations

I For many data collections, the mean adequately locates the


dominant visual center of the data.
I Mean, being a numerical average of all the observations, will
be sensitive to either unusually large or unusually small
observations in the collection of data, especially if the total
number of observations is not large.
For example, consider the data collection [1.9, 2.5, 3.6, 3.8, 18.2].
The mean for this data collection is
(1.9 + 2.5 + 3.6 + 3.8 + 18.2)
x̄ =
5
= 6
o
10/31
Measures of Location
Median

Median
A measure of the center that is less sensitive to unusually large or
small observations than the mean is provided by the median.
Median divides the set of ordered data values in equal sized halves.
To find the median, x̃, of a data collection x1, . . . , xn:
1. Sort the n data values in order from smallest to largest, i.e.,
x(1) , x(2) , . . . , x(n) .
2. If n is odd, the median, x̃, is the single value in the middle of
this ordered list, i.e., x̃ = x( n+1 ) .
2
3. If n is even, there are two ‘middle values’, and the median, x̃,
x( n ) +x( n +1)
is their average, i.e., x̃ = 2 2 2 .
Since the median is the midpoint of the data, 50% of the values
are below it. Hence, the median is also the 50th percentile.
o
11/31
Measures of Location
Median

Median: Example

consider the data collection


I [1.9, 2.5, 3.6, 3.8, 3.2]. The median for this data collection is
[1.9, 2.5, 3.2, 3.6, 3.8].
I [1.9, 2.5, 3.6, 3.8, 3.2, 4.2]. The median for this data collection
is
[1.9, 2.5, 3.2, 3.6, 3.8, 4.2], i.e., x̃ = 3.2+3.6
2 = 3.4.

o
12/31
Measures of Dispersion

Variability/dispersion is what the field of statistics is all about


I Measures of location can be used to measure the center of a
data collection. Consider, for example, the two data
collections of n = 10 observations:
Collection A: 3, 4, 5, 6, 7, 9, 10, 11, 12, 13;

x̄A = 8

Collection B: 7.5, 7.6, 7.7, 7.8, 7.9, 8.1, 8.2, 8.3, 8.4, 8.5;

x̄B = 8

I Clearly, such measurements only provide partial information


about the nature of the data collection.

o
13/31
Measures of Dispersion
Standard Deviation

Collection A

4 6 8 10 12

Collection B

4 6 8 10 12

Spread of the values about the mean o


14/31
Measures of Dispersion
Standard Deviation

o
15/31
Measures of Dispersion
Standard Deviation

Standard Deviation (SD)


I The standard deviation of a data set, denoted by s, represents
the typical distance from any point in the data set to the
center.
I It’s roughly the average distance from the center, and in this
case, the center is the average.

o
16/31
Measures of Dispersion
Standard Deviation

Standard Deviation (SD)


sP
(xi − x̄)2
SD =
n−1
(xi − x̄)2
P
Var =
n−1
Short Formula:

( xi )2
X P 
1
Var = xi2 −
(n − 1) n

I Calculation of the standard deviation of the population is


similar, and denoted by σ
I The standard deviation can never be negative
I The standard deviation has the same units as the original o
data, while variance is in square units
17/31
Measures of Dispersion
Standard Deviation

Standard Deviation (SD): Example

Consider the data set [1.9, 2.5, 3.6, 3.8, 3.2].


(1.9+2.5+3.6+3.8+3.2)
I The mean for the data set is x̄ = 5 =3
I Find the distance of the data points from the mean, i.e., xi − x̄
I Square the distances, i.e., (xi − x̄)2
Sum the squared distances, i.e., (xi − x̄)2
P
I

Divide the sum the of the squared distances (xi − x̄)2 by


P
I
n − 1 (the number of observations minus 1), to calculate the
variance
I Take the square root of variance to get the SD

o
18/31
Measures of Dispersion
Quantiles

Quantiles

Quantiles are cut points dividing the ordered observations in a


sample into equal parts. There is one fewer quantile than the
number of groups created.
I The median splits the data into equal sized halves
I The quartiles split the data into quarters
I The deciles split the data into tenths
I The percentiles split the data into 100 equal parts. A value’s
percentile tells what percent of the data is below that value.

o
19/31
Measures of Dispersion
Quantiles

Quantiles: Example

The following quantiles are from an R output based on scores of a


class of 200 students
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
41.0 49.0 51.0 52.0 53.0 54.0 55.0 57.0 58.2 62.0 70.0

Identify the 30th Percentile value & interpret it

o
20/31
Measures of Dispersion
Five Number Summary

Five Number Summary


Min minimum value in data set
Q1 1st quartile = 25th percentile
Q2 2nd quartile = 50th percentile = median!
Q3 3rd quartile = 75th percentile
Max maximum value in data set
A set of five descriptive statistics that divide the data set into four
equal sections

o
21/31
Measures of Dispersion
Five Number Summary

Graphical Presentation: Multiple Groups


I It is usually hard to compare multiple histograms by eye
Cabbage Head Weight Data

c39
c52
6
Frequency

4
2
0

0 2 4 6 8 10

Head Weight (KG)

o
22/31
Measures of Dispersion
Five Number Summary

Box-and-Whisker Plot
I A box plot is a way to plot data that simplifies comparison
between groups
I There is a vertical box whose height corresponds to the
interquartile range of the data (the width is just to make the
figure easy to interpret).
I Then there is a horizontal line for the median; and
I The behavior of the rest of the data is indicated with whiskers
are extended from the sides of the box to the maximum and
minimum data values
I The outlier identified if any by reducing the whisker length to
the most extreme observation that is not a potential outlier.
I Each data-set is represented by a vertical structure, making it
easy to show multiple data-sets on one plot and interpret the
plot. o
23/31
Measures of Dispersion
Five Number Summary

Box Plot

Cabbage Data

4.0
3.5
Head Weight (KG)

3.0
2.5
2.0
1.5
1.0

c39 c52

Cultivar Type

o
24/31
Measures of Dispersion
Five Number Summary

Range & Interquartile Range (IQR)

I Range: max − min


I IQR: For long tailed distribution, IQR is taken as measure of
spread.
IQR = Q3 − Q1

I IQR is a spread of the middle 50% of the data


I Helpful in identification of outliers (i.e., any observation which
is more than 1.5 × IQR above the Q3 or below Q1 is a
suspected outlier).

o
25/31
Measures of Dispersion
Chebyshev’s Rule

Background

I The sample standard deviation is a measure of the dispersion


of the sample data around the sample mean.
I A small standard deviation indicates less dispersion of sample
data.
I A larger standard deviation indicates more dispersion of
sample data.
I This understanding is also true for the range, however the
standard deviation provides more information about the data
than the range.
I The standard deviation permits the formation of intervals that
indicate the proportion of the data within those intervals.

o
26/31
Measures of Dispersion
Chebyshev’s Rule

The 68-95-99.7 Rule for the Normal Curve


I Approximately 68% of observations fall within 1 standard
deviation of the mean
I Approximately 95% of observations fall within 2 standard
deviation of the mean
I Approximately 99.7% of observations fall within 3 standard
deviation of the mean

o
27/31
Measures of Dispersion
Chebyshev’s Rule

Example

If 100 students took a mathematics test with a mean of 75 and a


standard deviation of 5, assuming a normal distribution
I then 68% of the scores would fall between a score of 70 and
80,
I 95% of the student scores would fall between a score of · · ·
and · · · ??
I · · · % of the student scores would fall between a score of 60
and 90??

o
28/31
Measures of Dispersion
Chebyshev’s Rule

Chebyshev Rule

I We generally assume our data is normally distributed


I However, in some cases, the data distribution takes on a
different shape.
I When this occurs, the Chebyshev Rule is helpful in
determining the percentage of data between the intervals.
I Chebyshev gives bounds that quantify both ‘how close’ the
data values are to the mean and ‘how much of the time’

o
29/31
Measures of Dispersion
Chebyshev’s Rule

Chebyshev Rule Cont’d


If X is a random variable with finite mean µ and variance σ 2 , then,
for any value k > 0,
1
P(|X − µ| ≤ kσ) ≥ 1 − 2
k
Relative Frequency

1
1−
k2

o
µ− kσ µ+ kσ
30/31
Measures of Dispersion
Chebyshev’s Rule

Chebyshev Rule Cont’d

1
At least 1 − k2
of observations fall within µ ± kσ
I Approximately 1 − 212 = 75% of observations fall within k = 2
standard deviation of the mean
I Approximately 1 − 312 = 89% of observations fall within k = 3
standard deviation of the mean

o
31/31
Measures of Dispersion
Chebyshev’s Rule

Example

Given a set of test scores with a mean of 80 and a standard


deviation of 5, using Chebyshev inequality find the interval that
contains at least 75% of the scores. Why we should or should not
use empirical rule?

You might also like