Lect4 Math231

1/31
Statistics
Descriptive Statistics
Shaheena Bashir
FALL, 2019
2/31
Outline
Introduction
Measures of Location
Mean
Median
Measures of Dispersion
Standard Deviation
Quantiles
Five Number Summary
Chebyshev’s Rule
o
3/31
Introduction
Background
I The graphical techniques are useful in capturing a general

sense of the important features in a one-variable data
collection.
I None of them is designed to provide us with more numerically
descriptive information about the data.
I Need better ways of condensing or summarizing data
collections so that we can interpret and compare them more
effectively.
This often requires a number of different numerical data
summaries.
o
4/31
o
5/31
Mean
If we interpret the visual center of a data collection to be the

balance point where data values larger than the center are equally
balanced by those that are smaller than the center, the numerical
average or mean is a natural statistic for identifying and measuring
the center.
o
6/31
Mean
Mean
This is the case, for example, when our interpretation of the visual
center corresponds to a value for which the numerical contribution
from data points that are greater than the ‘center’ is equally
balanced by the numerical contribution from those that are less
than it. In such settings, the appropriate statistic to measure this
‘visual center’ is naturally the average, or mean, of the collected
observations.
o
7/31
Mean
Mean
The mean of the observations in a data collection is their numerical

average. That is, the sum of the data values divided by the
number of observations in the data collection. So, if x1, x2, . . . , xn
are the n observations in our data collection, their mean is
n
1X
x̄ = xi
n
i=1
1
= (x1 + x2 + · · · + xn )
n
The population mean is also computed the same way but is
denoted as µ.
o
8/31
Mean
Mean: Example
Consider the data collection [1.9, 2.5, 3.6, 3.8, 3.2]. The mean for
this data collection is
(1.9 + 2.5 + 3.6 + 3.8 + 3.2)
x̄ =
5
= 3
o
9/31
Mean
Mean: Limitations
I For many data collections, the mean adequately locates the

dominant visual center of the data.
I Mean, being a numerical average of all the observations, will
be sensitive to either unusually large or unusually small
observations in the collection of data, especially if the total
number of observations is not large.
For example, consider the data collection [1.9, 2.5, 3.6, 3.8, 18.2].
The mean for this data collection is
(1.9 + 2.5 + 3.6 + 3.8 + 18.2)
x̄ =
5
= 6
o
10/31
Median
Median
A measure of the center that is less sensitive to unusually large or
small observations than the mean is provided by the median.
Median divides the set of ordered data values in equal sized halves.
To find the median, x̃, of a data collection x1, . . . , xn:
1. Sort the n data values in order from smallest to largest, i.e.,
x(1) , x(2) , . . . , x(n) .
2. If n is odd, the median, x̃, is the single value in the middle of
this ordered list, i.e., x̃ = x( n+1 ) .
2
3. If n is even, there are two ‘middle values’, and the median, x̃,
x( n ) +x( n +1)
is their average, i.e., x̃ = 2 2 2 .
Since the median is the midpoint of the data, 50% of the values
are below it. Hence, the median is also the 50th percentile.
o
11/31
Median
Median: Example
consider the data collection

I [1.9, 2.5, 3.6, 3.8, 3.2]. The median for this data collection is
[1.9, 2.5, 3.2, 3.6, 3.8].
I [1.9, 2.5, 3.6, 3.8, 3.2, 4.2]. The median for this data collection
is
[1.9, 2.5, 3.2, 3.6, 3.8, 4.2], i.e., x̃ = 3.2+3.6
2 = 3.4.
o
12/31
Variability/dispersion is what the field of statistics is all about

I Measures of location can be used to measure the center of a
data collection. Consider, for example, the two data
collections of n = 10 observations:
Collection A: 3, 4, 5, 6, 7, 9, 10, 11, 12, 13;
x̄A = 8
Collection B: 7.5, 7.6, 7.7, 7.8, 7.9, 8.1, 8.2, 8.3, 8.4, 8.5;
x̄B = 8
I Clearly, such measurements only provide partial information

about the nature of the data collection.
o
13/31
Standard Deviation
Collection A
4 6 8 10 12
Collection B
4 6 8 10 12
Spread of the values about the mean o

14/31
Standard Deviation
o
15/31
Standard Deviation
Standard Deviation (SD)

I The standard deviation of a data set, denoted by s, represents
the typical distance from any point in the data set to the
center.
I It’s roughly the average distance from the center, and in this
case, the center is the average.
o
16/31
Standard Deviation
Standard Deviation (SD)

sP
(xi − x̄)2
SD =
n−1
(xi − x̄)2
P
Var =
n−1
Short Formula:
( xi )2
X P
1
Var = xi2 −
(n − 1) n
I Calculation of the standard deviation of the population is

similar, and denoted by σ
I The standard deviation can never be negative
I The standard deviation has the same units as the original o
data, while variance is in square units
17/31
Standard Deviation
Standard Deviation (SD): Example
Consider the data set [1.9, 2.5, 3.6, 3.8, 3.2].

(1.9+2.5+3.6+3.8+3.2)
I The mean for the data set is x̄ = 5 =3
I Find the distance of the data points from the mean, i.e., xi − x̄
I Square the distances, i.e., (xi − x̄)2
Sum the squared distances, i.e., (xi − x̄)2
P
I
Divide the sum the of the squared distances (xi − x̄)2 by

P
I
n − 1 (the number of observations minus 1), to calculate the
variance
I Take the square root of variance to get the SD
o
18/31
Quantiles
Quantiles
Quantiles are cut points dividing the ordered observations in a

sample into equal parts. There is one fewer quantile than the
number of groups created.
I The median splits the data into equal sized halves
I The quartiles split the data into quarters
I The deciles split the data into tenths
I The percentiles split the data into 100 equal parts. A value’s
percentile tells what percent of the data is below that value.
o
19/31
Quantiles
Quantiles: Example
The following quantiles are from an R output based on scores of a

class of 200 students
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
41.0 49.0 51.0 52.0 53.0 54.0 55.0 57.0 58.2 62.0 70.0
Identify the 30th Percentile value & interpret it
o
20/31
Five Number Summary
Five Number Summary

Min minimum value in data set
Q1 1st quartile = 25th percentile
Q2 2nd quartile = 50th percentile = median!
Q3 3rd quartile = 75th percentile
Max maximum value in data set
A set of five descriptive statistics that divide the data set into four
equal sections
o
21/31
Five Number Summary
Graphical Presentation: Multiple Groups

I It is usually hard to compare multiple histograms by eye
Cabbage Head Weight Data
c39
c52
6
Frequency
4
2
0
0 2 4 6 8 10
Head Weight (KG)
o
22/31
Five Number Summary
Box-and-Whisker Plot
I A box plot is a way to plot data that simplifies comparison
between groups
I There is a vertical box whose height corresponds to the
interquartile range of the data (the width is just to make the
figure easy to interpret).
I Then there is a horizontal line for the median; and
I The behavior of the rest of the data is indicated with whiskers
are extended from the sides of the box to the maximum and
minimum data values
I The outlier identified if any by reducing the whisker length to
the most extreme observation that is not a potential outlier.
I Each data-set is represented by a vertical structure, making it
easy to show multiple data-sets on one plot and interpret the
plot. o
23/31
Five Number Summary
Box Plot
Cabbage Data
4.0
3.5
Head Weight (KG)
3.0
2.5
2.0
1.5
1.0
c39 c52
Cultivar Type
o
24/31
Five Number Summary
Range & Interquartile Range (IQR)
I Range: max − min

I IQR: For long tailed distribution, IQR is taken as measure of
spread.
IQR = Q3 − Q1
I IQR is a spread of the middle 50% of the data

I Helpful in identification of outliers (i.e., any observation which
is more than 1.5 × IQR above the Q3 or below Q1 is a
suspected outlier).
o
25/31
Chebyshev’s Rule
Background
I The sample standard deviation is a measure of the dispersion

of the sample data around the sample mean.
I A small standard deviation indicates less dispersion of sample
data.
I A larger standard deviation indicates more dispersion of
sample data.
I This understanding is also true for the range, however the
standard deviation provides more information about the data
than the range.
I The standard deviation permits the formation of intervals that
indicate the proportion of the data within those intervals.
o
26/31
Chebyshev’s Rule
The 68-95-99.7 Rule for the Normal Curve

I Approximately 68% of observations fall within 1 standard
deviation of the mean
I Approximately 95% of observations fall within 2 standard
I Approximately 99.7% of observations fall within 3 standard
o
27/31
Chebyshev’s Rule
Example
If 100 students took a mathematics test with a mean of 75 and a

standard deviation of 5, assuming a normal distribution
I then 68% of the scores would fall between a score of 70 and
80,
I 95% of the student scores would fall between a score of · · ·
and · · · ??
I · · · % of the student scores would fall between a score of 60
and 90??
o
28/31
Chebyshev’s Rule
Chebyshev Rule
I We generally assume our data is normally distributed

I However, in some cases, the data distribution takes on a
different shape.
I When this occurs, the Chebyshev Rule is helpful in
determining the percentage of data between the intervals.
I Chebyshev gives bounds that quantify both ‘how close’ the
data values are to the mean and ‘how much of the time’
o
29/31
Chebyshev’s Rule
Chebyshev Rule Cont’d

If X is a random variable with finite mean µ and variance σ 2 , then,
for any value k > 0,
1
P(|X − µ| ≤ kσ) ≥ 1 − 2
k
Relative Frequency
1
1−
k2
o
µ− kσ µ+ kσ
30/31
Chebyshev’s Rule
Chebyshev Rule Cont’d
1
At least 1 − k2
of observations fall within µ ± kσ
I Approximately 1 − 212 = 75% of observations fall within k = 2
standard deviation of the mean
I Approximately 1 − 312 = 89% of observations fall within k = 3
standard deviation of the mean
o
31/31
Chebyshev’s Rule
Example
Given a set of test scores with a mean of 80 and a standard

deviation of 5, using Chebyshev inequality find the interval that
contains at least 75% of the scores. Why we should or should not
use empirical rule?

Lect4 Math231

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lect4 Math231

Uploaded by

Copyright:

Available Formats

1/31

I The graphical techniques are useful in capturing a general

If we interpret the visual center of a data collection to be the

The mean of the observations in a data collection is their numerical

I For many data collections, the mean adequately locates the

consider the data collection

Variability/dispersion is what the field of statistics is all about

I Clearly, such measurements only provide partial information

Spread of the values about the mean o

Standard Deviation (SD)

Standard Deviation (SD)

I Calculation of the standard deviation of the population is

Standard Deviation (SD): Example

Consider the data set [1.9, 2.5, 3.6, 3.8, 3.2].

Divide the sum the of the squared distances (xi − x̄)2 by

Quantiles are cut points dividing the ordered observations in a

The following quantiles are from an R output based on scores of a

Identify the 30th Percentile value & interpret it

Five Number Summary

Graphical Presentation: Multiple Groups

Head Weight (KG)

Range & Interquartile Range (IQR)

I Range: max − min

I IQR is a spread of the middle 50% of the data

I The sample standard deviation is a measure of the dispersion

The 68-95-99.7 Rule for the Normal Curve

If 100 students took a mathematics test with a mean of 75 and a

I We generally assume our data is normally distributed

Chebyshev Rule Cont’d

Chebyshev Rule Cont’d

Given a set of test scores with a mean of 80 and a standard

You might also like