You are on page 1of 50

IE 228

Engineering Statistics
Lecture 2

Summary Statistics
(Section 1.2)
1-2

Topics to learn
1. Mean, standard deviation, variance
2. Outliers
3. Median, quartile, percentile, trimmed mean
4. Mode, range
5. Frequency, sample proportion
6. Difference between ‘‘statistics’’ and ‘‘parameter ’’

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-3

Section 1.2: Summary Statistics


• A sample is often a long list of numbers.
• To help make the important features of a sample stand
out, we compute summary statistics.
• The two most commonly used summary statistics are
– the sample mean, and
– the sample standard deviation.
• The mean gives an indication of the center of the
data.
• The standard deviation gives an indication of how
spread out the data are.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-4

Let X 1 , , X n be a sample.
Sample Mean:
- The sample mean is also called the “arithmetic
mean,” or the “average.”
- It is the sum of the numbers in the sample, divided
by how many there are.

1 n
X   Xi (1)
n i 1

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-5

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-6

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-7

Standard Deviation
• Consider the two lists of numbers:
List 1: 28, 29, 30, 31, 32
List 2: 10, 20, 30, 40, 50.
• Both lists have the same mean of 30.
• But clearly the lists differ in an important way that is
not captured by the mean:
– the second list is much more spread out than the first.
• The standard deviation is a quantity that measures
the degree of spread in a sample.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-8

Standard Deviation (cont.)


• Let X1, ... , Xn be a sample. The basic idea behind the
standard deviation is that;
– when the spread is large, the sample values will tend to
be far from their mean, but
– when the spread is small, the values will tend to be
close to their mean.
• So the first step in calculating the standard deviation
is to compute the differences (also called deviations)
between each sample value and the sample mean.
The deviations are (X1 − ), ... , (Xn − ).

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-9

Standard Deviation (cont.)


• Some of these deviations are positive and some are negative.
– Large negative deviations are just as indicative of spread as
large positive deviations are.
• To make all the deviations positive we square them,
obtaining the squared deviations:
(X1 − )2 , ... , (Xn − )2.
• From the squared deviations we can compute a measure of
spread called the sample variance:
The sample variance is the average of the squared
deviations, except that we divide by n − 1 instead of n.
(It is customary to denote the sample variance by s2)

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-10

Standard Deviation (cont.)


• While the sample variance is an important quantity, it has a
serious drawback as a measure of spread:
– Its units are not the same as the units of the sample values;
– Instead they are the squared units.
• To obtain a measure of spread whose units are the same as
those of the sample values, we simply take the square root of
the variance.
• This quantity is known as the sample standard deviation.
• It is customary to denote the sample standard deviation by s
(the square root of s2).

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-11

Let X 1 , , X n be a sample.
Sample Variance:

(2.1)

• An equivalent formula, which can be easier to


compute, is:

(2.2)

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-12

Let X 1 , , X n be a sample.
• Sample standard deviation is the square root of the
sample variance.

(3.1)

• or, using equivalent formula

(3.2)

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-13

Note:
Why is the sum of the squared deviations is divided by n − 1 rather
than n ?
• Ideally, we would compute deviations from the mean of all the
items in the population, rather than the deviations from the
sample mean.
• However, the population mean is in general unknown, so the
sample mean is used in its place.
• It is a mathematical fact that
– the deviations around the sample mean tend to be a bit smaller than
the deviations around the population mean, and that
– dividing by n − 1 rather than n provides exactly the right correction.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-14

(The five heights (in inches) are: 65.51, 72.30, 68.31, 67.05, 70.68.)

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-15

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-16

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-17

More on Summary Statistics


• If X 1 , , X n is a sample, and Yi  a  bX i , where a
and b are constants, then

Y  a  bX
and,
s  b s , and s y  b sx .
2
y
2 2
x

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-18

Example
In Example 1.9, if the heights were measured in
centimeters rather than inches what would happen to the
sample mean, variance, and standard deviation?

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-19

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-20

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-21

Outliers
• Outliers are points that are much larger or smaller
than the rest of the sample points.
• Outliers may be data entry errors or they may be
points that really are different from the rest.
• Outliers should not be deleted without considerable
thought—sometimes calculations and analyses will
be done with and without outliers and then compared.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-22

Outliers
• Outliers are a real problem for data analysts.
– For this reason, when people see outliers in their
data, they sometimes try to find a reason, or an
excuse, to delete them.
• An outlier should not be deleted, however, unless
there is reasonable certainty that it results from an
error.
• If a population truly contains outliers, but they are
deleted from the sample, the sample will not
characterize the population correctly.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-23

Definition of a Median
The median is another measure of center, like the
mean.
Order the n data points from smallest to largest. Then
 If n is odd, the sample median is the number in
n 1
position .
2

 If n is even, the sample median is the average


n n
of the numbers in positions and  1.
2 2

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-24

(Recall; the five heights are: 65.51, 72.30, 68.31, 67.05, 70.68.)

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-25

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-26

Trimmed Mean
• Like the median, the trimmed mean is a measure of center
that is designed to be unaffected by outliers.
• The trimmed mean is computed by
– arranging the sample values in order,
– “trimming” an equal number of them from each end, and
– computing the mean of those remaining.
• If p% of the data are trimmed from each end, the resulting
trimmed mean is called the “p% trimmed mean.”
• There are no hard-and-fast rules on how many values to trim.
The most commonly used trimmed means: 5%, 10%, and 20%
trimmed means.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-27

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-28

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-29

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-30

Mode and Range


• The mode and the range are summary statistics that are of
limited use but are occasionally seen.
• The sample mode is the most frequently occurring value in a
sample.
– If several values occur with equal frequency, each one is a
mode

• The range is the difference between the largest and smallest


values in a sample.
– It is a measure of spread, but it is rarely used, because it
depends only on the two extreme values and provides no
information about the rest of the sample.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-31

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-32

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-33

Quartiles
 Quartiles divide the data as nearly as possible
into quarters.
 The first quartile is the median of the lower
half of the data.
To find the first quartile, compute 0.25(n + 1);
- If this is an integer, then the sample value in that
position is the first quartile.
- If not, take the average of the sample values on
either side of this value.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-34

Quartiles
 The third quartile is the median of the upper
half of the data.
To find the third quartile, compute 0.75(n + 1);
- If this is an integer, then the sample value in that
position is the third quartile.
- If not, take the average of the sample values on
either side of this value.

 Note: The computation we used for the location of


the median is equivalent to 0.5(n +1).
 The median is the second quartile.
McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.
In-class exercise 1-35

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-36

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-37

Definition of Percentile
• The pth percentile of a sample, for a number
p between 0 and 100, divides the sample so
that as nearly as possible p% of the sample
values are less than the pth percentile, and
(100 – p%) are greater.
• The computation of the location of the pth
percentile is analogous to what we did for the
quartiles.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-38

To Find Percentiles
 Order the n sample values from smallest to
largest.
 Compute the quantity (p/100)(n + 1), where n
is the sample size.
 If this quantity is an integer, the sample value
in this position is the pth percentile.
Otherwise, average the two sample values on
either side.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-39

Note on Percentiles
• The first quartile is the 25th percentile.

• The median is the 50th percentile.

• The third quartile is the 75th percentile.

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-40

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-41

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-42

Example 4
• Suppose we have the following data:
2, 3, 5, 6, 7, 9, 9, 11, 12, 15
• What is the mean of these data?
• What is the median?
• What is the first quartile?
• What is the third quartile?

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-43

Example 4 (cont.)

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-44

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-45

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-46

Summary Statistics for Categorical Data


• The two most commonly used numerical summaries
for categorical data are the frequencies and the
sample proportion (sometimes called relative
frequencies).

• Example: 100 rivets are checked for their breaking


strength. If 4 of the rivets fail (i.e., do not hold up to
the standard), find the sample proportion of rivets that
fail.

Answer: Sample proportion = 4 / 100 = 0.04

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


In-class exercise 1-47

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


1-48

McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.


Sample Statistics and 1-49

Population Parameters
 A numerical summary of a sample is called a
statistic.

 A numerical summary of a population is called


a parameter.

 Statistics are often used to estimate


parameters.
McGraw-Hill ©2014 by The McGraw-Hill Companies, Inc. All rights reserved.
End of Lecture

You might also like