SEE5211 Chapter2 p2017

Data Analysis in Envir Application
(SEE5211/SEE8212)
Numerical Methods for Describing Data
Chapter 2
Population characteristic
• Fixed value about a population

• Typical unknown
Is this a value that is known?
Can we find it out?
Statistic -
• Value calculated from a sample

Measures of Central Tendency
• Mode – the observation that occurs the most often
• Can be more than one mode
• If all values occur only once – there is

no mode
• Not used as often as mean & median

Median - the middle value of the data; it divides the observations in

half
To find: list the observations in numerical order
single middle value is n is odd

sample median  
average of the two middle values if n is even
Where n = sample size

Example
Suppose we catch a sample of 5 fish from the lake. The lengths of

the fish (in inches) are listed below. Find the median length of fish.
The numbers are in order &

n is odd – so find the middle The median length of fish is 5
observation. inches.
3 4 5 8 10
Example
Suppose we caught a sample of 6 fish from the lake. The median
length is …
The numbers are in order & n is even –

so find the middle two observations.
5.5
3 4 5 6 8 10
Mean is the arithmetic average.
• Use  to represent a population mean

• Use x to represent a sample mean
Formula:
x  x
n
Example
Suppose we caught a sample of 6 fish from the lake. Find the mean
length of the fish.
3  4  5  6  8  10
6
x 6
3 4 5 6 8 10
Example
Now find how each observation deviates from the mean.
x (x - x) The mean is considered the

balance point of the distribution
3 -3
3-6 because it “balances” the positive
4 -2 and negative deviations.
5 -1 This is the deviation from mean.
6 0
8 2
10 Will this sum always equal
4
YES
zero?
Sum 0
x 6
Imagine a ruler with pennies placed at 3”, 4”, 5”, 6”, 8” and 10”.
To balance the
ruler on your
finger, you would
need to place your
finger at the mean
of 6.
The mean is the balance point of
a distribution
Example
What happens to the median & mean if the length of 10 inches
was 15 inches?
The median is . . . 5.5

The mean is . . . 6.833
3  4  5  6  8  15
6
3 4 5 6 8 15
What happens to the median & mean if the 15 inches was 20?
The median is . . . 5.5

The mean is . . . 7.667
2  4  5  6  8  20
6
3 4 5 6 8 20
Median & Mean
Some statistics that are not affected by extreme values . . .
Is the median resistant affected by extreme values?
YES
Is the mean affected by extreme values?
YES
Example
Suppose we caught a sample of 20 fish with the following lengths.
Create a histogram for the lengths of fish.
Mean =6.5
Median =6.5
Look at the placement of the mean and median in

this symmetrical distribution.
3 5 6 10 6 7 7 8 4 5
6 4 7 5 9 9 8 7 6 8
Example
Create a histogram for the lengths of fish.
Mean =6.8
Median =5.5
Look at the placement of the mean and

median in this skewed distribution.
3 5 6 10 15 7 3 3 4 5
6 4 12 5 3 4 8 13 11 9
Example
Create a histogram for the lengths of fish
Mean =7.75
Median =8.5
Look at the placement of the mean and median

in this skewed distribution.
3 5 6 10 10 7 10 8 9 5
6 4 9 10 9 9 10 7 10 8
Distribution
• In a symmetrical distribution, the mean and median are equal.

• In a skewed distribution, the mean is pulled in the direction of the
skewness.
• In a symmetrical distribution, you should report the mean!

• In a skewed distribution, the median should be reported as the
measure of center!
Trimmed mean:
Purpose is to remove outliers from a data set
To calculate a trimmed mean:

• Multiply the percent to trim by n
• Truncate that many observations from BOTH ends of the
distribution (when listed in order)
• Calculate the mean with the shortened data set
Example
Find the mean of the following set of data.
12 14 19 20 22 24 25 26 26 50
Mean = 23.8
Find a 10% trimmed.
10%(10) = 1 ; So remove one observation from each side!
14  19  20  22  24  25  26  26
xT   22
8
What values are used to describe categorical data?
Suppose that each person in a sample of 15 cell phone users is

asked if he or she is satisfied with the cell phone service.
Here are the responses:

Y N Y Y Y N N Y Y
N Y Y Y N N
number of successes
ˆ 
p
n
9
pˆ 
60% of the sample was satisfied with
 0.6 their cell phone service.
15
Why is the study of variability important?
• There is variability in virtually everything
• Allows us to distinguish between usual & unusual values
• Reporting only a measure of center doesn’t provide a complete

picture of the distribution.
Does this can of soda contain exactly 12 ounces?

Example
20 30 40 50 60 70
20 30 40 50 60 70
20 30 40 50 60 70
Notice that these three data sets all have the same mean and median (at
45), but they have very different amounts of variability.
Measures of Variability
The simplest numeric measure of variability is range.
Range = largest observation – smallest observation
20 30 40 50 60 70
The first two data sets have

20 30 40 50 60 70
a range of 50 (70-20) but
20 30 40 50 60 70 the third data set has a
much smaller range of 10.
Another measure of the variability in a data set uses the deviations

from the mean (x – x).
Remember the sample of 6 fish that we caught from the lake . . .

They were the following lengths:
3”, 4”, 5”, 6”, 8”, 10”
The mean length was 6 inches. Recall that we calculated the deviations
from the mean. What was the sum of these deviations?
The estimated average of the deviations squared is called the variance.
2
2  x  x  When calculating sample variance, we
s  use degrees of freedom (n – 1) in the
denominator instead of n because this
n 1 tends to produce better estimates.
Example
Remember the sample of 6 fish that we caught from the lake . . .
Find the variance of the length of fish.
x 6 First square the deviations
x (x - x) (x - x)2
3 -3 9
4 -2 4 Finding the average of the
5 -1 1 deviations would always equal 0!
6 0 0
8 2 4 Divide this by 5.
10 4 16  x  x 2
Sum 0 34 s2 = 6.5 s2 
n 1
The square root of variance is called standard

deviation.
A typical deviation from the mean is the

standard deviation.
s2 = 6.8 inches2 so s = 2.608 inches

The fish in our sample deviate from the mean of 6
by an average of 2.608 inches.
Calculation of standard deviation of a sample
The most commonly used measures of center and variability

are the mean and standard deviation, respectively.
2
 x  x 
s 
n 1
Population standard deviation is denoted by  (where n
is used in the denominator).
Interquartile range (iqr) is the range of the middle half of the
data.
Lower quartile (Q1) is the median of the lower half of the data
Upper quartile (Q3) is the median of the upper half of the data
iqr = Q3 – Q1
What advantage does the interquartile range have over the standard
deviation?
The iqr is resistant to extreme values

Example
The Chronicle of Higher Education (2009-2010 issue) published the
accompanying data on the percentage of the population with a
bachelor’s or higher degree in 2007 for each of the 50 states and the
District of Columbia.
21 27 26 19 30 35 35 26 47 26
27 30 24 29 22 24 29 20 20 27
35 38 25 31 19 24 27 27 23 34
25 32 26 24 22 28 26 30 23 25
22 25 29 33 34 30 17 25 23 34
26
Find the interquartile range for this set of data.
Example
17
21 19
27 19
26 20
19 20
30 21
35 22
35 22
26 22
47 23
26
23
27 23
30 24 24
29 24
22 24 25
29 25
20 25
20 25
27
25
35 26
38 26
25 26
31 26
19 26
24 26
27 27 27
23 27
34
27
25 27
32 28
26 29
24 29
22 29
28 30
26 30 30
23 30
25
31
22 32
25 33
29 34
33 34 34
30 35
17 35
25 35
23 38
34
47
26
First put the data in order & find the median.
iqr = 30 – 24 = 6
Boxplots
What are some advantages of boxplots?
• ease of construction
• convenient handling of outliers
• construction is not subjective (like histograms)
• Used with medium or large size data sets (n > 10)
• useful for comparative displays
Boxplots
When to Use Univariate numerical data
How to construct a Boxplot

• Calculate the five number summary
• Draw a horizontal (or vertical) scale
• Construct a rectangular box from the lower
quartile (Q1) to the upper quartile (Q3)
• Draw lines from the lower quartile to the smallest
observation and from the upper quartile to the
largest observation
To describe
– comment on the center, spread, and shape of the
distribution and if there is any unusual features
Example
Remember the data on the percentage of the population with a bachelor’s or
higher degree in 2007 for each of the 50 states and the District of Columbia.
17 19 19 20 20 21 22 22 22 23
23 23 24 24 24 24 25 25 25 25
25 26 26 26 26 26 26 27 27 27
27 27 28 29 29 29 30 30 30 30
31 32 33 34 34 34 35 35 35 38
47
10 20 30 40 50
Percentages
Modified boxplots
To display outliers:
• Identify mild & extreme outliers
An observation is an outliers if it is more than 1.5(iqr) away
from the nearest quartile.
An outlier is extreme if it is more than 3(iqr) away from the
nearest quartile.
Q1  1.5iqr  and Q3  1.5iqr 

• whiskers extend to largest (or smallest) data
observation that is not an outlier
Q1  3iqr  and Q3  3iqr 
Example
Remember the data on the percentage of the population with a
bachelor’s or higher degree in 2007 for each of the 50 states and the
District of Columbia.
17 19 19 20 20 21 22 22 22 23
23 23 24 24 24 24 25 25 25 25
25 26 26 26 26 26 26 27 27 27
27 27 28 29 29 29 30 30 30 30
31 32 33 34 34 34 35 35 35 38
47
24-1.5(6) = 15
30+1.5(6) = 39
30+3(6) = 48
10 20 30 40 50
Percentages
Symmetrical boxplots Approximately symmetrical boxplot
Skewed boxplot
The 2009-2010 salaries of NBA players were used to construct
the comparative boxplot of salary data for five teams.
Discuss the similarities

and differences.
Interpreting Center & Variability
Chebyshev’s Rule –
The percentage of observations that are

within k standard deviations of the mean is at
least
 1
1001  2 %
where k > 1  k 
 1 If k = 2, then at least 75% of the

1001  2 %  75% observations are within 2 standard
 2 
deviations of the mean.
Example
For a sample of families with one preschool child, it was reported
that the mean child care time per week was approximately 36 hours
with a standard deviation of approximately 12 hours.
Using Chebyshev’s rule, at least 75% of the sample observations

must be between 12 and 60 hours (within 2 standard deviations of
the mean).
At most, what percent of the

observations are greater than
72 hours?
Interpreting Center & Variability
Empirical Rule-
• Approximately 68% of the

observations are within 1 standard
deviation of the mean
• Approximately 95% of the

• Approximately 99.7% of the

The height of male students at PWSH is approximately normally
distributed with a mean of 71 inches and standard deviation of
2.5 inches.
a)What percent of the male students

About 2.5%
are shorter than 66 inches?
b) Taller than 73.5 inches?
About 16%
c) Between 66 & 73.5 inches?
About 81.5%
Measures of Relative Standing
Z-score
A z-score tells us how many standard
deviations the value is from the mean.
value - mean
z - score 
standard deviation
One example of standardized score.

Example
Sally is taking two different math achievement tests with different
means and standard deviations. The mean score on test A was 56
with a standard deviation of 3.5, while the mean score on test B
was 65 with a standard deviation of 2.8. Sally scored a 62 on test
A and a 69 on test B. On which test did Sally score the best?
Z-score on test A Z-score on test B
62  56 69  65
z  1.714 z  1.429
3 .5 2. 8
She did better on test A.

Measures of Relative Standing
Percentiles
A percentile is a value in the data set where r percent of the
observations fall AT or BELOW that value
Example
In addition to weight and length, head circumference is another
measure of health in newborn babies. The National Center for
Health Statistics reports the following summary values for head
circumference (in cm) at birth for boys.
Head circumference (cm) 32.2 33.2 34.5 35.8 37.0 38.2 38.6
Percentile 5 10 25 50 75 90 95
What percent of newborn boys had head circumferences

greater than 37.0 cm? 25%
10% of newborn babies have head circumferences bigger

than what value?
38.2 cm
EPD air pollutant dataset
• http://www.epd.gov.hk/epd/epic • Choose HK EPD website and

/english/data_air_data.html and investigate its air pollution
• http://www.aqhi.gov.hk/en/annu problems. Do your research in
al-aqi/latest-annual-aqi.html the internet and write a short
• http://www.aqhi.gov.hk/en/annu summary on your findings.
al-aqi/annual-aqi-trend.html You may consider points such
as how the air pollution level
has changed in recent years
and what is being done to try
to improve air quality?

SEE5211 Chapter2 p2017

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

SEE5211 Chapter2 p2017

Uploaded by

Copyright:

Available Formats

Data Analysis in Envir Application

Numerical Methods for Describing Data

• Fixed value about a population

• Value calculated from a sample

• Mode – the observation that occurs the most often

• Can be more than one mode

• If all values occur only once – there is

• Not used as often as mean & median

Median - the middle value of the data; it divides the observations in

To find: list the observations in numerical order

single middle value is n is odd

Where n = sample size

Suppose we catch a sample of 5 fish from the lake. The lengths of

The numbers are in order &

The numbers are in order & n is even –

Mean is the arithmetic average.

• Use  to represent a population mean

x (x - x) The mean is considered the

The median is . . . 5.5

The median is . . . 5.5

Is the median resistant affected by extreme values?

Is the mean affected by extreme values?

Look at the placement of the mean and median in

Look at the placement of the mean and

Look at the placement of the mean and median

• In a symmetrical distribution, the mean and median are equal.

• In a symmetrical distribution, you should report the mean!

To calculate a trimmed mean:

Find a 10% trimmed.

10%(10) = 1 ; So remove one observation from each side!

Suppose that each person in a sample of 15 cell phone users is

Here are the responses:

• There is variability in virtually everything

• Allows us to distinguish between usual & unusual values

• Reporting only a measure of center doesn’t provide a complete

Does this can of soda contain exactly 12 ounces?

The simplest numeric measure of variability is range.

Range = largest observation – smallest observation

The first two data sets have

Another measure of the variability in a data set uses the deviations

Remember the sample of 6 fish that we caught from the lake . . .

The estimated average of the deviations squared is called the variance.

The square root of variance is called standard

A typical deviation from the mean is the

s2 = 6.8 inches2 so s = 2.608 inches

The most commonly used measures of center and variability

The iqr is resistant to extreme values

First put the data in order & find the median.

What are some advantages of boxplots?

How to construct a Boxplot

Q1  1.5iqr  and Q3  1.5iqr 

Discuss the similarities

The percentage of observations that are

 1 If k = 2, then at least 75% of the

Using Chebyshev’s rule, at least 75% of the sample observations

At most, what percent of the

• Approximately 68% of the

• Approximately 95% of the

• Approximately 99.7% of the

a)What percent of the male students

One example of standardized score.

Z-score on test A Z-score on test B

She did better on test A.