You are on page 1of 52

Descriptive Statistics: Numerical

Methods

Descriptive Statistics
3.1
3.2
3.3
3.5

Describing Central Tendency


Measures of Variation
Percentiles, Quartiles
Grouped Data

Describing Central Tendency


In addition to describing the shape of a
distribution, want to describe the data
sets central tendency
A measure of central tendency represents the
center or middle of the data

Parameters and Statistics


A population parameter is a number
calculated from all the population
measurements that describes some
aspect of the population
A sample statistic is a number calculated
using the sample measurements that
describes some aspect of the sample

Measures of Central Tendency


Mean,
Median, Md

Mode, Mo

The average or expected


value
The value of the middle
point of the ordered
measurements
The most frequent value

The Mean
Population X1, X2, , XN

Sample x1, x2, , xn

Population Mean

Sample Mean
n

Xi

i=1

x=

i=1

The Sample Mean


For a sample of size n, the sample mean is defined as
n

x=

x
i =1

x1 + x2 + ... + xn
=
n

and is a point estimate of the population mean


It is the value to expect, on average and in the long run

Example 3.1: The Car Mileage Case


Example 3.1: Sample mean for first five car
mileages from Table 3.1:
30.8, 31.7, 30.1, 31.6, 32.1
5

x1 + x2 + x3 + x4 + x5
5
5
30.8 + 31.7 + 30.1 + 31.6 + 32.1 156.3
x=
=
= 31.26
5
5
x=

i =1

The Median
The median Md is a value such that 50% of all
measurements, after having been arranged in
numerical order, lie above (or below) it
1. If the number of measurements is odd, the median
is the middlemost measurement in the ordering
2. If the number of measurements is even, the median
is the average of the two middlemost
measurements in the ordering

Example: Car Mileage Case


Example 3.1: First five observations
from Table 3.1:
30.8, 31.7, 30.1, 31.6, 32.1
In order: 30.1, 30.8, 31.6, 31.7, 32.1
There is an odd so median is one in
middle, or 31.6

The Mode
The mode Mo of a population or sample of
measurements is the measurement that occurs
most frequently
Modes are the values that are observed most
typically
Sometimes higher frequencies at two or more values
If there are two modes, the data is bimodal
If more than two modes, the data is multimodal

When data are in classes, the class with the highest


frequency is the modal class
The tallest box in the histogram

Histogram Describing the 50 Mileages

Relationships Among Mean, Median


and Mode

Measures of Variation
Knowing the measures of central tendency is not
enough
Both of the distributions below have identical
measures of central tendency

Measures of Variation
Range

Largest minus the smallest


measurement

Variance

The average of the squared


deviations of all the population
measurements from the population
mean

Standard
Deviation

The square root of the variance

The Range
Largest minus smallest
Measures the interval spanned by all the
data
For Figure 3.13, largest repair time is 5
and smallest is 3
Range is 5 3 = 2 days

Population Variance and Standard


Deviation
The population variance (2) is the average
of the squared deviations of the individual
population measurements from the
population mean ()
The population standard deviation () is the
positive square root of the population
variance

Variance

For a population of size N, the population


variance 2 is:
N

2 =

2
(
)
x

i
i =1

2
2
2
(
x1 ) + ( x2 ) + L + (x N )
=

For a sample of size n, the sample


variance s2 is:
n

s2 =

2
(
)
x

x
i
i =1

n 1

2
2
2
(
x1 x ) + ( x2 x ) + L + ( xn x )
=

n 1

Standard Deviation
Population standard deviation ():

Sample standard deviation (s):

s= s

Example: Chriss Class Sizes This


Semester
Data points are: 60, 41, 15, 30, 34
Mean is 36
Variance is:
2
2
2
2
2
(
60 36 ) + (41 36 ) + (15 36 ) + (30 36 ) + (34 36)
=
2

5
576 + 25 + 441 + 36 + 4 1082
=
=
= 216.4
5
5

Standard deviation is:


= 216.4 = 14.71

Example: Sample Variance and


Standard Deviation
Example 3.7: data for first five car
mileages from Table 3.1 are 30.8, 31.7, 30.1,
31.6, 32.1
The sample mean is 31.26
5

s2 =

(x x )
i =1

5 1
2
2
2
2
2
(
30.8 31.26) + (31.7 31.26) + (30.1 31.26) + (31.6 31.26) + (32.1 31.26)
=
4
2.572
=
= 0.643
4

s = s 2 = 0.643 = 0.8019

The Empirical Rule for Normal


Populations
If a population has mean and standard
deviation and is described by a normal curve,
then
68.26% of the population measurements lie within
one standard deviation of the mean: [-, +]
68.26% of the population measurements lie within
two standard deviations of the mean: [-2, +2]
68.26% of the population measurements lie within
three standard deviations of the mean: [-3, +3]

The Empirical Rule and Tolerance Intervals

Example 3.9: The Car Mileage Case


Continued

68.26% of all individual cars will have


mileages in the range
[xs] = [31.60.8] = [30.8, 32.4] mpg
95.44% of all individual cars will have
mileages in the range
[x2s] = [31.61.6] = [30.0, 33.2] mpg
99.73% of all individual cars will have
mileages in the range
[x3s] = [31.62.4] = [29.2, 34.0] mpg

Chebyshevs Theorem

Let and be a populations mean and


standard deviation, then for any value k>
1
At least 100(1 - 1/k2 )% of the population
measurements lie in the interval [-k,
+k]

Coefficient of Variation
Measures the size of the standard deviation
relative to the size of the mean
Coefficient of variation =standard
deviation/mean 100%
Used to:
Compare the relative variabilities of values about the
mean
Compare the relative variability of populations or
samples with different means and different standard
deviations
Measure risk

Percentiles, Quartiles, and BoxBox-and


and-Whiskers Displays
For a set of measurements arranged in
increasing order, the pth percentile is a value
such that p percent of the measurements fall at
or below the value and (100-p) percent of the
measurements fall at or above the value
The first quartile Q1 is the 25th percentile
The second quartile (or median) is the 50th
percentile
The third quartile Q3 is the 75th percentile
The interquartile range IQR is Q3 - Q1

Quartiles
Quartiles split the ranked data into 4 segments with an
equal number of values per segment
25%

25%

Q1

25%

Q2

25%

Q3

The first quartile, Q1, is the value for which 25% of the
observations are smaller and 75% are larger
Q2 is the same as the median (50% are smaller, 50% are
larger)
Only 25% of the observations are greater than the third
quartile

Quartile Formulas
Find a quartile by determining the value in the
appropriate position in the ranked data, where
First quartile position:

Q1 = (n+1)/4

Second quartile position: Q2 = (n+1)/2 (the median position)


Third quartile position:

Q3 = 3(n+1)/4

where n is the number of observed values

Quartiles
Example: Find the first quartile
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the

(9+1)/4 = 2.5 position of the ranked data

so use the value half way between the 2nd and 3rd values,
so

Q1 = 12.5

Q1 and Q3 are measures of noncentral location


Q2 = median, a measure of central tendency

Quartiles
(continued)

Example:
Sample Data in Ordered Array: 11 12 13 16 16 17 18 21 22
(n = 9)
Q1 is in the (9+1)/4 = 2.5 position of the ranked data,
so

Q1 = 12.5

Q2 is in the (9+1)/2 = 5th position of the ranked data,


so

Q2 = median = 16

Q3 is in the 3(9+1)/4 = 7.5 position of the ranked data,


so

Q3 = 19.5

Five Number Summary


1.
2.
3.
4.
5.

The smallest measurement


The first quartile, Q1
The median, Md
The third quartile, Q3
The largest measurement
Displayed visually using a box-andwhiskers plot

Five Number Summary


Example:
X

minimum

Q1

25%

12

Median
(Q2)
25%

30

Q3
25%

45

Interquartile range
= 57 30 = 27

maximum
25%

57

70

Weighted Means
Sometimes, some measurements are more
important than others
Assign numerical weights to the data
Weights measure relative importance of the value

Calculate weighted mean as

w x
w

i i
i

where wi is the weight assigned to the ith


measurement xi

Example: Weighted Mean


June 2001 unemployment rates by census
region
Northeast, 26.9 million in civilian labor force,
4.1% unemployment rate
South, 50.6 million, 4.7% unemployment
Midwest, 34.7 million, 4.4% unemployment
West, 32.5 million, 5.0 unemployment

Want the mean unemployment rate for


the US

Example: Weighted Mean

Continued

Want the mean unemployment rate for


the U.S.
Calculate it as a weighted mean
So that the bigger the region, the more heavily
it counts in the mean

The data values are the regional


unemployment rates
The weights are the sizes of the regional
labor forces

Example: Weighted Mean

Continued

(
26.9 4.1) + (50.6 4.7 ) + (34.7 4.4 ) + (32.5 5.0 )
=
26.9 + 50.6 + 34.7 + 25.5 + 32.5

663.29
=
= 4.58%
144.7

Note that the unweigthed mean is 4.55%,


which underestimates the true rate by 0.03%
That is, 0.0003 144.7 million = 43,410
workers

Descriptive Statistics for Grouped Data


Data already categorized into a frequency
distribution or a histogram is called
grouped data
Can calculate the mean and variance even
when the raw data is not available
Calculations are slightly different for data
from a sample and data from a population

Mean for Grouped Data


Example
Find the arithmetic mean for the following continuous
frequency distribution:
Class
Frequency

0-1
1

1-2
4

2-3
8

3-4
7

4-5
3

5-6
2

Solution for the Example


1
2
3
4
5
6
7
8
9

A
Class
0-1
1-2
2-3
3-4
4-5
5-6
Totals
Mean

Applying the formula

B
X
0.5
1.5
2.5
3.5
4.5
5.5

C
f
1
4
8
7
3
2
25

fX

X=
n

D
fX
0.5
6.0
20.0
24.5
13.5
11.0
75.5
3.02

= 75.5/25=3.02

Median for Grouped Data


Formula for Median is given by

(n/2) m
c
Median = L +
f
Where
L =Lower limit of the median class
n = Total number of observations = f
m = Cumulative frequency preceding the median class
f = Frequency of the median class
c = Class interval of the median class

Median for Grouped Data


Example
Find the median for the following continuous
frequency distribution:
Class
Frequency

0-1
1

1-2
4

2-3
8

3-4
7

4-5
3

5-6
2

Solution for the Example


Class Frequency

Cumulative
Frequency
1
5
13
20
23
25

0-1
1
1-2
4
2-3
8
3-4
7
4-5
3
5-6
2
Total
25
Substituting in the formula the relevant values,
Median =
= 2.9375

L+

(n/2) m
c
f

,we have Median =

(25/ 2) 5
2+
1
8

Mode for Grouped Data


d1
c
Mode = L +
d1 + d 2
Where L =Lower limit of the modal class

d1 = f1 f0

d2 = f1 f2

f1

= Frequency of the modal class

f0

= Frequency preceding the modal class

f2

= Frequency succeeding the modal class


C = Class Interval of the modal class

Mode for Grouped Data


Example
Example: Find the mode for the following
continuous frequency distribution:
Class
Frequency

0-1
1

1-2
4

2-3
8

3-4
7

4-5
3

5-6
2

Solution for the Example


Class Frequency
0-1
1
1-2
4
2-3
8
3-4
7
4-5
3
5-6
2
Total 25

d1
c
Mode = L +
d1 + d 2
L=2
d1 = f1 f0 = 8-4 = 4

d2 = f1 f 2 = 8-7 = 1
4
2
+
1
C = 1 Hence Mode =
5

= 2.8

Standard Deviation for


Grouped Data
The standard deviation for sample data, based on frequency
distribution is given by

f(X X )

S=
which is used to estimate the Population
n 1
Standard Deviation .
Here

X =

fX
n

n is the Sample Size =

, X =Mid Point of each class

Standard Deviation for


Grouped DataData-Example
Frequency Distribution of Return on Investment of Mutual Funds
Return on
Investment
5-10
10-15
15-20
20-25
25-30
Total

Number of Mutual
Funds
10
12
16
14
8
60

Solution for the Example

Solution for the Example


From the spreadsheet of Microsoft Excel in the previous
slide, it is easy to see
Mean = X =

fX

=1040/60=17.333(cell F10),

Standard Deviation = S =
(Cell H12)

f(X X)
n 1

2448.33
59

= 6.44

Practice Problem
Q. 3.47 pp. 154, calculate sample mean, variance and s.d.
Age (Years)

Frequency

28-32

33-37

38-42

43-47

13

48-52

14

53-57

12

58-62

63-67

68-72

73-77

Solution
Midpoint

Freq

M*f

(M mean)**2

F*diff^2

30

30

462.25

462.25

35

105

272.25

816.75

40
45
50
55
60

3
13
14
12
9

120
585
700
660
540

132.25
42.25
2.25
12.25
72.25

396.75
549.25
31.5
147
650.25

65

65

182.25

182.25

70

210

342.25

1026.75

75

1
75
552.25
variance = 81.61017
sample mean = 51.5

552.25
s = 9.033835

You might also like