You are on page 1of 142

# Elementary Statistics

## (for Math 104 classes)

Dante V. Partosa
Mathematics Department
College of Science and Information Technology
Ateneo de Zamboanga University
Preliminaries

## Statistics consists of conducting

studies to collect, organize,
summarize, analyze, and draw
conclusions.
Data are the values (measurements
or observations) that the variables
can assume.
Variables whose values are
determined by chance are called
random variables.
A collection of data values forms a
data set.
Each value in the data set is called
a data value or a datum.
Descriptive statistics consists of the
collection, organization, summation,
and presentation of data.
A population consists of all subjects
(human or otherwise) that are being
studied.
A sample is a subgroup of the
population.
Inferential statistics consists of
generalizing from samples to
populations, performing hypothesis
testing, determining relationships
among variables, and making
predictions.
Variables and Types of Data
Qualitative variables are variables
that can be placed into distinct
categories, according to some
characteristic or attribute. For
example, gender (male or female).
Quantitative variables are numerical
in nature and can be ordered or
ranked. Example: age is numerical
and the values can be ranked.
Variables and Types of Data

## Discrete variables assume values

that can be counted.
Continuous variables can assume all
values between any two specific
values. They are obtained by
measuring.
Variables and Types of Data

## The nominal level of measurement

classifies data into mutually exclusive
(nonoverlapping), exhausting categories
in which no order or ranking can be
imposed on the data.
Variables and Types of Data

## The ordinal level of measurement

classifies data into categories that can
be ranked; precise differences between
the ranks do not exist.
Variables and Types of Data

## The interval level of measurement ranks

data; precise differences between units
of measure do exist; there is no
meaningful zero.
Variables and Types of Data

## The ratio level of measurement

possesses all the characteristics of
interval measurement, and there exists
a true zero. In addition, true ratios exist
for the same variable.
Data Collection and Sampling
Techniques

## Data can be collected in a variety of ways.

One of the most common methods is through
the use of surveys.
Surveys can be done by using a variety of
methods -
Examples are telephone, mail questionnaires,
personal interviews, surveying records and
direct observations.
Data Collection and Sampling
Techniques

## To obtain samples that are unbiased,

statisticians use four methods of
sampling.
Random samples are selected by using
chance methods or random numbers.
Data Collection and Sampling
Techniques

## Systematic samples are obtained by

numbering each value in the population
and then selecting the kth value.
Data Collection and Sampling
Techniques

## Stratified samples are selected by

dividing the population into groups
(strata) according to some characteristic
and then taking samples from each
group.
Data Collection and Sampling
Techniques

## Cluster samples are selected by

dividing the population into groups and
then taking samples of the groups.
Computers and Calculators

## Computers and calculators make

numerical computation easier.
Many statistical packages are available.
One example is SSPW (SPSS), MINITAB,
PHStat, Excel. The TI-83 calculator can
also be used to do statistical calculations.
Data must still be understood and
interpreted.
Organizing Data

## When data are collected in original

form, they are called raw data.
When the raw data is organized into a
frequency distribution, the frequency
will be the number of values in a
specific class of the distribution.
Organizing Data

## A frequency distribution is the

organizing of raw data in table form,
using classes and frequencies.
The following slide shows an example
of a frequency distribution.
Three Types of Frequency
Distributions

## Categorical frequency distributions - can

be used for data that can be placed in
specific categories, such as nominal- or
ordinal-level data.
Examples - political affiliation, religious
affiliation, blood type etc.
Blood Type Frequency Distribution -
Example

## C lass Frequency Percent

A 5 20

B 7 28

O 9 36

AB 4 16
Ungrouped Frequency
Distributions
Ungrouped frequency distributions - can
be used for data that can be enumerated
and when the range of values in the data
set is not large.
Examples - number of miles your
instructors have to travel from home to
campus, number of girls in a 4-child family
etc.
Number of Miles Traveled -
Example

Class Frequency

5 24

10 16

15 10
Grouped Frequency Distributions

## Grouped frequency distributions - can be

used when the range of values in the data
set is very large. The data must be
grouped into classes that are more than
one unit in width.
Examples - the life of boat batteries in
hours.
Example

C l as s C l as s F r e q u e n c y C u m u l a ti v e
l i m i ts Bo u n d a r i e s fr e q u e n c y
24 - 30 2 3 .5 - 3 7 .5 4 4

38 - 51 3 7 .5 - 5 1 .5 14 18

52 - 65 5 1 .5 - 6 5 .5 7 25
Terms Associated with a Grouped
Frequency Distribution

## Class limits represent the smallest and

largest data values that can be included in
a class.
In the lifetimes of boat batteries example,
the values 24 and 30 of the first class are
the class limits.
The lower class limit is 24 and the upper
class limit is 30.
Terms Associated with a Grouped
Frequency Distribution

## The class boundaries are used to

separate the classes so that there are
no gaps in the frequency distribution.
Terms Associated with a Grouped
Frequency Distribution

## The class width for a class in a

frequency distribution is found by
subtracting the lower (or upper) class
limit of one class minus the lower (or
upper) class limit of the previous
class.
Guidelines for Constructing a
Frequency Distribution

## There should be between 5 and 20

classes.
The class width should be an odd
number.
The classes must be mutually
exclusive.
Guidelines for Constructing a
Frequency Distribution

## The classes must be continuous.

The classes must be exhaustive.
The class must be equal in width.
Procedure for Constructing a Grouped
Frequency Distribution

## Find the highest and lowest value.

Find the range.
Select the number of classes desired.
Find the width by dividing the range by
the number of classes and rounding up.
Procedure for Constructing a Grouped
Frequency Distribution

## Select a starting point (usually the lowest

value); add the width to get the lower
limits.
Find the upper class limits.
Find the boundaries.
Tally the data, find the frequencies, and
find the cumulative frequency.
Grouped Frequency Distribution -
Example

10 8 6 14
22 13 17 19
11 9 18 14
13 12 15 15
5 11 16 11
Grouped Frequency Distribution -
Example

## Step 1: Find the highest and lowest

values: H = 22 and L = 5.
Step 2: Find the range:
R = H L = 22 5 = 17.
Step 3: Select the number of classes
desired. In this case it is
equal to 6.
Grouped Frequency Distribution -
Example

## Step 4: Find the class width by

dividing the range by the number of
classes. Width = 17/6 = 2.83. This
value is rounded up to 3.
Grouped Frequency Distribution -
Example

## Step 5: Select a starting point for the

lowest class limit. For convenience,
this value is chosen to be 5, the
smallest data value. The lower class
limits will be 5, 8, 11, 14, 17, and 20.
Grouped Frequency Distribution -
Example

## Step 6: The upper class limits will be

7, 10, 13, 16, 19, and 22. For
example, the upper limit for the first
class is computed as 8 - 1, etc.
Grouped Frequency Distribution -
Example

## Step 7: Find the class boundaries by

subtracting 0.5 from each lower class
limit and adding 0.5 to the upper
class limit.
Grouped Frequency Distribution -
Example

## Step 8: Tally the data, write the

numerical values for the tallies in the
frequency column, and find the
cumulative frequencies.
The grouped frequency distribution is
shown on the next slide.
Note: The dash - represents to.

## Class Limits Class Boundaries Frequency Cumulative Frequency

05 t o 07 4.5 - 7.5 2 2
08 t o 10 7.5 - 10.5 3 5
11 t o 13 10.5 - 13.5 6 11
14 t o 16 13.5 - 16.5 5 16
17 t o 19 16.5 - 19.5 3 19
20 t o 22 19.5 - 22.5 1 20
Histograms, Frequency Polygons,
and Ogives

## The three most commonly used

graphs in research are:
The histogram.
The frequency polygon.
The cumulative frequency graph, or
ogive (pronounced o-jive).
Histograms, Frequency Polygons,
and Ogives

## The histogram is a graph that

displays the data by using vertical
bars of various heights to represent
the frequencies.
Example of a Histogram

5
Frequency

5 8 11 14 17 20

N u m b e r o f C ig a re tte s S m o k e d p e r D a y
Histograms, Frequency Polygons,
and Ogives

## A frequency polygon is a graph that

displays the data by using lines that
connect points plotted for frequencies
at the midpoint of classes. The
frequencies represent the heights of
the midpoints.
Example of a Frequency Polygon

Frequency Polygon

5
Frequency

2 5 8 11 14 17 20 23 26

## Number of Cigarettes Smoked per Day

Histograms, Frequency Polygons,
and Ogives

## A cumulative frequency graph or

ogive is a graph that represents the
cumulative frequencies for the
classes in a frequency distribution.
Example of an Ogive
Ogive
20
Cumulative Frequency

10

2 5 8 11 14 17 20 23 26

## Number of C igarettes Smoked per Day

Other Types of Graphs

## Pareto charts - a Pareto chart is

used to represent a frequency
distribution for a categorical variable.
Other Types of Graphs-
Pareto Chart

## When constructing a Pareto chart -

Make the bars the same width.
Arrange the data from largest to
smallest according to frequencies.
Make the units that are used for the
frequency equal in size.
Example of a Pareto Chart

Pareto C hart for the num ber of Crim es Inves tigated by Law
Enforcement Officers in U.S. National Parks During 1995.
250 100
200 80

Percent
Count

150 60
100 40

50 20

0 0
Defec t
Count 164 34 29 13
Perc ent 68.3 14.2 12.1 5.4
Cum % 68.3 82.5 94.6 100.0
Other Types of Graphs

## Time series graph - A time series

graph represents data that occur over
a specific period of time.
2-4 Other Types of Graphs -
Time Series Graph

P O R T AU T H O R IT Y T R AN S IT R ID E R S H IP

89
Ridership (in millions)

87
85
83
81
79
77
75
199 0 19 91 1992 1993 19 94

Y ear
Other Types of Graphs

## Pie graph - A pie graph is a circle that

is divided into sections or wedges
according to the percentage of
frequencies in each category of the
distribution.
Other Types of Graphs -
Pie Graph
Pie Chart of the Robbery (29,
Number of Crimes 12.1%)
Investigated by Rape (34,
Law Enforcement 14.2%)
Officers In U.S.
National Parks Homicide
During 1995 (13, 5.4%)

Assaults
(164,
68.3%)
Organizing Data
Describing Data
Measures of Central Tendency
A statistic is a characteristic or
measure obtained by using the data
values from a sample.
A parameter is a characteristic or
measure obtained by using the data
values from a specific population.
The Mean (arithmetic average)
The mean is defined to be the sum
of the data values divided by the
total number of values.
We will compute two means: one
for the sample and one for a finite
population of values.
The mean, in most cases, is not an
actual data value.
The Sample Mean

## The symbol X represents the sampl e mean.

X i s read as " X - bar " . The G reek symbol
i s read as " si gma" and i t means " to sum" .

X + X + ... + X
X= 1 2 n

n
X.
=
n
The Sample Mean - Example

T h e a g es i n w eek s o f a r a n d o m sa m p l e
o f s i x k i tte n s a t a n a n i m a l s h e l te r a r e
3 , 8 , 5 , 1 2 , 1 4 , a n d 1 2 . F i n d th e
a v e r a g e a g e o f t h i s s a m p l e.
T h e sa m p l e m ea n i s

X = X
=
3 + 8 + 5 +12 +14 +12
n 6
54
= = 9 w e e k s.
6
The Population Mean

## The G r eek symbol m r epr esents the popul ati on

mean. The symbol m i s r ead as " mu" .
N i s the si ze of the fi ni te popul ati on.

X + X + ... + X
m=
1 2 N

N
X.
=
N
The Population Mean - Example

## A smal l company consi sts of the owner , the manager ,

the sal esperson, and two techni ci ans. The sal ari es are
l i sted as \$50,000, 20,000, 12,000, 9,000 and 9,000
respecti vel y. ( Assume thi s i s the popul ati on.)
Then the popul ati on mean wi l l be
= X
m
N
50,000 +20,000 +12,000 +9,000 +9,000
=
5
= \$20,000.
The Sample Mean for an Ungrouped
Frequency Distribution

## The mean for an ungrouped frequency

di stri but i on i s gi ven by

(f X)
X= .
n
H ere f i s the frequency for the
correspondi ng val ue of X , and n = f .
The Sample Mean for an Ungrouped
Frequency Distribution - Example

## The scores for 25 students on a 4 point quiz

are given in the table. Find the mean score

SSccoorree,,XX FFrreeqquueennccyy,,ff
00 22
11 44
22 1122
33 44
5
44 33
5
The Sample Mean for an Ungrouped
Frequency Distribution - Example

## SSccoorree,,XX FFrreeqquueennccyy,,ff ff?XX

00 22 00
11 44 44
22 1122 2244
33 44 1122
44 33 1122
5

f X 52
X= = = 2.08.
n 25
The Sample Mean for a Grouped
Frequency Distribution

## The meanfor a groupedfrequency

distributionis givenby

( f X m)
X= .
n
Here X is thecorresponding
m

class midpoint.
The Sample Mean for a Grouped
Frequency Distribution - Example

## Given the table below, find the mean.

CCllaassss FFrreeqquueennccyy,,ff
1155.5
.5--2200.5.5 33
2200.5
.5--2255.5
.5 55
2255.5
.5--3300.5
.5 44
3300.5
.5--3355.5
.5 33
3355.5
.5--4400.5
.5 22
5

5
The Sample Mean for a Grouped
Frequency Distribution - Example

## Table with class midpoints, Xm.

CCla
lasss FFrreeqquueennccyy,,ff XXmm ff?XXmm
1155.5
.5--2200.5
.5 33 1188 5544
2200.5
.5--2255.5
.5 55 2233 111155
2255.5
.5--3300.5
.5 44 2288 111122
3300.5
.5--3355.5
.5 33 3333 9999
5
3355.5
.5--4400.5
.5 22 3388 7766
5
The Sample Mean for a Grouped
Frequency Distribution - Example

f X m = 54 + 115 + 112 + 99 + 76
= 456
and n = 17. So
f Xm
X=
n
456
= = 26.82.
17
The Median

## When a data set is ordered, it is

called a data array.
The median is defined to be the
midpoint of the data array.
The symbol used to denote the
median is MD.
The Median - Example

## The weights (in pounds) of seven

army recruits are 180, 201, 220,
191, 219, 209, and 186. Find the
median.
Arrange the data in order and
select the middle point.
The Median - Example

## Data array: 180, 186, 191, 201,

209, 219, 220.
The median, MD = 201.
The Median

## In the previous example, there was

an odd number of values in the
data set. In this case it is easy to
select the middle number in the
data array.
The Median

## When there is an even number of

values in the data set, the median
is obtained by taking the average of
the two middle numbers.
The Median - Example

## Six customers purchased the following

number of magazines: 1, 7, 3, 2, 3, 4.
Find the median.
Arrange the data in order and compute
the middle point.
Data array: 1, 2, 3, 3, 4, 7.
The median, MD = (3 + 3)/2 = 3.
The Median - Example

## The ages of 10 college students

are: 18, 24, 20, 35, 19, 23, 26, 23,
19, 20. Find the median.
Arrange the data in order and
compute the middle point.
The Median - Example

## Data array: 18, 19, 19, 20, 20, 23,

23, 24, 26, 35.
The median,
MD = (20 + 23)/2 = 21.5.
The Median-Ungrouped Frequency
Distribution

## For an ungrouped frequency

distribution, find the median by
examining the cumulative
frequencies to locate the middle
value.
The Median-Ungrouped Frequency
Distribution

## If n is the sample size, compute

n/2. Locate the data point where
n/2 values fall below and n/2
values fall above.
The Median-Ungrouped Frequency
Distribution - Example

## LRJ Appliance recorded the number of

VCRs sold per week over a one-year
period. The data is given below.
NNoo. .SSeetstsSSoold
ld FFrreeqquueennccyy
11 44
22 99
33 66
44 22
55 33
The Median-Ungrouped Frequency
Distribution - Example

## To locate the middle point, divide n by 2;

24/2 = 12.
Locate the point where 12 values would fall
below and 12 values will fall above.
Consider the cumulative distribution.
The 12th and 13th values fall in class 2.
Hence MD = 2.
The Median-Ungrouped Frequency
Distribution - Example

NNoo..SSeetstsSSoold
ld FFrreeqquueennccyy CCuum muulalatitv
ivee
FFrreeqquueennccyy
11 44 44
22 99 1133
33 66 1199
44 22 2211
55 33 2244

## This class contains the 5th through the

13th values.
The Median for a Grouped
Frequency Distribution

## Themediancan be computed from:

(n 2) - cf
MD = (w) + Lm
f
Where
n = sum of the frequencies
cf = cumulativefrequencyof the class
immediatelyprecedingthe median class
f = frequencyof the medianclass
w = width of the median class
Lm = lower boundary of the median class
The Median for a Grouped
Frequency Distribution - Example

## Given the table below, find the median.

CCllaassss FFrreeqquueennccyy,,ff
1155.5
.5--2200.5.5 33
2200.5
.5--2255.5
.5 55
2255.5
.5--3300.5
.5 44
3300.5
.5--3355.5
.5 33

5
3355.5
.5--4400.5
.5 22
5
The Median for a Grouped
Frequency Distribution - Example

## Table with cumulative frequencies.

CCla
lassss FFrreeqquueennccyy,,ff CCuum muulalatitv
ivee
FFrreeqquueennccyy
1155.5
.5--2200.5
.5 33 33
2200.5
.5--2255.5
.5 55 88
2255.5
.5--3300.5
.5 44 1122
3300.5
.5--3355.5
.5 33 1155
5
3355.5
.5--4400.5
.5 22 1177
5
The Median for a Grouped Frequency
Distribution - Example

## To locate the halfway point, divide n by 2;

17/2 = 8.5 9.
Find the class that contains the 9th value.
This will be the median class.
Consider the cumulative distribution.
The median class will then be 25.5
30.5.
The Median for a Grouped
Frequency Distribution

n =17
cf = 8
f =4
w = 25.520.5=5
Lm = 25.5
(n 2) - cf (17/ 2) 8
MD = (w) + Lm = (5) + 25.5
f 4
= 26.125.
The Mode

## The mode is defined to be the value

that occurs most often in a data set.
A data set can have more than one
mode.
A data set is said to have no mode if
all values occur with equal frequency.
The Mode - Examples

## The following data represent the duration (in

days) of U.S. space shuttle voyages for the
years 1992-94. Find the mode.
Data set: 8, 9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8, 10,
14, 11, 8, 14, 11.
Ordered set: 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10,
10, 11, 11, 14, 14, 14. Mode = 8.
The Mode - Examples

## Six strains of bacteria were tested to see how

long they could remain alive outside their
normal environment. The time, in minutes, is
given below. Find the mode.
Data set: 2, 3, 5, 7, 8, 10.
There is no mode since each data value
occurs equally with a frequency of one.
The Mode - Examples

## Eleven different automobiles were tested at a

speed of 15 mph for stopping distances. The
distance, in feet, is given below. Find the
mode.
Data set: 15, 18, 18, 18, 20, 22, 24, 24, 24,
26, 26.
There are two modes (bimodal). The values
are 18 and 24. Why?
The Mode for an Ungrouped
Frequency Distribution - Example

## Given the table below, find the mode.

VVaalluueess FFrreeqquueennccyy,,ff
1155 33
Mode 2200 55
2255 88
3300 33
3355 22
5

5
The Mode - Grouped Frequency
Distribution
The mode for grouped data is the
modal class.
The modal class is the class with the
largest frequency.
Sometimes the midpoint of the class
is used rather than the boundaries.
The Mode for a Grouped Frequency
Distribution - Example

## Given the table below, find the mod

CCllaassss FFrreeqquueennccyy,,ff
Modal 1155.5
.5--2200.5.5 33
Class
2200.5
.5--2255.5
.5 55
2255.5
.5--3300.5
.5 77
3300.5
.5--3355.5
.5 33
3355.5
.5--4400.5
.5 22
5

5
The Midrange

## The midrange is found by adding the

lowest and highest values in the data
set and dividing by 2.
The midrange is a rough estimate of the
middle value of the data.
The symbol that is used to represent the
midrange is MR.
The Midrange - Example

## Last winter, the city of Brownsville,

Minnesota, reported the following number of
water-line breaks per month. The data is as
follows: 2, 3, 6, 8, 4, 1. Find the midrange.
MR = (1 + 8)/2 = 4.5.
Note: Extreme values influence the midrange
and thus may not be a typical description of
the middle.
The Weighted Mean

## The weighted mean is used when the

values in a data set are not all equally
represented.
The weighted mean of a variable X is
found by multiplying each value by its
corresponding weight and dividing the sum
of the products by the sum of the weights.
The Weighted Mean

## The wei ghted mean

w X + w X +...+ wn X n wX
X= 1 1 2
= 2

w + w +...+ wn
1 2 w
where w , w , ..., wn are the wei ghts
1 2

## for the val ues X , X , ..., X n . 1 2

Distribution Shapes

## Frequency distributions can assume

many shapes.
The three most important shapes are
positively skewed, symmetrical, and
negatively skewed.
Positively Skewed

Y
Positively Skewed

X
Mode < Median < Mean
Symmetrical

Y
Symmetrical

X
Mean = Median = Mode
Negatively Skewed

Negatively Skewed

X
Mean < Median < Mode
Measures of Variation - Range

## The range is defined to be the highest

value minus the lowest value. The
symbol R is used for the range.
R = highest value lowest value.
Extremely large or extremely small data
values can drastically affect the range.
Measures of Variation - Population
Variance

## The vari ance i s the average of the squares of the

di stance each val ue i s from the mean.
The symbol for the popul ati on vari ance is
s (s i s the G reek l owercase l etter si gma)
2

( X - m ) , where
2

s =
2

N
X = i ndi vi dual val ue
m = popul ati on mean
N = popul ati on si ze
Measures of Variation - Population
Standard Deviation

## The standard devi ation i s the square

root of the vari ance.

( X - m) 2

s = s = .
2

N
Measures of Variation - Example

## Consider the following data to constitute

the population: 10, 60, 50, 30, 40, 20.
Find the mean and variance.
The mean m = (10 + 60 + 50 + 30 + 40 +
20)/6 = 210/6 = 35.
The variance s 2 = 1750/6 = 291.67. See
next slide for computations.
Measures of Variation - Example

22

## 1100 --2255 662255

6600 +
+2255 662255
5500 +
+1155 222255
3300 --55 2255
4400 +
+55 2255
2200 --1155 222255
221100 11775500
3-3 Measures of Variation - Sample
3-58 Variance

## The unbiased estimator of the population

variance o r the sample varianc e is a
statistic whose value approximates the
expected value of a population variance.
It is denoted by s , where
2

(X - X ) 2

s = , and
2

n-1
X = sample mean
n = sample size
Measures of Variation - Sample
Standard Deviation

## The samplestandarddeviationis the squ

are
root of he
t samplevariance.

( X - X )2

s = s =
2
.
n-1
Shortcut Formula for the Sample
Variance and the Standard Deviation

X - ( X ) / n
2 2

s=
2

n-1

X - ( X ) / n
2 2

s=
n-1
Sample Variance - Example

## Find the variance and standard

deviation for the following sample: 16,
19, 15, 15, 14.
X = 16 + 19 + 15 + 15 + 14 = 79.
X2 = 162 + 192 + 152 + 152 + 142
= 1263.
Sample Variance - Example

X - ( X ) / n
2 2

s =
2

n-1
1263- (79)/ 5
2

= = 3.7
4

s = 3.7 = 1.9.
Sample Variance for Grouped and
Ungrouped Data

## For grouped data, use the class

midpoints for the observed value in the
different classes.
For ungrouped data, use the same
formula (see next slide) with the class
midpoints, Xm, replaced with the actual
observed X value.
Sample Variance for Grouped and
Ungrouped Data

## The sample variance for grouped data:

f X - [( f X ) / n]
2 2

s = .
2 m m

n-1
For ungrouped data, replace Xm
with the observe X value.
Sample Variance for Grouped Data
- Example

XX ff ffX
X ffX 2
X 2
55 22 1010 5050
66 33 18
18 108
108
77 88 56
56 392
392
88 11 88 64
64
99 66 54
54 486
486
10
10 44 40
40 400
400
nn= 24
f X
=
= 24 f X = 186 186
f
fX=
X

22
=1500
1500
Sample Variance for Ungrouped
Data - Example

## The samplevarianceand standard deviation:

f X 2 - [( f X )2 / n]
s =
2

n-1
1500- [(186)/ 24] =
2

= 2.54.
23
s = 2.54 = 1.6.
Coefficient of Variation

## The coefficient of variation is defined to

be the standard deviation divided by the
mean. The result is expressed as a
percentage.
s s
CVar = 100% or CVar = 100%.
X m
Chebyshevs Theorem

## The proportion of values from a data set that

will fall within k standard deviations of the
mean will be at least 1 1/k2, where k is any
number greater than 1.
For k = 2, 75% of the values will lie within 2
standard deviations of the mean. For k = 3,
approximately 89% will lie within 3 standard
deviations.
The Empirical (Normal) Rule

## For any bell shaped distribution:

Approximately 68% of the data values will fall
within one standard deviation of the mean.
Approximately 95% will fall within two
standard deviations of the mean.
Approximately 99.7% will fall within three
standard deviations of the mean.
The Empirical (Normal) Rule

m s -- m s -- 95% m s --

m -s m -s m -s m m +s m +s m +s
Measures of Position z score

## The standard score or z score for a

value is obtained by subtracting the
mean from the value and dividing the
result by the standard deviation.
The symbol z is used for the z
score.
Measures of Position z-score

## The z score represents the number of

standard deviations a data value falls above
or below the mean.
For samples:
X-X
z= .
s
For populations:
=
X -m
z .
s
z-score - Example

## A student scored 65 on a statistics exam that

had a mean of 50 and a standard deviation of
10. Compute the z-score.
z = (65 50)/10 = 1.5.
That is, the score of 65 is 1.5 standard
deviations above the mean.
Above - since the z-score is positive.
Measures of Position - Percentiles

## Percentiles divide the distribution into 100

groups.
The Pk percentile is defined to be that
numerical value such that at most k% of
the values are smaller than Pk and at most
(100 k)% are larger than Pk in an ordered
data set.
Percentile Formula

## The percentile corresponding to a given

value (X) is computed by using the
formula:
number of values below X + 0.5
Percentile= 100%
total number of values
Percentiles - Example

## A teacher gives a 20-point test to 10 students.

Find the percentile rank of a score of 12.
Scores: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10.
Ordered set: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.
Percentile = [(6 + 0.5)/10](100%) = 65th
percentile. Student did better than 65% of the
class.
Percentiles - Finding the value
Corresponding to a Given
Percentile

## Procedure: Let p be the percentile and n the

sample size.
Step 1: Arrange the data in order.
Step 2: Compute c = (np)/100.
Step 3: If c is not a whole number, round up
to the next whole number. If c is a whole
number, use the value halfway between c
and c+1.
Percentiles - Finding the value
Corresponding to a Given
Percentile

## Step 4: The value of c is the position value of

the required percentile.
Example: Find the value of the 25th
percentile for the following data set: 2, 3, 5, 6,
8, 10, 12, 15, 18, 20.
Note: the data set is already ordered.
n = 10, p = 25, so c = (1025)/100 = 2.5.
Hence round up to c = 3.
Percentiles - Finding the value
Corresponding to a Given
Percentile

## Thus, the value of the 25th percentile is the

value X = 5.
Find the 80th percentile.
c = (10 80)/100 = 8. Thus the value of the
80th percentile is the average of the 8th and
9th values. Thus, the 80th percentile for the
data set is (15 + 18)/2 = 16.5.
Special Percentiles - Deciles and
Quartiles

## Deciles divide the data set into 10

groups.
Deciles are denoted by D1, D2, , D9
with the corresponding percentiles
being P10, P20, , P90
Quartiles divide the data set into 4
groups.
Special Percentiles - Deciles and
Quartiles

## Quartiles are denoted by Q1, Q2, and

Q3 with the corresponding percentiles
being P25, P50, and P75.
The median is the same as P50 or Q2.
Outliers and the Interquartile
Range (IQR)

## An outlier is an extremely high or an

extremely low data value when
compared with the rest of the data
values.
The Interquartile Range, IQR
= Q3 Q1.
Outliers and the Interquartile
Range (IQR)

## To determine whether a data value can be

considered as an outlier:
Step 1: Compute Q1 and Q3.
Step 2: Find the IQR = Q3 Q1.
Step 3: Compute (1.5)(IQR).
Step 4: Compute Q1 (1.5)(IQR) and
Q3 + (1.5)(IQR).
Outliers and the Interquartile
Range (IQR)

## To determine whether a data value can be

considered as an outlier:
Step 5: Compare the data value (say X) with
Q1 (1.5)(IQR) and Q3 + (1.5)(IQR).
If X < Q1 (1.5)(IQR) or
if X > Q3 + (1.5)(IQR), then X is considered
an outlier.
Outliers and the Interquartile
Range (IQR) - Example

Given the data set 5, 6, 12, 13, 15, 18, 22, 50,
can the value of 50 be considered as an
outlier?
Q1 = 9, Q3 = 20, IQR = 11. Verify.
(1.5)(IQR) = (1.5)(11) = 16.5.
9 16.5 = 7.5 and 20 + 16.5 = 36.5.
The value of 50 is outside the range 7.5 to
36.5, hence 50 is an outlier.
Exploratory Data Analysis - Stem
and Leaf Plot

## A stem and leaf plot is a data plot

that uses part of a data value as the
stem and part of the data value as
the leaf to form groups or classes.
Exploratory Data Analysis - Stem
and Leaf Plot - Example

## At an outpatient testing center, a

sample of 20 days showed the following
number of cardiograms done each day:
25, 31, 20, 32, 13, 14, 43, 02, 57, 23,
36, 32, 33, 32, 44, 32, 52, 44, 51, 45.
Construct a stem and leaf plot for the
data.
Exploratory Data Analysis - Stem
and Leaf Plot - Example

## Leading Digit (Stem) Trailing Digit (Leaf)

0 2
1 3 4
2 0 3 5
3 1 2 2 2 2 3 6
4 3 4 4 5
5 1 2 7
Exploratory Data Analysis
Box Plot

## When the data set contains a small

number of values, a box plot is used to
graphically represent the data set.
These plots involve five values: the
minimum value, the lower hinge, the
median, the upper hinge, and the
maximum value.
Exploratory Data Analysis
Box Plot

## The lower hinge is the median of all values

less than or equal to the median when the
data set has an odd number of values, or
as the median of all values less than the
median when the data set has an even
number of values. The symbol for the
lower hinge is LH.
Exploratory Data Analysis
Box Plot

## The upper hinge is the median of all

values greater than or equal to the
median when the data set has an odd
number of values, or as the median of all
values greater than the median when the
data set has an even number of values.
The symbol for the upper hinge is UH.
Exploratory Data Analysis - Box
Plot - Example (Cardiograms data)

LH UH

MINIMUM MAXIMUM

MEDIAN

0 10 20 30 40 50 60
Information Obtained from a
Box Plot

## If the median is near the center of the box,

the distribution is approximately symmetric.
If the median falls to the left of the center of
the box, the distribution is positively skewed.
If the median falls to the right of the center of
the box, the distribution is negatively skewed.
Information Obtained from a
Box Plot

## If the lines are about the same length, the

distribution is approximately symmetric.
If the right line is larger than the left line, the
distribution is positively skewed.
If the left line is larger than the right line, the
distribution is negatively skewed.