You are on page 1of 142

Elementary Statistics

(for Math 104 classes)

Dante V. Partosa
Mathematics Department
College of Science and Information Technology
Ateneo de Zamboanga University
Preliminaries

Statistics consists of conducting


studies to collect, organize,
summarize, analyze, and draw
conclusions.
Data are the values (measurements
or observations) that the variables
can assume.
Variables whose values are
determined by chance are called
random variables.
A collection of data values forms a
data set.
Each value in the data set is called
a data value or a datum.
Descriptive statistics consists of the
collection, organization, summation,
and presentation of data.
A population consists of all subjects
(human or otherwise) that are being
studied.
A sample is a subgroup of the
population.
Inferential statistics consists of
generalizing from samples to
populations, performing hypothesis
testing, determining relationships
among variables, and making
predictions.
Variables and Types of Data
Qualitative variables are variables
that can be placed into distinct
categories, according to some
characteristic or attribute. For
example, gender (male or female).
Quantitative variables are numerical
in nature and can be ordered or
ranked. Example: age is numerical
and the values can be ranked.
Variables and Types of Data

Discrete variables assume values


that can be counted.
Continuous variables can assume all
values between any two specific
values. They are obtained by
measuring.
Variables and Types of Data

The nominal level of measurement


classifies data into mutually exclusive
(nonoverlapping), exhausting categories
in which no order or ranking can be
imposed on the data.
Variables and Types of Data

The ordinal level of measurement


classifies data into categories that can
be ranked; precise differences between
the ranks do not exist.
Variables and Types of Data

The interval level of measurement ranks


data; precise differences between units
of measure do exist; there is no
meaningful zero.
Variables and Types of Data

The ratio level of measurement


possesses all the characteristics of
interval measurement, and there exists
a true zero. In addition, true ratios exist
for the same variable.
Data Collection and Sampling
Techniques

Data can be collected in a variety of ways.


One of the most common methods is through
the use of surveys.
Surveys can be done by using a variety of
methods -
Examples are telephone, mail questionnaires,
personal interviews, surveying records and
direct observations.
Data Collection and Sampling
Techniques

To obtain samples that are unbiased,


statisticians use four methods of
sampling.
Random samples are selected by using
chance methods or random numbers.
Data Collection and Sampling
Techniques

Systematic samples are obtained by


numbering each value in the population
and then selecting the kth value.
Data Collection and Sampling
Techniques

Stratified samples are selected by


dividing the population into groups
(strata) according to some characteristic
and then taking samples from each
group.
Data Collection and Sampling
Techniques

Cluster samples are selected by


dividing the population into groups and
then taking samples of the groups.
Computers and Calculators

Computers and calculators make


numerical computation easier.
Many statistical packages are available.
One example is SSPW (SPSS), MINITAB,
PHStat, Excel. The TI-83 calculator can
also be used to do statistical calculations.
Data must still be understood and
interpreted.
Organizing Data

When data are collected in original


form, they are called raw data.
When the raw data is organized into a
frequency distribution, the frequency
will be the number of values in a
specific class of the distribution.
Organizing Data

A frequency distribution is the


organizing of raw data in table form,
using classes and frequencies.
The following slide shows an example
of a frequency distribution.
Three Types of Frequency
Distributions

Categorical frequency distributions - can


be used for data that can be placed in
specific categories, such as nominal- or
ordinal-level data.
Examples - political affiliation, religious
affiliation, blood type etc.
Blood Type Frequency Distribution -
Example

C lass Frequency Percent

A 5 20

B 7 28

O 9 36

AB 4 16
Ungrouped Frequency
Distributions
Ungrouped frequency distributions - can
be used for data that can be enumerated
and when the range of values in the data
set is not large.
Examples - number of miles your
instructors have to travel from home to
campus, number of girls in a 4-child family
etc.
Number of Miles Traveled -
Example

Class Frequency

5 24

10 16

15 10
Grouped Frequency Distributions

Grouped frequency distributions - can be


used when the range of values in the data
set is very large. The data must be
grouped into classes that are more than
one unit in width.
Examples - the life of boat batteries in
hours.
Lifetimes of Boat Batteries -
Example

C l as s C l as s F r e q u e n c y C u m u l a ti v e
l i m i ts Bo u n d a r i e s fr e q u e n c y
24 - 30 2 3 .5 - 3 7 .5 4 4

38 - 51 3 7 .5 - 5 1 .5 14 18

52 - 65 5 1 .5 - 6 5 .5 7 25
Terms Associated with a Grouped
Frequency Distribution

Class limits represent the smallest and


largest data values that can be included in
a class.
In the lifetimes of boat batteries example,
the values 24 and 30 of the first class are
the class limits.
The lower class limit is 24 and the upper
class limit is 30.
Terms Associated with a Grouped
Frequency Distribution

The class boundaries are used to


separate the classes so that there are
no gaps in the frequency distribution.
Terms Associated with a Grouped
Frequency Distribution

The class width for a class in a


frequency distribution is found by
subtracting the lower (or upper) class
limit of one class minus the lower (or
upper) class limit of the previous
class.
Guidelines for Constructing a
Frequency Distribution

There should be between 5 and 20


classes.
The class width should be an odd
number.
The classes must be mutually
exclusive.
Guidelines for Constructing a
Frequency Distribution

The classes must be continuous.


The classes must be exhaustive.
The class must be equal in width.
Procedure for Constructing a Grouped
Frequency Distribution

Find the highest and lowest value.


Find the range.
Select the number of classes desired.
Find the width by dividing the range by
the number of classes and rounding up.
Procedure for Constructing a Grouped
Frequency Distribution

Select a starting point (usually the lowest


value); add the width to get the lower
limits.
Find the upper class limits.
Find the boundaries.
Tally the data, find the frequencies, and
find the cumulative frequency.
Grouped Frequency Distribution -
Example

10 8 6 14
22 13 17 19
11 9 18 14
13 12 15 15
5 11 16 11
Grouped Frequency Distribution -
Example

Step 1: Find the highest and lowest


values: H = 22 and L = 5.
Step 2: Find the range:
R = H L = 22 5 = 17.
Step 3: Select the number of classes
desired. In this case it is
equal to 6.
Grouped Frequency Distribution -
Example

Step 4: Find the class width by


dividing the range by the number of
classes. Width = 17/6 = 2.83. This
value is rounded up to 3.
Grouped Frequency Distribution -
Example

Step 5: Select a starting point for the


lowest class limit. For convenience,
this value is chosen to be 5, the
smallest data value. The lower class
limits will be 5, 8, 11, 14, 17, and 20.
Grouped Frequency Distribution -
Example

Step 6: The upper class limits will be


7, 10, 13, 16, 19, and 22. For
example, the upper limit for the first
class is computed as 8 - 1, etc.
Grouped Frequency Distribution -
Example

Step 7: Find the class boundaries by


subtracting 0.5 from each lower class
limit and adding 0.5 to the upper
class limit.
Grouped Frequency Distribution -
Example

Step 8: Tally the data, write the


numerical values for the tallies in the
frequency column, and find the
cumulative frequencies.
The grouped frequency distribution is
shown on the next slide.
Note: The dash - represents to.

Class Limits Class Boundaries Frequency Cumulative Frequency

05 t o 07 4.5 - 7.5 2 2
08 t o 10 7.5 - 10.5 3 5
11 t o 13 10.5 - 13.5 6 11
14 t o 16 13.5 - 16.5 5 16
17 t o 19 16.5 - 19.5 3 19
20 t o 22 19.5 - 22.5 1 20
Histograms, Frequency Polygons,
and Ogives

The three most commonly used


graphs in research are:
The histogram.
The frequency polygon.
The cumulative frequency graph, or
ogive (pronounced o-jive).
Histograms, Frequency Polygons,
and Ogives

The histogram is a graph that


displays the data by using vertical
bars of various heights to represent
the frequencies.
Example of a Histogram

5
Frequency

5 8 11 14 17 20

N u m b e r o f C ig a re tte s S m o k e d p e r D a y
Histograms, Frequency Polygons,
and Ogives

A frequency polygon is a graph that


displays the data by using lines that
connect points plotted for frequencies
at the midpoint of classes. The
frequencies represent the heights of
the midpoints.
Example of a Frequency Polygon

Frequency Polygon

5
Frequency

2 5 8 11 14 17 20 23 26

Number of Cigarettes Smoked per Day


Histograms, Frequency Polygons,
and Ogives

A cumulative frequency graph or


ogive is a graph that represents the
cumulative frequencies for the
classes in a frequency distribution.
Example of an Ogive
Ogive
20
Cumulative Frequency

10

2 5 8 11 14 17 20 23 26

Number of C igarettes Smoked per Day


Other Types of Graphs

Pareto charts - a Pareto chart is


used to represent a frequency
distribution for a categorical variable.
Other Types of Graphs-
Pareto Chart

When constructing a Pareto chart -


Make the bars the same width.
Arrange the data from largest to
smallest according to frequencies.
Make the units that are used for the
frequency equal in size.
Example of a Pareto Chart

Pareto C hart for the num ber of Crim es Inves tigated by Law
Enforcement Officers in U.S. National Parks During 1995.
250 100
200 80

Percent
Count

150 60
100 40

50 20

0 0
Defec t
Count 164 34 29 13
Perc ent 68.3 14.2 12.1 5.4
Cum % 68.3 82.5 94.6 100.0
Other Types of Graphs

Time series graph - A time series


graph represents data that occur over
a specific period of time.
2-4 Other Types of Graphs -
Time Series Graph

P O R T AU T H O R IT Y T R AN S IT R ID E R S H IP

89
Ridership (in millions)

87
85
83
81
79
77
75
199 0 19 91 1992 1993 19 94

Y ear
Other Types of Graphs

Pie graph - A pie graph is a circle that


is divided into sections or wedges
according to the percentage of
frequencies in each category of the
distribution.
Other Types of Graphs -
Pie Graph
Pie Chart of the Robbery (29,
Number of Crimes 12.1%)
Investigated by Rape (34,
Law Enforcement 14.2%)
Officers In U.S.
National Parks Homicide
During 1995 (13, 5.4%)

Assaults
(164,
68.3%)
Organizing Data
Describing Data
Measures of Central Tendency
A statistic is a characteristic or
measure obtained by using the data
values from a sample.
A parameter is a characteristic or
measure obtained by using the data
values from a specific population.
The Mean (arithmetic average)
The mean is defined to be the sum
of the data values divided by the
total number of values.
We will compute two means: one
for the sample and one for a finite
population of values.
The mean, in most cases, is not an
actual data value.
The Sample Mean

The symbol X represents the sampl e mean.


X i s read as " X - bar " . The G reek symbol
i s read as " si gma" and i t means " to sum" .

X + X + ... + X
X= 1 2 n

n
X.
=
n
The Sample Mean - Example

T h e a g es i n w eek s o f a r a n d o m sa m p l e
o f s i x k i tte n s a t a n a n i m a l s h e l te r a r e
3 , 8 , 5 , 1 2 , 1 4 , a n d 1 2 . F i n d th e
a v e r a g e a g e o f t h i s s a m p l e.
T h e sa m p l e m ea n i s

X = X
=
3 + 8 + 5 +12 +14 +12
n 6
54
= = 9 w e e k s.
6
The Population Mean

The G r eek symbol m r epr esents the popul ati on


mean. The symbol m i s r ead as " mu" .
N i s the si ze of the fi ni te popul ati on.

X + X + ... + X
m=
1 2 N

N
X.
=
N
The Population Mean - Example

A smal l company consi sts of the owner , the manager ,


the sal esperson, and two techni ci ans. The sal ari es are
l i sted as $50,000, 20,000, 12,000, 9,000 and 9,000
respecti vel y. ( Assume thi s i s the popul ati on.)
Then the popul ati on mean wi l l be
= X
m
N
50,000 +20,000 +12,000 +9,000 +9,000
=
5
= $20,000.
The Sample Mean for an Ungrouped
Frequency Distribution

The mean for an ungrouped frequency


di stri but i on i s gi ven by

(f X)
X= .
n
H ere f i s the frequency for the
correspondi ng val ue of X , and n = f .
The Sample Mean for an Ungrouped
Frequency Distribution - Example

The scores for 25 students on a 4 point quiz


are given in the table. Find the mean score

SSccoorree,,XX FFrreeqquueennccyy,,ff
00 22
11 44
22 1122
33 44
5
44 33
5
The Sample Mean for an Ungrouped
Frequency Distribution - Example

SSccoorree,,XX FFrreeqquueennccyy,,ff ff?XX


00 22 00
11 44 44
22 1122 2244
33 44 1122
44 33 1122
5

f X 52
X= = = 2.08.
n 25
The Sample Mean for a Grouped
Frequency Distribution

The meanfor a groupedfrequency


distributionis givenby

( f X m)
X= .
n
Here X is thecorresponding
m

class midpoint.
The Sample Mean for a Grouped
Frequency Distribution - Example

Given the table below, find the mean.

CCllaassss FFrreeqquueennccyy,,ff
1155.5
.5--2200.5.5 33
2200.5
.5--2255.5
.5 55
2255.5
.5--3300.5
.5 44
3300.5
.5--3355.5
.5 33
3355.5
.5--4400.5
.5 22
5

5
The Sample Mean for a Grouped
Frequency Distribution - Example

Table with class midpoints, Xm.


CCla
lasss FFrreeqquueennccyy,,ff XXmm ff?XXmm
1155.5
.5--2200.5
.5 33 1188 5544
2200.5
.5--2255.5
.5 55 2233 111155
2255.5
.5--3300.5
.5 44 2288 111122
3300.5
.5--3355.5
.5 33 3333 9999
5
3355.5
.5--4400.5
.5 22 3388 7766
5
The Sample Mean for a Grouped
Frequency Distribution - Example

f X m = 54 + 115 + 112 + 99 + 76
= 456
and n = 17. So
f Xm
X=
n
456
= = 26.82.
17
The Median

When a data set is ordered, it is


called a data array.
The median is defined to be the
midpoint of the data array.
The symbol used to denote the
median is MD.
The Median - Example

The weights (in pounds) of seven


army recruits are 180, 201, 220,
191, 219, 209, and 186. Find the
median.
Arrange the data in order and
select the middle point.
The Median - Example

Data array: 180, 186, 191, 201,


209, 219, 220.
The median, MD = 201.
The Median

In the previous example, there was


an odd number of values in the
data set. In this case it is easy to
select the middle number in the
data array.
The Median

When there is an even number of


values in the data set, the median
is obtained by taking the average of
the two middle numbers.
The Median - Example

Six customers purchased the following


number of magazines: 1, 7, 3, 2, 3, 4.
Find the median.
Arrange the data in order and compute
the middle point.
Data array: 1, 2, 3, 3, 4, 7.
The median, MD = (3 + 3)/2 = 3.
The Median - Example

The ages of 10 college students


are: 18, 24, 20, 35, 19, 23, 26, 23,
19, 20. Find the median.
Arrange the data in order and
compute the middle point.
The Median - Example

Data array: 18, 19, 19, 20, 20, 23,


23, 24, 26, 35.
The median,
MD = (20 + 23)/2 = 21.5.
The Median-Ungrouped Frequency
Distribution

For an ungrouped frequency


distribution, find the median by
examining the cumulative
frequencies to locate the middle
value.
The Median-Ungrouped Frequency
Distribution

If n is the sample size, compute


n/2. Locate the data point where
n/2 values fall below and n/2
values fall above.
The Median-Ungrouped Frequency
Distribution - Example

LRJ Appliance recorded the number of


VCRs sold per week over a one-year
period. The data is given below.
NNoo. .SSeetstsSSoold
ld FFrreeqquueennccyy
11 44
22 99
33 66
44 22
55 33
The Median-Ungrouped Frequency
Distribution - Example

To locate the middle point, divide n by 2;


24/2 = 12.
Locate the point where 12 values would fall
below and 12 values will fall above.
Consider the cumulative distribution.
The 12th and 13th values fall in class 2.
Hence MD = 2.
The Median-Ungrouped Frequency
Distribution - Example

NNoo..SSeetstsSSoold
ld FFrreeqquueennccyy CCuum muulalatitv
ivee
FFrreeqquueennccyy
11 44 44
22 99 1133
33 66 1199
44 22 2211
55 33 2244

This class contains the 5th through the


13th values.
The Median for a Grouped
Frequency Distribution

Themediancan be computed from:


(n 2) - cf
MD = (w) + Lm
f
Where
n = sum of the frequencies
cf = cumulativefrequencyof the class
immediatelyprecedingthe median class
f = frequencyof the medianclass
w = width of the median class
Lm = lower boundary of the median class
The Median for a Grouped
Frequency Distribution - Example

Given the table below, find the median.


CCllaassss FFrreeqquueennccyy,,ff
1155.5
.5--2200.5.5 33
2200.5
.5--2255.5
.5 55
2255.5
.5--3300.5
.5 44
3300.5
.5--3355.5
.5 33

5
3355.5
.5--4400.5
.5 22
5
The Median for a Grouped
Frequency Distribution - Example

Table with cumulative frequencies.


CCla
lassss FFrreeqquueennccyy,,ff CCuum muulalatitv
ivee
FFrreeqquueennccyy
1155.5
.5--2200.5
.5 33 33
2200.5
.5--2255.5
.5 55 88
2255.5
.5--3300.5
.5 44 1122
3300.5
.5--3355.5
.5 33 1155
5
3355.5
.5--4400.5
.5 22 1177
5
The Median for a Grouped Frequency
Distribution - Example

To locate the halfway point, divide n by 2;


17/2 = 8.5 9.
Find the class that contains the 9th value.
This will be the median class.
Consider the cumulative distribution.
The median class will then be 25.5
30.5.
The Median for a Grouped
Frequency Distribution

n =17
cf = 8
f =4
w = 25.520.5=5
Lm = 25.5
(n 2) - cf (17/ 2) 8
MD = (w) + Lm = (5) + 25.5
f 4
= 26.125.
The Mode

The mode is defined to be the value


that occurs most often in a data set.
A data set can have more than one
mode.
A data set is said to have no mode if
all values occur with equal frequency.
The Mode - Examples

The following data represent the duration (in


days) of U.S. space shuttle voyages for the
years 1992-94. Find the mode.
Data set: 8, 9, 9, 14, 8, 8, 10, 7, 6, 9, 7, 8, 10,
14, 11, 8, 14, 11.
Ordered set: 6, 7, 7, 8, 8, 8, 8, 8, 9, 9, 9, 10,
10, 11, 11, 14, 14, 14. Mode = 8.
The Mode - Examples

Six strains of bacteria were tested to see how


long they could remain alive outside their
normal environment. The time, in minutes, is
given below. Find the mode.
Data set: 2, 3, 5, 7, 8, 10.
There is no mode since each data value
occurs equally with a frequency of one.
The Mode - Examples

Eleven different automobiles were tested at a


speed of 15 mph for stopping distances. The
distance, in feet, is given below. Find the
mode.
Data set: 15, 18, 18, 18, 20, 22, 24, 24, 24,
26, 26.
There are two modes (bimodal). The values
are 18 and 24. Why?
The Mode for an Ungrouped
Frequency Distribution - Example

Given the table below, find the mode.


VVaalluueess FFrreeqquueennccyy,,ff
1155 33
Mode 2200 55
2255 88
3300 33
3355 22
5

5
The Mode - Grouped Frequency
Distribution
The mode for grouped data is the
modal class.
The modal class is the class with the
largest frequency.
Sometimes the midpoint of the class
is used rather than the boundaries.
The Mode for a Grouped Frequency
Distribution - Example

Given the table below, find the mod


CCllaassss FFrreeqquueennccyy,,ff
Modal 1155.5
.5--2200.5.5 33
Class
2200.5
.5--2255.5
.5 55
2255.5
.5--3300.5
.5 77
3300.5
.5--3355.5
.5 33
3355.5
.5--4400.5
.5 22
5

5
The Midrange

The midrange is found by adding the


lowest and highest values in the data
set and dividing by 2.
The midrange is a rough estimate of the
middle value of the data.
The symbol that is used to represent the
midrange is MR.
The Midrange - Example

Last winter, the city of Brownsville,


Minnesota, reported the following number of
water-line breaks per month. The data is as
follows: 2, 3, 6, 8, 4, 1. Find the midrange.
MR = (1 + 8)/2 = 4.5.
Note: Extreme values influence the midrange
and thus may not be a typical description of
the middle.
The Weighted Mean

The weighted mean is used when the


values in a data set are not all equally
represented.
The weighted mean of a variable X is
found by multiplying each value by its
corresponding weight and dividing the sum
of the products by the sum of the weights.
The Weighted Mean

The wei ghted mean


w X + w X +...+ wn X n wX
X= 1 1 2
= 2

w + w +...+ wn
1 2 w
where w , w , ..., wn are the wei ghts
1 2

for the val ues X , X , ..., X n . 1 2


Distribution Shapes

Frequency distributions can assume


many shapes.
The three most important shapes are
positively skewed, symmetrical, and
negatively skewed.
Positively Skewed

Y
Positively Skewed

X
Mode < Median < Mean
Symmetrical

Y
Symmetrical

X
Mean = Median = Mode
Negatively Skewed

Negatively Skewed

X
Mean < Median < Mode
Measures of Variation - Range

The range is defined to be the highest


value minus the lowest value. The
symbol R is used for the range.
R = highest value lowest value.
Extremely large or extremely small data
values can drastically affect the range.
Measures of Variation - Population
Variance

The vari ance i s the average of the squares of the


di stance each val ue i s from the mean.
The symbol for the popul ati on vari ance is
s (s i s the G reek l owercase l etter si gma)
2

( X - m ) , where
2

s =
2

N
X = i ndi vi dual val ue
m = popul ati on mean
N = popul ati on si ze
Measures of Variation - Population
Standard Deviation

The standard devi ation i s the square


root of the vari ance.

( X - m) 2

s = s = .
2

N
Measures of Variation - Example

Consider the following data to constitute


the population: 10, 60, 50, 30, 40, 20.
Find the mean and variance.
The mean m = (10 + 60 + 50 + 30 + 40 +
20)/6 = 210/6 = 35.
The variance s 2 = 1750/6 = 291.67. See
next slide for computations.
Measures of Variation - Example

XX XX- mm ((XX - mm))


22

1100 --2255 662255


6600 +
+2255 662255
5500 +
+1155 222255
3300 --55 2255
4400 +
+55 2255
2200 --1155 222255
221100 11775500
3-3 Measures of Variation - Sample
3-58 Variance

The unbiased estimator of the population


variance o r the sample varianc e is a
statistic whose value approximates the
expected value of a population variance.
It is denoted by s , where
2

(X - X ) 2

s = , and
2

n-1
X = sample mean
n = sample size
Measures of Variation - Sample
Standard Deviation

The samplestandarddeviationis the squ


are
root of he
t samplevariance.

( X - X )2

s = s =
2
.
n-1
Shortcut Formula for the Sample
Variance and the Standard Deviation

X - ( X ) / n
2 2

s=
2

n-1

X - ( X ) / n
2 2

s=
n-1
Sample Variance - Example

Find the variance and standard


deviation for the following sample: 16,
19, 15, 15, 14.
X = 16 + 19 + 15 + 15 + 14 = 79.
X2 = 162 + 192 + 152 + 152 + 142
= 1263.
Sample Variance - Example

X - ( X ) / n
2 2

s =
2

n-1
1263- (79)/ 5
2

= = 3.7
4

s = 3.7 = 1.9.
Sample Variance for Grouped and
Ungrouped Data

For grouped data, use the class


midpoints for the observed value in the
different classes.
For ungrouped data, use the same
formula (see next slide) with the class
midpoints, Xm, replaced with the actual
observed X value.
Sample Variance for Grouped and
Ungrouped Data

The sample variance for grouped data:

f X - [( f X ) / n]
2 2

s = .
2 m m

n-1
For ungrouped data, replace Xm
with the observe X value.
Sample Variance for Grouped Data
- Example

XX ff ffX
X ffX 2
X 2
55 22 1010 5050
66 33 18
18 108
108
77 88 56
56 392
392
88 11 88 64
64
99 66 54
54 486
486
10
10 44 40
40 400
400
nn= 24
f X
=
= 24 f X = 186 186
f
fX=
X

22
=1500
1500
Sample Variance for Ungrouped
Data - Example

The samplevarianceand standard deviation:

f X 2 - [( f X )2 / n]
s =
2

n-1
1500- [(186)/ 24] =
2

= 2.54.
23
s = 2.54 = 1.6.
Coefficient of Variation

The coefficient of variation is defined to


be the standard deviation divided by the
mean. The result is expressed as a
percentage.
s s
CVar = 100% or CVar = 100%.
X m
Chebyshevs Theorem

The proportion of values from a data set that


will fall within k standard deviations of the
mean will be at least 1 1/k2, where k is any
number greater than 1.
For k = 2, 75% of the values will lie within 2
standard deviations of the mean. For k = 3,
approximately 89% will lie within 3 standard
deviations.
The Empirical (Normal) Rule

For any bell shaped distribution:


Approximately 68% of the data values will fall
within one standard deviation of the mean.
Approximately 95% will fall within two
standard deviations of the mean.
Approximately 99.7% will fall within three
standard deviations of the mean.
The Empirical (Normal) Rule

m s -- m s -- 95% m s --

m -s m -s m -s m m +s m +s m +s
Measures of Position z score

The standard score or z score for a


value is obtained by subtracting the
mean from the value and dividing the
result by the standard deviation.
The symbol z is used for the z
score.
Measures of Position z-score

The z score represents the number of


standard deviations a data value falls above
or below the mean.
For samples:
X-X
z= .
s
For populations:
=
X -m
z .
s
z-score - Example

A student scored 65 on a statistics exam that


had a mean of 50 and a standard deviation of
10. Compute the z-score.
z = (65 50)/10 = 1.5.
That is, the score of 65 is 1.5 standard
deviations above the mean.
Above - since the z-score is positive.
Measures of Position - Percentiles

Percentiles divide the distribution into 100


groups.
The Pk percentile is defined to be that
numerical value such that at most k% of
the values are smaller than Pk and at most
(100 k)% are larger than Pk in an ordered
data set.
Percentile Formula

The percentile corresponding to a given


value (X) is computed by using the
formula:
number of values below X + 0.5
Percentile= 100%
total number of values
Percentiles - Example

A teacher gives a 20-point test to 10 students.


Find the percentile rank of a score of 12.
Scores: 18, 15, 12, 6, 8, 2, 3, 5, 20, 10.
Ordered set: 2, 3, 5, 6, 8, 10, 12, 15, 18, 20.
Percentile = [(6 + 0.5)/10](100%) = 65th
percentile. Student did better than 65% of the
class.
Percentiles - Finding the value
Corresponding to a Given
Percentile

Procedure: Let p be the percentile and n the


sample size.
Step 1: Arrange the data in order.
Step 2: Compute c = (np)/100.
Step 3: If c is not a whole number, round up
to the next whole number. If c is a whole
number, use the value halfway between c
and c+1.
Percentiles - Finding the value
Corresponding to a Given
Percentile

Step 4: The value of c is the position value of


the required percentile.
Example: Find the value of the 25th
percentile for the following data set: 2, 3, 5, 6,
8, 10, 12, 15, 18, 20.
Note: the data set is already ordered.
n = 10, p = 25, so c = (1025)/100 = 2.5.
Hence round up to c = 3.
Percentiles - Finding the value
Corresponding to a Given
Percentile

Thus, the value of the 25th percentile is the


value X = 5.
Find the 80th percentile.
c = (10 80)/100 = 8. Thus the value of the
80th percentile is the average of the 8th and
9th values. Thus, the 80th percentile for the
data set is (15 + 18)/2 = 16.5.
Special Percentiles - Deciles and
Quartiles

Deciles divide the data set into 10


groups.
Deciles are denoted by D1, D2, , D9
with the corresponding percentiles
being P10, P20, , P90
Quartiles divide the data set into 4
groups.
Special Percentiles - Deciles and
Quartiles

Quartiles are denoted by Q1, Q2, and


Q3 with the corresponding percentiles
being P25, P50, and P75.
The median is the same as P50 or Q2.
Outliers and the Interquartile
Range (IQR)

An outlier is an extremely high or an


extremely low data value when
compared with the rest of the data
values.
The Interquartile Range, IQR
= Q3 Q1.
Outliers and the Interquartile
Range (IQR)

To determine whether a data value can be


considered as an outlier:
Step 1: Compute Q1 and Q3.
Step 2: Find the IQR = Q3 Q1.
Step 3: Compute (1.5)(IQR).
Step 4: Compute Q1 (1.5)(IQR) and
Q3 + (1.5)(IQR).
Outliers and the Interquartile
Range (IQR)

To determine whether a data value can be


considered as an outlier:
Step 5: Compare the data value (say X) with
Q1 (1.5)(IQR) and Q3 + (1.5)(IQR).
If X < Q1 (1.5)(IQR) or
if X > Q3 + (1.5)(IQR), then X is considered
an outlier.
Outliers and the Interquartile
Range (IQR) - Example

Given the data set 5, 6, 12, 13, 15, 18, 22, 50,
can the value of 50 be considered as an
outlier?
Q1 = 9, Q3 = 20, IQR = 11. Verify.
(1.5)(IQR) = (1.5)(11) = 16.5.
9 16.5 = 7.5 and 20 + 16.5 = 36.5.
The value of 50 is outside the range 7.5 to
36.5, hence 50 is an outlier.
Exploratory Data Analysis - Stem
and Leaf Plot

A stem and leaf plot is a data plot


that uses part of a data value as the
stem and part of the data value as
the leaf to form groups or classes.
Exploratory Data Analysis - Stem
and Leaf Plot - Example

At an outpatient testing center, a


sample of 20 days showed the following
number of cardiograms done each day:
25, 31, 20, 32, 13, 14, 43, 02, 57, 23,
36, 32, 33, 32, 44, 32, 52, 44, 51, 45.
Construct a stem and leaf plot for the
data.
Exploratory Data Analysis - Stem
and Leaf Plot - Example

Leading Digit (Stem) Trailing Digit (Leaf)

0 2
1 3 4
2 0 3 5
3 1 2 2 2 2 3 6
4 3 4 4 5
5 1 2 7
Exploratory Data Analysis
Box Plot

When the data set contains a small


number of values, a box plot is used to
graphically represent the data set.
These plots involve five values: the
minimum value, the lower hinge, the
median, the upper hinge, and the
maximum value.
Exploratory Data Analysis
Box Plot

The lower hinge is the median of all values


less than or equal to the median when the
data set has an odd number of values, or
as the median of all values less than the
median when the data set has an even
number of values. The symbol for the
lower hinge is LH.
Exploratory Data Analysis
Box Plot

The upper hinge is the median of all


values greater than or equal to the
median when the data set has an odd
number of values, or as the median of all
values greater than the median when the
data set has an even number of values.
The symbol for the upper hinge is UH.
Exploratory Data Analysis - Box
Plot - Example (Cardiograms data)

LH UH

MINIMUM MAXIMUM

MEDIAN

0 10 20 30 40 50 60
Information Obtained from a
Box Plot

If the median is near the center of the box,


the distribution is approximately symmetric.
If the median falls to the left of the center of
the box, the distribution is positively skewed.
If the median falls to the right of the center of
the box, the distribution is negatively skewed.
Information Obtained from a
Box Plot

If the lines are about the same length, the


distribution is approximately symmetric.
If the right line is larger than the left line, the
distribution is positively skewed.
If the left line is larger than the right line, the
distribution is negatively skewed.