0.biostat All in One

PART-I: DESCRIPTIVE STATISTICS
Lecture Note organized by

Gurmesa Tura (MPH)
PhD Candidate, AAU, SPH
April, 2011
Addis Ababa University
1
Introduction to
Biostatistics
Lecturer by:
Gurmesa Tura (MPH)
March 2011
AAU 2
Objectives
• At the end of this lecture the students will be
able to:
 Define statistics & Biostatistics
 Explain the roles of statistics in medicine
 Describe the types of data and scales of

measurement
 Identify different methods of data collection
3
Contents
• Definition
• Types of statistics
• Roles of statistics
• Types of data & scales of measurement
• Data collection methods
4
What is statistics?
• The scientific study of numerical data based
on variation in nature. (Sokal and Rohlf)
• A set of procedures and rules for reducing

large masses of data into manageable
proportions allowing us to draw conclusions
from those data. (McCarthy)
5
Statistics…
• Statistics is the art and science of making
decisions in the face of uncertainty
• Statistics the science of collecting, summarizing,

presenting, interpreting data, and of using them
to test hypotheses.
• Biostatistics is statistics applied to biological and

health problems
6
What are statistical data?
• Observation: information obtained from a
single person
• Data: information gathered from group of

people
• Statistical data: raw material or facts of any

statistical observation arising when ever
measurements are made or observations are
classified
7
Types of statistics
• Descriptive Statistics
– Collection,
– organization,
– summarization, and
– presentation of data.
• Inferential Statistics
– Generalizing from samples to populations using
probabilities.
– Performing hypothesis testing,
– Determining relationships between variables,
– Making predictions.
8
Why study statistics in medicine?
• Medicine and epidemiology are becoming

increasingly quantitative
• Knowledge of statistics is required to design,

conduct and analyse medical researches
• Helps for better understanding of medical

literature
9
Roles of statistics
• In clinical medicine
– Making clinical diagnosis
– Determining Rx and prognosis
– Handling variations (defining normal values and normal
ranges)
• In public health
– Community diagnosis
• In Research
– Designing and undertaking clinical & public health research
10
Uses of statistics
1. Collecting data in the best possible way
2. Describing a characteristics of a group or

population
3. Analyzing and interpreting data
4. Making generalization about populations based on

studies of samples
11
Limitations of statistics
1. Statistics doesn’t deal with single (individual) value.
– It deals only with aggregate values
2. Statistics can’t deal with qualitative characteristics

– Deals with data which can be quantified
3. Statistical conclusions are not universally true

– Context specific
4. Statistical interpretations require high degree of skill

& understanding of the subject.
12
Types of data
• Based on source :
– Primary & secondary data
1. Primary data
• Data collected by the investigator for the
purpose of specific study
• Original in character
• Mostly generated by surveys
• Complete, reliable and more accurate
13
Types of data…
2. Secondary data
• When the investigator uses data which have been collected by
others for other purpose
• Obtained from Journals, reports, Gov’t publications etc
• Less expensive (less money & time)
• May be incomplete, less quality, less valid
14
Scales of measurement
• Variable is any aspect of an individual or
thing that is measured and can take any
value for different individuals or cases
• Divided in to two
1. Qualitative (categorical) variable &
2. Quantitative (numerical) variables
15
Qualitative (categorical) variable
• A variable which can not be measure in
quantitative (numerical) form but can only be
identified by names.
• It has three forms based on scales of

measurement
– Nominal
– Ordinal
16
Nominal data
• Represent categories or names
• There is no orders in the categories
• It has two forms:
– Dichotomous- has 2 value categories
• E.g. Sex: Male or Female
» Immunization: yes or No
» Diseases outcome : Died or survived
– Multichotomous: >two categories
• E.g.
– Blood group: A, B, AB or O
– Marital status: single, married, divorced or widowed
17
Ordinal data
• Have order in the response categories
• But, the distance or interval between categories are not

necessarily equal
– E.g Immunization status:

» Not immunized,
» Partially immunized
» Fully immunized
• Disease state
» Mild
» Moderate
» Severe
• Agreement questions
» Strongly agree
» Agree
» Indifferent
» Disagree
» Strongly disagree
18
Quantitative (numerical) variables
• Variables which assume numerical values.
• variables to which a number is assigned as a
quantitative value
• Has two forms
– Discrete Variables
• Variables which assume a finite or countable number of
possible values.
• Usually obtained by counting. No decimal
Eg. - House hold size
- No. children
– Continuous Variables
• Variables which assume an infinite number of possible
values.
• Usually obtained by measurement.
• Can have decimals
• Eg. Age, weight, height
19
Quantitative …..
• Continues variables…
• Has two scales of measures
• Interval scale:
– Order and distance implied. Differences can be compared;
– no true zero.
– Ratios can not be compared.
E.g. Temperature in Celsius.
» 0Oc is not to mean there is no temperature
» 40Oc is not twice as hot as 20Oc
• Ratio scale:
– Order and distance implied.
– Differences can be compared;
– has a true zero.
– Ratios can be compared.
– Examples: Height, weight, blood pressure
• 40cms is twice as long as 20cms
• 0 cm is true 0 as there is no 0zero height
20
Discrete
21
Data collection
• The process of obtaining statistical data
• Before any statistical work can be done data must be
collected
• Collecting Primary data
– Observation
– Interview
– Use of self administered questionnaire
• Collecting secondary data

– Use of documentary sources
22
Observation
• Systematically selecting , watching and recording
behaviours of people or other phenomena and
aspects of the settings in which they occur
• For the purpose of obtaining specified observation
• Includes
– Visual observation
– Radiographic, Biomedical, x-ray, microscope,
clinical examinations, etc
23
Observation…
• It can also be used In observing behaviour
of people, culture etc.
• It could be
– Participant observation or
– Non-participant observation
24
Observation…
• Advantage
– More accurate data on behaviour or activity
• Disadvantages
– Observer bias
– Prejudice
– Desirability bias
– Needs skilled human power in high level
machines
25
Interviews
• Face to face interview
• Telephone interview
• Group interview or Focused Group Discussion (FGD)
• Self administered questionnaire
• Mailed questionnaire
• Computer interview
26
Face to face interview
• Advantage
– Permits detailed & in-depth questions & responses
– Minimizes non-response
• Disadvantage
– Costly
– Interviewer bias
– Investigator bias
– Interviewer cheating
27
Telephone interview
• Advantage
– Convenient
– Saves time
– Relatively inexpensive
– Less interviewer & investigator bias than personal
interview
• Disadvantage
– Non-coverage
– Limited length & depth of questions and responses
28
Self-administered Questionnaire
• Advantage
– Cost effective for large areas
– Minimizes interviewer bias
– Promotes accurate answers
– Sensitive issues can be gathered
• Disadvantage
– Low response rates
– Unanswered questions
– Incorrect answers
29
Mailed questionnaire
• Advantage
– Allows collecting data with out personal presence
• Disadvantage
– Low response rate
– Not applicable for illiterates
– Low coverage in rural areas
30
Use of documentary sources
• These include
– Clinical & other personal records
– Vital statistics
– Census data
• Sources
– Official publications of CSA
– Publications of MOH & other ministries
– News papers & journals
– International publications (WHO, UNICEF, etc)
– Health facilities’ records
31
Choosing method of data collection
• Choosing which method(s) of data collection
depends on:
– Type of data we need
– Resources (time, personnel & facility)
– Accuracy & strength of the method
– Acceptability of the method by the subjects
– Back ground of study subjects
– Etc
32
33
Data organization &

Presentation
Lecture 2
By Gurmesa Tura (MPH)
March 2011
AAU
33
Learning objectives
able to:
– Identify different ways of data organization &

presentation
– Familiar with constructing different methods of

data organization and presentation
34
34
Methods of data organization
• The data collected in a survey is called raw data
• Information is not immediately evident from the mass of unsorted

raw data
• Needs to be organized in such a way as to condense information to

show patterns and variations
• Techniques of data organization & presentation

• Ordered array
• Tables &
• Graphs
35
35
Ordered array
• A serial arrangement of numerical data in an ascending
or descending order
• Tells as the ranges of data and their general distributions
• Appropriate only for small data (<20)
• If it is beyond 20 we need to use frequency distributions

or Tables
36
36
Frequency distributions
• Is a table that shows data classified in to a number of
classes with a corresponding number of times falling in
each categories (frequency)
• Frequency is the number of times a certain value of the

variable is separated in a given class.
• Two types
– Categorical frequency distribution
– Numerical frequency distribution
37
37
Categorical frequency distribution
• Used for data that can be placed in specific categories
• Used for nominal & Ordinal

– E.g. blood type, marital status etc.
• Example: A health worker collected data on blood type of 30

individuals and recorded as follows (Hypothetical)
• O, A, AB, B, O, O, O, A, B, O, AB, B, B, A, AB, O, O, O, B, AB, O, A, AB, B,

O, O, O, A, B, O
38
38
Procedures to construct the frequency distribution
• There are 4 types of blood group, so we have four classes
• Step 1: Make a table
• Step 2: Tally the data & place the result in Tally column
• Step 3: count the tally and Place the result in frequency
column
• Step 4: calculate the % for each class
% = f/n*100
Where f= frequency of the class &
n= total number of values
39
39
40
40
Numerical frequency distribution
• Here the classification criterion is quantitative
• It has two forms

– Ungrouped frequency distribution
• For discrete quantitative data
– Grouped frequency distribution

• For continues quantitative data
41
41
Ungrouped frequency distribution
• Is a table of all the potential raw score values

that could possibly occur in the data along
with the number of times each actually
occurred
• Often used for small set of data on discrete

variables
42
42
Constructing ungrouped freq. distri.
• 1st find the smallest & the largest values in the data
• Arrange the data in order of magnitude and count the frequency
• To facilitate counting one may include column of tallies.
• Steps in constructing
• Step 1: make the table
• Step 2: Tally the data
• Step 3: Count the frequency
• Step 4: compute the percentage
• E.g. the following hypothetical data represent family size of 50 households.
4, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6, 2, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4,
6, 2, 5, 2 , 8, 10, 4, 4, 5, 3, 10, 4, 5, 6, 3, 5, 6
43
43
44
44
Grouped frequency distribution (GFD)
• A frequency distribution when several numbers are
grouped in one class
• Usually used when the range of the data is large
• Two types
– Inclusive
• the upper limit of one class coincides with the lower limit
of the next class
– Exclusive
– the upper limit of one class does not coincides with the lower limit
of the next class
45
45
Grouped freq. distr….
• Example: Consider the following ungrouped marks of 30
students (out of 50%)
24 30 36 35 42 40 26 23
36 36 12 45 29 21 34 40
16 47 28 32 33 44 19 34
30 36 35 47 20 14
• Construct grouped frequency distribution for the above data

46
46
Guidelines for creating classes
1. There should be b/n 6-20 classes
2. The classes must be mutually exclusive.
i.e. no data value fall into two d/t classes
3. The classes must be all inclusive or exhaustive. i.e. all data
values must be included
4. The classes must be continues. i.e. No
gaps in a frequency distribution
5. The classes must be equal in width.
• The exception here is the first or the last classes
• Possible to have ‘Below…’ or ‘…and above’ class.
• Often used in ages.
47
47
Steps in constructing Grouped freq. distr.
1. Find the largest & smallest value
2. Compute range (R) = Maximum –Minimum
• From above example R = 47-12 =35
3. Select number of classes (usually 6-20) or use Sturge’s rule
k = 1+ 3.32 logn
Where k is desired number of classes &
n is total number of observations
K will be round up if there are values after decimal
From the example above (n =30)
K = 1+ 3.322 log30 (log30 = 1.48)
K = 1 + 3.322(1.48) = 5.9, round up to 6
So we need to have 6 classes
48
48
Steps…
4. Find the class width (w) by dividing the range by the
number of classes and roundup not round off.
From ex. Above w = R/k = 35/6 = 5.8, rounded to 6
5. Form a suitable starting point which is equal to the

minimum value.
– Starting point is called the lower limit of the 1st class
– Continue to add the class width to this lower limit to get the
rest of lower limits.
49
49
Steps..
• From the above example the lower class limits
(LCL) will be:
• The starting point is Small value = 12, so,
• 1st lower limit = 12
• 2nd lower limit = 12 +6 =18
• 3rd lower limit = 18+6 = 24
• 4th lower limit = 24+6 = 30
• 5th lower limit= 30+6 = 36
• 6th lower limit = 36+6 = 42
50
50
Steps…
6. Find the upper class limit (UCL),
UCL= LCL + (w-1)
From the above ex. W= 6,
so, W-1 = 5 Classes Tally Freq %
 1st UCL = 12 + 5 = 17 12-17

 2nd UCL = 18 + 5 = 23 18-23
 3rd UCL = 24 + 5 = 29 24-29
 4th UCL = 30 + 5 = 35 30-35
 5th UCL = 36 + 5 = 41 36-41
 6th UCL = 42 + 5 = 47 42-47
Total
51
51
Steps …
7. Make tally
8. Count the tally & fill frequency
9. Calculate & fill percentages
10. Find relative frequency (rf)
Rf=f/n
11. Find cumulative frequency (cf):
– Lcf : Less than cumulative frequency (<UCB)
– Gcf: Greater than cumulative frequency (>LCB)
52
52
By combining all the steps
Classes Tally Freq % rf Cf Cf (greater
(less than) than)
12-17 /// 3 10.0 0.10 3 30
18-23 //// 4 13.3 0.13 7 27
24-29 //// 4 13.3 0.13 11 23
30-35 //// /// 8 26.7 0.27 19 19
36-41 //// / 6 20.0 0.20 25 11
42-47 //// 5 16.7 0.17 30 5
Total 30 100.0 1.00
53
53
Common terms used in grouped freq. distr. (GFD)
• Class interval: range of scores grouped together in a GFD
• Class limits: the first & the last elements in the given class
interval
• Units of measurement (U): the distance between two

consecutive measures
– U = (n+1)th LCL – nth UCL
– Eg. 12-17, 18-23, U = 18-17 =1
– U is usually taken as; 1, 0.1, 0.01, 0.001….
54
54
Terms….
• Class boundaries: separates one class in GFD from another
• The boundaries have one more decimal places than the raw data and
therefore do not appear in the data
• There is no gap b/n the upper boundary of one class and the lower
bounder of the next class
• LCB = LCL-U/2
• UCB = UCL + U/2
– Eg. 12-17, 18-23, U = 18-17 =1

• LCB for 18-23, 18-1/2 = 18-0.5 =17.5
• UCB for 18-23, 23 + ½ = 23 +0.5 =23.5
55
55
Terms…
Classes Boundaries
Classes Class Freq %

boundaries
12-17 11.5-17.5
18-23 17.5-23.5
24-29 23.5-29.5
30-35 29.5-35.5
36-41 35.5-41.5
42-47 41.5-47.5
Total
• Class width (w) = UCB-LCB 56

56
Terms…
• Class mark (Xc) Classes Class
– The mid point of the class marks (Xc)
– The average of LCL & UCL or the 12-17 14.5
average of LCB + UCB
18-23 20.5
24-29 26.5
– Xc = LCL + UCL 30-35 32.5
2 36-41 38.5
42-47 44.5
Total
Eg. Xc1 = 12+17 = 29/2 = 14.5
2
57
57
Rules in constructing tables
1. Table should be as simple as possible (6-20 categories)
2. Tables should be self explanatory
• Title should be clear and to the point (answers: What, when, where, how
classified)
e.g. Table 1: Marks of 30 Medical students of AAU, March 2011, AA, Ethiopia.
• Placed above the table
3. Each raw & column should be labelled
4. Numerical entities of zero should be explicitly written rather than indicating
by dash, as dashes are reserved for missing or unobserved data.
5. Totals should be indicated (last raw last column)
6. If the data are not original, their source should be given in foot notes.
58
58
Types of tables
• We have three d/t types of tables based on the number of
variables included
1. Simple or one way table

– Single variable involved
2. Two way table

- Two variables cross tabulated
3. Higher ordered table

- Three or more variables involved
59
59
Eg. One way
• Table 2: Immunization status of children in xxx woreda,
2010 (hypothetical)
Immunization Number Percent

status
Immunized 135 64.3
Not immunized 75 35.7
Total 210 100.0
60
60
Eg. Two way table
• Table 3: Immunization status by sex of children in xxx
woreda, 2010 (hypothetical)
Sex of children Immunization status Total
Immunized Not immunized N %

N % N %
Male 85 65.4 45 34.6 130 100.0
Female 50 62.5 30 37.5 80
Total 135 64.3 75 35.7 210 100.0
61
61
Eg. Higher ordered table
•Table 4: Immunization status by sex and residence of children in xxx
Immunization status Total
Sex & residence of children
N % N %
Male Urban 55 25 80 100.0

68.7 31.3
Rural 30 20 50 100.0
60.0 40.0
Female Urban 40 20 60 100.0
66.7 33.3
rural 10 10 20 100.0
50.0 50.0
Total 135 64.3 75 35.7 210 100.0
62
62
Diagrammatic/Graphical
presentation of data
Lecture 3
By: Gurmesa Tura (MPH)
March 2011
AAU
63
Objectives
• At the end of the class the students will be
able to:
– Identify the different types of graphs
– Chose among the graphs based on the data
– Familiar with constructing the different types of
graphs
– Identify importance and limitation of using graphs
64
Graphical presentation of data
• Techniques for presenting data in visual
displays using geometric and pictures.
• Importance
• Greater attraction
• Easily understandable
• Facilitate comparison
• May reveal unsuspected patterns in complex set of
data
• Greater memorizing value
65
Limitations
• Used only for purpose of comparison
• Not an alternative to tabulation
• Can give only an approximate idea
• They fail to bring to light too small differences
66
Types of graphs
• For qualitative & quantitative discrete data
• Bar chart
• Pie chart
• For quantitative continues data

• Histograms
• Frequency polygon
• Cumulative frequency polygon (Ogive)
67
Bar chart
• A series of equally spaced bars having equal width
(base) where the height of the bar represents the
frequency of (amount) associated with each category.
• It could be either vertical or horizontal
• Three types based on number of variables

involved
– Simple bar chart
– Multiple bar chart
– Component bar chart
68
• Simple bar chart
From our previous example
Table 2: Immunization status of children in xxx

Immunization Freq %
status
160
number of children
Immunized 135 64.3 140
120
100
80
Not 75 35.7
60
40
20
immunized 0
Immunized Not immunized
Immunization Status
Total 210 100.0
Figure 1: Immunization status of children in

xxx woreda, 2010 (hypothetical)
69
Multiple bar chart
• From the previous example
Table 3: Immunization status by sex of children in xxx

Sex of children Immunization status Total

N % N %
Male 85 65.4 45 34.6 130 100.0
Female 50 62.5 30 37.5 80
Total 135 64.3 75 35.7 210 100.0
70
Multiple bar chart…
70
60
50
% of children
90 40 Male
80 30 Female
70
20
No. of childern
60
10
50 Male
40 Female 0
30 Immunized not immunized
20 Immunization
10
0
Immunized not immunized
Immunization
Figure 2: Immunization status by sex of children in xxx woreda, 2010 (hypothetical)
71
Component bar chart
We can also construct component bar chart for the above table
120.00%
140
100.00%
120
80.00%
% of children
100
NO. of children
60.00%
80
40.00%
60
20.00%
40
20 0.00%
Male Female
0 Sex
Male Female
Sex
Figure 3: Immunization status by sex of children in xxx woreda, 2010 (hypothetical)
72
Pie chart
• A circle divided in to sectors so that the areas
of the sectors are proportional to the
frequencies.
• Distribution of angles (360o) is made based on

the proportion of each frequency’s share from
the total observation.
• fi/n * 360o or % of each class * 360o

73
Example:
Pie chart…
Table 4: Blood type of 30 individuals in xxx
A = 5/30*360o =60o
Blood Type Freq. %
A 5 16.7 B = 7/30*360o =84o

B 7 23.3
AB 5 16.7
AB = 5/30*360o =60o
O 13 43.3
O = 13/30*360o =156o
Total 30 100
74
Pie chart
17%
A
B
43%
AB
23% O
17%
Figure 4: Blood type of 30 individuals in XXXX Woreda,

2010 (hypothetical)
75
•
Histograms
Graph consists of series of rectangles whose bases are equal to
the class width of the corresponding class & whose heights are
proportional to class frequencies
• Used for quantitative Continues data

1. The horizontal axis is continues scale running from one
extreme end to the other
• Should be labelled with the name of the variable & units of
measurement
2. For each class in the distribution, a vertical rectangle is
drawn with:
• There will never be gaps b/n the histogram rectangles
• Bases of rectangle will be determined by the class width
76
Eg. Conceder the data on student marks
Table 5: Marks of 30 students, AAU, Ethiopia, 2011 (hypothetical data)
Classes Class marks Frequency

(Xc)
12-17 14.5 3
18-23 20.5 4
24-29 26.5 4
30-35 32.5 8
36-41 38.5 6
42-47 44.5 5
Total 30
77
Histograms
Figure 4: Histograms showing students’ mark, AAU, 2010 (hypothetical

78
data)
Frequency Polygon
• Join the mid points of the tops of the adjacent rectangles
of the histogram with segments
• When it is joined with x-axis the area under the polygon is

equal to the area under the histogram.
• The scales should be marked in the numerical values of

the midpoints (Xc)
• The length of the ordinates represent the class frequency.
79
80
Figure 5: Frequency polygon showing mark of 30 students, AAU, 2010, 81
(Hypothetical data
Cumulative frequency polygon (Ogive)
• Line graph obtained by plotting the cumulative
frequency distribution (Y-axis) against class
boundaries (x-axis)
• Two types
– Cumulative frequency Less than the UCB (Lcf)or
– Cumulative frequency More than the LCB (Mcf)
– We can also use the intersection of the two.
82
Construct Ogive by using the table from the
above Example
Classes Class Freq Less More than
boundaries than cf
cF
12-17 11.5-17.5 3 3 30
18-23 17.5-23.5 4 7 27
24-29 23.5-29.5 4 11 23
30-35 29.5-35.5 8 19 19
36-41 35.5-41.5 6 25 11
42-47 41.5-47.5 5 30 5
Total 30
83
Less than Ogive
Figure 6: Less than Ogive showing mark of 30 students, AAU, 2010,

(Hypothetical data) 84
More than Ogive
Figure 7: More than Ogive showing mark of 30 students, AAU, 2010,

(Hypothetical data) 85
Less than & More than Ogive
Figure 8: More than & less than Ogive with their intersection showing mark
86
of 30 students, AAU, 2010, (Hypothetical data)
Data summarization
Lecture 4-6
March 2011
AAU
87
Learning objectives
At the end of this lecture, the students will
be able to:
– Identify the different parameters for data
summarizations
– Differentiate between measures of central
tendency and dispersion
– Calculate the commonly used measures of
central tendency and measures of dispersion
– Interpreter the final results of the measures
88
Data summarization
Although tables and graphs serve useful
purposes, there are many situations that require
other types of data summarization.
Important to summarize data by means of just a

few numerical measures, before inferences or
generalizations are drawn from the data.
These can be done by determining

– Measures of central Tendency
– Measures of variation or dispersion
89
Measures of central Tendency
Are numbers that tell us where the majority
of values in the distribution are located.
The center of the probability distribution
from which the data were sampled
Are also called measures of location
– Arithmetic Mean
– Median Commonly used and
– Mode focus of this lecture
– Geometric mean, and
– Harmonic mean.
90
The Arithmetic Mean
Arithmetic Mean = average
The arithmetic mean is the sum of the

individual values in a data set divided by
the number of values in the data set.
We can compute a mean of both a finite

population and a sample
91
Mean…
{8, 5, 4, 12, 15, 5, 7}
What is the mean of these data?
Mean = (8 +5+4+12+15+5+7)
7
= 56/7 = 8
But what if large number of data set?
92
93
Population mean
94
Mean for large Discreet data set
with frequency distribution
 when we have large data set HH size freq
difficult to add manually 2 5
In which case multiply each 3 6
value with their respective
4 14
frequency and divide by total
number of frequency 5 10
6 6
8 5
10 4
Total 50
95
Example: determine the mean HH size from the
following table
HH size Freq f ix i
(xi) ( f i)
2 5 10
3 6 18 = 250/50
4 14 56
=5
5 10 50
According to this data
6 6 36 in average 5 people
8 5 40 live in a Household
10 4 40
Total 50 250 96
Mean for Grouped data
From our previous example determine the mean
Mark of the students presented in the following table
Classes Freq In this case need to determine the

( f i) mid point (class mark) for each
class (xi) that represent the group
12-17 3
18-23 4
24-29 4
30-35 8
36-41 6
Where xi is class
42-47 5
mark
Total 30 97
Grouped mean….
Class Freq xi f ix i
es (fi)
12-17 3 14.5 43.5

18-23 4 20.5 82
24-29 4 26.5 106 = 945/30
30-35 8 32.5 260
= 31.5
36-41 6 38.5 231
42-47 5 44.5 222.5
Total 30 945
98
Characteristics of Arithmetic mean
1. Determined by every item in the series
2. Greatly affected by extreme values
3. The sum of deviations about it is zero
4. The sum of the squares of deveations

from the arithmetic mean is less than of
those computed from any other point
99
Arithmetic mean…
Advantages
– Based on all values given in the distributions.
– Most amenable to mathematical treatment.
– It is most easily understood
Disadvantage
– Affected by extreme values in the distribution
– When the distribution has an open end classes
its computation would be based on assumption
and therefore may not be valid
100
Reading assignment
Geometric mean
Harmonic Mean
101
Median
Median = middle value
The median is defined as the “middle

most” observation.
Median is the observation such that half

the observations are above it and half are
below it.
It is the 50th percentile point
102
Median for ungrouped data
To determine median the first step is putting the
values in ascending order
E.g consider the following data

{8, 5, 4, 12, 15, 5, 7} not ordered
4, 5, 5, 7, 8, 12, 15
Median
What if large number of data that can not be
listed?
103
104
105
Median for grouped data
It is possible to know the
Class Freq cf median class, by the above
formula.
12-17 3 3
But it doesn’t tell us the exact
18-23 4 7
median value.
24-29 4 11
30-35 8 19 N=30, so median class is the
class that contain 15th & 16th
36-41 6 25
observation
42-47 5 30
Total 30 i.e. class 30-35
106
Median for grouped data…
To get the exact value from 30-35, we need
other formula.
The formula to determine median for grouped

frequency distributions
w n 
Median L med

f
 C
 2 
med
Where:
Lmed = LCB of median class
W = width of median class
fmed = the frequency of the median class

.n = total number of observations
C =cumulative frequency of the class preceding the median class 107
Example: from the previous table…
Median class = 30-35
Lmed 30 – 0.5 = 29.5
Class Freq cf
w = 35.5-29.5 = 6
12-17 3 3 fmed = 8
18-23 4 7 C = 11
w n 
n = 30 Median  Lmed    C 
24-29 4 11 f med  2 
30-35 8 19
36-41 6 25
Median = 29.5 + 6/8 (30/2 – 11)
42-47 5 30
Total 30 = 29.5 + 6/8(15-11)
= 29.5 + 3 108
Median…
Characteristics
– An average position
– Affected by number of items than by extreme values
Advantages
– Easy to calculate and more typical of the series
– The median may be located even when the data is
incomplete.
E.g. when the class intervals are irregular and the final
classes have open ended
– Not affected by extreme observation
Disadvantages
– Not well suited to mathematical treatment
– Not so familiar as the arithmetic mean
109
Mode
Mode - The value that occurs most frequently
The given data set may have
– One mode = unimodal
E.g.. 3,3,4,4,4, 5,5,5,5,6,7,8 mode is 5
– Two mode = bimodal
E.g.. 10, 11, 12, 12, 12, 13, 14, 15,15,15, 17
modes are 12 & 15
– More than two modes = multimodal
– No mode at all =non-modal
E.g. 3,4,5,7,8,10
110
111
Mode for ungrouped data
HH size Freq the mode can simply
2 5 identified by selecting the
observation with largest
3 6
frequency.
4 14
5 10 From this data the
6 6 greatest frequency is 14,
8 5 so the mode is 4
10 4
Total 50
112
Mode for grouped data
Class Freq Here the modal class, the
class with the highest
12-17 3 frequency, is 30-35.
18-23 4
We need to determine the
24-29 4 exact value b/n 30 & 35 that
30-35 8 represent the mode of the
data
36-41 6
42-47 5 It is determined by the
Total 30 formula given below.
113
Mode for grouped data…
Mode  Lmo  w

  1


 
 1  2 

Where:
– Lmo = LCB of modal class
– The width of modal class
– ∆1= frequency of modal class – frequency of
class preceding modal class
– ∆2= frequency of modal class – frequency of
class following the modal class 114
Example Modal class = 30-35
Lmo = 29.5
w = 35.5-29.5 = 6
Class Freq Frequency of modal class =8
Frequency of the class preceding modal
12-17 3 class = 4
18-23 4 Frequency of the class following modal
class = 6
24-29 4
∆1= 8-4 = 4 & ∆2= 8-6 = 2
30-35 8
 1 
36-41 6 Mode  Lmo  w 
  
 1 2 
42-47 5
Mode = 29.5 +6(4/4+2)
Total 30
=29.5 +6(4/6)
= 33.5
115
Characteristics of mode
Is an average position
Not affected by extreme values
The most typical value of the distribution
116
Advantage & disadvantage of mode
Advantages
– Since it is most typical value it is the most descriptive
average
– Since the mode is usually an actual value it indicates
the precise value of an important part of the series.
– It is not affected by extreme values
Disadvantages
– It is not capable of mathematical treatment
– Has no significant for small samples
– In small number of items the mode may not exist
117
Measures of Dispersion
118
Dispersion
In order to utilize the information provided
by a set of data, knowing just a location or
average value of the data alone is not
adequate,
We need also to know the dispersion or

the variability.
Common measures of variability:

– Range,
– Inter-quartile range,
– Variance,
– Standard deviation and
– Coefficient of variation
119
Range
The range is defined as the difference between
the largest and the smallest observations in
the data set.
Range (R) = Largest (L) – Smallest (S)

observations
R=L-S
120
Range…
Eg. From the HH size
4, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6,
2, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6,
2, 5, 2 , 8, 10, 4, 4, 5, 3, 10, 4, 5, 6, 3, 5, 6
Largest (L)= 10 & Smallest (s) = 2
R = 10 -2 = 8
121
Example from students Mark
Classes Freq R = L-S

12-17 3
18-23 4
24-29 4
R = 47-12
30-35 8 = 35
36-41 6
42-47 5
Total 30
122
Range..
Advantages
– Computation is simple
– Easy to understand
Disadvantages
– It does not consider all values
– A poor measure of dispersion
Interpretation of the range depends on the

number of observations-
when the number of observations increase, the
range can get larger
123
Quintiles
When distribution is arranged in order of
magnitude, the median is the value of the
middle term.
Their measures that depend up on their

distribution such as quartiles, deciles, &
percentiles are collectively called quintiles.
124
Quartiles
Quartiles are sets of values which divide the
distribution into four parts such that there are an
equal number of observations in each part.
– First Quartile (Q1) = a value at which 25% are less

than or equal to it.
– Second Quartile (Q2) = a value at which 50% are

less than or equal to it. This is median
– Third Quartile (Q3) = a value at which 75% are

less than or equal to it.
125
Calculating quartiles for ungrouped data
First arrange the data in increasing order

Then, Use the following formula
Q i

i
4
n 1 th
value, i  1,2,3, then
1
Q1  4 n 1 value
th
2
Q2  4 n 1 value
th
3
Q3  4 n 1 value
th
126
Q1= ¼(50+1)th value =1/4(50+1)th value
= ¼(51) = 12.75th value
Example = 12th value + 0.75 x (13th value -12th
value)
= 4 + 0.75(4-4) = 4
HH Frq. Cf
size
Q2= 2/4(50+1)th value =2/4(50+1)th value
2 5 5
= 2/4(51) = 25. 5th value
3 6 11
= 25th value + 0. 5 x (26th value -25th
4 14 25 value)
5 10 35 = 4 + 0.5(5-4) = 4.5
6 6 41
8 5 46 Q3= 3/4(50+1)th value =3/4(50+1)th value
= 3/4(51) = 38.25th value
10 4 50
= 38th value + 0.25 x (39th value -38th
Total 50 value)
= 6 + 0.25(6-6) = 6
Q i

i
4
n 1 value
th
127
Inter-quartile range (IQR)
Inter-quartile range is the difference b/n the third
and the first quartiles.
IQR = Q3-Q1
= 6-4 = 2
This tells us how much the third and the first

quartiles are far apart or close to each other
128
Calculating quartiles for Grouped data
First find the class in which the Qi lies
Classe Freq cf This can be obtained by counting in/4
s of the class beginning from the lowest
12-17 3 3 class,
18-23 4 7 Class for Q1 = 1(30)/4 = 7.5th

= class of 7.5th value
24-29 4 11 =24-29
30-35 8 19 Class for Q1 = 2(30)/4 = 15th

= class of 15th value
36-41 6 25 =30-35
Class for Q3 = 3(30)/4 = 22.5th
42-47 5 30
= class of 22.5th value
Total 30 =36-41
129
Then, we use the formula
w in
Q  LQ  (  C ), i  1,2,3
i i
fQ 4
i
Where:
– LQi = Lower class boundary of the Quartile class
– W = width of quartile class
– n = total number of observations
– fQi = frequency of the quartile class
– C = cumulative frequency preceding the quartile
class
130
Solution
w in
Q  LQ  (  C ), i  1,2,3
i i
fQ 4
i
Q1= 23.5 + 6/4(1(30/4) – 7)

= 23.5 + (7.5-7) =23.5 + 0.5 = 24
Q2= 29.5 + 6/8(2(30/4) – 11)

= 29.5 + 6/8(15-11) =29.5 + 3 = 32.5
Q3= 35.5 + 6/6(3(30/4) – 19)

= 35.5 + (22.5-19) =35.5 + 3.5= 39
IQR = Q3 - Q1
= 39-24 = 15
131
Box plot
a graphical display that involves a five-number
summary of a distribution of values, consisting
of
– the minimum value,
– the first quartile,
– the median,
– the third quartile, and
– the maximum value
132
Box plots
• It could be vertical or horizontal
• A vertical box-plot is constructed by drawing

a box between the quartiles Q1 and Q3.
• Vertical lines are then drawn from the middle

of the sides of the box to the minimum and
maximum values.
133
Box plots…
• These horizontal lines are called
whiskers.
• A vertical line inside the box marks the

median.
• Outliers are usually indicated by a dot

or an asterisk.
134
Box plot...
Putting IQR in diagrammatic form
Maximum = 47
Q3 =39
Q2= median =32.5
Q1 =24
Minimum =12
135
Purpose of box plot
Shows center of distribution (median)
Tells the spread of the distribution
Tells the distribution
136
IQR..
Advantages
– It is simple and versatile measure
– It encloses the central 50% of the observation
– Less prone to distortion by a single large or
small value
Disadvantage
– It is not based on all observations but only on
two specific values
137
Reading assignments
Deciles
Percentiles
138
Variance & standard deviation
Variance and standard deviations are another
measures of dispersion
They measure how much the data are close to

each other
They are denoted as:

σ2 = Population variance
S2 = Sample Variance
σ: Population Standard Deviation
s: Sample Standard Deviation
139
Variance
Population variance(σ2) is computed by squaring the
deviation of each observation from the mean, adding
them, and dividing their sum by N:
 X i   
N 2

2
 i 1
N
Sample variance (S2) is computed by squaring each

deviation, adding them, and dividing their sum by one
less than n:
2
n
 Xi X 
2
S  i 1
n 1 140
Standard deviation
Standard deviation is taking the square root
of the variance
2
S  S

2
 
141
142
For frequency distribution
Population Variance Population Stand. Div.
 f X i    X i   
N
N 2
2
 f i
i   i 1

2
 i 1 N
N
Sample Variance Sample Stand.divi.
   
n n 2
 f X i X
2
 f X i X i
S
i
2
 i 1
S  i 1
n 1 n 1
143
Example
Calculate the variance & Standard
deviation for the following Table
HH size Freq First step is calculating
2 5 mean by using the formula
3 6
4 14
5 10
6 6
As calculated above
8 5
10 4 The mean is 5
Total 50
144
Solution
Mean = 5 Variance
 
n 2
HH Freq
( Xi  X ) ( Xi  X ) 2 f ( Xi  X ) 2
2
 f i X i X
size
S  i 1
i
2 5 -3 9 45 n 1
3 6 -2 4 24
S2=234/50-1
4 14 -1 1 14
=234/49
5 10 0 0 0
= 4.8
6 6 1 1 6
8 5 3 9 45 Standard deviation
10 4 5 25 100
Total 50 234 S  4.8  2.2

145
Variance & SD for
grouped frequency
distribution
146
For grouped frequency distribution
We use class mark (mid point) of the class

to represent the class
E.g. calculate variance & standard

deviation for the 30 students marks
First step is determining the mean

Determined above (31.5)
147
 
n 2
2
 f X i X
i
Variance  S  i 1
n 1
 2598/30 - 1  2598/29  89.6
SD  S  89.6  9.5
148
Importance of variance & Standard deviation
The greater the variation in the data set, the

larger the magnitude of these deviations will
tend to be.
The basic question being asked is how much do

the scores deviate around the Mean?
The more “bunched up” around the mean the

better your ability to make accurate predictions.
149
Coefficient of Variation (CV)
CV is the ratio of the standard deviation to
the absolute value of the mean.
CV = Standard deviation X 100%

Mean

PopulationCV   %  x100%

S
SampleCV  v %  X 100%
X
150
CV….
Shows the size of measure of variation with the
mean
CV is good measure of variation
The higher the CV, the higher the variability in

the data set and the lesser the precision of the
data and vise versa.
Commonly used for comparison of different data

sets (d/t samples)
151
Skewness
152
Skewness
Skewness is the measure of asymmetry of the
distribution
If extremely low or extremely high observations

exist in distribution, then the mean tends to shift
towards those scores.
Based on the type of Skewness the distribution

can be
– Negatively skewed
– Positively skewe
– Normally distributed
153
Negatively skewed distribution;
– occurs when majority of scores at the right end of the
curve and a few small scores are scattered at the left
end.
Negatively skewed
In unimodal negatively skewed distribution,
Mean, median and mode occur in alphabetic
order 154
Positively skewed distribution
Occurs when the majority of scores are at the
left end of the curve and a few extreme large
scores are scattered at the right end.
Positively skewed
In unimodal positively skewed distribution,
Mean, median & Mode occur in reverse
alphabetical order 155
Symmetrical distribution
It is neither positively nor negatively skewed.
– A curve is symmetric if one half of the curve
is the mirror image of the other half.
– This is called Normal distribution
In unimodal symmetric distribution

Mean, median and mode are identical. 156
Example
1. Data on birthweight were collected from 1000
neonates in Woreda “A” and summarized as:
– Mean = 3kg, Median = 2.5kg & Mode = 2kg
2. A body weights were measured for 600 adults aged
18 years and above in Woreda “A” and summarized
as;
– Mean = 60Kg, Median = 70Kg & Mode = 80Kg
3. A body weights were measured for 600 adults aged
18 years and above in Woreda “C” and summarized
as;
– Mean = 65.0Kg, Median = 65.5Kg & Mode = 65.2Kg
Q1. What is the type of distribution for the three

cases?
Q2. What do you understand from the three data sets?
157
Solution
Case 1:
Mode (2kg) < Median (2.5kg) < Mean (3kg)
Reverse alphabetic order, So, positively skewed
Case 2:
Mean (60Kg) < Median (70Kg) < Mode(80Kg)
Alphabetic order, So, negatively skewed
Case 3:
Mean (65.0Kg) ≈ Median (65.5Kg) ≈ Mode(65.2Kg)
Almost at similar position
So, symmetrical (Normally distributed)
Q2: The data tell us that using mean may misled in
describing skewed data.
158
Choice of Central tendency
The choice of which measure to use depends on:
The shape of the distribution (whether normal or
skewed)
If the distribution is symmetrical, mean is

the best measure of central tendency
If the distribution is skewed, median is

appropriate measure
159
Z-Score
(Relative Position)
160
Z-score
The z-score is the number of standard deviations
the data value falls either above or below the mean
for the data set.
– If above: positive z-score
– If below: negative z-score
It tells us the relative position of each value in
reference to mean
When computing the value of the z-score, the data

values can be population values or sample values.
Hence we can compute either a population z-score

or a sample z-score
161
Z-score…
The z-score for a value in a data set is
obtained by subtracting the mean of the
data set from the value and dividing the
result by the standard deviation of the
data set.
162
Sample Z-score
• The Sample z-score for a value x is given
by the following formula:
xx
z  score 
s
• Where X is the sample mean and s is
the sample standard deviation.
163
Population Z-score
• The Population z-score for a value x is
given by the following formula:
x
z  score 

• Where  is the population mean and  is

the population standard deviation.
164
Z-score…
The z-score is affected by an outlying value
in the data set,
Because the outlier directly affects the

value of the mean and the standard
deviation.
Outlying value is very small or very large

value relative to the size of the other values
in the data set
So usually used for symmetrical or normally

distributed data sets. 165
Why use Z-score?
• The z-score gives us an idea of how far
away the data value is from the mean,
and so it gives us an idea of the
position of the data value relative to the
mean.
166
Example
• What is the z-score for the value of

14 in the following sample values?
3 8 6 14 4 12 7 10
167
Solution
1st step is determining Mean & standard
deviation.
X  8 & S  3.82
X X 14  8
Z    1.57
S 3.82
Indicated in two decimal places
Thus, the data value of 14 is 1.57 standard

deviations above the mean of 8, since the z-
score is positive
168
This can be presented as
169
Z-score…
What is the z-score for the value of 6
in the above sample values?
X  X 24  31.5  7.5
Z    0.79
S 9.5 9.50

deviations below the mean of 8, since the z-score
is negative
170
Example 2
What are the z-scores Classes Freq
for the values of 24 and 12-17 3
44 in the students marks 18-23 4
given in the following 24-29 4
table? 30-35 8
36-41 6
1st Determine mean &
standard deviation 42-47 5
Total 30
X  31.5 & S  9.50

171
Solution
For the value 24,
X  X 24  31.5  7.5
Z    0.79
S 9.5 9.50
deviations below the mean of 31.5, since the z-score
is negative
 For the value 24,

X  X 44  31.5 12.5
Z    1.32
S 9.5 9.50
Thus, the data value of 44 is 1.32 standard deviations
above the mean of 31.5, since the z-score is positive
172
Transformed Z-score
If the Z-score is given, it is possible to get
the value corresponding to that Z.
X X
Z 
S
X  X  ZS
X  X  ZS 173
Example
From the above students’ marks, mean is
31.5 & standard deviation is 9.5. find the
value that corresponds to z - score of -1.50
and Z- score of 1.50
174
Solution
For Z=-2.00
X  X  ZS  31.5  1.50(9.5)  31.5  14.25  17.25
For Z=2.00
X  X  ZS  31.5  1.50(9.5)  31.5  14.25  45.75
175
Normal Values
Normal values are values regarded as being
within the usual range of variation in a given
population or a set of data
The range of such values is called normal

ranges.
The normal range for most biological & natural

distribution is defined by the area 2 standard
deviation units around the mean
In a normal distribution, this makes up 95% of

the total area (observation)
176
The Rule of Thumb
For data that approximate a normal distribution:
This is also called Standard Deviation 68-95-99 Rule 177

178
S = Standard deviation
Rule of Thumb
68%
95%
99.7%
X µ-3σ µ+3σ
The entire area under the curve = 100%

179
Normal value
68%
95%
180
Example
Students have Biostatistics exam out of 100%
Mean = 75
SD = 5
Minim = 50
Max= 95
Assuming that the results are normally distributed,
What are the values within which 68% of students are

encompassed?
What are the values within which 95% of students are

encompassed (the ranges for normal values)?
181
Solution
68% =1SD X  X  ZS
X = 75 ±1(5) = (75-5, 75+5) = (70,80)
95% = 2SD
X= 75 ±2(5) =(75-10, 75+10) = (65, 85)
182
Can be presented by using standard
normal curve
D C B
F A
Marks (%)
183
Probability
Lecture-7-8
April 2011,
AAU
184
Probability
 Deterministic Vs Probabilistic explanation of
occurrences
 Since thereis little in life that occurs with

absolute certainty, probability theory has
found application in virtually every field of
human endeavor.
185
Why Probability Theory?
• As we observe the universe around us, wonderful
Craftsmanship can be seen.
• As we examine the elements of this creation we discover that

there is incredible order, but also variation therein.
• Probability theory seeks to describe the variation or

randomness within order so that underlying order may be
better understood.
• Once understood, strategies can be more effectively

formulated and their risks evaluated.
186
What is Probability?
• Probability is a branch of mathematics concerned with
the analysis of random phenomena (chance)
• is the mathematical framework for describing
(modelling) uncertainty
• Is a numerical measure of the likelihood that a specific
event will occur
• Probability theory provides a way to find and express our
uncertainty in making decisions about a population from
sample information
• A measure of the degree of chance or likelihood of

occurrence of an uncertain event
187
Probability…
• Probability theory began in the 16th and 17th
centuries
• European mathematicians began to analyze simple
games cards and dice.
• One of the first attempts to use ideas of relative
frequency to study human populations by J.Grant.
• Now applied to analyze data in astronomy, mortality

data, traffic flow, telephone interchange, genetics,
epidemics, investment...etc
188
Common terms in probability
1. Experiment:
 In statistics is any thing that results in a count or
measurement is called an experiment.
 E.g. tossing a coin, Rolling a die etc
2. Sample Space (S):
 The set of all possible out comes of an experiment
 e.g. in tossing a coin (H, T)
 In rolling a die (1,2,3,4,5,6)
3. Event (E): is a set of outcomes of a random phenomena
(experiment)
 any subset of the sample space
 Eg. Getting even numbers (2,4,6)
Getting odd numbers (1,3,5)
189
Properties of Probability
1. Probabilities always lie between 0 and 1.
2. Zero probability implies that something is impossible.
3. A probability of 1 means something is certain.
4. The sum of all probabilities of a distribution is equal to
1.
190
Probability…
• Example if we say that the probability of getting sick
for a person is 0.25
• A probability of 0.25 (also expressed as 1/4, or 25%)
implies that we think that it is 3 times as likely not to
get sick as it is to get sick.
• This is because
– P(no sickness) = 1 - P(sickness) = 0.75
– 0.75/0.25 = 3.
191
Probability..
• Let A denote an event . Then,
• The probability of that event is usually written as

P(A) or Pr(A)
• The complement of an event (Ac) is everything not in that

event .
• The probability of the complement of an event or

probability of non occurrence is written as
P(Ac) = 1 - P(A)
192
Probability theories
 Two views:
1. Objectivist (Frequentist) &
2. Subjectivist (Bayesian)
193
1. Frequentist (or Objectivist):
• Probabilities are real aspects of the world that can be
measured by relative frequencies of outcomes of
experiments
 based on equally-likely events
 based on long-run relative frequency of events
 not based on personal beliefs
 is the same for all observers (objective)
 examples: toss a coin, throw a die, pick a card
 Well accepted in statistics as compared to the
Bayesian (or Subjectivist)
194
2. Bayesian (or Subjectivist):
• Probabilities are descriptions of an observer's
degree of belief or uncertainty rather than
having any external significance
– based on personal beliefs, experiences, prejudices,
intuition - personal judgment
– different for all observers (subjective)
– examples: elections, new product introduction,
snowfall
(Thomas Bayes, c. 1706 - 1761)

195
Example of subjective
• If some one says that he is 95% certain that
a cure for AIDS will be discovered within 5
years, then
– He means that Pr(discovery of cure of AIDS
within 5 years) = 95%.
• Although the subjective view of probability
has enjoyed increased attention over the
years, it has not been fully accepted by
scientists.
196
Classical definition of probability
• Classical Probability (theoretical):
– The probability of an event is the event’s long run relative
frequency in repeated trials under similar conditions. OR,
– The probability of any outcome of a random phenomenon
is the proportion of times the outcome would occur in a
very long series of repetitions.
– Examples:.
• The probability of the occurrence of a head tossing a
coin is 0.5, so, if tossed 100 times, we expect 50 heads.
• The probability of having male sex of fetus per a
pregnancy is 50%, so if 8 pregnancy, we expect 4 males.
197
Relative frequency probability (empirical):
• If some process is repeated a large number of n times,
and some resulting event E occurs m times, the relative
frequency of E (m/n) will be approximately equal to the
probability of E.
– Symbolically, Pr(E) = m/n
– E.g. Suppose that of 158 people who attended a

dinner party, 99 were ill due to food poisoning.
– Thus, the probability of illness for a person

selected at random is given as
• Pr (illness) = 99/158 = 0.63 or 63%
198
In general,
• If there are “n” equally likely possibilities of
which one must occur and “S” are regarded as
favourable outcomes or success, then the
probability of the success is given by S/n
• i.e.
# of sucesses
P( sucess) 
total # of outcomes
# of ways A can occur
P( A) 
total # of outcomes
199
Random Phenomena
We call a phenomenon random if:-
 The exact outcome is not predictable in advance.
 Nonetheless, there is a predictable long term pattern that

can be described by the distribution of outcomes of very
many trials.
• Thus,
• A phenomenon is random, if individual outcomes are
uncertain but there is a regular distribution of outcomes in a
large number of repetitions.
 E.g. tossing coin 100 times, approximates 50%H & 50%T

 A woman giving 8 births, approximates 50% male &50%
female
200
e.g.
Coin tossing 100 times
201
Common terms in Relation of events
 Set - a collection of elements or objects of interest
 Empty set (denoted by )
 a set containing no elements
 Universal set (denoted by S) =Sample space
 a set containing all possible elements
 Complement (Not). The complement of A or A’ is
 a set containing all elements of S not in A
• Intersection
• Union
• Mutually exclusive
• Partition
202
Elements of Set A
Venn Diagram illustrating the elements of of an event

203
Complement of a Set A =AC = A’
A’
A
Venn Diagram illustrating the Complement of an event

204
Intersection of sets
Intersection (And)
–
a set containing all elements in both A and
B
A B
A B 205
Union of sets
Union (Or)
a set containing all elements in A or B
or both
206
Mutually exclusive or disjoint sets
 sets having no elements in common, having no
intersection, whose intersection is empty set
207
Partition
• a collection of mutually exclusive sets which
together include all possible elements, whose union
is the universal set
208
Rules of probability
1. For any event A, P(A) ranges from 0 to 1
P(A): 0  P(A)  1.
2. If A and B can never both occur at a time (they are
mutually exclusive), then
P(A and B) = P(A  B) = 0
3. For any event A and event B,
P(A or B) = P(A  B) = P(A) + P(B) - P(A  B).
4. If A and B are mutually exclusive events, then
P(A or B) = P(A  B) = P(A) + P(B).
5. For event A, the probability that it does not occur
P(Ac) = 1 - P(A).
6. If A and B are independent events, then
P(A and B) = P(A  B) = P(A)  P(B).
209
Conditional Probability
• For non-independent events
• The probability that event B occurs given that
event A has occurred is called a conditional
probability.
• It is denoted by the symbol P(B | A), which is

read “the probability of B given A.”
• We call A the given event.

• It is also called joint probability
210
Conditional…..
• If A and B are any two events, and the occurrence of
event B depends on the occurrence of event A, then
P( A & B) P( A  B)
P( B | A)   , P( A)  0
P( A) P( A)
• In words, for any two events, the conditional

probability that one event occurs given that the
other event has occurred equals the joint probability
of the two events divided by the probability of the
given event.
211
Example
• In a dinner party, 100 people participated. 60 of them ate
“Kitfo” and 40 of them ate Roasted meat (“Tibs”). A day
latter, 40 people developed food poising 36 of which were
among eaters of “kitifo”.
• Q1. what is the probability of occurrence of food poisoning

among people who ate “kitfo”.
• Q2. what is the probability of occurrence of food poisoning

among people who ate roasted meat.
• Two approaches can be used

– The contingency table = use frequency
– The joint probability table = use probabilities
212
Draw contingency table
P(poisoning/kitifo)
Type of Food Total
=36/60 =0.60 = 60%
food poisoning
eaten (B)
(A) Yes No P(poisoning/Roasted meat)
=4/40 =0.10 =10%
Here the probability of food

“Kitfo” 36 24 60 poisonig was about 6 times
more likely to occur for the
“kitfo” as compared to the
Roasted 4 36 40 roasted meat.
meat - The kitifo might have been
spoiled needs intervention. 213
Joint probability Table
 A joint probability table is similar to a contingency
table , except that it has probabilities in place of
frequencies.
Type of Food Total
food poisoning
eaten
Yes No
 Pi = fi/n ,
 e.g. 36/100 =.36 “Kitfo” .36 .24 .60
Roasted .04 .36 .40

meat
 The row totals and
column totals are called Total .40 .60 1.00
marginal probabilities. P(poisoning/kitifo)

=.36/.60 =0.60 = 60%
P(poisoning/roasted meat)
=.04/.40 =0.10 =10%
214
Example 2
• Suppose in country “X” the chance that an infant lives to
age 25 is .95. Whereas the chance that he lives to age 60
is .65. for the later, it is understood that to survive to age
60 means to survive both from birth to age 25 and from
age 25 to 60.
• Q1. What is the probability that a person of 25 years

survives to age 60?
215
Solution
Notation Event Probability
A Survive birth to age 25 .95
A&B Survive birth to age 25 & age .65
25 t0 60
B/A Survive age 25-60 given ?
survived to age 25
• P(B/A) = P(A&B)/P(B/A) = .65/.95 = .684
• That is, a person aged 25 has 68.4% chance of

living to age 60.
216
Independent Events
• We call Independent events when there are two events such that the
occurrence or non-occurrence of one does not in any way affect the
occurrence or non occurrence of the other.
• Two events A and B are said to be independent if the fact that A has
occurred or not does not affect your assessment of the probability of
B occurring.
• Conversely, the fact that B has occurred or not does not affect your
assessment of the probability of A occurring.
– P(A | B) = P(A), P(B | A) = P(B)

• If A and B are independent events, then
P(A and B) = P(A  B) = P(A)  P(B)
217
Example
• What is the probability that a pregnant
woman gives a female child after having a
female child before?
• Answer:
• The sex of the foetus is independent of the sex
of the previous child.
• So, P (female fetus) =1/2 =0.5 =50%

218
Counting of possible outcomes
219
Counting of possible outcomes
• According to classical definition of probability, outcomes are
equally likely to occur.
• In this case the probability is determined as,
# of ways A can occur

P ( A) 
total # of outcomes
• To know the # of ways A can occur and the total # of outcomes
we have to count
• These are called “counting methods”
• What if large number of trials?

220
Counting…
• If the number of possible outcomes in an experiment is small, it is
relatively easy to list and count all possible events.
• When there are large number of possible outcomes, an

enumeration of cases is often difficult , tedious or both
• To overcome such problems one can use various counting

techniques.
• Such as:
• Powers
• Permutations &
• Combinations
221
Counting ….
• We can have two approaches in determining
the number of possible outcomes
• If Order is considered
– With replacement = powers
– Without replacement = permutations
• If Order is not considered

– Without replacement = combinations
222
Counting …
Counting methods for computing probabilities
Combinations—
Permutations— Order doesn’t
order matters! matter
With replacement
Without replacement
Without replacement
223
Counting with replacement
• With replacement: once an event occurs, it can
occur again (after you roll a 6, you can roll a 6
again on the same die).
• Example
– Assume you tossed a coin 3 times, what’s the
probability of 3 of them are heads?
224
With replacement…
• Solution:
– Determine the total number of possible outcomes.
– As this is small trial we can use probability tree
225
Replacement…
• What if 100 tosses? Difficult to list and count all possible out
comes. In this case we use the rules of powers.
General rule :
When order matters and with replacement
For n number of outcomes per trial for r trials;
The total possible number of outcomes is given by
n to the power of r.
r
(# possible outcomes per event) the # of events
n
226
Example:
• What is the total possible number of outcomes for tossing coin 3 times
– Solution:
• Possible out come per trial (H or T) =2
• Number of trial = 3
• Total possible number of outcomes (Sample space)
• S = nr =23 = 8
• Getting head in all the 3 trials is 1/8
• What is the total possible outcomes for rolling a die 3 times?

• Solution:
• Possible out come per trial (1,2,3,4,5,6) =6
• Number of trial = 3
• Total possible number of outcomes
• S = nr =63 = 216
227
Without replacement
• Without replacement: an event cannot repeat
after once selected
• E.g. A after you draw an ace of spades out of a

deck, there is 0 probability of getting it again
 Example:
 What is the total possible ways of picking 5 cards
from a deck of 52
228
With replacement…
 If it is with replacement, we have 52 sets for
all the five trials
 i.e: 52 x 52 x 52 x 52 x 52 = 525 = 380,204,032
= 380,204,032 different possible outcomes
- What if without replacement

 52 x 51 x 50 x 49 x 48 = 311,875,200 different possible
outcomes
What general formula applies for this? Answer is permutation

229
•
Permutations
Permutations are the possible ordered selections of r objects out of
a total of n objects without replacement.
• General rule for events without replacement:

• The number of permutations of n objects taken r at a time is denoted
by nPr, where
P  n!
n r (n  r )!
• For the above example picking 5 cards from a deck of 52.
• n =52, r = 5, 52P5 = 52!/(52-5)! = 52!/47!
= 52x51x50x49x48x47! = 311,875,200 ways

47!
230
When order is not considered
• Suppose that we picked 3 letters out of the 6 letters A, B, C, D,
E, and F with out replacement.
• Total ways = 6!/(6-3)! = 120
• From this for example letters (B, C & D)
• Cab be ordered in 3! ways = 6
• i.e. BCD, or BDC, or CBD, or CDB, or DBC, or DCB.
• But these are orderings of the same combination of 3 letters.

• If we avoid order, how many combinations of 6 different
letters, taking 3 at a time, are there?
• To do this we use the rules of combination
231
232
Example above
• If we avoid order, how many combinations of
6 different letters, taking 3 at a time, are
possible?
 n  C  n! n  6&r  3
r n r
  r! (n  r)!
 6  C  6! 6 x5 x 4 x3! 6 x5 x 4
 3 6 3    20
  3! (6  3)! 3!3! 3 x 2 x1
While considering order we had 6P3 =120 ways, but,
without order we have 6C3 = 20 ways 233

Summary of counting techniques
• hgfhgfh
234
Exercise 1
• Suppose the department head tried to form a committee having a
group of 6 students among 200 medical students by listing their ID.NO.
Q1. What is the possible number of ways that he can do in considering

order with replacement?
Q2. What is the possible number of ways that he can do in considering

order without replacement?
Q3. What is the possible number of ways that he can do without

considering order?
Q4. Which one do you think is the best way for him to form the
committee? Why?
235
Exercise 2
• Suppose there are 100 2nd year medical students. 60 of
them are males and 40 females. 10 students were planned
to be selected for scholarship abroad to continue their
education. In how many ways this can be done if.
a. There is no restriction?
b. Two particular females should be included?
c. Five particular females can be included?
236
Random variable and
Probability distribution
237
Random variable
• A random variable is a numerical description of the outcomes
of the experiment or a numerical valued function defined on
sample space.
• Usually denoted by capital letters.
• It takes a possible outcomes and assigns a number to it.
• Example. Toss a coin three times and let X be number of heads

in three tosses
• S = {(HHH),(HHT),(HTH),(HTT),(THH),(THT),(TTH),(TTT)
– X(HHH)=3
– X(HHT)=(HTH)= (THH)=2
– X(HTT)=(THT) =(TTH) =1
– X(TTT) =0
238
Random variable
• Random variables are of two types.
– Discrete random variable &
– Continues random variables
239
Discrete random variables
• Are variables which can assume only a specific number of
values
• They have values that can be counted
• Example:
– Toss a coin n times and count the number of heads
– Number of children in a family
– Number of car accidents per week
– Number of two malaria cases per month
– Etc….
240
Continues random variables
• Are variables that can assume all values between any
two given values.
• A continuous random variable X can take on an

uncountably infinite number of values
• Example:
– Height of students at a certain college
– Mark of students
– Weight of individuals in a certain community
– Etc…
241
Probability distribution
• The term probability distribution refers to the way data are
distributed, in order to draw conclusions about a set of data.
• A probability distribution consists of a value a random variable

can assume and the corresponding probabilities of the values
• Every random variable has a corresponding probability

distribution.
• A probability distribution applies the theory of probability to

describe the behavior of the random variable.
242
Probability distribution…
• A probability distribution of a random variable can be
displayed by a table or a graph or a mathematical
formula.
• With categorical variables, we obtain the frequency
distribution of each variable.
• With numeric variables, the aim is to determine whether
or not normality may be assumed.
• If not we may wish to consider transforming the variable,
or may wish to categorize the variable for analysis (e.g.
age groups).
243
Models of probability distribution
• For discrete random variables
– Binomial distribution
– Poison distribution
• For continues random variables

– Standard normal distribution
244
Binomial Distribution
245
Binomial distribution
• A binomial distribution is a probability experiment that
satisfies the following four assumptions
1. The experiment has n identical fixed number of trials

2. Each trial has only one of the two possible mutually
exclusive outcomes (success or failure)
3. The probability of each outcome does not change from
trial to trial &
4. The trials are independent, thus we must sample with
replacement
246
Binomial dist….
• Suppose that n independent experiments, or trials, are
performed, where n is a fixed number, and that each
experiment results in a “success” with probability p and a
“failure” with probability 1-p.
• Then the total number of successes, X, is a binomial

random variable with parameters n and p.
• We write: X ~ Bin (n, p) {reads: “X is distributed

binomially with parameters n and p}
247
Binomial dist…
• The probability that X=r (i.e., that there are exactly r
successes) is:
n r nr
P ( X  r )    p (1  p )
r
Where: n = number of trials

r = number of success
p = probability of success
1-p = probability of failure
248
Binomial dist…
Bernouilli trial:
• If there is only 1 trial with probability of
success p and probability of failure 1-p, this is
called a Bernouilli distribution.
• Special case of the binomial with n = 1
1 1
Probability of success: P ( X  1)    p (1  p )11  p
1
1 0
Probability of failure: P ( X  0)    p (1  p )10  1  p
0
249
Example
• Assume a woman planned to give 6 children and the
probability of getting male is 50%.
a) What is the probability that exactly 3 of them are male

children?
b) What is the probability that at least 3 of them are male

children?
c) What is the probability that at most 2 of them are male

children?
250
Solution
• # of trial = n = 6
• Probability of success (male child) per a single
trial = 0.5
a) For exactly 3 male, r = 3
6 3
P( X  3)   0.5 (1  0.5) 63
3
6!
0.53 x0.53  20(0.5) 3 (0.5) 3
3!(6  3)!
 20 x.125x.125  .3125
The probability of getting exactly 3 male children in 6 pregnancies is .3125
251
b) Probability that at least 3 of them
are male children
• When we say at least 3 males, it could be 3, 4, 5 or 6
• i.e P(X≥3) =P(x=3)+P(X=4) + P(X=5)+P(X=6)
6 3
P( X  3)   0.5 (1  0.5) 3  0.313
3
6 4
P( X  4)   0.5 (1  0.5) 2  0.234
4
6 5
P( X  5)   0.5 (1  0.5)1  0.094
5
6 6
P( X  6)   0.5 (1  0.5) 0  0.016
6
P ( X  3)  0.313  0.234  0.094  0.016  0.657
The probability of getting at least 3 male children in 6

pregnancies is 0.657 =65.7% 252
c) Probability that at most 2 of them
are male children
• When we say at most 2 males, it could be 0, 1 or 2
• i.e P(X≥3) =P(x=3)+P(X=4) + P(X=5)+P(X=6)
6
P ( X  0)   0.50 (1  0.5) 6  0.016
0
6
P ( X  1)   0.51 (1  0.5) 5  0.094
1
6
P ( X  2)   0.5 2 (1  0.5) 4  0.234
2
P ( X  2)  0.016  0.094  0.234  0.344
The probability of getting at most 2 male children in 6

pregnancies is 0.344 =34.4% 253
Expected value and a variance Binomial
distribution
• All probability distributions are characterized by an expected value and a
variance:
• If X follows a binomial distribution with

parameters n and p: X ~ Bin (n, p)
Then: Note: the variance will
x= E(X) = np always lie between

0*N-.25 *N
 =Var (X) = np(1-p)

x
2 p(1-p) reaches
maximum at p=.5
P(1-p)=.25
x =SD (X)= np (1  p )
E(X)= Expected number to have the condition 254

Things that follow a binomial distribution
• Cohort study (or cross-sectional):
– The number of exposed individuals in your sample
that develop the disease
– The number of unexposed individuals in your
sample that develop the disease
• Case-control study:
– The number of cases that have had the exposure
– The number of controls that have had the
exposure
255
Example
Suppose you are performing a cohort study. If the probability of
developing disease in the exposed group is .05 for the study
duration, then if you randomly samples 500 exposed people.
Q1. How many do you expect to develop the disease? Give a

margin of error (+/- 1 standard deviation) for your estimate.
Q2. What’s the probability that at most 10 exposed people

develop the disease?
256
Solution for Q 1
Given:
• N=500, p=0.05, Z=+/-1SD
• µx= E(X) = ?
• Expected case with in +/-1SD ?
i.e. X ~ binomial (500, .05)
– µx = E(X) = np
– E(X) = 500 (.05) = 25
Var(X) = np(1-p) = 500 (.05) (.95) = 23.75

StdDev(X) = square root (23.75) = 4.87
25  4.87, (20.13, 29.87) will develop the disease
257
Solution 2
Given:
• N=500, p=0.05
• P(X≤10) =?
• P(X≤10) = P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4)+….+ P(X=10)
 500  0 500  500  1 499  500  2 498  500  10 490

 (.05) (.95)   (.05) (.95)   (.05) (.95)  ...   (.05) (.95)  .01
 0  1  2  10 
The probability at most 10 of them develop the disease is <0.01
258
Exercise
Suppose you are conducting case control study. Assume
the probability of being a smoker among a group of cases
with lung cancer is .6, and you sampled 10 cases for your
study.
1. What is the expected number of smokers?

2. What is the variance & SD for the number of smokers?
3. Give +/-2SD margin for the expected number of smokers.
4. What is the probability that more than 5 of the cases are
smokers?
259
Poison Distribution
260
Poison distribution
• The Poisson distribution is used to model discrete events
that occur infrequently in time and space
– i.e. rare events that occur in constant rate.
– example death rates, accident rates, Incidence
rate of rare diseases.
• Our random variable will be the “number of occurrences
of the event over the region of opportunity for
occurrence in a given time”.
• Poisson distribution is for counts
261
Poison…
• If events happen at a constant rate over time, the
Poisson distribution gives the probability of X number of
events occurring in time T.
• For a Poisson random variable, the variance and mean

are the same and represented by λ
Mean    
Variance      
2
Standard Deviation    
where  = expected number of event of interest in a

given time period
262
Poison…
• If X is a random variable representing a Poisson
distribution, then the probability of k occurrences is
given by
k 
e
p( X  k ) 
k!
– Where:
• K = # of occurrences
• λ = the mean number of occurrences in periods of some interval
• e = 2.71
– The Poisson distribution has normal distribution.
263
Example
• Suppose X is a random variable representing the number of
individuals involved in a road accident each year in Ethiopia.
Assume the mean number of occurrence of road accident in
Ethiopia is 2.4 individuals per 1,000 populations per year.
Q1. What is the probability that exactly 5 accidents occur in this

population in the coming one year?
Q2. What is the probability that at most 3 accidents occur in this

population the coming year?
264
Solution
2. n=1,000, λ=2.4 per 1000, e = 2.71, k = 5
• P(X=5)=?
k e  
p( X  k ) 
k!
• P(X=5)= (2.4)5(2.71)-2.4
5!
=(79.63) (0.09) = 0.06 = 6%
120
265
Solution to Q2
2. At most 3 accidents= P(X≤3)= ?
P(X  3)  P(X  0)  P(X  1)  P(X  2)  P(X  3)
2.40 2.71 2.4 2.412.71 2.4 2.4 2 2.71 2.4 2.432.71 2.4
p( X  3)    
0! 1! 2! 3!
 0.09  0.22  0.26  0.21  0.78
The probability that three or less car accidents per 1000 population is
0.78 =78%
266
“Poisson Process”
• Note that the Poisson parameter  can be given as the
mean number of events that occur in a defined time
period OR,
• equivalently,  can be given as a rate, in a given time
period so that we can multiply it by the required time =t
• This is called a “Poisson Process” and given as,
k  t
( t ) e
P( X  k ) 
k!
E(X) = t
Var(X) = t 267
Example
• Suppose new cases of measles is occurring at a
rate of about 2 per month per 100,000 under five
population in Ethiopia,
1) what’s the probability that exactly 4 cases of
measles will occur in the next 3 months in the
same population?
2) what’s the expected number of measles cases in
1,000,000 under five population in one year?
3) Give +/-2SD margin for the expected number of
cases.
268
Solution to Q1
1.Given λ=2 per 100,000 per month & t=3 months
P(X=4)=?
(2 x3) 4 2.71 ( 2 x 3)
P ( X  4 in 3 months) 
4!
(6) 4 2.71( 6)
P ( X  4 in 3 months) 
24
(1296)(0.0025)
  0.135  13.5%
24
So, the probability that 4 new cases of measles occur in
3 months in 100,000 population is 0.135 =13.5%
269
Solution to Q2 & Q3
Q2 .Given λ = 2per month/100,000
=(2/100,000)*1,000,000
=20 per month per 1,000,000
t=1year=12 months
– E(X) = t
E(X) = t
– E(X=12month) = 20X12 = 240 cases
Q3. +/-2SD margin for 240=?

– Var(X) = t =240,
– SD(X) = square root of 240 =15.49
– 240+/-30.98 = (209.92, 270.98)
270
Normal Distribution
271
Normal distribution
• Normal distributions are symmetric single picked bell-shaped
curve described by its mean (µ) and standard deviation (σ).
• Used for continues random variables.
• The “normal” or “Gaussian” distribution is the most

commonly used of all probability models.
• It is also foundational to the development of numerous

commonly used statistical methods
272
•
Normal dist…
Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.
– E.g. Suppose, X represents the continuous variable ‘Height’;
rarely is an individual exactly equal to 170cm tall
– X can assume an infinite number of intermediate values 170.1,
170.2, 170.3 etc.
• Because a continuous random variable X can take on an
uncountably infinite number of values, the probability
associated with any particular one value is almost equal to zero
• However the probability that X will assume some value in the

interval enclosed by two ranges say x1 and x2 can be
determined
273
Normal dist…
• As a continuous variable can take an infinite number of values,
it helps to visualize the probability distribution as a curve and
probabilities as ‘area under the curve’.
• The normal distribution is completely described by two

parameters (μ & σ )
• The mean μ can be any number (negative, positive or zero). (-

∞≤ µ ≤ +∞)
• The standard deviation σ must be a positive number.

• the out come of the measurement (x) will array from (- ∞≤ X
≤ +∞),
• To model this we use normal probability density function
274
275
276
277
Normal distr…
Bell-shaped and symmetric distributions.

Because the distribution is symmetric, one-half
(.50 or 50%) lies on either side of the mean.
Example
Finding Probabilities of the Standard Normal
Distribution so that: P(0 ≤ Z ≤ 1.56)
Procedures:
 Look in row labeled 1.5 and column labeled .06 to find P(0 ≤
Z ≤ 1.56) = 0.4406
278
Standard Normal Probabilities
AREA UNDER THE

STANDARD NORMAL CURVE
279
Example
• Let X be systolic blood pressure (for US population
aged 18-74 males) with μ = 129 mmHg and σ =
19.8 mmHg.
Q1. What level encompasses the middle 95%?
Q2. What proportion of men in the population

have SBP greater than 150mmHg?
Q3 .What level cuts the lower 10% of SBP?

280
Solution Q1
• Given μ = 129 mmHg and σ = 19.8 mmHg
• Level encompassing 95%=?
• Read from Z Table i.e. SND
• From SND, Z corresponding to =95%=0.95
• As the table is one sided, 0.95/2 =0.4750 = 1.96
X= μ ± Z σ
=129 ± 1.96(19.8) = (129 ± 38.8) = (129-38.8, 129+38.8)
• Level for 95%: (90.2, 167.8)
• Interpretation:
• The systolic blood pressure for 95% of US population aged 18-
74 males in mmHg lies (90.2, 167.8).
281
Solution to Q2
• Given: μ = 129 mmHg and σ = 19.8 mmHg
– % for SBP > 150mmHg
• To get %, find Z corresponding to 150
• Z = ( X – μ)/ σ = (150-129)/19.8 = 1.06
• P(Z>1.06)
Go and read from the table

• Go to Z table and find, P(0≤Z ≤ 1.06)=0.3554
• P(Z>1.06) = 0.50- P(0≤Z ≤ 1.06)
• P(Z>1.06) = 0.50-0.3554 = 0.1446
• The proportion of adult males aged 18-74 having SBP >

150mmHg is 0.1446 = 14.46%
282
•
solution
Lower 10% of SBP, 10% =0.10
to Q3
• Find Z from the table corresponding to 0.1
• To read from the table, 0.5-0.1=0.4
• Find the Z corresponding approximately to 0.4 from the table.
• 0.3997 corresponds to P(0≤Z ≤ 1.28)
• 0.1 corresponds to P(Z>1.28)
• As required is the lowest 0.1, it will be negative
• i.e. the lowest o.1: P(Z<-1.28)
• To get the cult off point (X) corresponding to P(Z<-1.28)
• X= μ +Z σ = 129+-1.28(19.8) = 103.6
• 10% of them has SBP <103.6mmHg

.
283
Exercise: try the following exercises and
compare your findings with the answers given
1. Find Probabilities of the Standard Normal
Distribution: P(Z < -2.47)
answer = 0.0068
2. Find Probabilities of the Standard Normal
Distribution: P(1≤ Z ≤ 2)
answer = 0.1359
3. Find Values of the Standard Normal Random
Variable: P(0 < Z < z) = 0.40
answer = value corresponding Z=1.28
i.e. X = µ+1.28σ
284
Sampling Methods
Lecture By Gurmesa Tura (MPH)

April 2011
AAU
285
Learning objectives…
able to:
– Define common terms used in sampling
– Distinguish the difference between probability and
non probability sampling
– Identify the different methods of probability and

non probability sampling techniques
– Explain the advantages and disadvantages of each

technique
286
Sampling
• Sampling is a process of choosing a section of the
population for observation and study.
• Is taking representative subgroup of the reference

population
• Sample should reflect all the qualities found in the

population
287
Common terms used in sampling
• Reference population (target population)
– The population of interest, to which the
investigator would like to generalize the results of
the study
• Source population
– From which the representative sample is to be
drawn
288
Common terms…
• Study or sample population
– The population included in the sample
• Sampling unit
– The unit of selection in the sampling process
• Study unit
– The unit on which information is collected
289
Common terms…
• Sampling frame
– The list of all the units in the reference population,
from which a sample is to be picked
• Sampling fraction/sampling interval

– The ratio of the number of units in the sample to
No. of units the reference population (n/N)
290
Hierarchy of Sampling
AA
WRA
291
Why sampling?
• Feasibility: Sampling may be the only feasible
method of collecting the information.
• Reduced cost: Sampling reduces demands on
resource such as finance, personnel, and material.
• Greater accuracy: Sampling may lead to better
accuracy of collecting data
• Sampling error: Precise allowance can be made for

sampling error
• Greater speed: Data can be collected and
summarized more quickly
292
Limitations of sampling…
• There is always a sampling error
• Sampling may create a feeling of discrimination

with in the population
• Sampling may be inadvisable where every unit in

the population is legally required to have a record
293
Types of sampling
A. Probability sampling
– Subjects of the sample are chosen based on known (non-
zero chance) probabilities.
– Guarantees that every element in the population of
interest has the same probability of being chosen for the
sample as all other elements in the population; “random”
selection.
B. Non-probability sampling
– we do not know the probability that each population
element will be chosen, and/or
– we cannot be sure that each population element has a
non-zero chance of being chosen.
294
Main differences
Probability sampling Non-Probability sampling
• Every item has a chance of being • Not every item that has chance of
selected. being selected
• Randomization is a feature of the • An assumption that there is an even

selection process. distribution of characteristics within
the population
• Elements are chosen randomly with • Elements are chosen arbitrarily

a (non-zero) probability
• Produce representative data

• Produce non representative data
295
Types of Sampling Methods
Sampling
Probability Sampling
Non-Probability
Sampling
Simple
Random Stratified
Convenience
Quota
Cluster
Purposive Snowball
Systematic
Volunteer Multistage
296
I. Probability Sampling
• A probability sampling method is any method of
sampling that utilizes some form of random selection.
• Is more complex, more time-consuming and usually

more costly than non-probability sampling
• Inferences can be made about the population
297
Probability Sampling…
• The population of interest is clear (because it
must be identified before sampling from it.)
• Possible sources of bias are removed, such as

self-selection and interviewer selection
effects.
• The general size of the sampling error can be

estimated
298
Probability Sampling…
• Includes
1. Simple Random Sampling (SRS)
2. Systematic Sampling
3. Stratified Random Sampling
4. Cluster Sampling
5. Multistage Sampling
299
1. Simple random sampling
• Each sampling unit in the population has an equal chance of
being included in the sample.
• Steps
1. Define the population
2. Determine the desired sample size
3. List all members of the population or the potential
subjects (sampling frame)-we can use codes
4. Select the desired samples by simple random methods
 we can apply methods like
 Lottery method (sample drawn from box)
 Table of random numbers (show the table)
 Computer generated random numbers
300
Advantages of SRS
• Each unit in the sampling frame has an equal
chance of being selected
• The formulas are easy to use.
• Easy to apply to small populations.
301
Disadvantages of SRS
• Can be expensive and unfeasible for large
populations –need complete list.
• Minority subgroups may not be present in the

sample in sufficient numbers for the study
302
•
2. Systematic random sampling
Individuals are chosen at regular intervals from the sampling
frame
Steps :
1. Number the units on your frame from 1 to N
2. Determine the sampling interval (K) by dividing N/n. Example,
N=100, n=20, then k=N/n=100/20=5
3. Select a number between 1 and K at random. This number is
called the random start.
4. Using the sample above, you would select a number b/n 1
and 4.
5. Select every Kth (in this case, every fifth) unit after the first
number.
303
Systematic random sampling…
304
Advantages of Systematic sampling
– Require no sampling frame
– Easier to perform
– Require less time than SRS
– Very good when the population from which
sample is to be drawn is homogeneously
distributed.
Disadvantage:
– Patterns/periodicity in which case it may be non representative
305

3. Stratified Sampling
The population is first divided into groups of elements having similar
characteristics called strata.
 Each element in the population belongs to one and only one stratum.
 It is appropriate when the distribution of the characteristic to be

studied is heterogeneous
 Best results are obtained when the elements within each stratum are
homogeneous group
 Maximum homogeneity within the group and max. heterogeneity

among the groups contribute for the accuracy of the estimates.
 A simple random sample is taken from each stratum
306
Stratified Sampling…
 A separate sample is then taken from each stratum by random
sampling
• The sampling method can vary from one stratum to another.
• Proportionate allocation
– The same sampling fraction is used for each stratum
• Non-proportionate allocation
– Different sampling fraction is used or
– Though the strata are unequal in size, a fixed number of
units is selected from each stratum
307
Advantages Stratified Sampling
• If strata are homogeneous, this method is as
“precise” as simple random sampling but with
a smaller total sample size
• Good representation of the minorities in non-proportional

allocation
• This will increase the adequacy of the sample of each stratum

to equate the statistical power of tests of differences between
strata.
308
Disadvantages Stratified Sampling
• Can be difficult to select relevant stratification variables
• Not useful when there are no homogeneous subgroups
• Can be expensive
• Requires accurate information about the population, if

not it introduces bias.
309
Example
• Suppose that in a company (E.g AAU) has 1800 (N) staff from
which 400 (n) are to be selected proportionally:
– Male academic staff = 900
– Male administrative staff = 180
– Female academic staff = 90
– Female administrative staff = 630
• To take a sample of 400 staffs, stratified according to the

above categories by using the formula for proportional
allocation.
Ni
n i

N
xn
310
Example…
By using the formula
– Male academic staff = (900 / 1800) x 400 = 200
– Male administrative staff = (180 / 1800) x 400 = 40
– Female academic staff = (90 / 1800) x 400 = 20
– Female administrative staff = (630 / 1800) x 400 = 140
• Final = 200 + 40 + 20 + 140 = 400
311
4.Cluster sampling
• Is a sampling technique used when "natural" groupings are
evident in a statistical population.
• If not, the population is first divided into separate groups of

elements called clusters
• Reference population (homogeneous) is divided into clusters –

often geographical units
• A simple random sample of the clusters is then taken
• All the units in the selected cluster are studied
312
Cluster sampling…
Cluster samples are generally used if:
• No list of the population exists.
• Well-defined clusters, which will often be geographic areas

exist.
• A reasonable estimate of the number of elements in each

level of clustering can be made.
• Often the total sample size must be fairly large to enable

cluster sampling to be used effectively.
313
Cluster sampling…
Advantages:
• Sampling frame of the reference population is not required
(Sufficient to have a list of clusters)
• Cost effective
Disadvantage:
• Based on the assumption that the study units are uniformly
distributed through out the reference population. Which may
not be always the case.
• we do not have total control over the final sample size
314
5. Multistage sampling
• Used when the reference population is large and widely
scattered.
• Selection is done in stages until the final sampling unit are
arrived at.
– Primary sampling units –from the first sampling stage
– Secondary sampling units- from the second sampling
stage etc..
• Finally study subjects will be selected by SRS
• No need of sampling frame for the reference population.
315
Multistage …
Advantage
• Cuts the cost of preparing the sample frame
Disadvantage
• sampling error is high compared with simple random
sampling (so we need to use design effect)
• Less precise estimation than SRS for the same sample but the
reduction in cost outweighs this and allow for a large sample
size
316
Example Multistage …
• Suppose research wanted to study the risk of
AAU students to HIV/AIDS and wanted to
include 1500 students. How can he go about?
• Multi stage
– Primary sampling unit: Campus/college
– Secondary sampling unit: Departments
– Tertiary sampling unit: students
317
Multistage …
318
2. Non-Probability Sampling
Advantage
• Used when a sampling frame does not exist
• They are quick, inexpensive and Convenient
• Useful when descriptive comments about the sample itself

are desired
• Good for pretests, pilot studies, In-depth interviews
• Used when Precise representativeness is not necessary.
319
Non-Probability Sampling…
Disadvantages
• No random selection (non-representative)
• Reliability cannot be measured
• No way to measure the precision of the resulting
sample.
• Inappropriate for generalizing findings obtained from
a sample to the population.
320
Types Non-Probability Sampling
1. Convenience/ opportunity/haphazard/accidental sampling.
2. Volunteer sampling
3. Purposive/ judgemental sampling
4. Quota sampling
5. Snowball sampling
321
1.Convenience/opportunity/accidental
sampling
• Selection of a sample based on easy accessibility and
convenience
• Is not representative of the target population
• it may deliver accurate results when the population is

homogeneous
322
2.Volunteer sampling
• As the term implies, this type of sampling occurs when people
volunteer their services for the study
• The sample is taken from a group of volunteers
• Sometimes, the researcher offers payment to entice
respondents
• Commonly used in psychological experiments or
pharmaceutical trials (drug testing),
• Its limitation, it would be difficult and unethical to enlist
random participants from the general public- volunteers.
323
3.Purposive/Judgemental sampling
• The selection of a sample based on judgment and knowledge
of the subject
• It is subject to the researcher's biases - more biased than

haphazard sampling
• Can be used in pre-testing of questionnaires
• Focus groups or in-depth interviews

• Example
– In laboratory settings choice of experimental subjects (i.e., animal,
vegetable etc..)
• Reflects the investigator's pre-existing beliefs about the

population.
324
4.Quota sampling
• Is the most common forms of non-probability sampling
• The population is first segmented into mutually exclusive

sub-groups
• A quota is given to select the subjects or units from each

segment based on a specified proportion.
• In quota sampling the selection of the sample is non-

random.
325
Quota ….
• Advantages
– Quota sampling is generally less expensive than random sampling.
– Easy to administer
– It is an effective sampling method when information is urgently
required and can be carried out independent of existing sampling
frames.
• Disadvantages
– It does not meet the basic requirement of randomness.
– Some units may have no chance of selection or the chance of
selection may be unknown. Therefore, the sample may be
biased.
326
5.Snowball sampling
• Snowball sampling is a special non-probability method used
when the desired sample characteristic is rare.
• lower cost
• But, biased
SM
M
Involves two main steps.
1. Identify a few key individuals
2. Ask these individuals to volunteer to distribute the questionnaire
to people who know and fit the characteristics of the desired
sample
327
Errors In sampling
Sampling error (random error)
Non sampling error (bias)
328
Sampling error
• A sample is a subset of a population.
• Because of this property of samples, results obtained

from them can not reflect the full range of variation
found in population which arise from the sampling
process it self.
• Cab be avoided by increasing the size of the sample,

• When n=N sampling error is 0
329
Non sampling error
 It is a type of systematic error in the design or
conduct of a sampling procedure which results
in distortion of the sample
• Ho to reduce/avoid
– careful design of the sampling procedure and not
by increasing of the sample size and
– Testing the data collection tool
330
Thank You!
331
Sampling distribution and
sample size Determination
Lecture by Gurmesa Tura

(MPH)
AAU, SPH
332
May 2011
Learning objectives
At the end of this class the students will be able to:
Explain the concepts of sampling distribution
Familiar with different approaches in

determining sample size and be able to calculate
sample size for different study objectives.
333
334
Sampling distribution….
• The sampling distribution of a statistic is the
probability distribution of all possible values the
statistic may assume, when computed from random
samples of the same size, drawn from a specified
population.
• The sampling distribution of X is the probability

distribution of all possible values the random
variable may assume when a sample of size n is
taken from a specified population
335
Sampling distribution….
Suppose that we calculate a sample mean (X) as an
estimate of the population mean (μ).
It is possible to select many samples of size n from a
population.
The value of this sample estimate of the parameter would

differ from one random sample to the next.
By determining the distribution of these estimates, a
statistician is then able to draw an inference based on the
distribution of sample statistics.
This distribution that is so important to us is called the
sampling distribution for the estimate
336
THE CENTRAL LIMIT
THEOREM
Suppose we have taken a random sample of size n
(usually >30) from a population
We assume the population has a mean (μ) and a

standard deviation ( ) .
ᵟ
 We then can assert the following:
337
338
CENTRAL LIMIT THEOREM…
2 . The mean for the distribution of sample means is equal to
the mean of the population distribution
x  
where x  the mean of the distribution of

the sample means
This statement signifies that the sample mean is an

unbiased estimate of the population mean
339
CENTRAL LIMIT THEOREM…
3.The standard deviation of the distribution of sample
means is equal to the standard deviation of the
population divided by the square root of the sample
Size

 x

n
where the standard deviation of the
 
distribution of
x
sample means based on n observations
 We call  x the standard error of the mean.

340
Sample size determination
 In planning any investigation we must decide
how many people need to be studied in
order to answer the study objectives.
 Too small
 We may fail to detect important effects or
may estimate effects too imprecisely
 False conclusion
 Too large:
 Unnecessary involvement of extra subjects
 High cost
 Time constraints
341
Sample size…
The main determinant of the sample size is, how
accurate the results need to be.
It is much better to increase the accuracy of data

collection than to increase sample size after a
certain point.
It is better to make extra efforts to get a

representative sample rather than to get a very large
sample
A compromise between what is desirable and what is

feasible.
It is important to consider the available resources.

342
Things to consider while determining
sample size
1. The study design
 Cross-sectional, cohort, case control, RCT etc
2. The parameter to be estimated
 Continues outcomes
 Single mean
 Comparison of two means
 Categorical outcomes
 Single proportion
 Comparison of two proportions
3. Level of confidence (usually 95%) i.e. level of

significance 5%
4. Power of the study (usually for comparison and 80%)
5. Margin of error (the accuracy within which the investigator
desires the true value to be within a given level of
confidence)
343
1. For Single mean
Used when the outcome variable is continues

2 2
n Z 1
2
d  
Where:
 n= minimum required sample size
 Z=upper critical value for the distribution
1-alpha confidence level
 d= margin of error
 ᵟ = population standard deviation
344
Finally we need to add 10% of n for the non
Finite population correction
If the source population (N) is <10,000 or n>10% of
N, we need finite population correction
n
n f

n
1
N
345
Example
Assume a physician wants to study the systolic blood
pressure (SBP) of 20-39 years of age in a certain
country.
The normal values are μ =120mmHg & ᵟ =10mmhg
How many people should he include in the study if he
has desired the patients SBP must not raise above
122mmHg in 95% of the time?
a. From source population of 50,000
b. From source population of 6,000.
346
Solution (a)
Given
Z=1.96 as confidence level is 95%
 ᵟ =10mmHg
d=122mmHg-120mmHg=2mmhg
2
x (10mmHg )
2
n
1. 96  96
2
(2mmHg )
347
 96  0.1x96  105.6  106 people
Solution for b.
As N=6000<10,000, we need population correction
106
n f

106
 104 people
1
6000
348
2. For single proportion
Used when the outcome variable categorical
Z  pq
2
n 2
2
d
Where:
 n = minimum required 
 sample size
 Z = upper critical value for the
distribution 1- alpha confidence level
 d = margin of error
 p = expected proportion of the population
with the event of outcome (prevalence)
 q =1-p: the probability of non occurrence
of the event of interest
349 Finally we need to add 10% of n for the non
Single proportion…
We need also to use finite population correction here if
the source population is <10,000.
Example.
A survey is needed to estimate prevalence of influenza virus infection
in school children
Suppose the available evidence suggests that approximately 20% of
the children will have antibodies to the virus.
Assume the investigator wants to estimate the prevalence within 5% of
the true value.
a. Calculate the sample size assuming source population of 40,000
b. Calculate the sample size assuming source population of 4,000
350
Solution (a)
2
(1.96) (0.2 x0.8)
n 2
 245.8  246
(0.05)
246  0.1x 246  270.6  271children
351
Solution (b)
As the source population is <10,000, we need
population correction
271
n f

271
 253.8  254children
1
4000
352
Single proportion…
Usually we obtain ‘p’ from previous
similar studies or pilot test
If we don’t have previous similar study

and pilot test is impossible, we use 50%
(p=0.5) to get the maximum sample size
with margin of error 5%(d=0.05).
353
Exercise
Suppose that a study is to be conducted to estimate the
smoking rate among adult males in Addis Ababa.
Assume that the current smoking rate among adult
males in Addis Ababa in general is about 27%. It was
desired that the rate of smoking to be within 3% of the
general population with 95% confidence.
a. Determine the required number of adult male to be
included in the study based on the above data.
b. What will be the required number of adult males to be

included in the study if the rate of smoking is
unknown?
354
For comparing two means
In this case we need power in addition to the
significance level
Power is the chance of being in the rejection region if

the alternative hypothesis is true
355
Sample size formula for difference in means…
Z Z 1  1
2 2 2 2
(r  1)  ( 1 
 Z1 ) (r  1)  (  Z )
n1  
r difference 2 r ( X1  X 2 ) 2
where :
n1  size of smaller group
r  ratio of larger group to smaller group
  standard deviation of the characteristic
diffference  clinically meaningful difference in means of the outcome
Z   corresponds to power (80% power, Z  0.84)
2
Z / 2  corresponds to two - tailed significance level

(95% level of confidence, Z  1.96 for   .05)
n 2  rn1
356
Difference in means…
If r  1 (equal groups), then
2 ( Z1   Z1 )
2 2
n1 
difference 2
n n
2 1
357
Example
 Suppose the investigator wanted to compare the difference in
mean hemoglobin level between adult males and adult
females. From previous study, The mean hemoglobin level for
normal adult males is 15 g/100ml and that of normal females
is 13g/100ml. The standard deviation is about 3g/100 ml.
a. Calculate the sample size to have equal number of males and

females in your sample.
b. Calculate the sample size to have the number of males sample
to be 3 times that of females.
Use 95% confidence level and 80% power.
358
Solutions (b)
2(3) 2 (0.84  1.96) 2 18(7.84)

n1  2
  35
(15 - 13) 4
n  n  35
2 1
We need to add 10% for non

responses
So, we need 39 females and 39
359
males a total of 78 study subjects.
Solutions (a)
It was asked to have male to
female ratio to be 3:1, r = 3
(3  1) (3) 2 (0.84  1.96) 2 4(9)(7.84)
n1  2
  24
3 (15 - 13) 12
n 2
 rx n1  3 x 24  72
Add 10% for non responses

So, we need 27 females and 79
360
males a total of 106 study
4. Comparing two proportion
To compare two proportions we use the following formula
n
Z 1 
 Z1  ( p1 (1  p1 )  p1 (1  p1 ))
2
 p1  p2  2
 n = sample size in each group (for 1:1 ratio)

Z power of the study usually 80% (Z = 0.84)
1-β = 1-β
 Z =Confidence level of the study usually 95% (Z =
1-ɑ 1-ɑ
1.96)
 P = proportion of outcome of interest in group 1
1
 P = proportion of outcome of interest in group 2
2
361
Example
Suppose the investigator wants to conduct a study to see

whether there is a difference in the rate of Malnutrition
among infants who are exclusively breast feed and who use
mixed feeding.
Assume that from previous study in similar population, the

prevalence of malnutrition among mixed feeding infants is
20% and among those on exclusive breast feeding is 15%.
Determine the minimum sample size required for both

groups of infants. Use 95% confidence level and 80%
power.
362
Solution
Given
P1=20% =0.2 (among on mixed feeding)
P2=15%=0.15 (among on exclusive breast feeding)
Confidence level=95%, Z1-ɑ=1.96
Power =80%, Z1-β=0.84
n1 
0.84  1.96  (0.2(0.8)  0.15(0.85))
2
 45
0.2  0.15 2
n2  n1  45
 By adding 10% for non responses we need 50 infants from both
groups a total of 100 infants will be included in the study
363
Thank you!
364

0.biostat All in One

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

0.biostat All in One

Uploaded by

Copyright:

Available Formats

PART-I: DESCRIPTIVE STATISTICS

Lecture Note organized by

 Explain the roles of statistics in medicine

 Describe the types of data and scales of

• Types of data & scales of measurement

• Data collection methods

• A set of procedures and rules for reducing

• Statistics the science of collecting, summarizing,

• Biostatistics is statistics applied to biological and

• Data: information gathered from group of

• Statistical data: raw material or facts of any

• Medicine and epidemiology are becoming

• Knowledge of statistics is required to design,

• Helps for better understanding of medical

2. Describing a characteristics of a group or

3. Analyzing and interpreting data

4. Making generalization about populations based on

2. Statistics can’t deal with qualitative characteristics

3. Statistical conclusions are not universally true

4. Statistical interpretations require high degree of skill

• Obtained from Journals, reports, Gov’t publications etc

• Less expensive (less money & time)

• May be incomplete, less quality, less valid

• It has three forms based on scales of

• But, the distance or interval between categories are not

– E.g Immunization status:

• Collecting secondary data

• For the purpose of obtaining specified observation

• Group interview or Focused Group Discussion (FGD)

• Self administered questionnaire

Data organization &

– Identify different ways of data organization &

– Familiar with constructing different methods of

• Information is not immediately evident from the mass of unsorted

• Needs to be organized in such a way as to condense information to

• Techniques of data organization & presentation

• Tells as the ranges of data and their general distributions

• Appropriate only for small data (<20)

• If it is beyond 20 we need to use frequency distributions

• Frequency is the number of times a certain value of the

• Used for data that can be placed in specific categories

• Used for nominal & Ordinal

• Example: A health worker collected data on blood type of 30

• O, A, AB, B, O, O, O, A, B, O, AB, B, B, A, AB, O, O, O, B, AB, O, A, AB, B,

• Here the classification criterion is quantitative

• It has two forms

– Grouped frequency distribution

• Is a table of all the potential raw score values

• Often used for small set of data on discrete

• E.g. the following hypothetical data represent family size of 50 households.

• Usually used when the range of the data is large

• Construct grouped frequency distribution for the above data

5. Form a suitable starting point which is equal to the

 1st UCL = 12 + 5 = 17 12-17

• Units of measurement (U): the distance between two

– Eg. 12-17, 18-23, U = 18-17 =1

Classes Class Freq %

• Class width (w) = UCB-LCB 56

1. Simple or one way table

2. Two way table

3. Higher ordered table

Immunization Number Percent

Immunized Not immunized N %