Professional Documents
Culture Documents
April, 2011
Addis Ababa University
1
Introduction to
Biostatistics
Lecturer by:
Gurmesa Tura (MPH)
March 2011
AAU 2
Objectives
• At the end of this lecture the students will be
able to:
Define statistics & Biostatistics
3
Contents
• Definition
• Types of statistics
• Roles of statistics
4
What is statistics?
• The scientific study of numerical data based
on variation in nature. (Sokal and Rohlf)
5
Statistics…
• Statistics is the art and science of making
decisions in the face of uncertainty
6
What are statistical data?
• Observation: information obtained from a
single person
7
Types of statistics
• Descriptive Statistics
– Collection,
– organization,
– summarization, and
– presentation of data.
• Inferential Statistics
– Generalizing from samples to populations using
probabilities.
– Performing hypothesis testing,
– Determining relationships between variables,
– Making predictions.
8
Why study statistics in medicine?
9
Roles of statistics
• In clinical medicine
– Making clinical diagnosis
– Determining Rx and prognosis
– Handling variations (defining normal values and normal
ranges)
• In public health
– Community diagnosis
• In Research
– Designing and undertaking clinical & public health research
10
Uses of statistics
1. Collecting data in the best possible way
11
Limitations of statistics
1. Statistics doesn’t deal with single (individual) value.
– It deals only with aggregate values
12
Types of data
• Based on source :
– Primary & secondary data
1. Primary data
• Data collected by the investigator for the
purpose of specific study
• Original in character
• Mostly generated by surveys
• Complete, reliable and more accurate
13
Types of data…
2. Secondary data
• When the investigator uses data which have been collected by
others for other purpose
14
Scales of measurement
• Variable is any aspect of an individual or
thing that is measured and can take any
value for different individuals or cases
• Divided in to two
1. Qualitative (categorical) variable &
2. Quantitative (numerical) variables
15
Qualitative (categorical) variable
• A variable which can not be measure in
quantitative (numerical) form but can only be
identified by names.
16
Nominal data
• Represent categories or names
• There is no orders in the categories
• It has two forms:
– Dichotomous- has 2 value categories
• E.g. Sex: Male or Female
» Immunization: yes or No
» Diseases outcome : Died or survived
– Multichotomous: >two categories
• E.g.
– Blood group: A, B, AB or O
– Marital status: single, married, divorced or widowed
17
Ordinal data
• Have order in the response categories
18
Quantitative (numerical) variables
• Variables which assume numerical values.
• variables to which a number is assigned as a
quantitative value
• Has two forms
– Discrete Variables
• Variables which assume a finite or countable number of
possible values.
• Usually obtained by counting. No decimal
Eg. - House hold size
- No. children
– Continuous Variables
• Variables which assume an infinite number of possible
values.
• Usually obtained by measurement.
• Can have decimals
• Eg. Age, weight, height
19
Quantitative …..
• Continues variables…
• Has two scales of measures
• Interval scale:
– Order and distance implied. Differences can be compared;
– no true zero.
– Ratios can not be compared.
E.g. Temperature in Celsius.
» 0Oc is not to mean there is no temperature
» 40Oc is not twice as hot as 20Oc
• Ratio scale:
– Order and distance implied.
– Differences can be compared;
– has a true zero.
– Ratios can be compared.
– Examples: Height, weight, blood pressure
• 40cms is twice as long as 20cms
• 0 cm is true 0 as there is no 0zero height
20
Discrete
21
Data collection
• The process of obtaining statistical data
• Before any statistical work can be done data must be
collected
• Collecting Primary data
– Observation
– Interview
– Use of self administered questionnaire
22
Observation
• Systematically selecting , watching and recording
behaviours of people or other phenomena and
aspects of the settings in which they occur
• Includes
– Visual observation
– Radiographic, Biomedical, x-ray, microscope,
clinical examinations, etc
23
Observation…
• It can also be used In observing behaviour
of people, culture etc.
• It could be
– Participant observation or
– Non-participant observation
24
Observation…
• Advantage
– More accurate data on behaviour or activity
• Disadvantages
– Observer bias
– Prejudice
– Desirability bias
– Needs skilled human power in high level
machines
25
Interviews
• Face to face interview
• Telephone interview
• Mailed questionnaire
• Computer interview
26
Face to face interview
• Advantage
– Permits detailed & in-depth questions & responses
– Minimizes non-response
• Disadvantage
– Costly
– Interviewer bias
– Investigator bias
– Interviewer cheating
27
Telephone interview
• Advantage
– Convenient
– Saves time
– Relatively inexpensive
– Less interviewer & investigator bias than personal
interview
• Disadvantage
– Non-coverage
– Limited length & depth of questions and responses
28
Self-administered Questionnaire
• Advantage
– Cost effective for large areas
– Minimizes interviewer bias
– Promotes accurate answers
– Sensitive issues can be gathered
• Disadvantage
– Low response rates
– Unanswered questions
– Incorrect answers
29
Mailed questionnaire
• Advantage
– Allows collecting data with out personal presence
• Disadvantage
– Low response rate
– Not applicable for illiterates
– Low coverage in rural areas
30
Use of documentary sources
• These include
– Clinical & other personal records
– Vital statistics
– Census data
• Sources
– Official publications of CSA
– Publications of MOH & other ministries
– News papers & journals
– International publications (WHO, UNICEF, etc)
– Health facilities’ records
31
Choosing method of data collection
• Choosing which method(s) of data collection
depends on:
– Type of data we need
– Resources (time, personnel & facility)
– Accuracy & strength of the method
– Acceptability of the method by the subjects
– Back ground of study subjects
– Etc
32
33
33
Learning objectives
• At the end of this lecture the students will be
able to:
34
34
Methods of data organization
• The data collected in a survey is called raw data
36
36
Frequency distributions
• Is a table that shows data classified in to a number of
classes with a corresponding number of times falling in
each categories (frequency)
• Two types
– Categorical frequency distribution
– Numerical frequency distribution
37
37
Categorical frequency distribution
38
38
Procedures to construct the frequency distribution
• There are 4 types of blood group, so we have four classes
• Step 1: Make a table
• Step 2: Tally the data & place the result in Tally column
• Step 3: count the tally and Place the result in frequency
column
• Step 4: calculate the % for each class
% = f/n*100
Where f= frequency of the class &
n= total number of values
39
39
40
40
Numerical frequency distribution
41
41
Ungrouped frequency distribution
42
42
Constructing ungrouped freq. distri.
• 1st find the smallest & the largest values in the data
• Arrange the data in order of magnitude and count the frequency
• To facilitate counting one may include column of tallies.
• Steps in constructing
• Step 1: make the table
• Step 2: Tally the data
• Step 3: Count the frequency
• Step 4: compute the percentage
4, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6, 2, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4,
6, 2, 5, 2 , 8, 10, 4, 4, 5, 3, 10, 4, 5, 6, 3, 5, 6
43
43
44
44
Grouped frequency distribution (GFD)
• A frequency distribution when several numbers are
grouped in one class
• Two types
– Inclusive
• the upper limit of one class coincides with the lower limit
of the next class
– Exclusive
– the upper limit of one class does not coincides with the lower limit
of the next class
45
45
Grouped freq. distr….
• Example: Consider the following ungrouped marks of 30
students (out of 50%)
24 30 36 35 42 40 26 23
36 36 12 45 29 21 34 40
16 47 28 32 33 44 19 34
30 36 35 47 20 14
47
47
Steps in constructing Grouped freq. distr.
1. Find the largest & smallest value
2. Compute range (R) = Maximum –Minimum
• From above example R = 47-12 =35
3. Select number of classes (usually 6-20) or use Sturge’s rule
k = 1+ 3.32 logn
Where k is desired number of classes &
n is total number of observations
K will be round up if there are values after decimal
From the example above (n =30)
K = 1+ 3.322 log30 (log30 = 1.48)
K = 1 + 3.322(1.48) = 5.9, round up to 6
So we need to have 6 classes
48
48
Steps…
4. Find the class width (w) by dividing the range by the
number of classes and roundup not round off.
From ex. Above w = R/k = 35/6 = 5.8, rounded to 6
49
49
Steps..
• From the above example the lower class limits
(LCL) will be:
• The starting point is Small value = 12, so,
• 1st lower limit = 12
• 2nd lower limit = 12 +6 =18
• 3rd lower limit = 18+6 = 24
• 4th lower limit = 24+6 = 30
• 5th lower limit= 30+6 = 36
• 6th lower limit = 36+6 = 42
50
50
Steps…
6. Find the upper class limit (UCL),
UCL= LCL + (w-1)
From the above ex. W= 6,
so, W-1 = 5 Classes Tally Freq %
52
52
By combining all the steps
Classes Tally Freq % rf Cf Cf (greater
(less than) than)
12-17 /// 3 10.0 0.10 3 30
18-23 //// 4 13.3 0.13 7 27
24-29 //// 4 13.3 0.13 11 23
30-35 //// /// 8 26.7 0.27 19 19
36-41 //// / 6 20.0 0.20 25 11
42-47 //// 5 16.7 0.17 30 5
Total 30 100.0 1.00
53
53
Common terms used in grouped freq. distr. (GFD)
• Class interval: range of scores grouped together in a GFD
• Class limits: the first & the last elements in the given class
interval
54
54
Terms….
• Class boundaries: separates one class in GFD from another
• The boundaries have one more decimal places than the raw data and
therefore do not appear in the data
• There is no gap b/n the upper boundary of one class and the lower
bounder of the next class
• LCB = LCL-U/2
• UCB = UCL + U/2
55
55
Terms…
Classes Boundaries
57
57
Rules in constructing tables
1. Table should be as simple as possible (6-20 categories)
2. Tables should be self explanatory
• Title should be clear and to the point (answers: What, when, where, how
classified)
e.g. Table 1: Marks of 30 Medical students of AAU, March 2011, AA, Ethiopia.
• Placed above the table
3. Each raw & column should be labelled
4. Numerical entities of zero should be explicitly written rather than indicating
by dash, as dashes are reserved for missing or unobserved data.
5. Totals should be indicated (last raw last column)
6. If the data are not original, their source should be given in foot notes.
58
58
Types of tables
• We have three d/t types of tables based on the number of
variables included
60
60
Eg. Two way table
• Table 3: Immunization status by sex of children in xxx
woreda, 2010 (hypothetical)
Sex of children Immunization status Total
61
61
Eg. Higher ordered table
•Table 4: Immunization status by sex and residence of children in xxx
woreda, 2010 (hypothetical)
Immunization status Total
Sex & residence of children
N % N %
Lecture 3
By: Gurmesa Tura (MPH)
March 2011
AAU
63
Objectives
• At the end of the class the students will be
able to:
– Identify the different types of graphs
– Chose among the graphs based on the data
– Familiar with constructing the different types of
graphs
– Identify importance and limitation of using graphs
64
Graphical presentation of data
• Techniques for presenting data in visual
displays using geometric and pictures.
• Importance
• Greater attraction
• Easily understandable
• Facilitate comparison
• May reveal unsuspected patterns in complex set of
data
• Greater memorizing value
65
Limitations
• Used only for purpose of comparison
• Not an alternative to tabulation
• Can give only an approximate idea
• They fail to bring to light too small differences
66
Types of graphs
• For qualitative & quantitative discrete data
• Bar chart
• Pie chart
67
Bar chart
• A series of equally spaced bars having equal width
(base) where the height of the bar represents the
frequency of (amount) associated with each category.
Immunization Freq %
status
160
number of children
Immunized 135 64.3 140
120
100
80
Not 75 35.7
60
40
20
immunized 0
Immunized Not immunized
Immunization Status
69
Multiple bar chart
• From the previous example
70
Multiple bar chart…
70
60
50
% of children
90 40 Male
80 30 Female
70
20
No. of childern
60
10
50 Male
40 Female 0
30 Immunized not immunized
20 Immunization
10
0
Immunized not immunized
Immunization
71
Component bar chart
We can also construct component bar chart for the above table
120.00%
140
100.00%
120
80.00%
% of children
100
NO. of children
60.00%
80
40.00%
60
20.00%
40
20 0.00%
Male Female
0 Sex
Male Female
Sex
72
Pie chart
• A circle divided in to sectors so that the areas
of the sectors are proportional to the
frequencies.
A = 5/30*360o =60o
Blood Type Freq. %
AB 5 16.7
AB = 5/30*360o =60o
O 13 43.3
O = 13/30*360o =156o
Total 30 100
74
Pie chart
17%
A
B
43%
AB
23% O
17%
75
•
Histograms
Graph consists of series of rectangles whose bases are equal to
the class width of the corresponding class & whose heights are
proportional to class frequencies
76
Eg. Conceder the data on student marks
Table 5: Marks of 30 students, AAU, Ethiopia, 2011 (hypothetical data)
12-17 14.5 3
18-23 20.5 4
24-29 26.5 4
30-35 32.5 8
36-41 38.5 6
42-47 44.5 5
Total 30
77
Histograms
79
80
Figure 5: Frequency polygon showing mark of 30 students, AAU, 2010, 81
(Hypothetical data
Cumulative frequency polygon (Ogive)
• Line graph obtained by plotting the cumulative
frequency distribution (Y-axis) against class
boundaries (x-axis)
• Two types
– Cumulative frequency Less than the UCB (Lcf)or
– Cumulative frequency More than the LCB (Mcf)
– We can also use the intersection of the two.
82
Construct Ogive by using the table from the
above Example
Classes Class Freq Less More than
boundaries than cf
cF
12-17 11.5-17.5 3 3 30
18-23 17.5-23.5 4 7 27
24-29 23.5-29.5 4 11 23
30-35 29.5-35.5 8 19 19
36-41 35.5-41.5 6 25 11
42-47 41.5-47.5 5 30 5
Total 30
83
Less than Ogive
Figure 8: More than & less than Ogive with their intersection showing mark
86
of 30 students, AAU, 2010, (Hypothetical data)
Data summarization
Lecture 4-6
By Gurmesa Tura (MPH)
March 2011
AAU
87
Learning objectives
At the end of this lecture, the students will
be able to:
– Identify the different parameters for data
summarizations
– Differentiate between measures of central
tendency and dispersion
– Calculate the commonly used measures of
central tendency and measures of dispersion
– Interpreter the final results of the measures
88
Data summarization
Although tables and graphs serve useful
purposes, there are many situations that require
other types of data summarization.
91
Mean…
{8, 5, 4, 12, 15, 5, 7}
What is the mean of these data?
Mean = (8 +5+4+12+15+5+7)
7
= 56/7 = 8
But what if large number of data set?
92
93
Population mean
94
Mean for large Discreet data set
with frequency distribution
when we have large data set HH size freq
difficult to add manually 2 5
In which case multiply each 3 6
value with their respective
4 14
frequency and divide by total
number of frequency 5 10
6 6
8 5
10 4
Total 50
95
Example: determine the mean HH size from the
following table
HH size Freq f ix i
(xi) ( f i)
2 5 10
3 6 18 = 250/50
4 14 56
=5
5 10 50
According to this data
6 6 36 in average 5 people
8 5 40 live in a Household
10 4 40
Total 50 250 96
Mean for Grouped data
From our previous example determine the mean
Mark of the students presented in the following table
Class Freq xi f ix i
es (fi)
Disadvantage
– Affected by extreme values in the distribution
– When the distribution has an open end classes
its computation would be based on assumption
and therefore may not be valid
100
Reading assignment
Geometric mean
Harmonic Mean
101
Median
Median = middle value
Median
What if large number of data that can not be
listed?
103
104
105
Median for grouped data
It is possible to know the
Class Freq cf median class, by the above
formula.
12-17 3 3
But it doesn’t tell us the exact
18-23 4 7
median value.
24-29 4 11
30-35 8 19 N=30, so median class is the
class that contain 15th & 16th
36-41 6 25
observation
42-47 5 30
Total 30 i.e. class 30-35
106
Median for grouped data…
To get the exact value from 30-35, we need
other formula.
Where:
Lmed = LCB of median class
W = width of median class
30-35 8 19
36-41 6 25
Median = 29.5 + 6/8 (30/2 – 11)
42-47 5 30
Total 30 = 29.5 + 6/8(15-11)
= 29.5 + 3 108
Median…
Characteristics
– An average position
– Affected by number of items than by extreme values
Advantages
– Easy to calculate and more typical of the series
– The median may be located even when the data is
incomplete.
E.g. when the class intervals are irregular and the final
classes have open ended
– Not affected by extreme observation
Disadvantages
– Not well suited to mathematical treatment
– Not so familiar as the arithmetic mean
109
Mode
Mode - The value that occurs most frequently
The given data set may have
– One mode = unimodal
E.g.. 3,3,4,4,4, 5,5,5,5,6,7,8 mode is 5
– Two mode = bimodal
E.g.. 10, 11, 12, 12, 12, 13, 14, 15,15,15, 17
modes are 12 & 15
– More than two modes = multimodal
– No mode at all =non-modal
E.g. 3,4,5,7,8,10
110
111
Mode for ungrouped data
HH size Freq the mode can simply
2 5 identified by selecting the
observation with largest
3 6
frequency.
4 14
5 10 From this data the
6 6 greatest frequency is 14,
8 5 so the mode is 4
10 4
Total 50
112
Mode for grouped data
Class Freq Here the modal class, the
class with the highest
12-17 3 frequency, is 30-35.
18-23 4
We need to determine the
24-29 4 exact value b/n 30 & 35 that
30-35 8 represent the mode of the
data
36-41 6
42-47 5 It is determined by the
Total 30 formula given below.
113
Mode for grouped data…
Mode Lmo w
1
1 2
Where:
– Lmo = LCB of modal class
– The width of modal class
– ∆1= frequency of modal class – frequency of
class preceding modal class
– ∆2= frequency of modal class – frequency of
class following the modal class 114
Example Modal class = 30-35
Lmo = 29.5
w = 35.5-29.5 = 6
Class Freq Frequency of modal class =8
Frequency of the class preceding modal
12-17 3 class = 4
18-23 4 Frequency of the class following modal
class = 6
24-29 4
∆1= 8-4 = 4 & ∆2= 8-6 = 2
30-35 8
1
36-41 6 Mode Lmo w
1 2
42-47 5
Mode = 29.5 +6(4/4+2)
Total 30
=29.5 +6(4/6)
= 33.5
115
Characteristics of mode
Is an average position
116
Advantage & disadvantage of mode
Advantages
– Since it is most typical value it is the most descriptive
average
– Since the mode is usually an actual value it indicates
the precise value of an important part of the series.
– It is not affected by extreme values
Disadvantages
– It is not capable of mathematical treatment
– Has no significant for small samples
– In small number of items the mode may not exist
117
Measures of Dispersion
118
Dispersion
In order to utilize the information provided
by a set of data, knowing just a location or
average value of the data alone is not
adequate,
R=L-S
120
Range…
Eg. From the HH size
4, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6,
2, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6,
2, 5, 2 , 8, 10, 4, 4, 5, 3, 10, 4, 5, 6, 3, 5, 6
R = 10 -2 = 8
121
Example from students Mark
122
Range..
Advantages
– Computation is simple
– Easy to understand
Disadvantages
– It does not consider all values
– A poor measure of dispersion
124
Quartiles
Quartiles are sets of values which divide the
distribution into four parts such that there are an
equal number of observations in each part.
125
Calculating quartiles for ungrouped data
Q i
i
4
n 1 th
value, i 1,2,3, then
1
Q1 4 n 1 value
th
2
Q2 4 n 1 value
th
3
Q3 4 n 1 value
th
126
Q1= ¼(50+1)th value =1/4(50+1)th value
= ¼(51) = 12.75th value
Example = 12th value + 0.75 x (13th value -12th
value)
= 4 + 0.75(4-4) = 4
HH Frq. Cf
size
Q2= 2/4(50+1)th value =2/4(50+1)th value
2 5 5
= 2/4(51) = 25. 5th value
3 6 11
= 25th value + 0. 5 x (26th value -25th
4 14 25 value)
5 10 35 = 4 + 0.5(5-4) = 4.5
6 6 41
8 5 46 Q3= 3/4(50+1)th value =3/4(50+1)th value
= 3/4(51) = 38.25th value
10 4 50
= 38th value + 0.25 x (39th value -38th
Total 50 value)
= 6 + 0.25(6-6) = 6
Q i
i
4
n 1 value
th
127
Inter-quartile range (IQR)
Inter-quartile range is the difference b/n the third
and the first quartiles.
IQR = Q3-Q1
= 6-4 = 2
128
Calculating quartiles for Grouped data
First find the class in which the Qi lies
Classe Freq cf This can be obtained by counting in/4
s of the class beginning from the lowest
12-17 3 3 class,
129
Then, we use the formula
w in
Q LQ ( C ), i 1,2,3
i i
fQ 4
i
Where:
– LQi = Lower class boundary of the Quartile class
– W = width of quartile class
– n = total number of observations
– fQi = frequency of the quartile class
– C = cumulative frequency preceding the quartile
class
130
Solution
w in
Q LQ ( C ), i 1,2,3
i i
fQ 4
i
IQR = Q3 - Q1
= 39-24 = 15
131
Box plot
a graphical display that involves a five-number
summary of a distribution of values, consisting
of
– the minimum value,
– the first quartile,
– the median,
– the third quartile, and
– the maximum value
132
Box plots
• It could be vertical or horizontal
133
Box plots…
• These horizontal lines are called
whiskers.
134
Box plot...
Putting IQR in diagrammatic form
Maximum = 47
Q3 =39
Q2= median =32.5
Q1 =24
Minimum =12
135
Purpose of box plot
Shows center of distribution (median)
136
IQR..
Advantages
– It is simple and versatile measure
– It encloses the central 50% of the observation
– Less prone to distortion by a single large or
small value
Disadvantage
– It is not based on all observations but only on
two specific values
137
Reading assignments
Deciles
Percentiles
138
Variance & standard deviation
Variance and standard deviations are another
measures of dispersion
X i
N 2
2
i 1
N
2
n
Xi X
2
S i 1
n 1 140
Standard deviation
Standard deviation is taking the square root
of the variance
2
S S
2
141
142
For frequency distribution
Population Variance Population Stand. Div.
f X i X i
N
N 2
2
f i
i i 1
2
i 1 N
N
n n 2
f X i X
2
f X i X i
S
i
2
i 1
S i 1
n 1 n 1
143
Example
Calculate the variance & Standard
deviation for the following Table
HH size Freq First step is calculating
2 5 mean by using the formula
3 6
4 14
5 10
6 6
As calculated above
8 5
10 4 The mean is 5
Total 50
144
Solution
Mean = 5 Variance
n 2
HH Freq
( Xi X ) ( Xi X ) 2 f ( Xi X ) 2
2
f i X i X
size
S i 1
i
2 5 -3 9 45 n 1
3 6 -2 4 24
S2=234/50-1
4 14 -1 1 14
=234/49
5 10 0 0 0
= 4.8
6 6 1 1 6
8 5 3 9 45 Standard deviation
10 4 5 25 100
146
For grouped frequency distribution
147
n 2
2
f X i X
i
Variance S i 1
n 1
2598/30 - 1 2598/29 89.6
SD S 89.6 9.5
148
Importance of variance & Standard deviation
149
Coefficient of Variation (CV)
CV is the ratio of the standard deviation to
the absolute value of the mean.
S
SampleCV v % X 100%
X
150
CV….
Shows the size of measure of variation with the
mean
152
Skewness
Skewness is the measure of asymmetry of the
distribution
153
Negatively skewed distribution;
– occurs when majority of scores at the right end of the
curve and a few small scores are scattered at the left
end.
Negatively skewed
In unimodal negatively skewed distribution,
Mean, median and mode occur in alphabetic
order 154
Positively skewed distribution
Occurs when the majority of scores are at the
left end of the curve and a few extreme large
scores are scattered at the right end.
Positively skewed
In unimodal positively skewed distribution,
Mean, median & Mode occur in reverse
alphabetical order 155
Symmetrical distribution
It is neither positively nor negatively skewed.
– A curve is symmetric if one half of the curve
is the mirror image of the other half.
– This is called Normal distribution
158
Choice of Central tendency
The choice of which measure to use depends on:
The shape of the distribution (whether normal or
skewed)
159
Z-Score
(Relative Position)
160
Z-score
The z-score is the number of standard deviations
the data value falls either above or below the mean
for the data set.
– If above: positive z-score
– If below: negative z-score
It tells us the relative position of each value in
reference to mean
162
Sample Z-score
• The Sample z-score for a value x is given
by the following formula:
xx
z score
s
• Where X is the sample mean and s is
the sample standard deviation.
163
Population Z-score
• The Population z-score for a value x is
given by the following formula:
x
z score
166
Example
3 8 6 14 4 12 7 10
167
Solution
1st step is determining Mean & standard
deviation.
X 8 & S 3.82
X X 14 8
Z 1.57
S 3.82
169
Z-score…
What is the z-score for the value of 6
in the above sample values?
X X 24 31.5 7.5
Z 0.79
S 9.5 9.50
X X
Z
S
X X ZS
X X ZS 173
Example
From the above students’ marks, mean is
31.5 & standard deviation is 9.5. find the
value that corresponds to z - score of -1.50
and Z- score of 1.50
174
Solution
For Z=-2.00
For Z=2.00
175
Normal Values
Normal values are values regarded as being
within the usual range of variation in a given
population or a set of data
68%
95%
99.7%
X µ-3σ µ+3σ
68%
95%
180
Example
Students have Biostatistics exam out of 100%
Mean = 75
SD = 5
Minim = 50
Max= 95
181
Solution
68% =1SD X X ZS
X = 75 ±1(5) = (75-5, 75+5) = (70,80)
95% = 2SD
182
Can be presented by using standard
normal curve
D C B
F A
Marks (%)
183
Probability
Lecture-7-8
By Gurmesa Tura (MPH)
April 2011,
AAU
184
Probability
Deterministic Vs Probabilistic explanation of
occurrences
185
Why Probability Theory?
• As we observe the universe around us, wonderful
Craftsmanship can be seen.
186
What is Probability?
• Probability is a branch of mathematics concerned with
the analysis of random phenomena (chance)
• is the mathematical framework for describing
(modelling) uncertainty
• Is a numerical measure of the likelihood that a specific
event will occur
• Probability theory provides a way to find and express our
uncertainty in making decisions about a population from
sample information
187
Probability…
• Probability theory began in the 16th and 17th
centuries
• European mathematicians began to analyze simple
games cards and dice.
• One of the first attempts to use ideas of relative
frequency to study human populations by J.Grant.
188
Common terms in probability
1. Experiment:
In statistics is any thing that results in a count or
measurement is called an experiment.
E.g. tossing a coin, Rolling a die etc
2. Sample Space (S):
The set of all possible out comes of an experiment
e.g. in tossing a coin (H, T)
In rolling a die (1,2,3,4,5,6)
3. Event (E): is a set of outcomes of a random phenomena
(experiment)
any subset of the sample space
Eg. Getting even numbers (2,4,6)
Getting odd numbers (1,3,5)
189
Properties of Probability
1. Probabilities always lie between 0 and 1.
2. Zero probability implies that something is impossible.
3. A probability of 1 means something is certain.
4. The sum of all probabilities of a distribution is equal to
1.
190
Probability…
• Example if we say that the probability of getting sick
for a person is 0.25
• A probability of 0.25 (also expressed as 1/4, or 25%)
implies that we think that it is 3 times as likely not to
get sick as it is to get sick.
• This is because
– P(no sickness) = 1 - P(sickness) = 0.75
– 0.75/0.25 = 3.
191
Probability..
• Let A denote an event . Then,
192
Probability theories
Two views:
1. Objectivist (Frequentist) &
2. Subjectivist (Bayesian)
193
1. Frequentist (or Objectivist):
• Probabilities are real aspects of the world that can be
measured by relative frequencies of outcomes of
experiments
based on equally-likely events
based on long-run relative frequency of events
not based on personal beliefs
is the same for all observers (objective)
examples: toss a coin, throw a die, pick a card
Well accepted in statistics as compared to the
Bayesian (or Subjectivist)
194
2. Bayesian (or Subjectivist):
• Probabilities are descriptions of an observer's
degree of belief or uncertainty rather than
having any external significance
– based on personal beliefs, experiences, prejudices,
intuition - personal judgment
– different for all observers (subjective)
– examples: elections, new product introduction,
snowfall
197
Relative frequency probability (empirical):
• If some process is repeated a large number of n times,
and some resulting event E occurs m times, the relative
frequency of E (m/n) will be approximately equal to the
probability of E.
198
In general,
• If there are “n” equally likely possibilities of
which one must occur and “S” are regarded as
favourable outcomes or success, then the
probability of the success is given by S/n
• i.e.
# of sucesses
P( sucess)
total # of outcomes
# of ways A can occur
P( A)
total # of outcomes
199
Random Phenomena
We call a phenomenon random if:-
The exact outcome is not predictable in advance.
• Thus,
• A phenomenon is random, if individual outcomes are
uncertain but there is a regular distribution of outcomes in a
large number of repetitions.
200
e.g.
Coin tossing 100 times
201
Common terms in Relation of events
Set - a collection of elements or objects of interest
Empty set (denoted by )
a set containing no elements
Universal set (denoted by S) =Sample space
a set containing all possible elements
Complement (Not). The complement of A or A’ is
a set containing all elements of S not in A
• Intersection
• Union
• Mutually exclusive
• Partition
202
Elements of Set A
A’
A
A B
A B 205
Union of sets
Union (Or)
a set containing all elements in A or B
or both
206
Mutually exclusive or disjoint sets
sets having no elements in common, having no
intersection, whose intersection is empty set
207
Partition
• a collection of mutually exclusive sets which
together include all possible elements, whose union
is the universal set
208
Rules of probability
1. For any event A, P(A) ranges from 0 to 1
P(A): 0 P(A) 1.
2. If A and B can never both occur at a time (they are
mutually exclusive), then
P(A and B) = P(A B) = 0
3. For any event A and event B,
P(A or B) = P(A B) = P(A) + P(B) - P(A B).
4. If A and B are mutually exclusive events, then
P(A or B) = P(A B) = P(A) + P(B).
5. For event A, the probability that it does not occur
P(Ac) = 1 - P(A).
6. If A and B are independent events, then
P(A and B) = P(A B) = P(A) P(B).
209
Conditional Probability
• For non-independent events
• The probability that event B occurs given that
event A has occurred is called a conditional
probability.
P( A & B) P( A B)
P( B | A) , P( A) 0
P( A) P( A)
211
Example
• In a dinner party, 100 people participated. 60 of them ate
“Kitfo” and 40 of them ate Roasted meat (“Tibs”). A day
latter, 40 people developed food poising 36 of which were
among eaters of “kitifo”.
212
Draw contingency table
P(poisoning/kitifo)
Type of Food Total
=36/60 =0.60 = 60%
food poisoning
eaten (B)
(A) Yes No P(poisoning/Roasted meat)
=4/40 =0.10 =10%
P(poisoning/roasted meat)
=.04/.40 =0.10 =10%
214
Example 2
• Suppose in country “X” the chance that an infant lives to
age 25 is .95. Whereas the chance that he lives to age 60
is .65. for the later, it is understood that to survive to age
60 means to survive both from birth to age 25 and from
age 25 to 60.
215
Solution
Notation Event Probability
A Survive birth to age 25 .95
A&B Survive birth to age 25 & age .65
25 t0 60
B/A Survive age 25-60 given ?
survived to age 25
• Two events A and B are said to be independent if the fact that A has
occurred or not does not affect your assessment of the probability of
B occurring.
• Conversely, the fact that B has occurred or not does not affect your
assessment of the probability of A occurring.
217
Example
• What is the probability that a pregnant
woman gives a female child after having a
female child before?
• Answer:
• The sex of the foetus is independent of the sex
of the previous child.
219
Counting of possible outcomes
• According to classical definition of probability, outcomes are
equally likely to occur.
• In this case the probability is determined as,
• Such as:
• Powers
• Permutations &
• Combinations
221
Counting ….
• We can have two approaches in determining
the number of possible outcomes
• If Order is considered
– With replacement = powers
– Without replacement = permutations
222
Counting …
Counting methods for computing probabilities
Combinations—
Permutations— Order doesn’t
order matters! matter
With replacement
Without replacement
Without replacement
223
Counting with replacement
• With replacement: once an event occurs, it can
occur again (after you roll a 6, you can roll a 6
again on the same die).
• Example
– Assume you tossed a coin 3 times, what’s the
probability of 3 of them are heads?
224
With replacement…
• Solution:
– Determine the total number of possible outcomes.
– As this is small trial we can use probability tree
225
Replacement…
• What if 100 tosses? Difficult to list and count all possible out
comes. In this case we use the rules of powers.
General rule :
When order matters and with replacement
For n number of outcomes per trial for r trials;
The total possible number of outcomes is given by
n to the power of r.
r
(# possible outcomes per event) the # of events
n
226
Example:
• What is the total possible number of outcomes for tossing coin 3 times
– Solution:
• Possible out come per trial (H or T) =2
• Number of trial = 3
• Total possible number of outcomes (Sample space)
• S = nr =23 = 8
• Getting head in all the 3 trials is 1/8
227
Without replacement
• Without replacement: an event cannot repeat
after once selected
Example:
What is the total possible ways of picking 5 cards
from a deck of 52
228
With replacement…
If it is with replacement, we have 52 sets for
all the five trials
i.e: 52 x 52 x 52 x 52 x 52 = 525 = 380,204,032
= 380,204,032 different possible outcomes
230
When order is not considered
• Suppose that we picked 3 letters out of the 6 letters A, B, C, D,
E, and F with out replacement.
• Total ways = 6!/(6-3)! = 120
• From this for example letters (B, C & D)
• Cab be ordered in 3! ways = 6
• i.e. BCD, or BDC, or CBD, or CDB, or DBC, or DCB.
231
232
Example above
• If we avoid order, how many combinations of
6 different letters, taking 3 at a time, are
possible?
n C n! n 6&r 3
r n r
r! (n r)!
6 C 6! 6 x5 x 4 x3! 6 x5 x 4
3 6 3 20
3! (6 3)! 3!3! 3 x 2 x1
While considering order we had 6P3 =120 ways, but,
234
Exercise 1
• Suppose the department head tried to form a committee having a
group of 6 students among 200 medical students by listing their ID.NO.
Q4. Which one do you think is the best way for him to form the
committee? Why?
235
Exercise 2
• Suppose there are 100 2nd year medical students. 60 of
them are males and 40 females. 10 students were planned
to be selected for scholarship abroad to continue their
education. In how many ways this can be done if.
a. There is no restriction?
b. Two particular females should be included?
c. Five particular females can be included?
236
Random variable and
Probability distribution
237
Random variable
• A random variable is a numerical description of the outcomes
of the experiment or a numerical valued function defined on
sample space.
• Usually denoted by capital letters.
• It takes a possible outcomes and assigns a number to it.
238
Random variable
• Random variables are of two types.
– Discrete random variable &
239
Discrete random variables
• Are variables which can assume only a specific number of
values
• Example:
– Toss a coin n times and count the number of heads
– Number of children in a family
– Number of car accidents per week
– Number of two malaria cases per month
– Etc….
240
Continues random variables
• Are variables that can assume all values between any
two given values.
• Example:
– Height of students at a certain college
– Mark of students
– Weight of individuals in a certain community
– Etc…
241
Probability distribution
• The term probability distribution refers to the way data are
distributed, in order to draw conclusions about a set of data.
242
Probability distribution…
• A probability distribution of a random variable can be
displayed by a table or a graph or a mathematical
formula.
• With categorical variables, we obtain the frequency
distribution of each variable.
• With numeric variables, the aim is to determine whether
or not normality may be assumed.
• If not we may wish to consider transforming the variable,
or may wish to categorize the variable for analysis (e.g.
age groups).
243
Models of probability distribution
• For discrete random variables
– Binomial distribution
– Poison distribution
244
Binomial Distribution
245
Binomial distribution
• A binomial distribution is a probability experiment that
satisfies the following four assumptions
246
Binomial dist….
• Suppose that n independent experiments, or trials, are
performed, where n is a fixed number, and that each
experiment results in a “success” with probability p and a
“failure” with probability 1-p.
247
Binomial dist…
• The probability that X=r (i.e., that there are exactly r
successes) is:
n r nr
P ( X r ) p (1 p )
r
248
Binomial dist…
Bernouilli trial:
• If there is only 1 trial with probability of
success p and probability of failure 1-p, this is
called a Bernouilli distribution.
• Special case of the binomial with n = 1
1 1
Probability of success: P ( X 1) p (1 p )11 p
1
1 0
Probability of failure: P ( X 0) p (1 p )10 1 p
0
249
Example
• Assume a woman planned to give 6 children and the
probability of getting male is 50%.
6 3
P( X 3) 0.5 (1 0.5) 63
3
6!
0.53 x0.53 20(0.5) 3 (0.5) 3
3!(6 3)!
20 x.125x.125 .3125
The probability of getting exactly 3 male children in 6 pregnancies is .3125
251
b) Probability that at least 3 of them
are male children
• When we say at least 3 males, it could be 3, 4, 5 or 6
• i.e P(X≥3) =P(x=3)+P(X=4) + P(X=5)+P(X=6)
6 3
P( X 3) 0.5 (1 0.5) 3 0.313
3
6 4
P( X 4) 0.5 (1 0.5) 2 0.234
4
6 5
P( X 5) 0.5 (1 0.5)1 0.094
5
6 6
P( X 6) 0.5 (1 0.5) 0 0.016
6
P ( X 3) 0.313 0.234 0.094 0.016 0.657
6
P ( X 0) 0.50 (1 0.5) 6 0.016
0
6
P ( X 1) 0.51 (1 0.5) 5 0.094
1
6
P ( X 2) 0.5 2 (1 0.5) 4 0.234
2
• Case-control study:
– The number of cases that have had the exposure
– The number of controls that have had the
exposure
255
Example
Suppose you are performing a cohort study. If the probability of
developing disease in the exposed group is .05 for the study
duration, then if you randomly samples 500 exposed people.
256
Solution for Q 1
Given:
• N=500, p=0.05, Z=+/-1SD
• µx= E(X) = ?
• Expected case with in +/-1SD ?
i.e. X ~ binomial (500, .05)
– µx = E(X) = np
– E(X) = 500 (.05) = 25
257
Solution 2
Given:
• N=500, p=0.05
• P(X≤10) =?
• P(X≤10) = P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4)+….+ P(X=10)
258
Exercise
Suppose you are conducting case control study. Assume
the probability of being a smoker among a group of cases
with lung cancer is .6, and you sampled 10 cases for your
study.
259
Poison Distribution
260
Poison distribution
• The Poisson distribution is used to model discrete events
that occur infrequently in time and space
– i.e. rare events that occur in constant rate.
– example death rates, accident rates, Incidence
rate of rare diseases.
• Our random variable will be the “number of occurrences
of the event over the region of opportunity for
occurrence in a given time”.
• Poisson distribution is for counts
261
Poison…
• If events happen at a constant rate over time, the
Poisson distribution gives the probability of X number of
events occurring in time T.
Variance
2
Standard Deviation
263
Example
• Suppose X is a random variable representing the number of
individuals involved in a road accident each year in Ethiopia.
Assume the mean number of occurrence of road accident in
Ethiopia is 2.4 individuals per 1,000 populations per year.
264
Solution
2. n=1,000, λ=2.4 per 1000, e = 2.71, k = 5
• P(X=5)=?
k e
p( X k )
k!
• P(X=5)= (2.4)5(2.71)-2.4
5!
=(79.63) (0.09) = 0.06 = 6%
120
265
Solution to Q2
2. At most 3 accidents= P(X≤3)= ?
2.40 2.71 2.4 2.412.71 2.4 2.4 2 2.71 2.4 2.432.71 2.4
p( X 3)
0! 1! 2! 3!
0.09 0.22 0.26 0.21 0.78
The probability that three or less car accidents per 1000 population is
0.78 =78%
266
“Poisson Process”
• Note that the Poisson parameter can be given as the
mean number of events that occur in a defined time
period OR,
• equivalently, can be given as a rate, in a given time
period so that we can multiply it by the required time =t
• This is called a “Poisson Process” and given as,
k t
( t ) e
P( X k )
k!
E(X) = t
Var(X) = t 267
Example
• Suppose new cases of measles is occurring at a
rate of about 2 per month per 100,000 under five
population in Ethiopia,
1) what’s the probability that exactly 4 cases of
measles will occur in the next 3 months in the
same population?
2) what’s the expected number of measles cases in
1,000,000 under five population in one year?
3) Give +/-2SD margin for the expected number of
cases.
268
Solution to Q1
1.Given λ=2 per 100,000 per month & t=3 months
P(X=4)=?
(2 x3) 4 2.71 ( 2 x 3)
P ( X 4 in 3 months)
4!
(6) 4 2.71( 6)
P ( X 4 in 3 months)
24
(1296)(0.0025)
0.135 13.5%
24
So, the probability that 4 new cases of measles occur in
3 months in 100,000 population is 0.135 =13.5%
269
Solution to Q2 & Q3
Q2 .Given λ = 2per month/100,000
=(2/100,000)*1,000,000
=20 per month per 1,000,000
t=1year=12 months
– E(X) = t
E(X) = t
– E(X=12month) = 20X12 = 240 cases
270
Normal Distribution
271
Normal distribution
• Normal distributions are symmetric single picked bell-shaped
curve described by its mean (µ) and standard deviation (σ).
272
•
Normal dist…
Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.
– E.g. Suppose, X represents the continuous variable ‘Height’;
rarely is an individual exactly equal to 170cm tall
– X can assume an infinite number of intermediate values 170.1,
170.2, 170.3 etc.
• Because a continuous random variable X can take on an
uncountably infinite number of values, the probability
associated with any particular one value is almost equal to zero
273
Normal dist…
• As a continuous variable can take an infinite number of values,
it helps to visualize the probability distribution as a curve and
probabilities as ‘area under the curve’.
274
275
276
277
Normal distr…
Example
Finding Probabilities of the Standard Normal
Distribution so that: P(0 ≤ Z ≤ 1.56)
Procedures:
Look in row labeled 1.5 and column labeled .06 to find P(0 ≤
Z ≤ 1.56) = 0.4406
278
Standard Normal Probabilities
279
Example
• Let X be systolic blood pressure (for US population
aged 18-74 males) with μ = 129 mmHg and σ =
19.8 mmHg.
• Interpretation:
• The systolic blood pressure for 95% of US population aged 18-
74 males in mmHg lies (90.2, 167.8).
281
Solution to Q2
• Given: μ = 129 mmHg and σ = 19.8 mmHg
– % for SBP > 150mmHg
• To get %, find Z corresponding to 150
• Z = ( X – μ)/ σ = (150-129)/19.8 = 1.06
• P(Z>1.06)
282
•
solution
Lower 10% of SBP, 10% =0.10
to Q3
• Find Z from the table corresponding to 0.1
• To read from the table, 0.5-0.1=0.4
• Find the Z corresponding approximately to 0.4 from the table.
• 0.3997 corresponds to P(0≤Z ≤ 1.28)
• 0.1 corresponds to P(Z>1.28)
• As required is the lowest 0.1, it will be negative
• i.e. the lowest o.1: P(Z<-1.28)
• X= μ +Z σ = 129+-1.28(19.8) = 103.6
283
Exercise: try the following exercises and
compare your findings with the answers given
1. Find Probabilities of the Standard Normal
Distribution: P(Z < -2.47)
answer = 0.0068
2. Find Probabilities of the Standard Normal
Distribution: P(1≤ Z ≤ 2)
answer = 0.1359
3. Find Values of the Standard Normal Random
Variable: P(0 < Z < z) = 0.40
answer = value corresponding Z=1.28
i.e. X = µ+1.28σ
284
Sampling Methods
285
Learning objectives…
• At the end of this lecture the students will be
able to:
– Define common terms used in sampling
– Distinguish the difference between probability and
non probability sampling
287
Common terms used in sampling
• Reference population (target population)
– The population of interest, to which the
investigator would like to generalize the results of
the study
• Source population
– From which the representative sample is to be
drawn
288
Common terms…
• Study or sample population
– The population included in the sample
• Sampling unit
– The unit of selection in the sampling process
• Study unit
– The unit on which information is collected
289
Common terms…
• Sampling frame
– The list of all the units in the reference population,
from which a sample is to be picked
290
Hierarchy of Sampling
AA
WRA
291
Why sampling?
• Feasibility: Sampling may be the only feasible
method of collecting the information.
• Reduced cost: Sampling reduces demands on
resource such as finance, personnel, and material.
• Greater accuracy: Sampling may lead to better
accuracy of collecting data
292
Limitations of sampling…
• There is always a sampling error
293
Types of sampling
A. Probability sampling
– Subjects of the sample are chosen based on known (non-
zero chance) probabilities.
– Guarantees that every element in the population of
interest has the same probability of being chosen for the
sample as all other elements in the population; “random”
selection.
B. Non-probability sampling
– we do not know the probability that each population
element will be chosen, and/or
– we cannot be sure that each population element has a
non-zero chance of being chosen.
294
Main differences
Probability sampling Non-Probability sampling
• Every item has a chance of being • Not every item that has chance of
selected. being selected
295
Types of Sampling Methods
Sampling
Probability Sampling
Non-Probability
Sampling
Simple
Random Stratified
Convenience
Quota
Cluster
Purposive Snowball
Systematic
Volunteer Multistage
296
I. Probability Sampling
• A probability sampling method is any method of
sampling that utilizes some form of random selection.
297
Probability Sampling…
• The population of interest is clear (because it
must be identified before sampling from it.)
299
1. Simple random sampling
• Each sampling unit in the population has an equal chance of
being included in the sample.
• Steps
1. Define the population
2. Determine the desired sample size
3. List all members of the population or the potential
subjects (sampling frame)-we can use codes
4. Select the desired samples by simple random methods
we can apply methods like
Lottery method (sample drawn from box)
Table of random numbers (show the table)
Computer generated random numbers
300
Advantages of SRS
• Each unit in the sampling frame has an equal
chance of being selected
301
Disadvantages of SRS
• Can be expensive and unfeasible for large
populations –need complete list.
302
•
2. Systematic random sampling
Individuals are chosen at regular intervals from the sampling
frame
Steps :
1. Number the units on your frame from 1 to N
2. Determine the sampling interval (K) by dividing N/n. Example,
N=100, n=20, then k=N/n=100/20=5
3. Select a number between 1 and K at random. This number is
called the random start.
4. Using the sample above, you would select a number b/n 1
and 4.
5. Select every Kth (in this case, every fifth) unit after the first
number.
303
Systematic random sampling…
304
Advantages of Systematic sampling
– Require no sampling frame
– Easier to perform
– Require less time than SRS
– Very good when the population from which
sample is to be drawn is homogeneously
distributed.
Disadvantage:
– Patterns/periodicity in which case it may be non representative
305
3. Stratified Sampling
The population is first divided into groups of elements having similar
characteristics called strata.
Each element in the population belongs to one and only one stratum.
Best results are obtained when the elements within each stratum are
homogeneous group
306
Stratified Sampling…
A separate sample is then taken from each stratum by random
sampling
• Proportionate allocation
– The same sampling fraction is used for each stratum
• Non-proportionate allocation
– Different sampling fraction is used or
– Though the strata are unequal in size, a fixed number of
units is selected from each stratum
307
Advantages Stratified Sampling
• If strata are homogeneous, this method is as
“precise” as simple random sampling but with
a smaller total sample size
308
Disadvantages Stratified Sampling
• Can be difficult to select relevant stratification variables
• Can be expensive
309
Example
• Suppose that in a company (E.g AAU) has 1800 (N) staff from
which 400 (n) are to be selected proportionally:
– Male academic staff = 900
– Male administrative staff = 180
– Female academic staff = 90
– Female administrative staff = 630
310
Example…
By using the formula
– Male academic staff = (900 / 1800) x 400 = 200
– Male administrative staff = (180 / 1800) x 400 = 40
– Female academic staff = (90 / 1800) x 400 = 20
– Female administrative staff = (630 / 1800) x 400 = 140
311
4.Cluster sampling
• Is a sampling technique used when "natural" groupings are
evident in a statistical population.
312
Cluster sampling…
Cluster samples are generally used if:
313
Cluster sampling…
Advantages:
• Sampling frame of the reference population is not required
(Sufficient to have a list of clusters)
• Cost effective
Disadvantage:
• Based on the assumption that the study units are uniformly
distributed through out the reference population. Which may
not be always the case.
• we do not have total control over the final sample size
314
5. Multistage sampling
• Used when the reference population is large and widely
scattered.
• Selection is done in stages until the final sampling unit are
arrived at.
– Primary sampling units –from the first sampling stage
– Secondary sampling units- from the second sampling
stage etc..
• Finally study subjects will be selected by SRS
• No need of sampling frame for the reference population.
315
Multistage …
Advantage
• Cuts the cost of preparing the sample frame
Disadvantage
• sampling error is high compared with simple random
sampling (so we need to use design effect)
• Less precise estimation than SRS for the same sample but the
reduction in cost outweighs this and allow for a large sample
size
316
Example Multistage …
• Suppose research wanted to study the risk of
AAU students to HIV/AIDS and wanted to
include 1500 students. How can he go about?
• Multi stage
– Primary sampling unit: Campus/college
– Secondary sampling unit: Departments
– Tertiary sampling unit: students
317
Multistage …
318
2. Non-Probability Sampling
Advantage
319
Non-Probability Sampling…
Disadvantages
• No random selection (non-representative)
• Reliability cannot be measured
• No way to measure the precision of the resulting
sample.
• Inappropriate for generalizing findings obtained from
a sample to the population.
320
Types Non-Probability Sampling
2. Volunteer sampling
4. Quota sampling
5. Snowball sampling
321
1.Convenience/opportunity/accidental
sampling
• Selection of a sample based on easy accessibility and
convenience
322
2.Volunteer sampling
• As the term implies, this type of sampling occurs when people
volunteer their services for the study
• The sample is taken from a group of volunteers
• Sometimes, the researcher offers payment to entice
respondents
• Commonly used in psychological experiments or
pharmaceutical trials (drug testing),
• Its limitation, it would be difficult and unethical to enlist
random participants from the general public- volunteers.
323
3.Purposive/Judgemental sampling
• The selection of a sample based on judgment and knowledge
of the subject
325
Quota ….
• Advantages
– Quota sampling is generally less expensive than random sampling.
– Easy to administer
– It is an effective sampling method when information is urgently
required and can be carried out independent of existing sampling
frames.
• Disadvantages
– It does not meet the basic requirement of randomness.
– Some units may have no chance of selection or the chance of
selection may be unknown. Therefore, the sample may be
biased.
326
5.Snowball sampling
• Snowball sampling is a special non-probability method used
when the desired sample characteristic is rare.
• lower cost
• But, biased
SM
M
Involves two main steps.
1. Identify a few key individuals
2. Ask these individuals to volunteer to distribute the questionnaire
to people who know and fit the characteristics of the desired
sample
327
Errors In sampling
328
Sampling error
• A sample is a subset of a population.
329
Non sampling error
It is a type of systematic error in the design or
conduct of a sampling procedure which results
in distortion of the sample
• Ho to reduce/avoid
– careful design of the sampling procedure and not
by increasing of the sample size and
– Testing the data collection tool
330
Thank You!
331
Sampling distribution and
sample size Determination
333
334
Sampling distribution….
• The sampling distribution of a statistic is the
probability distribution of all possible values the
statistic may assume, when computed from random
samples of the same size, drawn from a specified
population.
335
Sampling distribution….
Suppose that we calculate a sample mean (X) as an
estimate of the population mean (μ).
It is possible to select many samples of size n from a
population.
337
338
CENTRAL LIMIT THEOREM…
2 . The mean for the distribution of sample means is equal to
the mean of the population distribution
x
339
CENTRAL LIMIT THEOREM…
3.The standard deviation of the distribution of sample
means is equal to the standard deviation of the
population divided by the square root of the sample
Size
x
n
where the standard deviation of the
distribution of
x
Too small
We may fail to detect important effects or
may estimate effects too imprecisely
False conclusion
Too large:
Unnecessary involvement of extra subjects
High cost
Time constraints
341
Sample size…
The main determinant of the sample size is, how
accurate the results need to be.
2 2
n Z 1
2
d
Where:
n= minimum required sample size
Z=upper critical value for the distribution
1-alpha confidence level
d= margin of error
ᵟ = population standard deviation
344
Finally we need to add 10% of n for the non
Finite population correction
If the source population (N) is <10,000 or n>10% of
N, we need finite population correction
n
n f
n
1
N
345
Example
Assume a physician wants to study the systolic blood
pressure (SBP) of 20-39 years of age in a certain
country.
The normal values are μ =120mmHg & ᵟ =10mmhg
How many people should he include in the study if he
has desired the patients SBP must not raise above
122mmHg in 95% of the time?
a. From source population of 50,000
b. From source population of 6,000.
346
Solution (a)
Given
Z=1.96 as confidence level is 95%
ᵟ =10mmHg
d=122mmHg-120mmHg=2mmhg
2
x (10mmHg )
2
n
1. 96 96
2
(2mmHg )
347
96 0.1x96 105.6 106 people
Solution for b.
As N=6000<10,000, we need population correction
106
n f
106
104 people
1
6000
348
2. For single proportion
Used when the outcome variable categorical
Z pq
2
n 2
2
d
Where:
n = minimum required
sample size
Z = upper critical value for the
distribution 1- alpha confidence level
d = margin of error
p = expected proportion of the population
with the event of outcome (prevalence)
q =1-p: the probability of non occurrence
of the event of interest
349 Finally we need to add 10% of n for the non
Single proportion…
We need also to use finite population correction here if
the source population is <10,000.
Example.
A survey is needed to estimate prevalence of influenza virus infection
in school children
Suppose the available evidence suggests that approximately 20% of
the children will have antibodies to the virus.
Assume the investigator wants to estimate the prevalence within 5% of
the true value.
a. Calculate the sample size assuming source population of 40,000
b. Calculate the sample size assuming source population of 4,000
350
Solution (a)
2
(1.96) (0.2 x0.8)
n 2
245.8 246
(0.05)
351
Solution (b)
As the source population is <10,000, we need
population correction
271
n f
271
253.8 254children
1
4000
352
Single proportion…
Usually we obtain ‘p’ from previous
similar studies or pilot test
353
Exercise
Suppose that a study is to be conducted to estimate the
smoking rate among adult males in Addis Ababa.
Assume that the current smoking rate among adult
males in Addis Ababa in general is about 27%. It was
desired that the rate of smoking to be within 3% of the
general population with 95% confidence.
a. Determine the required number of adult male to be
included in the study based on the above data.
355
Sample size formula for difference in means…
Z Z 1 1
2 2 2 2
(r 1) ( 1
Z1 ) (r 1) ( Z )
n1
r difference 2 r ( X1 X 2 ) 2
where :
n1 size of smaller group
r ratio of larger group to smaller group
standard deviation of the characteristic
diffference clinically meaningful difference in means of the outcome
Z corresponds to power (80% power, Z 0.84)
2
2 ( Z1 Z1 )
2 2
n1
difference 2
n n
2 1
357
Example
Suppose the investigator wanted to compare the difference in
mean hemoglobin level between adult males and adult
females. From previous study, The mean hemoglobin level for
normal adult males is 15 g/100ml and that of normal females
is 13g/100ml. The standard deviation is about 3g/100 ml.
358
Solutions (b)
n n 35
2 1
360
males a total of 106 study
4. Comparing two proportion
To compare two proportions we use the following formula
n
Z 1
Z1 ( p1 (1 p1 ) p1 (1 p1 ))
2
p1 p2 2
362
Solution
Given
P1=20% =0.2 (among on mixed feeding)
P2=15%=0.15 (among on exclusive breast feeding)
Confidence level=95%, Z1-ɑ=1.96
Power =80%, Z1-β=0.84
n1
0.84 1.96 (0.2(0.8) 0.15(0.85))
2
45
0.2 0.15 2
n2 n1 45
By adding 10% for non responses we need 50 infants from both
groups a total of 100 infants will be included in the study
363
Thank you!
364