You are on page 1of 364

PART-I: DESCRIPTIVE STATISTICS

Lecture Note organized by


Gurmesa Tura (MPH)
PhD Candidate, AAU, SPH

April, 2011
Addis Ababa University
1
Introduction to
Biostatistics
Lecturer by:
Gurmesa Tura (MPH)

March 2011
AAU 2
Objectives
• At the end of this lecture the students will be
able to:
 Define statistics & Biostatistics

 Explain the roles of statistics in medicine

 Describe the types of data and scales of


measurement
 Identify different methods of data collection

3
Contents

• Definition

• Types of statistics

• Roles of statistics

• Types of data & scales of measurement

• Data collection methods

4
What is statistics?
• The scientific study of numerical data based
on variation in nature. (Sokal and Rohlf)

• A set of procedures and rules for reducing


large masses of data into manageable
proportions allowing us to draw conclusions
from those data. (McCarthy)

5
Statistics…
• Statistics is the art and science of making
decisions in the face of uncertainty

• Statistics the science of collecting, summarizing,


presenting, interpreting data, and of using them
to test hypotheses.

• Biostatistics is statistics applied to biological and


health problems

6
What are statistical data?
• Observation: information obtained from a
single person

• Data: information gathered from group of


people

• Statistical data: raw material or facts of any


statistical observation arising when ever
measurements are made or observations are
classified

7
Types of statistics
• Descriptive Statistics
– Collection,
– organization,
– summarization, and
– presentation of data.

• Inferential Statistics
– Generalizing from samples to populations using
probabilities.
– Performing hypothesis testing,
– Determining relationships between variables,
– Making predictions.
8
Why study statistics in medicine?

• Medicine and epidemiology are becoming


increasingly quantitative

• Knowledge of statistics is required to design,


conduct and analyse medical researches

• Helps for better understanding of medical


literature

9
Roles of statistics
• In clinical medicine
– Making clinical diagnosis
– Determining Rx and prognosis
– Handling variations (defining normal values and normal
ranges)
• In public health
– Community diagnosis
• In Research
– Designing and undertaking clinical & public health research

10
Uses of statistics
1. Collecting data in the best possible way

2. Describing a characteristics of a group or


population

3. Analyzing and interpreting data

4. Making generalization about populations based on


studies of samples

11
Limitations of statistics
1. Statistics doesn’t deal with single (individual) value.
– It deals only with aggregate values

2. Statistics can’t deal with qualitative characteristics


– Deals with data which can be quantified

3. Statistical conclusions are not universally true


– Context specific

4. Statistical interpretations require high degree of skill


& understanding of the subject.

12
Types of data
• Based on source :
– Primary & secondary data
1. Primary data
• Data collected by the investigator for the
purpose of specific study
• Original in character
• Mostly generated by surveys
• Complete, reliable and more accurate

13
Types of data…
2. Secondary data
• When the investigator uses data which have been collected by
others for other purpose

• Obtained from Journals, reports, Gov’t publications etc

• Less expensive (less money & time)

• May be incomplete, less quality, less valid

14
Scales of measurement
• Variable is any aspect of an individual or
thing that is measured and can take any
value for different individuals or cases

• Divided in to two
1. Qualitative (categorical) variable &
2. Quantitative (numerical) variables

15
Qualitative (categorical) variable
• A variable which can not be measure in
quantitative (numerical) form but can only be
identified by names.

• It has three forms based on scales of


measurement
– Nominal
– Ordinal

16
Nominal data
• Represent categories or names
• There is no orders in the categories
• It has two forms:
– Dichotomous- has 2 value categories
• E.g. Sex: Male or Female
» Immunization: yes or No
» Diseases outcome : Died or survived
– Multichotomous: >two categories
• E.g.
– Blood group: A, B, AB or O
– Marital status: single, married, divorced or widowed

17
Ordinal data
• Have order in the response categories

• But, the distance or interval between categories are not


necessarily equal

– E.g Immunization status:


» Not immunized,
» Partially immunized
» Fully immunized
• Disease state
» Mild
» Moderate
» Severe
• Agreement questions
» Strongly agree
» Agree
» Indifferent
» Disagree
» Strongly disagree

18
Quantitative (numerical) variables
• Variables which assume numerical values.
• variables to which a number is assigned as a
quantitative value
• Has two forms
– Discrete Variables
• Variables which assume a finite or countable number of
possible values.
• Usually obtained by counting. No decimal
Eg. - House hold size
- No. children

– Continuous Variables
• Variables which assume an infinite number of possible
values.
• Usually obtained by measurement.
• Can have decimals
• Eg. Age, weight, height

19
Quantitative …..
• Continues variables…
• Has two scales of measures

• Interval scale:
– Order and distance implied. Differences can be compared;
– no true zero.
– Ratios can not be compared.
E.g. Temperature in Celsius.
» 0Oc is not to mean there is no temperature
» 40Oc is not twice as hot as 20Oc

• Ratio scale:
– Order and distance implied.
– Differences can be compared;
– has a true zero.
– Ratios can be compared.
– Examples: Height, weight, blood pressure
• 40cms is twice as long as 20cms
• 0 cm is true 0 as there is no 0zero height

20
Discrete

21
Data collection
• The process of obtaining statistical data
• Before any statistical work can be done data must be
collected
• Collecting Primary data
– Observation
– Interview
– Use of self administered questionnaire

• Collecting secondary data


– Use of documentary sources

22
Observation
• Systematically selecting , watching and recording
behaviours of people or other phenomena and
aspects of the settings in which they occur

• For the purpose of obtaining specified observation

• Includes
– Visual observation
– Radiographic, Biomedical, x-ray, microscope,
clinical examinations, etc

23
Observation…
• It can also be used In observing behaviour
of people, culture etc.

• It could be
– Participant observation or
– Non-participant observation

24
Observation…
• Advantage
– More accurate data on behaviour or activity

• Disadvantages
– Observer bias
– Prejudice
– Desirability bias
– Needs skilled human power in high level
machines

25
Interviews
• Face to face interview

• Telephone interview

• Group interview or Focused Group Discussion (FGD)

• Self administered questionnaire

• Mailed questionnaire

• Computer interview

26
Face to face interview
• Advantage
– Permits detailed & in-depth questions & responses
– Minimizes non-response

• Disadvantage
– Costly
– Interviewer bias
– Investigator bias
– Interviewer cheating

27
Telephone interview
• Advantage
– Convenient
– Saves time
– Relatively inexpensive
– Less interviewer & investigator bias than personal
interview

• Disadvantage
– Non-coverage
– Limited length & depth of questions and responses

28
Self-administered Questionnaire
• Advantage
– Cost effective for large areas
– Minimizes interviewer bias
– Promotes accurate answers
– Sensitive issues can be gathered

• Disadvantage
– Low response rates
– Unanswered questions
– Incorrect answers
29
Mailed questionnaire
• Advantage
– Allows collecting data with out personal presence

• Disadvantage
– Low response rate
– Not applicable for illiterates
– Low coverage in rural areas

30
Use of documentary sources
• These include
– Clinical & other personal records
– Vital statistics
– Census data

• Sources
– Official publications of CSA
– Publications of MOH & other ministries
– News papers & journals
– International publications (WHO, UNICEF, etc)
– Health facilities’ records

31
Choosing method of data collection
• Choosing which method(s) of data collection
depends on:
– Type of data we need
– Resources (time, personnel & facility)
– Accuracy & strength of the method
– Acceptability of the method by the subjects
– Back ground of study subjects
– Etc

32
33

Data organization &


Presentation
Lecture 2
By Gurmesa Tura (MPH)
March 2011
AAU

33
Learning objectives
• At the end of this lecture the students will be
able to:

– Identify different ways of data organization &


presentation

– Familiar with constructing different methods of


data organization and presentation

34
34
Methods of data organization
• The data collected in a survey is called raw data

• Information is not immediately evident from the mass of unsorted


raw data

• Needs to be organized in such a way as to condense information to


show patterns and variations

• Techniques of data organization & presentation


• Ordered array
• Tables &
• Graphs
35
35
Ordered array
• A serial arrangement of numerical data in an ascending
or descending order

• Tells as the ranges of data and their general distributions

• Appropriate only for small data (<20)

• If it is beyond 20 we need to use frequency distributions


or Tables

36
36
Frequency distributions
• Is a table that shows data classified in to a number of
classes with a corresponding number of times falling in
each categories (frequency)

• Frequency is the number of times a certain value of the


variable is separated in a given class.

• Two types
– Categorical frequency distribution
– Numerical frequency distribution

37
37
Categorical frequency distribution

• Used for data that can be placed in specific categories

• Used for nominal & Ordinal


– E.g. blood type, marital status etc.

• Example: A health worker collected data on blood type of 30


individuals and recorded as follows (Hypothetical)

• O, A, AB, B, O, O, O, A, B, O, AB, B, B, A, AB, O, O, O, B, AB, O, A, AB, B,


O, O, O, A, B, O

38
38
Procedures to construct the frequency distribution
• There are 4 types of blood group, so we have four classes
• Step 1: Make a table
• Step 2: Tally the data & place the result in Tally column
• Step 3: count the tally and Place the result in frequency
column
• Step 4: calculate the % for each class
% = f/n*100
Where f= frequency of the class &
n= total number of values

39
39
40
40
Numerical frequency distribution

• Here the classification criterion is quantitative

• It has two forms


– Ungrouped frequency distribution
• For discrete quantitative data

– Grouped frequency distribution


• For continues quantitative data

41
41
Ungrouped frequency distribution

• Is a table of all the potential raw score values


that could possibly occur in the data along
with the number of times each actually
occurred

• Often used for small set of data on discrete


variables

42
42
Constructing ungrouped freq. distri.
• 1st find the smallest & the largest values in the data
• Arrange the data in order of magnitude and count the frequency
• To facilitate counting one may include column of tallies.

• Steps in constructing
• Step 1: make the table
• Step 2: Tally the data
• Step 3: Count the frequency
• Step 4: compute the percentage

• E.g. the following hypothetical data represent family size of 50 households.

4, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6, 2, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4,
6, 2, 5, 2 , 8, 10, 4, 4, 5, 3, 10, 4, 5, 6, 3, 5, 6

43
43
44
44
Grouped frequency distribution (GFD)
• A frequency distribution when several numbers are
grouped in one class

• Usually used when the range of the data is large

• Two types
– Inclusive
• the upper limit of one class coincides with the lower limit
of the next class
– Exclusive
– the upper limit of one class does not coincides with the lower limit
of the next class

45
45
Grouped freq. distr….
• Example: Consider the following ungrouped marks of 30
students (out of 50%)

24 30 36 35 42 40 26 23

36 36 12 45 29 21 34 40

16 47 28 32 33 44 19 34

30 36 35 47 20 14

• Construct grouped frequency distribution for the above data


46
46
Guidelines for creating classes
1. There should be b/n 6-20 classes
2. The classes must be mutually exclusive.
i.e. no data value fall into two d/t classes
3. The classes must be all inclusive or exhaustive. i.e. all data
values must be included
4. The classes must be continues. i.e. No
gaps in a frequency distribution
5. The classes must be equal in width.
• The exception here is the first or the last classes
• Possible to have ‘Below…’ or ‘…and above’ class.
• Often used in ages.

47
47
Steps in constructing Grouped freq. distr.
1. Find the largest & smallest value
2. Compute range (R) = Maximum –Minimum
• From above example R = 47-12 =35
3. Select number of classes (usually 6-20) or use Sturge’s rule
k = 1+ 3.32 logn
Where k is desired number of classes &
n is total number of observations
K will be round up if there are values after decimal
From the example above (n =30)
K = 1+ 3.322 log30 (log30 = 1.48)
K = 1 + 3.322(1.48) = 5.9, round up to 6
So we need to have 6 classes
48
48
Steps…
4. Find the class width (w) by dividing the range by the
number of classes and roundup not round off.
From ex. Above w = R/k = 35/6 = 5.8, rounded to 6

5. Form a suitable starting point which is equal to the


minimum value.
– Starting point is called the lower limit of the 1st class
– Continue to add the class width to this lower limit to get the
rest of lower limits.

49
49
Steps..
• From the above example the lower class limits
(LCL) will be:
• The starting point is Small value = 12, so,
• 1st lower limit = 12
• 2nd lower limit = 12 +6 =18
• 3rd lower limit = 18+6 = 24
• 4th lower limit = 24+6 = 30
• 5th lower limit= 30+6 = 36
• 6th lower limit = 36+6 = 42

50
50
Steps…
6. Find the upper class limit (UCL),
UCL= LCL + (w-1)
From the above ex. W= 6,
so, W-1 = 5 Classes Tally Freq %

 1st UCL = 12 + 5 = 17 12-17


 2nd UCL = 18 + 5 = 23 18-23
 3rd UCL = 24 + 5 = 29 24-29
 4th UCL = 30 + 5 = 35 30-35
 5th UCL = 36 + 5 = 41 36-41
 6th UCL = 42 + 5 = 47 42-47
Total
51
51
Steps …
7. Make tally
8. Count the tally & fill frequency
9. Calculate & fill percentages
10. Find relative frequency (rf)
Rf=f/n
11. Find cumulative frequency (cf):
– Lcf : Less than cumulative frequency (<UCB)
– Gcf: Greater than cumulative frequency (>LCB)

52
52
By combining all the steps
Classes Tally Freq % rf Cf Cf (greater
(less than) than)
12-17 /// 3 10.0 0.10 3 30
18-23 //// 4 13.3 0.13 7 27
24-29 //// 4 13.3 0.13 11 23
30-35 //// /// 8 26.7 0.27 19 19
36-41 //// / 6 20.0 0.20 25 11
42-47 //// 5 16.7 0.17 30 5
Total 30 100.0 1.00

53
53
Common terms used in grouped freq. distr. (GFD)
• Class interval: range of scores grouped together in a GFD
• Class limits: the first & the last elements in the given class
interval

• Units of measurement (U): the distance between two


consecutive measures
– U = (n+1)th LCL – nth UCL
– Eg. 12-17, 18-23, U = 18-17 =1
– U is usually taken as; 1, 0.1, 0.01, 0.001….

54
54
Terms….
• Class boundaries: separates one class in GFD from another
• The boundaries have one more decimal places than the raw data and
therefore do not appear in the data
• There is no gap b/n the upper boundary of one class and the lower
bounder of the next class
• LCB = LCL-U/2
• UCB = UCL + U/2

– Eg. 12-17, 18-23, U = 18-17 =1


• LCB for 18-23, 18-1/2 = 18-0.5 =17.5
• UCB for 18-23, 23 + ½ = 23 +0.5 =23.5

55
55
Terms…
Classes Boundaries

Classes Class Freq %


boundaries
12-17 11.5-17.5
18-23 17.5-23.5
24-29 23.5-29.5
30-35 29.5-35.5
36-41 35.5-41.5
42-47 41.5-47.5
Total

• Class width (w) = UCB-LCB 56


56
Terms…
• Class mark (Xc) Classes Class
– The mid point of the class marks (Xc)
– The average of LCL & UCL or the 12-17 14.5
average of LCB + UCB
18-23 20.5
24-29 26.5
– Xc = LCL + UCL 30-35 32.5
2 36-41 38.5
42-47 44.5
Total
Eg. Xc1 = 12+17 = 29/2 = 14.5
2

57
57
Rules in constructing tables
1. Table should be as simple as possible (6-20 categories)
2. Tables should be self explanatory
• Title should be clear and to the point (answers: What, when, where, how
classified)
e.g. Table 1: Marks of 30 Medical students of AAU, March 2011, AA, Ethiopia.
• Placed above the table
3. Each raw & column should be labelled
4. Numerical entities of zero should be explicitly written rather than indicating
by dash, as dashes are reserved for missing or unobserved data.
5. Totals should be indicated (last raw last column)
6. If the data are not original, their source should be given in foot notes.

58
58
Types of tables
• We have three d/t types of tables based on the number of
variables included

1. Simple or one way table


– Single variable involved

2. Two way table


- Two variables cross tabulated

3. Higher ordered table


- Three or more variables involved
59
59
Eg. One way
• Table 2: Immunization status of children in xxx woreda,
2010 (hypothetical)

Immunization Number Percent


status
Immunized 135 64.3
Not immunized 75 35.7
Total 210 100.0

60
60
Eg. Two way table
• Table 3: Immunization status by sex of children in xxx
woreda, 2010 (hypothetical)
Sex of children Immunization status Total

Immunized Not immunized N %


N % N %
Male 85 65.4 45 34.6 130 100.0

Female 50 62.5 30 37.5 80

Total 135 64.3 75 35.7 210 100.0

61
61
Eg. Higher ordered table
•Table 4: Immunization status by sex and residence of children in xxx
woreda, 2010 (hypothetical)
Immunization status Total
Sex & residence of children

Immunized Not immunized N %

N % N %

Male Urban 55 25 80 100.0


68.7 31.3
Rural 30 20 50 100.0
60.0 40.0
Female Urban 40 20 60 100.0
66.7 33.3
rural 10 10 20 100.0
50.0 50.0
Total 135 64.3 75 35.7 210 100.0
62
62
Diagrammatic/Graphical
presentation of data

Lecture 3
By: Gurmesa Tura (MPH)

March 2011
AAU
63
Objectives
• At the end of the class the students will be
able to:
– Identify the different types of graphs
– Chose among the graphs based on the data
– Familiar with constructing the different types of
graphs
– Identify importance and limitation of using graphs

64
Graphical presentation of data
• Techniques for presenting data in visual
displays using geometric and pictures.
• Importance
• Greater attraction
• Easily understandable
• Facilitate comparison
• May reveal unsuspected patterns in complex set of
data
• Greater memorizing value

65
Limitations
• Used only for purpose of comparison
• Not an alternative to tabulation
• Can give only an approximate idea
• They fail to bring to light too small differences

66
Types of graphs
• For qualitative & quantitative discrete data
• Bar chart
• Pie chart

• For quantitative continues data


• Histograms
• Frequency polygon
• Cumulative frequency polygon (Ogive)

67
Bar chart
• A series of equally spaced bars having equal width
(base) where the height of the bar represents the
frequency of (amount) associated with each category.

• It could be either vertical or horizontal

• Three types based on number of variables


involved
– Simple bar chart
– Multiple bar chart
– Component bar chart
68
• Simple bar chart
From our previous example

Table 2: Immunization status of children in xxx


woreda, 2010 (hypothetical)

Immunization Freq %
status
160

number of children
Immunized 135 64.3 140
120
100
80

Not 75 35.7
60
40
20

immunized 0
Immunized Not immunized
Immunization Status

Total 210 100.0

Figure 1: Immunization status of children in


xxx woreda, 2010 (hypothetical)

69
Multiple bar chart
• From the previous example

Table 3: Immunization status by sex of children in xxx


woreda, 2010 (hypothetical)

Sex of children Immunization status Total

Immunized Not immunized N %


N % N %
Male 85 65.4 45 34.6 130 100.0
Female 50 62.5 30 37.5 80
Total 135 64.3 75 35.7 210 100.0

70
Multiple bar chart…
70

60

50

% of children
90 40 Male
80 30 Female
70
20
No. of childern

60
10
50 Male
40 Female 0
30 Immunized not immunized

20 Immunization
10
0
Immunized not immunized

Immunization

Figure 2: Immunization status by sex of children in xxx woreda, 2010 (hypothetical)

71
Component bar chart
We can also construct component bar chart for the above table

120.00%
140
100.00%
120
80.00%

% of children
100
NO. of children

60.00%
80
40.00%
60
20.00%
40

20 0.00%
Male Female
0 Sex
Male Female
Sex

Figure 3: Immunization status by sex of children in xxx woreda, 2010 (hypothetical)

72
Pie chart
• A circle divided in to sectors so that the areas
of the sectors are proportional to the
frequencies.

• Distribution of angles (360o) is made based on


the proportion of each frequency’s share from
the total observation.

• fi/n * 360o or % of each class * 360o


73
Example:
Pie chart…
Table 4: Blood type of 30 individuals in xxx
woreda, 2010 (hypothetical)

A = 5/30*360o =60o
Blood Type Freq. %

A 5 16.7 B = 7/30*360o =84o


B 7 23.3

AB 5 16.7
AB = 5/30*360o =60o
O 13 43.3
O = 13/30*360o =156o
Total 30 100

74
Pie chart

17%

A
B
43%
AB
23% O

17%

Figure 4: Blood type of 30 individuals in XXXX Woreda,


2010 (hypothetical)

75

Histograms
Graph consists of series of rectangles whose bases are equal to
the class width of the corresponding class & whose heights are
proportional to class frequencies

• Used for quantitative Continues data


1. The horizontal axis is continues scale running from one
extreme end to the other
• Should be labelled with the name of the variable & units of
measurement
2. For each class in the distribution, a vertical rectangle is
drawn with:
• There will never be gaps b/n the histogram rectangles
• Bases of rectangle will be determined by the class width

76
Eg. Conceder the data on student marks
Table 5: Marks of 30 students, AAU, Ethiopia, 2011 (hypothetical data)

Classes Class marks Frequency


(Xc)

12-17 14.5 3
18-23 20.5 4
24-29 26.5 4
30-35 32.5 8
36-41 38.5 6
42-47 44.5 5
Total 30
77
Histograms

Figure 4: Histograms showing students’ mark, AAU, 2010 (hypothetical


78
data)
Frequency Polygon
• Join the mid points of the tops of the adjacent rectangles
of the histogram with segments

• When it is joined with x-axis the area under the polygon is


equal to the area under the histogram.

• The scales should be marked in the numerical values of


the midpoints (Xc)

• The length of the ordinates represent the class frequency.

79
80
Figure 5: Frequency polygon showing mark of 30 students, AAU, 2010, 81
(Hypothetical data
Cumulative frequency polygon (Ogive)
• Line graph obtained by plotting the cumulative
frequency distribution (Y-axis) against class
boundaries (x-axis)

• Two types
– Cumulative frequency Less than the UCB (Lcf)or
– Cumulative frequency More than the LCB (Mcf)
– We can also use the intersection of the two.

82
Construct Ogive by using the table from the
above Example
Classes Class Freq Less More than
boundaries than cf
cF
12-17 11.5-17.5 3 3 30
18-23 17.5-23.5 4 7 27
24-29 23.5-29.5 4 11 23
30-35 29.5-35.5 8 19 19
36-41 35.5-41.5 6 25 11
42-47 41.5-47.5 5 30 5
Total 30
83
Less than Ogive

Figure 6: Less than Ogive showing mark of 30 students, AAU, 2010,


(Hypothetical data) 84
More than Ogive

Figure 7: More than Ogive showing mark of 30 students, AAU, 2010,


(Hypothetical data) 85
Less than & More than Ogive

Figure 8: More than & less than Ogive with their intersection showing mark
86
of 30 students, AAU, 2010, (Hypothetical data)
Data summarization

Lecture 4-6
By Gurmesa Tura (MPH)
March 2011
AAU

87
Learning objectives
At the end of this lecture, the students will
be able to:
– Identify the different parameters for data
summarizations
– Differentiate between measures of central
tendency and dispersion
– Calculate the commonly used measures of
central tendency and measures of dispersion
– Interpreter the final results of the measures

88
Data summarization
Although tables and graphs serve useful
purposes, there are many situations that require
other types of data summarization.

Important to summarize data by means of just a


few numerical measures, before inferences or
generalizations are drawn from the data.

These can be done by determining


– Measures of central Tendency
– Measures of variation or dispersion
89
Measures of central Tendency
Are numbers that tell us where the majority
of values in the distribution are located.
The center of the probability distribution
from which the data were sampled
Are also called measures of location
– Arithmetic Mean
– Median Commonly used and
– Mode focus of this lecture
– Geometric mean, and
– Harmonic mean.
90
The Arithmetic Mean
Arithmetic Mean = average

The arithmetic mean is the sum of the


individual values in a data set divided by
the number of values in the data set.

We can compute a mean of both a finite


population and a sample

91
Mean…
{8, 5, 4, 12, 15, 5, 7}
What is the mean of these data?
Mean = (8 +5+4+12+15+5+7)
7
= 56/7 = 8
But what if large number of data set?

92
93
Population mean

94
Mean for large Discreet data set
with frequency distribution
 when we have large data set HH size freq
difficult to add manually 2 5
In which case multiply each 3 6
value with their respective
4 14
frequency and divide by total
number of frequency 5 10
6 6
8 5
10 4
Total 50
95
Example: determine the mean HH size from the
following table
HH size Freq f ix i
(xi) ( f i)
2 5 10
3 6 18 = 250/50
4 14 56
=5
5 10 50
According to this data
6 6 36 in average 5 people
8 5 40 live in a Household
10 4 40
Total 50 250 96
Mean for Grouped data
From our previous example determine the mean
Mark of the students presented in the following table

Classes Freq In this case need to determine the


( f i) mid point (class mark) for each
class (xi) that represent the group
12-17 3
18-23 4
24-29 4
30-35 8
36-41 6
Where xi is class
42-47 5
mark
Total 30 97
Grouped mean….

Class Freq xi f ix i
es (fi)

12-17 3 14.5 43.5


18-23 4 20.5 82
24-29 4 26.5 106 = 945/30
30-35 8 32.5 260
= 31.5
36-41 6 38.5 231
42-47 5 44.5 222.5
Total 30 945
98
Characteristics of Arithmetic mean
1. Determined by every item in the series

2. Greatly affected by extreme values

3. The sum of deviations about it is zero

4. The sum of the squares of deveations


from the arithmetic mean is less than of
those computed from any other point
99
Arithmetic mean…
Advantages
– Based on all values given in the distributions.
– Most amenable to mathematical treatment.
– It is most easily understood

Disadvantage
– Affected by extreme values in the distribution
– When the distribution has an open end classes
its computation would be based on assumption
and therefore may not be valid
100
Reading assignment
Geometric mean

Harmonic Mean

101
Median
Median = middle value

The median is defined as the “middle


most” observation.

Median is the observation such that half


the observations are above it and half are
below it.
It is the 50th percentile point
102
Median for ungrouped data
To determine median the first step is putting the
values in ascending order

E.g consider the following data


{8, 5, 4, 12, 15, 5, 7} not ordered
4, 5, 5, 7, 8, 12, 15

Median
What if large number of data that can not be
listed?

103
104
105
Median for grouped data
It is possible to know the
Class Freq cf median class, by the above
formula.
12-17 3 3
But it doesn’t tell us the exact
18-23 4 7
median value.
24-29 4 11
30-35 8 19 N=30, so median class is the
class that contain 15th & 16th
36-41 6 25
observation
42-47 5 30
Total 30 i.e. class 30-35

106
Median for grouped data…
To get the exact value from 30-35, we need
other formula.

The formula to determine median for grouped


frequency distributions
w n 
Median L med

f
 C
 2 
med

Where:
Lmed = LCB of median class
W = width of median class

fmed = the frequency of the median class


.n = total number of observations
C =cumulative frequency of the class preceding the median class 107
Example: from the previous table…
Median class = 30-35
Lmed 30 – 0.5 = 29.5
Class Freq cf
w = 35.5-29.5 = 6
12-17 3 3 fmed = 8
18-23 4 7 C = 11
w n 
n = 30 Median  Lmed    C 
24-29 4 11 f med  2 

30-35 8 19
36-41 6 25
Median = 29.5 + 6/8 (30/2 – 11)
42-47 5 30
Total 30 = 29.5 + 6/8(15-11)

= 29.5 + 3 108
Median…
Characteristics
– An average position
– Affected by number of items than by extreme values

Advantages
– Easy to calculate and more typical of the series
– The median may be located even when the data is
incomplete.
E.g. when the class intervals are irregular and the final
classes have open ended
– Not affected by extreme observation

Disadvantages
– Not well suited to mathematical treatment
– Not so familiar as the arithmetic mean
109
Mode
Mode - The value that occurs most frequently
The given data set may have
– One mode = unimodal
E.g.. 3,3,4,4,4, 5,5,5,5,6,7,8 mode is 5
– Two mode = bimodal
E.g.. 10, 11, 12, 12, 12, 13, 14, 15,15,15, 17
modes are 12 & 15
– More than two modes = multimodal
– No mode at all =non-modal
E.g. 3,4,5,7,8,10

110
111
Mode for ungrouped data
HH size Freq the mode can simply
2 5 identified by selecting the
observation with largest
3 6
frequency.
4 14
5 10 From this data the
6 6 greatest frequency is 14,
8 5 so the mode is 4
10 4
Total 50
112
Mode for grouped data
Class Freq Here the modal class, the
class with the highest
12-17 3 frequency, is 30-35.
18-23 4
We need to determine the
24-29 4 exact value b/n 30 & 35 that
30-35 8 represent the mode of the
data
36-41 6
42-47 5 It is determined by the
Total 30 formula given below.

113
Mode for grouped data…

Mode  Lmo  w

  1


 
 1  2 

Where:
– Lmo = LCB of modal class
– The width of modal class
– ∆1= frequency of modal class – frequency of
class preceding modal class
– ∆2= frequency of modal class – frequency of
class following the modal class 114
Example Modal class = 30-35
Lmo = 29.5
w = 35.5-29.5 = 6
Class Freq Frequency of modal class =8
Frequency of the class preceding modal
12-17 3 class = 4
18-23 4 Frequency of the class following modal
class = 6
24-29 4
∆1= 8-4 = 4 & ∆2= 8-6 = 2
30-35 8
 1 
36-41 6 Mode  Lmo  w 
  
 1 2 
42-47 5
Mode = 29.5 +6(4/4+2)
Total 30
=29.5 +6(4/6)
= 33.5
115
Characteristics of mode
Is an average position

Not affected by extreme values

The most typical value of the distribution

116
Advantage & disadvantage of mode
Advantages
– Since it is most typical value it is the most descriptive
average
– Since the mode is usually an actual value it indicates
the precise value of an important part of the series.
– It is not affected by extreme values

Disadvantages
– It is not capable of mathematical treatment
– Has no significant for small samples
– In small number of items the mode may not exist

117
Measures of Dispersion

118
Dispersion
In order to utilize the information provided
by a set of data, knowing just a location or
average value of the data alone is not
adequate,

We need also to know the dispersion or


the variability.

Common measures of variability:


– Range,
– Inter-quartile range,
– Variance,
– Standard deviation and
– Coefficient of variation
119
Range
The range is defined as the difference between
the largest and the smallest observations in
the data set.

Range (R) = Largest (L) – Smallest (S)


observations

R=L-S

120
Range…
Eg. From the HH size
4, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6,
2, 6, 4, 3, 5, 2 , 8, 10, 4, 4, 5, 3, 5, 8, 4, 4, 6,
2, 5, 2 , 8, 10, 4, 4, 5, 3, 10, 4, 5, 6, 3, 5, 6

Largest (L)= 10 & Smallest (s) = 2

R = 10 -2 = 8

121
Example from students Mark

Classes Freq R = L-S


12-17 3
18-23 4
24-29 4
R = 47-12
30-35 8 = 35
36-41 6
42-47 5
Total 30

122
Range..
Advantages
– Computation is simple
– Easy to understand

Disadvantages
– It does not consider all values
– A poor measure of dispersion

Interpretation of the range depends on the


number of observations-
when the number of observations increase, the
range can get larger
123
Quintiles
When distribution is arranged in order of
magnitude, the median is the value of the
middle term.

Their measures that depend up on their


distribution such as quartiles, deciles, &
percentiles are collectively called quintiles.

124
Quartiles
Quartiles are sets of values which divide the
distribution into four parts such that there are an
equal number of observations in each part.

– First Quartile (Q1) = a value at which 25% are less


than or equal to it.

– Second Quartile (Q2) = a value at which 50% are


less than or equal to it. This is median

– Third Quartile (Q3) = a value at which 75% are


less than or equal to it.

125
Calculating quartiles for ungrouped data

First arrange the data in increasing order


Then, Use the following formula

Q i

i
4
n 1 th
value, i  1,2,3, then

1
Q1  4 n 1 value
th

2
Q2  4 n 1 value
th

3
Q3  4 n 1 value
th

126
Q1= ¼(50+1)th value =1/4(50+1)th value
= ¼(51) = 12.75th value
Example = 12th value + 0.75 x (13th value -12th
value)
= 4 + 0.75(4-4) = 4
HH Frq. Cf
size
Q2= 2/4(50+1)th value =2/4(50+1)th value
2 5 5
= 2/4(51) = 25. 5th value
3 6 11
= 25th value + 0. 5 x (26th value -25th
4 14 25 value)
5 10 35 = 4 + 0.5(5-4) = 4.5
6 6 41
8 5 46 Q3= 3/4(50+1)th value =3/4(50+1)th value
= 3/4(51) = 38.25th value
10 4 50
= 38th value + 0.25 x (39th value -38th
Total 50 value)
= 6 + 0.25(6-6) = 6
Q i

i
4
n 1 value
th

127
Inter-quartile range (IQR)
Inter-quartile range is the difference b/n the third
and the first quartiles.

IQR = Q3-Q1
= 6-4 = 2

This tells us how much the third and the first


quartiles are far apart or close to each other

128
Calculating quartiles for Grouped data
First find the class in which the Qi lies
Classe Freq cf This can be obtained by counting in/4
s of the class beginning from the lowest
12-17 3 3 class,

18-23 4 7 Class for Q1 = 1(30)/4 = 7.5th


= class of 7.5th value
24-29 4 11 =24-29

30-35 8 19 Class for Q1 = 2(30)/4 = 15th


= class of 15th value
36-41 6 25 =30-35
Class for Q3 = 3(30)/4 = 22.5th
42-47 5 30
= class of 22.5th value
Total 30 =36-41

129
Then, we use the formula
w in
Q  LQ  (  C ), i  1,2,3
i i
fQ 4
i

Where:
– LQi = Lower class boundary of the Quartile class
– W = width of quartile class
– n = total number of observations
– fQi = frequency of the quartile class
– C = cumulative frequency preceding the quartile
class
130
Solution
w in
Q  LQ  (  C ), i  1,2,3
i i
fQ 4
i

Q1= 23.5 + 6/4(1(30/4) – 7)


= 23.5 + (7.5-7) =23.5 + 0.5 = 24

Q2= 29.5 + 6/8(2(30/4) – 11)


= 29.5 + 6/8(15-11) =29.5 + 3 = 32.5

Q3= 35.5 + 6/6(3(30/4) – 19)


= 35.5 + (22.5-19) =35.5 + 3.5= 39

IQR = Q3 - Q1
= 39-24 = 15
131
Box plot
a graphical display that involves a five-number
summary of a distribution of values, consisting
of
– the minimum value,
– the first quartile,
– the median,
– the third quartile, and
– the maximum value

132
Box plots
• It could be vertical or horizontal

• A vertical box-plot is constructed by drawing


a box between the quartiles Q1 and Q3.

• Vertical lines are then drawn from the middle


of the sides of the box to the minimum and
maximum values.

133
Box plots…
• These horizontal lines are called
whiskers.

• A vertical line inside the box marks the


median.

• Outliers are usually indicated by a dot


or an asterisk.

134
Box plot...
Putting IQR in diagrammatic form

Maximum = 47

Q3 =39
Q2= median =32.5

Q1 =24

Minimum =12

135
Purpose of box plot
Shows center of distribution (median)

Tells the spread of the distribution

Tells the distribution

136
IQR..
Advantages
– It is simple and versatile measure
– It encloses the central 50% of the observation
– Less prone to distortion by a single large or
small value

Disadvantage
– It is not based on all observations but only on
two specific values

137
Reading assignments

Deciles

Percentiles

138
Variance & standard deviation
Variance and standard deviations are another
measures of dispersion

They measure how much the data are close to


each other

They are denoted as:


σ2 = Population variance
S2 = Sample Variance
σ: Population Standard Deviation
s: Sample Standard Deviation
139
Variance
Population variance(σ2) is computed by squaring the
deviation of each observation from the mean, adding
them, and dividing their sum by N:

 X i   
N 2


2
 i 1
N

Sample variance (S2) is computed by squaring each


deviation, adding them, and dividing their sum by one
less than n:

2
n

 Xi X 
2

S  i 1
n 1 140
Standard deviation
Standard deviation is taking the square root
of the variance

2
S  S

2
 

141
142
For frequency distribution
Population Variance Population Stand. Div.

 f X i    X i   
N
N 2
2
 f i
i   i 1


2
 i 1 N
N

Sample Variance Sample Stand.divi.

   
n n 2
 f X i X
2
 f X i X i
S
i
2
 i 1
S  i 1

n 1 n 1
143
Example
Calculate the variance & Standard
deviation for the following Table
HH size Freq First step is calculating
2 5 mean by using the formula

3 6
4 14
5 10
6 6
As calculated above
8 5
10 4 The mean is 5
Total 50
144
Solution
Mean = 5 Variance
 
n 2
HH Freq
( Xi  X ) ( Xi  X ) 2 f ( Xi  X ) 2
2
 f i X i X
size
S  i 1
i

2 5 -3 9 45 n 1
3 6 -2 4 24
S2=234/50-1
4 14 -1 1 14
=234/49
5 10 0 0 0
= 4.8
6 6 1 1 6

8 5 3 9 45 Standard deviation
10 4 5 25 100

Total 50 234 S  4.8  2.2


145
Variance & SD for
grouped frequency
distribution

146
For grouped frequency distribution

We use class mark (mid point) of the class


to represent the class

E.g. calculate variance & standard


deviation for the 30 students marks

First step is determining the mean


Determined above (31.5)

147
 
n 2

2
 f X i X
i
Variance  S  i 1
n 1
 2598/30 - 1  2598/29  89.6

SD  S  89.6  9.5
148
Importance of variance & Standard deviation

The greater the variation in the data set, the


larger the magnitude of these deviations will
tend to be.

The basic question being asked is how much do


the scores deviate around the Mean?

The more “bunched up” around the mean the


better your ability to make accurate predictions.

149
Coefficient of Variation (CV)
CV is the ratio of the standard deviation to
the absolute value of the mean.

CV = Standard deviation X 100%


Mean

PopulationCV   %  x100%

S
SampleCV  v %  X 100%
X
150
CV….
Shows the size of measure of variation with the
mean

CV is good measure of variation

The higher the CV, the higher the variability in


the data set and the lesser the precision of the
data and vise versa.

Commonly used for comparison of different data


sets (d/t samples)
151
Skewness

152
Skewness
Skewness is the measure of asymmetry of the
distribution

If extremely low or extremely high observations


exist in distribution, then the mean tends to shift
towards those scores.

Based on the type of Skewness the distribution


can be
– Negatively skewed
– Positively skewe
– Normally distributed

153
Negatively skewed distribution;
– occurs when majority of scores at the right end of the
curve and a few small scores are scattered at the left
end.

Negatively skewed
In unimodal negatively skewed distribution,
Mean, median and mode occur in alphabetic
order 154
Positively skewed distribution
Occurs when the majority of scores are at the
left end of the curve and a few extreme large
scores are scattered at the right end.

Positively skewed
In unimodal positively skewed distribution,
Mean, median & Mode occur in reverse
alphabetical order 155
Symmetrical distribution
It is neither positively nor negatively skewed.
– A curve is symmetric if one half of the curve
is the mirror image of the other half.
– This is called Normal distribution

In unimodal symmetric distribution


Mean, median and mode are identical. 156
Example
1. Data on birthweight were collected from 1000
neonates in Woreda “A” and summarized as:
– Mean = 3kg, Median = 2.5kg & Mode = 2kg
2. A body weights were measured for 600 adults aged
18 years and above in Woreda “A” and summarized
as;
– Mean = 60Kg, Median = 70Kg & Mode = 80Kg
3. A body weights were measured for 600 adults aged
18 years and above in Woreda “C” and summarized
as;
– Mean = 65.0Kg, Median = 65.5Kg & Mode = 65.2Kg

Q1. What is the type of distribution for the three


cases?
Q2. What do you understand from the three data sets?
157
Solution
Case 1:
Mode (2kg) < Median (2.5kg) < Mean (3kg)
Reverse alphabetic order, So, positively skewed
Case 2:
Mean (60Kg) < Median (70Kg) < Mode(80Kg)
Alphabetic order, So, negatively skewed
Case 3:
Mean (65.0Kg) ≈ Median (65.5Kg) ≈ Mode(65.2Kg)
Almost at similar position
So, symmetrical (Normally distributed)
Q2: The data tell us that using mean may misled in
describing skewed data.

158
Choice of Central tendency
The choice of which measure to use depends on:
The shape of the distribution (whether normal or
skewed)

If the distribution is symmetrical, mean is


the best measure of central tendency

If the distribution is skewed, median is


appropriate measure

159
Z-Score
(Relative Position)

160
Z-score
The z-score is the number of standard deviations
the data value falls either above or below the mean
for the data set.
– If above: positive z-score
– If below: negative z-score
It tells us the relative position of each value in
reference to mean

When computing the value of the z-score, the data


values can be population values or sample values.

Hence we can compute either a population z-score


or a sample z-score
161
Z-score…
The z-score for a value in a data set is
obtained by subtracting the mean of the
data set from the value and dividing the
result by the standard deviation of the
data set.

162
Sample Z-score
• The Sample z-score for a value x is given
by the following formula:

xx
z  score 
s
• Where X is the sample mean and s is
the sample standard deviation.
163
Population Z-score
• The Population z-score for a value x is
given by the following formula:

x
z  score 

• Where  is the population mean and  is


the population standard deviation.
164
Z-score…
The z-score is affected by an outlying value
in the data set,

Because the outlier directly affects the


value of the mean and the standard
deviation.

Outlying value is very small or very large


value relative to the size of the other values
in the data set

So usually used for symmetrical or normally


distributed data sets. 165
Why use Z-score?
• The z-score gives us an idea of how far
away the data value is from the mean,
and so it gives us an idea of the
position of the data value relative to the
mean.

166
Example

• What is the z-score for the value of


14 in the following sample values?

3 8 6 14 4 12 7 10

167
Solution
1st step is determining Mean & standard
deviation.

X  8 & S  3.82
X X 14  8
Z    1.57
S 3.82

Indicated in two decimal places

Thus, the data value of 14 is 1.57 standard


deviations above the mean of 8, since the z-
score is positive
168
This can be presented as

169
Z-score…
What is the z-score for the value of 6
in the above sample values?

X  X 24  31.5  7.5
Z    0.79
S 9.5 9.50

Thus, the data value of 6 is 0.52 standard


deviations below the mean of 8, since the z-score
is negative
170
Example 2
What are the z-scores Classes Freq
for the values of 24 and 12-17 3
44 in the students marks 18-23 4
given in the following 24-29 4
table? 30-35 8
36-41 6
1st Determine mean &
standard deviation 42-47 5
Total 30

X  31.5 & S  9.50


171
Solution
For the value 24,
X  X 24  31.5  7.5
Z    0.79
S 9.5 9.50
Thus, the data value of 24 is 0.79 standard
deviations below the mean of 31.5, since the z-score
is negative

 For the value 24,


X  X 44  31.5 12.5
Z    1.32
S 9.5 9.50
Thus, the data value of 44 is 1.32 standard deviations
above the mean of 31.5, since the z-score is positive
172
Transformed Z-score
If the Z-score is given, it is possible to get
the value corresponding to that Z.

X X
Z 
S
X  X  ZS

X  X  ZS 173
Example
From the above students’ marks, mean is
31.5 & standard deviation is 9.5. find the
value that corresponds to z - score of -1.50
and Z- score of 1.50

174
Solution
For Z=-2.00

X  X  ZS  31.5  1.50(9.5)  31.5  14.25  17.25

For Z=2.00

X  X  ZS  31.5  1.50(9.5)  31.5  14.25  45.75

175
Normal Values
Normal values are values regarded as being
within the usual range of variation in a given
population or a set of data

The range of such values is called normal


ranges.

The normal range for most biological & natural


distribution is defined by the area 2 standard
deviation units around the mean

In a normal distribution, this makes up 95% of


the total area (observation)
176
The Rule of Thumb
For data that approximate a normal distribution:

This is also called Standard Deviation 68-95-99 Rule 177


178
S = Standard deviation
Rule of Thumb

68%

95%
99.7%

X µ-3σ µ+3σ

The entire area under the curve = 100%


179
Normal value

68%

95%

180
Example
Students have Biostatistics exam out of 100%
Mean = 75
SD = 5
Minim = 50
Max= 95

Assuming that the results are normally distributed,

What are the values within which 68% of students are


encompassed?

What are the values within which 95% of students are


encompassed (the ranges for normal values)?

181
Solution

68% =1SD X  X  ZS
X = 75 ±1(5) = (75-5, 75+5) = (70,80)

95% = 2SD

X= 75 ±2(5) =(75-10, 75+10) = (65, 85)

182
Can be presented by using standard
normal curve

D C B
F A

Marks (%)

183
Probability
Lecture-7-8
By Gurmesa Tura (MPH)
April 2011,
AAU

184
Probability
 Deterministic Vs Probabilistic explanation of
occurrences

 Since thereis little in life that occurs with


absolute certainty, probability theory has
found application in virtually every field of
human endeavor.

185
Why Probability Theory?
• As we observe the universe around us, wonderful
Craftsmanship can be seen.

• As we examine the elements of this creation we discover that


there is incredible order, but also variation therein.

• Probability theory seeks to describe the variation or


randomness within order so that underlying order may be
better understood.

• Once understood, strategies can be more effectively


formulated and their risks evaluated.

186
What is Probability?
• Probability is a branch of mathematics concerned with
the analysis of random phenomena (chance)
• is the mathematical framework for describing
(modelling) uncertainty
• Is a numerical measure of the likelihood that a specific
event will occur
• Probability theory provides a way to find and express our
uncertainty in making decisions about a population from
sample information

• A measure of the degree of chance or likelihood of


occurrence of an uncertain event

187
Probability…
• Probability theory began in the 16th and 17th
centuries
• European mathematicians began to analyze simple
games cards and dice.
• One of the first attempts to use ideas of relative
frequency to study human populations by J.Grant.

• Now applied to analyze data in astronomy, mortality


data, traffic flow, telephone interchange, genetics,
epidemics, investment...etc

188
Common terms in probability
1. Experiment:
 In statistics is any thing that results in a count or
measurement is called an experiment.
 E.g. tossing a coin, Rolling a die etc
2. Sample Space (S):
 The set of all possible out comes of an experiment
 e.g. in tossing a coin (H, T)
 In rolling a die (1,2,3,4,5,6)
3. Event (E): is a set of outcomes of a random phenomena
(experiment)
 any subset of the sample space
 Eg. Getting even numbers (2,4,6)
Getting odd numbers (1,3,5)

189
Properties of Probability
1. Probabilities always lie between 0 and 1.
2. Zero probability implies that something is impossible.
3. A probability of 1 means something is certain.
4. The sum of all probabilities of a distribution is equal to
1.

190
Probability…
• Example if we say that the probability of getting sick
for a person is 0.25
• A probability of 0.25 (also expressed as 1/4, or 25%)
implies that we think that it is 3 times as likely not to
get sick as it is to get sick.
• This is because
– P(no sickness) = 1 - P(sickness) = 0.75
– 0.75/0.25 = 3.

191
Probability..
• Let A denote an event . Then,

• The probability of that event is usually written as


P(A) or Pr(A)

• The complement of an event (Ac) is everything not in that


event .

• The probability of the complement of an event or


probability of non occurrence is written as
P(Ac) = 1 - P(A)

192
Probability theories
 Two views:
1. Objectivist (Frequentist) &

2. Subjectivist (Bayesian)

193
1. Frequentist (or Objectivist):
• Probabilities are real aspects of the world that can be
measured by relative frequencies of outcomes of
experiments
 based on equally-likely events
 based on long-run relative frequency of events
 not based on personal beliefs
 is the same for all observers (objective)
 examples: toss a coin, throw a die, pick a card
 Well accepted in statistics as compared to the
Bayesian (or Subjectivist)

194
2. Bayesian (or Subjectivist):
• Probabilities are descriptions of an observer's
degree of belief or uncertainty rather than
having any external significance
– based on personal beliefs, experiences, prejudices,
intuition - personal judgment
– different for all observers (subjective)
– examples: elections, new product introduction,
snowfall

(Thomas Bayes, c. 1706 - 1761)


195
Example of subjective
• If some one says that he is 95% certain that
a cure for AIDS will be discovered within 5
years, then
– He means that Pr(discovery of cure of AIDS
within 5 years) = 95%.
• Although the subjective view of probability
has enjoyed increased attention over the
years, it has not been fully accepted by
scientists.
196
Classical definition of probability
• Classical Probability (theoretical):
– The probability of an event is the event’s long run relative
frequency in repeated trials under similar conditions. OR,
– The probability of any outcome of a random phenomenon
is the proportion of times the outcome would occur in a
very long series of repetitions.
– Examples:.
• The probability of the occurrence of a head tossing a
coin is 0.5, so, if tossed 100 times, we expect 50 heads.
• The probability of having male sex of fetus per a
pregnancy is 50%, so if 8 pregnancy, we expect 4 males.

197
Relative frequency probability (empirical):
• If some process is repeated a large number of n times,
and some resulting event E occurs m times, the relative
frequency of E (m/n) will be approximately equal to the
probability of E.

– Symbolically, Pr(E) = m/n

– E.g. Suppose that of 158 people who attended a


dinner party, 99 were ill due to food poisoning.

– Thus, the probability of illness for a person


selected at random is given as
• Pr (illness) = 99/158 = 0.63 or 63%

198
In general,
• If there are “n” equally likely possibilities of
which one must occur and “S” are regarded as
favourable outcomes or success, then the
probability of the success is given by S/n
• i.e.

# of sucesses
P( sucess) 
total # of outcomes
# of ways A can occur
P( A) 
total # of outcomes

199
Random Phenomena
We call a phenomenon random if:-
 The exact outcome is not predictable in advance.

 Nonetheless, there is a predictable long term pattern that


can be described by the distribution of outcomes of very
many trials.

• Thus,
• A phenomenon is random, if individual outcomes are
uncertain but there is a regular distribution of outcomes in a
large number of repetitions.

 E.g. tossing coin 100 times, approximates 50%H & 50%T


 A woman giving 8 births, approximates 50% male &50%
female

200
e.g.
Coin tossing 100 times

201
Common terms in Relation of events
 Set - a collection of elements or objects of interest
 Empty set (denoted by )
 a set containing no elements
 Universal set (denoted by S) =Sample space
 a set containing all possible elements
 Complement (Not). The complement of A or A’ is
 a set containing all elements of S not in A
• Intersection
• Union
• Mutually exclusive
• Partition

202
Elements of Set A

Venn Diagram illustrating the elements of of an event


203
Complement of a Set A =AC = A’

A’
A

Venn Diagram illustrating the Complement of an event


204
Intersection of sets
Intersection (And)

a set containing all elements in both A and
B

A B

A B 205
Union of sets
Union (Or)
a set containing all elements in A or B
or both

206
Mutually exclusive or disjoint sets
 sets having no elements in common, having no
intersection, whose intersection is empty set

207
Partition
• a collection of mutually exclusive sets which
together include all possible elements, whose union
is the universal set

208
Rules of probability
1. For any event A, P(A) ranges from 0 to 1
P(A): 0  P(A)  1.
2. If A and B can never both occur at a time (they are
mutually exclusive), then
P(A and B) = P(A  B) = 0
3. For any event A and event B,
P(A or B) = P(A  B) = P(A) + P(B) - P(A  B).
4. If A and B are mutually exclusive events, then
P(A or B) = P(A  B) = P(A) + P(B).
5. For event A, the probability that it does not occur
P(Ac) = 1 - P(A).
6. If A and B are independent events, then
P(A and B) = P(A  B) = P(A)  P(B).
209
Conditional Probability
• For non-independent events
• The probability that event B occurs given that
event A has occurred is called a conditional
probability.

• It is denoted by the symbol P(B | A), which is


read “the probability of B given A.”

• We call A the given event.


• It is also called joint probability
210
Conditional…..
• If A and B are any two events, and the occurrence of
event B depends on the occurrence of event A, then

P( A & B) P( A  B)
P( B | A)   , P( A)  0
P( A) P( A)

• In words, for any two events, the conditional


probability that one event occurs given that the
other event has occurred equals the joint probability
of the two events divided by the probability of the
given event.

211
Example
• In a dinner party, 100 people participated. 60 of them ate
“Kitfo” and 40 of them ate Roasted meat (“Tibs”). A day
latter, 40 people developed food poising 36 of which were
among eaters of “kitifo”.

• Q1. what is the probability of occurrence of food poisoning


among people who ate “kitfo”.

• Q2. what is the probability of occurrence of food poisoning


among people who ate roasted meat.

• Two approaches can be used


– The contingency table = use frequency
– The joint probability table = use probabilities

212
Draw contingency table
P(poisoning/kitifo)
Type of Food Total
=36/60 =0.60 = 60%
food poisoning
eaten (B)
(A) Yes No P(poisoning/Roasted meat)
=4/40 =0.10 =10%

Here the probability of food


“Kitfo” 36 24 60 poisonig was about 6 times
more likely to occur for the
“kitfo” as compared to the
Roasted 4 36 40 roasted meat.
meat - The kitifo might have been
spoiled needs intervention. 213
Joint probability Table
 A joint probability table is similar to a contingency
table , except that it has probabilities in place of
frequencies.
Type of Food Total
food poisoning
eaten
Yes No
 Pi = fi/n ,
 e.g. 36/100 =.36 “Kitfo” .36 .24 .60

Roasted .04 .36 .40


meat
 The row totals and
column totals are called Total .40 .60 1.00

marginal probabilities. P(poisoning/kitifo)


=.36/.60 =0.60 = 60%

P(poisoning/roasted meat)
=.04/.40 =0.10 =10%
214
Example 2
• Suppose in country “X” the chance that an infant lives to
age 25 is .95. Whereas the chance that he lives to age 60
is .65. for the later, it is understood that to survive to age
60 means to survive both from birth to age 25 and from
age 25 to 60.

• Q1. What is the probability that a person of 25 years


survives to age 60?

215
Solution
Notation Event Probability
A Survive birth to age 25 .95
A&B Survive birth to age 25 & age .65
25 t0 60
B/A Survive age 25-60 given ?
survived to age 25

• P(B/A) = P(A&B)/P(B/A) = .65/.95 = .684

• That is, a person aged 25 has 68.4% chance of


living to age 60.
216
Independent Events
• We call Independent events when there are two events such that the
occurrence or non-occurrence of one does not in any way affect the
occurrence or non occurrence of the other.

• Two events A and B are said to be independent if the fact that A has
occurred or not does not affect your assessment of the probability of
B occurring.

• Conversely, the fact that B has occurred or not does not affect your
assessment of the probability of A occurring.

– P(A | B) = P(A), P(B | A) = P(B)


• If A and B are independent events, then
P(A and B) = P(A  B) = P(A)  P(B)

217
Example
• What is the probability that a pregnant
woman gives a female child after having a
female child before?
• Answer:
• The sex of the foetus is independent of the sex
of the previous child.

• So, P (female fetus) =1/2 =0.5 =50%


218
Counting of possible outcomes

219
Counting of possible outcomes
• According to classical definition of probability, outcomes are
equally likely to occur.
• In this case the probability is determined as,

# of ways A can occur


P ( A) 
total # of outcomes
• To know the # of ways A can occur and the total # of outcomes
we have to count

• These are called “counting methods”

• What if large number of trials?


220
Counting…
• If the number of possible outcomes in an experiment is small, it is
relatively easy to list and count all possible events.

• When there are large number of possible outcomes, an


enumeration of cases is often difficult , tedious or both

• To overcome such problems one can use various counting


techniques.

• Such as:
• Powers
• Permutations &
• Combinations

221
Counting ….
• We can have two approaches in determining
the number of possible outcomes

• If Order is considered
– With replacement = powers
– Without replacement = permutations

• If Order is not considered


– Without replacement = combinations

222
Counting …
Counting methods for computing probabilities

Combinations—
Permutations— Order doesn’t
order matters! matter

With replacement
Without replacement

Without replacement

223
Counting with replacement
• With replacement: once an event occurs, it can
occur again (after you roll a 6, you can roll a 6
again on the same die).

• Example
– Assume you tossed a coin 3 times, what’s the
probability of 3 of them are heads?

224
With replacement…
• Solution:
– Determine the total number of possible outcomes.
– As this is small trial we can use probability tree

225
Replacement…
• What if 100 tosses? Difficult to list and count all possible out
comes. In this case we use the rules of powers.

General rule :
When order matters and with replacement
For n number of outcomes per trial for r trials;
The total possible number of outcomes is given by
n to the power of r.
r
(# possible outcomes per event) the # of events
n

226
Example:
• What is the total possible number of outcomes for tossing coin 3 times
– Solution:
• Possible out come per trial (H or T) =2
• Number of trial = 3
• Total possible number of outcomes (Sample space)
• S = nr =23 = 8
• Getting head in all the 3 trials is 1/8

• What is the total possible outcomes for rolling a die 3 times?


• Solution:
• Possible out come per trial (1,2,3,4,5,6) =6
• Number of trial = 3
• Total possible number of outcomes
• S = nr =63 = 216

227
Without replacement
• Without replacement: an event cannot repeat
after once selected

• E.g. A after you draw an ace of spades out of a


deck, there is 0 probability of getting it again

 Example:
 What is the total possible ways of picking 5 cards
from a deck of 52
228
With replacement…
 If it is with replacement, we have 52 sets for
all the five trials
 i.e: 52 x 52 x 52 x 52 x 52 = 525 = 380,204,032
= 380,204,032 different possible outcomes

- What if without replacement


 52 x 51 x 50 x 49 x 48 = 311,875,200 different possible
outcomes

What general formula applies for this? Answer is permutation


229

Permutations
Permutations are the possible ordered selections of r objects out of
a total of n objects without replacement.

• General rule for events without replacement:


• The number of permutations of n objects taken r at a time is denoted
by nPr, where
P  n!
n r (n  r )!
• For the above example picking 5 cards from a deck of 52.
• n =52, r = 5, 52P5 = 52!/(52-5)! = 52!/47!

= 52x51x50x49x48x47! = 311,875,200 ways


47!

230
When order is not considered
• Suppose that we picked 3 letters out of the 6 letters A, B, C, D,
E, and F with out replacement.
• Total ways = 6!/(6-3)! = 120
• From this for example letters (B, C & D)
• Cab be ordered in 3! ways = 6
• i.e. BCD, or BDC, or CBD, or CDB, or DBC, or DCB.

• But these are orderings of the same combination of 3 letters.


• If we avoid order, how many combinations of 6 different
letters, taking 3 at a time, are there?
• To do this we use the rules of combination

231
232
Example above
• If we avoid order, how many combinations of
6 different letters, taking 3 at a time, are
possible?

 n  C  n! n  6&r  3
r n r
  r! (n  r)!

 6  C  6! 6 x5 x 4 x3! 6 x5 x 4
 3 6 3    20
  3! (6  3)! 3!3! 3 x 2 x1
While considering order we had 6P3 =120 ways, but,

without order we have 6C3 = 20 ways 233


Summary of counting techniques
• hgfhgfh

234
Exercise 1
• Suppose the department head tried to form a committee having a
group of 6 students among 200 medical students by listing their ID.NO.

Q1. What is the possible number of ways that he can do in considering


order with replacement?

Q2. What is the possible number of ways that he can do in considering


order without replacement?

Q3. What is the possible number of ways that he can do without


considering order?

Q4. Which one do you think is the best way for him to form the
committee? Why?

235
Exercise 2
• Suppose there are 100 2nd year medical students. 60 of
them are males and 40 females. 10 students were planned
to be selected for scholarship abroad to continue their
education. In how many ways this can be done if.

a. There is no restriction?
b. Two particular females should be included?
c. Five particular females can be included?

236
Random variable and
Probability distribution

237
Random variable
• A random variable is a numerical description of the outcomes
of the experiment or a numerical valued function defined on
sample space.
• Usually denoted by capital letters.
• It takes a possible outcomes and assigns a number to it.

• Example. Toss a coin three times and let X be number of heads


in three tosses
• S = {(HHH),(HHT),(HTH),(HTT),(THH),(THT),(TTH),(TTT)
– X(HHH)=3
– X(HHT)=(HTH)= (THH)=2
– X(HTT)=(THT) =(TTH) =1
– X(TTT) =0

238
Random variable
• Random variables are of two types.
– Discrete random variable &

– Continues random variables

239
Discrete random variables
• Are variables which can assume only a specific number of
values

• They have values that can be counted

• Example:
– Toss a coin n times and count the number of heads
– Number of children in a family
– Number of car accidents per week
– Number of two malaria cases per month
– Etc….

240
Continues random variables
• Are variables that can assume all values between any
two given values.

• A continuous random variable X can take on an


uncountably infinite number of values

• Example:
– Height of students at a certain college
– Mark of students
– Weight of individuals in a certain community
– Etc…

241
Probability distribution
• The term probability distribution refers to the way data are
distributed, in order to draw conclusions about a set of data.

• A probability distribution consists of a value a random variable


can assume and the corresponding probabilities of the values

• Every random variable has a corresponding probability


distribution.

• A probability distribution applies the theory of probability to


describe the behavior of the random variable.

242
Probability distribution…
• A probability distribution of a random variable can be
displayed by a table or a graph or a mathematical
formula.
• With categorical variables, we obtain the frequency
distribution of each variable.
• With numeric variables, the aim is to determine whether
or not normality may be assumed.
• If not we may wish to consider transforming the variable,
or may wish to categorize the variable for analysis (e.g.
age groups).

243
Models of probability distribution
• For discrete random variables
– Binomial distribution
– Poison distribution

• For continues random variables


– Standard normal distribution

244
Binomial Distribution

245
Binomial distribution
• A binomial distribution is a probability experiment that
satisfies the following four assumptions

1. The experiment has n identical fixed number of trials


2. Each trial has only one of the two possible mutually
exclusive outcomes (success or failure)
3. The probability of each outcome does not change from
trial to trial &
4. The trials are independent, thus we must sample with
replacement

246
Binomial dist….
• Suppose that n independent experiments, or trials, are
performed, where n is a fixed number, and that each
experiment results in a “success” with probability p and a
“failure” with probability 1-p.

• Then the total number of successes, X, is a binomial


random variable with parameters n and p.

• We write: X ~ Bin (n, p) {reads: “X is distributed


binomially with parameters n and p}

247
Binomial dist…
• The probability that X=r (i.e., that there are exactly r
successes) is:

n r nr
P ( X  r )    p (1  p )
r

Where: n = number of trials


r = number of success
p = probability of success
1-p = probability of failure

248
Binomial dist…
Bernouilli trial:
• If there is only 1 trial with probability of
success p and probability of failure 1-p, this is
called a Bernouilli distribution.
• Special case of the binomial with n = 1
1 1
Probability of success: P ( X  1)    p (1  p )11  p
1

1 0
Probability of failure: P ( X  0)    p (1  p )10  1  p
0

249
Example
• Assume a woman planned to give 6 children and the
probability of getting male is 50%.

a) What is the probability that exactly 3 of them are male


children?

b) What is the probability that at least 3 of them are male


children?

c) What is the probability that at most 2 of them are male


children?
250
Solution
• # of trial = n = 6
• Probability of success (male child) per a single
trial = 0.5
a) For exactly 3 male, r = 3

6 3
P( X  3)   0.5 (1  0.5) 63
3
6!
0.53 x0.53  20(0.5) 3 (0.5) 3
3!(6  3)!
 20 x.125x.125  .3125
The probability of getting exactly 3 male children in 6 pregnancies is .3125

251
b) Probability that at least 3 of them
are male children
• When we say at least 3 males, it could be 3, 4, 5 or 6
• i.e P(X≥3) =P(x=3)+P(X=4) + P(X=5)+P(X=6)
6 3
P( X  3)   0.5 (1  0.5) 3  0.313
3
6 4
P( X  4)   0.5 (1  0.5) 2  0.234
4
6 5
P( X  5)   0.5 (1  0.5)1  0.094
5
6 6
P( X  6)   0.5 (1  0.5) 0  0.016
6
P ( X  3)  0.313  0.234  0.094  0.016  0.657

The probability of getting at least 3 male children in 6


pregnancies is 0.657 =65.7% 252
c) Probability that at most 2 of them
are male children
• When we say at most 2 males, it could be 0, 1 or 2
• i.e P(X≥3) =P(x=3)+P(X=4) + P(X=5)+P(X=6)

6
P ( X  0)   0.50 (1  0.5) 6  0.016
0
6
P ( X  1)   0.51 (1  0.5) 5  0.094
1
6
P ( X  2)   0.5 2 (1  0.5) 4  0.234
2

P ( X  2)  0.016  0.094  0.234  0.344

The probability of getting at most 2 male children in 6


pregnancies is 0.344 =34.4% 253
Expected value and a variance Binomial
distribution
• All probability distributions are characterized by an expected value and a
variance:

• If X follows a binomial distribution with


parameters n and p: X ~ Bin (n, p)
Then: Note: the variance will

x= E(X) = np always lie between


0*N-.25 *N

 =Var (X) = np(1-p)


x
2 p(1-p) reaches
maximum at p=.5
P(1-p)=.25
x =SD (X)= np (1  p )

E(X)= Expected number to have the condition 254


Things that follow a binomial distribution
• Cohort study (or cross-sectional):
– The number of exposed individuals in your sample
that develop the disease
– The number of unexposed individuals in your
sample that develop the disease

• Case-control study:
– The number of cases that have had the exposure
– The number of controls that have had the
exposure

255
Example
Suppose you are performing a cohort study. If the probability of
developing disease in the exposed group is .05 for the study
duration, then if you randomly samples 500 exposed people.

Q1. How many do you expect to develop the disease? Give a


margin of error (+/- 1 standard deviation) for your estimate.

Q2. What’s the probability that at most 10 exposed people


develop the disease?

256
Solution for Q 1
Given:
• N=500, p=0.05, Z=+/-1SD

• µx= E(X) = ?
• Expected case with in +/-1SD ?
i.e. X ~ binomial (500, .05)
– µx = E(X) = np
– E(X) = 500 (.05) = 25

Var(X) = np(1-p) = 500 (.05) (.95) = 23.75


StdDev(X) = square root (23.75) = 4.87 
25  4.87, (20.13, 29.87) will develop the disease

257
Solution 2
Given:

• N=500, p=0.05
• P(X≤10) =?
• P(X≤10) = P(X=0) + P(X=1) + P(X=2) + P(X=3) + P(X=4)+….+ P(X=10)

 500  0 500  500  1 499  500  2 498  500  10 490


 (.05) (.95)   (.05) (.95)   (.05) (.95)  ...   (.05) (.95)  .01
 0  1  2  10 

The probability at most 10 of them develop the disease is <0.01

258
Exercise
Suppose you are conducting case control study. Assume
the probability of being a smoker among a group of cases
with lung cancer is .6, and you sampled 10 cases for your
study.

1. What is the expected number of smokers?


2. What is the variance & SD for the number of smokers?
3. Give +/-2SD margin for the expected number of smokers.
4. What is the probability that more than 5 of the cases are
smokers?

259
Poison Distribution

260
Poison distribution
• The Poisson distribution is used to model discrete events
that occur infrequently in time and space
– i.e. rare events that occur in constant rate.
– example death rates, accident rates, Incidence
rate of rare diseases.
• Our random variable will be the “number of occurrences
of the event over the region of opportunity for
occurrence in a given time”.
• Poisson distribution is for counts

261
Poison…
• If events happen at a constant rate over time, the
Poisson distribution gives the probability of X number of
events occurring in time T.

• For a Poisson random variable, the variance and mean


are the same and represented by λ
Mean    

Variance      
2

Standard Deviation    

where  = expected number of event of interest in a


given time period
262
Poison…
• If X is a random variable representing a Poisson
distribution, then the probability of k occurrences is
given by
k 
e
p( X  k ) 
k!
– Where:
• K = # of occurrences
• λ = the mean number of occurrences in periods of some interval
• e = 2.71

– The Poisson distribution has normal distribution.

263
Example
• Suppose X is a random variable representing the number of
individuals involved in a road accident each year in Ethiopia.
Assume the mean number of occurrence of road accident in
Ethiopia is 2.4 individuals per 1,000 populations per year.

Q1. What is the probability that exactly 5 accidents occur in this


population in the coming one year?

Q2. What is the probability that at most 3 accidents occur in this


population the coming year?

264
Solution
2. n=1,000, λ=2.4 per 1000, e = 2.71, k = 5
• P(X=5)=?
k e  
p( X  k ) 
k!
• P(X=5)= (2.4)5(2.71)-2.4
5!
=(79.63) (0.09) = 0.06 = 6%
120

265
Solution to Q2
2. At most 3 accidents= P(X≤3)= ?

P(X  3)  P(X  0)  P(X  1)  P(X  2)  P(X  3)

2.40 2.71 2.4 2.412.71 2.4 2.4 2 2.71 2.4 2.432.71 2.4
p( X  3)    
0! 1! 2! 3!
 0.09  0.22  0.26  0.21  0.78

The probability that three or less car accidents per 1000 population is
0.78 =78%

266
“Poisson Process”
• Note that the Poisson parameter  can be given as the
mean number of events that occur in a defined time
period OR,
• equivalently,  can be given as a rate, in a given time
period so that we can multiply it by the required time =t
• This is called a “Poisson Process” and given as,

k  t
( t ) e
P( X  k ) 
k!
E(X) = t
Var(X) = t 267
Example
• Suppose new cases of measles is occurring at a
rate of about 2 per month per 100,000 under five
population in Ethiopia,
1) what’s the probability that exactly 4 cases of
measles will occur in the next 3 months in the
same population?
2) what’s the expected number of measles cases in
1,000,000 under five population in one year?
3) Give +/-2SD margin for the expected number of
cases.

268
Solution to Q1
1.Given λ=2 per 100,000 per month & t=3 months
P(X=4)=?
(2 x3) 4 2.71 ( 2 x 3)
P ( X  4 in 3 months) 
4!
(6) 4 2.71( 6)
P ( X  4 in 3 months) 
24
(1296)(0.0025)
  0.135  13.5%
24
So, the probability that 4 new cases of measles occur in
3 months in 100,000 population is 0.135 =13.5%
269
Solution to Q2 & Q3
Q2 .Given λ = 2per month/100,000
=(2/100,000)*1,000,000
=20 per month per 1,000,000
t=1year=12 months
– E(X) = t
E(X) = t
– E(X=12month) = 20X12 = 240 cases

Q3. +/-2SD margin for 240=?


– Var(X) = t =240,
– SD(X) = square root of 240 =15.49
– 240+/-30.98 = (209.92, 270.98)

270
Normal Distribution

271
Normal distribution
• Normal distributions are symmetric single picked bell-shaped
curve described by its mean (µ) and standard deviation (σ).

• Used for continues random variables.

• The “normal” or “Gaussian” distribution is the most


commonly used of all probability models.

• It is also foundational to the development of numerous


commonly used statistical methods

272

Normal dist…
Under different circumstances, the outcome of a random
variable may not be limited to categories or counts.
– E.g. Suppose, X represents the continuous variable ‘Height’;
rarely is an individual exactly equal to 170cm tall
– X can assume an infinite number of intermediate values 170.1,
170.2, 170.3 etc.
• Because a continuous random variable X can take on an
uncountably infinite number of values, the probability
associated with any particular one value is almost equal to zero

• However the probability that X will assume some value in the


interval enclosed by two ranges say x1 and x2 can be
determined

273
Normal dist…
• As a continuous variable can take an infinite number of values,
it helps to visualize the probability distribution as a curve and
probabilities as ‘area under the curve’.

• The normal distribution is completely described by two


parameters (μ & σ )

• The mean μ can be any number (negative, positive or zero). (-


∞≤ µ ≤ +∞)

• The standard deviation σ must be a positive number.


• the out come of the measurement (x) will array from (- ∞≤ X
≤ +∞),

• To model this we use normal probability density function

274
275
276
277
Normal distr…

Bell-shaped and symmetric distributions.


Because the distribution is symmetric, one-half
(.50 or 50%) lies on either side of the mean.

Example
Finding Probabilities of the Standard Normal
Distribution so that: P(0 ≤ Z ≤ 1.56)
Procedures:
 Look in row labeled 1.5 and column labeled .06 to find P(0 ≤
Z ≤ 1.56) = 0.4406

278
Standard Normal Probabilities

AREA UNDER THE


STANDARD NORMAL CURVE

279
Example
• Let X be systolic blood pressure (for US population
aged 18-74 males) with μ = 129 mmHg and σ =
19.8 mmHg.

Q1. What level encompasses the middle 95%?

Q2. What proportion of men in the population


have SBP greater than 150mmHg?

Q3 .What level cuts the lower 10% of SBP?


280
Solution Q1
• Given μ = 129 mmHg and σ = 19.8 mmHg
• Level encompassing 95%=?
• Read from Z Table i.e. SND
• From SND, Z corresponding to =95%=0.95
• As the table is one sided, 0.95/2 =0.4750 = 1.96
X= μ ± Z σ
=129 ± 1.96(19.8) = (129 ± 38.8) = (129-38.8, 129+38.8)
• Level for 95%: (90.2, 167.8)

• Interpretation:
• The systolic blood pressure for 95% of US population aged 18-
74 males in mmHg lies (90.2, 167.8).

281
Solution to Q2
• Given: μ = 129 mmHg and σ = 19.8 mmHg
– % for SBP > 150mmHg
• To get %, find Z corresponding to 150
• Z = ( X – μ)/ σ = (150-129)/19.8 = 1.06
• P(Z>1.06)

Go and read from the table


• Go to Z table and find, P(0≤Z ≤ 1.06)=0.3554
• P(Z>1.06) = 0.50- P(0≤Z ≤ 1.06)
• P(Z>1.06) = 0.50-0.3554 = 0.1446

• The proportion of adult males aged 18-74 having SBP >


150mmHg is 0.1446 = 14.46%

282

solution
Lower 10% of SBP, 10% =0.10
to Q3
• Find Z from the table corresponding to 0.1
• To read from the table, 0.5-0.1=0.4
• Find the Z corresponding approximately to 0.4 from the table.
• 0.3997 corresponds to P(0≤Z ≤ 1.28)
• 0.1 corresponds to P(Z>1.28)
• As required is the lowest 0.1, it will be negative
• i.e. the lowest o.1: P(Z<-1.28)

• To get the cult off point (X) corresponding to P(Z<-1.28)

• X= μ +Z σ = 129+-1.28(19.8) = 103.6

• 10% of them has SBP <103.6mmHg


.

283
Exercise: try the following exercises and
compare your findings with the answers given
1. Find Probabilities of the Standard Normal
Distribution: P(Z < -2.47)
answer = 0.0068
2. Find Probabilities of the Standard Normal
Distribution: P(1≤ Z ≤ 2)
answer = 0.1359
3. Find Values of the Standard Normal Random
Variable: P(0 < Z < z) = 0.40
answer = value corresponding Z=1.28
i.e. X = µ+1.28σ

284
Sampling Methods

Lecture By Gurmesa Tura (MPH)


April 2011
AAU

285
Learning objectives…
• At the end of this lecture the students will be
able to:
– Define common terms used in sampling
– Distinguish the difference between probability and
non probability sampling

– Identify the different methods of probability and


non probability sampling techniques

– Explain the advantages and disadvantages of each


technique
286
Sampling
• Sampling is a process of choosing a section of the
population for observation and study.

• Is taking representative subgroup of the reference


population

• Sample should reflect all the qualities found in the


population

287
Common terms used in sampling
• Reference population (target population)
– The population of interest, to which the
investigator would like to generalize the results of
the study

• Source population
– From which the representative sample is to be
drawn

288
Common terms…
• Study or sample population
– The population included in the sample

• Sampling unit
– The unit of selection in the sampling process

• Study unit
– The unit on which information is collected

289
Common terms…
• Sampling frame
– The list of all the units in the reference population,
from which a sample is to be picked

• Sampling fraction/sampling interval


– The ratio of the number of units in the sample to
No. of units the reference population (n/N)

290
Hierarchy of Sampling

AA

WRA
291
Why sampling?
• Feasibility: Sampling may be the only feasible
method of collecting the information.
• Reduced cost: Sampling reduces demands on
resource such as finance, personnel, and material.
• Greater accuracy: Sampling may lead to better
accuracy of collecting data

• Sampling error: Precise allowance can be made for


sampling error
• Greater speed: Data can be collected and
summarized more quickly

292
Limitations of sampling…
• There is always a sampling error

• Sampling may create a feeling of discrimination


with in the population

• Sampling may be inadvisable where every unit in


the population is legally required to have a record

293
Types of sampling
A. Probability sampling
– Subjects of the sample are chosen based on known (non-
zero chance) probabilities.
– Guarantees that every element in the population of
interest has the same probability of being chosen for the
sample as all other elements in the population; “random”
selection.

B. Non-probability sampling
– we do not know the probability that each population
element will be chosen, and/or
– we cannot be sure that each population element has a
non-zero chance of being chosen.
294
Main differences
Probability sampling Non-Probability sampling
• Every item has a chance of being • Not every item that has chance of
selected. being selected

• Randomization is a feature of the • An assumption that there is an even


selection process. distribution of characteristics within
the population

• Elements are chosen randomly with • Elements are chosen arbitrarily


a (non-zero) probability

• Produce representative data


• Produce non representative data

295
Types of Sampling Methods

Sampling

Probability Sampling
Non-Probability
Sampling
Simple
Random Stratified
Convenience
Quota
Cluster
Purposive Snowball
Systematic
Volunteer Multistage

296
I. Probability Sampling
• A probability sampling method is any method of
sampling that utilizes some form of random selection.

• Is more complex, more time-consuming and usually


more costly than non-probability sampling

• Inferences can be made about the population

297
Probability Sampling…
• The population of interest is clear (because it
must be identified before sampling from it.)

• Possible sources of bias are removed, such as


self-selection and interviewer selection
effects.

• The general size of the sampling error can be


estimated
298
Probability Sampling…
• Includes
1. Simple Random Sampling (SRS)
2. Systematic Sampling
3. Stratified Random Sampling
4. Cluster Sampling
5. Multistage Sampling

299
1. Simple random sampling
• Each sampling unit in the population has an equal chance of
being included in the sample.
• Steps
1. Define the population
2. Determine the desired sample size
3. List all members of the population or the potential
subjects (sampling frame)-we can use codes
4. Select the desired samples by simple random methods
 we can apply methods like
 Lottery method (sample drawn from box)
 Table of random numbers (show the table)
 Computer generated random numbers
300
Advantages of SRS
• Each unit in the sampling frame has an equal
chance of being selected

• The formulas are easy to use.

• Easy to apply to small populations.

301
Disadvantages of SRS
• Can be expensive and unfeasible for large
populations –need complete list.

• Minority subgroups may not be present in the


sample in sufficient numbers for the study

302

2. Systematic random sampling
Individuals are chosen at regular intervals from the sampling
frame

Steps :
1. Number the units on your frame from 1 to N
2. Determine the sampling interval (K) by dividing N/n. Example,
N=100, n=20, then k=N/n=100/20=5
3. Select a number between 1 and K at random. This number is
called the random start.
4. Using the sample above, you would select a number b/n 1
and 4.
5. Select every Kth (in this case, every fifth) unit after the first
number.

303
Systematic random sampling…

304
Advantages of Systematic sampling
– Require no sampling frame
– Easier to perform
– Require less time than SRS
– Very good when the population from which
sample is to be drawn is homogeneously
distributed.
Disadvantage:
– Patterns/periodicity in which case it may be non representative

305

3. Stratified Sampling
The population is first divided into groups of elements having similar
characteristics called strata.

 Each element in the population belongs to one and only one stratum.

 It is appropriate when the distribution of the characteristic to be


studied is heterogeneous

 Best results are obtained when the elements within each stratum are
homogeneous group

 Maximum homogeneity within the group and max. heterogeneity


among the groups contribute for the accuracy of the estimates.

 A simple random sample is taken from each stratum

306
Stratified Sampling…
 A separate sample is then taken from each stratum by random
sampling

• The sampling method can vary from one stratum to another.

• Proportionate allocation
– The same sampling fraction is used for each stratum

• Non-proportionate allocation
– Different sampling fraction is used or
– Though the strata are unequal in size, a fixed number of
units is selected from each stratum

307
Advantages Stratified Sampling
• If strata are homogeneous, this method is as
“precise” as simple random sampling but with
a smaller total sample size

• Good representation of the minorities in non-proportional


allocation

• This will increase the adequacy of the sample of each stratum


to equate the statistical power of tests of differences between
strata.

308
Disadvantages Stratified Sampling
• Can be difficult to select relevant stratification variables

• Not useful when there are no homogeneous subgroups

• Can be expensive

• Requires accurate information about the population, if


not it introduces bias.

309
Example
• Suppose that in a company (E.g AAU) has 1800 (N) staff from
which 400 (n) are to be selected proportionally:
– Male academic staff = 900
– Male administrative staff = 180
– Female academic staff = 90
– Female administrative staff = 630

• To take a sample of 400 staffs, stratified according to the


above categories by using the formula for proportional
allocation.
Ni
n i

N
xn

310
Example…
By using the formula
– Male academic staff = (900 / 1800) x 400 = 200
– Male administrative staff = (180 / 1800) x 400 = 40
– Female academic staff = (90 / 1800) x 400 = 20
– Female administrative staff = (630 / 1800) x 400 = 140

• Final = 200 + 40 + 20 + 140 = 400

311
4.Cluster sampling
• Is a sampling technique used when "natural" groupings are
evident in a statistical population.

• If not, the population is first divided into separate groups of


elements called clusters

• Reference population (homogeneous) is divided into clusters –


often geographical units

• A simple random sample of the clusters is then taken

• All the units in the selected cluster are studied

312
Cluster sampling…
Cluster samples are generally used if:

• No list of the population exists.

• Well-defined clusters, which will often be geographic areas


exist.

• A reasonable estimate of the number of elements in each


level of clustering can be made.

• Often the total sample size must be fairly large to enable


cluster sampling to be used effectively.

313
Cluster sampling…
Advantages:
• Sampling frame of the reference population is not required
(Sufficient to have a list of clusters)
• Cost effective

Disadvantage:
• Based on the assumption that the study units are uniformly
distributed through out the reference population. Which may
not be always the case.
• we do not have total control over the final sample size

314
5. Multistage sampling
• Used when the reference population is large and widely
scattered.
• Selection is done in stages until the final sampling unit are
arrived at.
– Primary sampling units –from the first sampling stage
– Secondary sampling units- from the second sampling
stage etc..
• Finally study subjects will be selected by SRS
• No need of sampling frame for the reference population.

315
Multistage …
Advantage
• Cuts the cost of preparing the sample frame

Disadvantage
• sampling error is high compared with simple random
sampling (so we need to use design effect)
• Less precise estimation than SRS for the same sample but the
reduction in cost outweighs this and allow for a large sample
size

316
Example Multistage …
• Suppose research wanted to study the risk of
AAU students to HIV/AIDS and wanted to
include 1500 students. How can he go about?
• Multi stage
– Primary sampling unit: Campus/college
– Secondary sampling unit: Departments
– Tertiary sampling unit: students

317
Multistage …

318
2. Non-Probability Sampling
Advantage

• Used when a sampling frame does not exist

• They are quick, inexpensive and Convenient

• Useful when descriptive comments about the sample itself


are desired

• Good for pretests, pilot studies, In-depth interviews

• Used when Precise representativeness is not necessary.

319
Non-Probability Sampling…
Disadvantages
• No random selection (non-representative)
• Reliability cannot be measured
• No way to measure the precision of the resulting
sample.
• Inappropriate for generalizing findings obtained from
a sample to the population.

320
Types Non-Probability Sampling

1. Convenience/ opportunity/haphazard/accidental sampling.

2. Volunteer sampling

3. Purposive/ judgemental sampling

4. Quota sampling

5. Snowball sampling

321
1.Convenience/opportunity/accidental
sampling
• Selection of a sample based on easy accessibility and
convenience

• Is not representative of the target population

• it may deliver accurate results when the population is


homogeneous

322
2.Volunteer sampling
• As the term implies, this type of sampling occurs when people
volunteer their services for the study
• The sample is taken from a group of volunteers
• Sometimes, the researcher offers payment to entice
respondents
• Commonly used in psychological experiments or
pharmaceutical trials (drug testing),
• Its limitation, it would be difficult and unethical to enlist
random participants from the general public- volunteers.

323
3.Purposive/Judgemental sampling
• The selection of a sample based on judgment and knowledge
of the subject

• It is subject to the researcher's biases - more biased than


haphazard sampling
• Can be used in pre-testing of questionnaires

• Focus groups or in-depth interviews


• Example
– In laboratory settings choice of experimental subjects (i.e., animal,
vegetable etc..)

• Reflects the investigator's pre-existing beliefs about the


population.
324
4.Quota sampling
• Is the most common forms of non-probability sampling

• The population is first segmented into mutually exclusive


sub-groups

• A quota is given to select the subjects or units from each


segment based on a specified proportion.

• In quota sampling the selection of the sample is non-


random.

325
Quota ….
• Advantages
– Quota sampling is generally less expensive than random sampling.
– Easy to administer
– It is an effective sampling method when information is urgently
required and can be carried out independent of existing sampling
frames.

• Disadvantages
– It does not meet the basic requirement of randomness.
– Some units may have no chance of selection or the chance of
selection may be unknown. Therefore, the sample may be
biased.

326
5.Snowball sampling
• Snowball sampling is a special non-probability method used
when the desired sample characteristic is rare.
• lower cost
• But, biased
SM
M
Involves two main steps.
1. Identify a few key individuals
2. Ask these individuals to volunteer to distribute the questionnaire
to people who know and fit the characteristics of the desired
sample

327
Errors In sampling

Sampling error (random error)

Non sampling error (bias)

328
Sampling error
• A sample is a subset of a population.

• Because of this property of samples, results obtained


from them can not reflect the full range of variation
found in population which arise from the sampling
process it self.

• Cab be avoided by increasing the size of the sample,


• When n=N sampling error is 0

329
Non sampling error
 It is a type of systematic error in the design or
conduct of a sampling procedure which results
in distortion of the sample

• Ho to reduce/avoid
– careful design of the sampling procedure and not
by increasing of the sample size and
– Testing the data collection tool

330
Thank You!

331
Sampling distribution and
sample size Determination

Lecture by Gurmesa Tura


(MPH)
AAU, SPH
332
May 2011
Learning objectives
At the end of this class the students will be able to:

Explain the concepts of sampling distribution

Familiar with different approaches in


determining sample size and be able to calculate
sample size for different study objectives.

333
334
Sampling distribution….
• The sampling distribution of a statistic is the
probability distribution of all possible values the
statistic may assume, when computed from random
samples of the same size, drawn from a specified
population.

• The sampling distribution of X is the probability


distribution of all possible values the random
variable may assume when a sample of size n is
taken from a specified population

335
Sampling distribution….
Suppose that we calculate a sample mean (X) as an
estimate of the population mean (μ).
It is possible to select many samples of size n from a
population.

The value of this sample estimate of the parameter would


differ from one random sample to the next.
By determining the distribution of these estimates, a
statistician is then able to draw an inference based on the
distribution of sample statistics.
This distribution that is so important to us is called the
sampling distribution for the estimate
336
THE CENTRAL LIMIT
THEOREM
Suppose we have taken a random sample of size n
(usually >30) from a population

We assume the population has a mean (μ) and a


standard deviation ( ) .

 We then can assert the following:

337
338
CENTRAL LIMIT THEOREM…
2 . The mean for the distribution of sample means is equal to
the mean of the population distribution

x  

where x  the mean of the distribution of


the sample means

This statement signifies that the sample mean is an


unbiased estimate of the population mean

339
CENTRAL LIMIT THEOREM…
3.The standard deviation of the distribution of sample
means is equal to the standard deviation of the
population divided by the square root of the sample
Size

 x

n
where the standard deviation of the
 
distribution of
x

sample means based on n observations

 We call  x the standard error of the mean.


340
Sample size determination
 In planning any investigation we must decide
how many people need to be studied in
order to answer the study objectives.

 Too small
 We may fail to detect important effects or
may estimate effects too imprecisely
 False conclusion

 Too large:
 Unnecessary involvement of extra subjects
 High cost
 Time constraints
341
Sample size…
The main determinant of the sample size is, how
accurate the results need to be.  

It is much better to increase the accuracy of data


collection than to increase sample size after a
certain point.

It is better to make extra efforts to get a


representative sample rather than to get a very large
sample

A compromise between what is desirable and what is


feasible.

It is important to consider the available resources.


342
Things to consider while determining
sample size
1. The study design
 Cross-sectional, cohort, case control, RCT etc
2. The parameter to be estimated
 Continues outcomes
 Single mean
 Comparison of two means
 Categorical outcomes
 Single proportion
 Comparison of two proportions

3. Level of confidence (usually 95%) i.e. level of


significance 5%
4. Power of the study (usually for comparison and 80%)
5. Margin of error (the accuracy within which the investigator
desires the true value to be within a given level of
confidence)
343
1. For Single mean
Used when the outcome variable is continues


2 2

n Z 1
2
d  
Where:
 n= minimum required sample size
 Z=upper critical value for the distribution
1-alpha confidence level
 d= margin of error
 ᵟ = population standard deviation
344
Finally we need to add 10% of n for the non
Finite population correction
If the source population (N) is <10,000 or n>10% of
N, we need finite population correction

n
n f

n
1
N

345
Example
Assume a physician wants to study the systolic blood
pressure (SBP) of 20-39 years of age in a certain
country.
The normal values are μ =120mmHg & ᵟ =10mmhg
How many people should he include in the study if he
has desired the patients SBP must not raise above
122mmHg in 95% of the time?
a. From source population of 50,000
b. From source population of 6,000.

346
Solution (a)
Given
Z=1.96 as confidence level is 95%
 ᵟ =10mmHg
d=122mmHg-120mmHg=2mmhg

2
x (10mmHg )
2

n
1. 96  96
2
(2mmHg )

347
 96  0.1x96  105.6  106 people
Solution for b.
As N=6000<10,000, we need population correction

106
n f

106
 104 people
1
6000

348
2. For single proportion
Used when the outcome variable categorical

Z  pq
2

n 2
2
d
Where:
 n = minimum required 
 sample size
 Z = upper critical value for the
distribution 1- alpha confidence level
 d = margin of error
 p = expected proportion of the population
with the event of outcome (prevalence)
 q =1-p: the probability of non occurrence
of the event of interest
349 Finally we need to add 10% of n for the non
Single proportion…
We need also to use finite population correction here if
the source population is <10,000.
Example.
A survey is needed to estimate prevalence of influenza virus infection
in school children
Suppose the available evidence suggests that approximately 20% of
the children will have antibodies to the virus.
Assume the investigator wants to estimate the prevalence within 5% of
the true value.
a. Calculate the sample size assuming source population of 40,000
b. Calculate the sample size assuming source population of 4,000

350
Solution (a)

2
(1.96) (0.2 x0.8)
n 2
 245.8  246
(0.05)

246  0.1x 246  270.6  271children

351
Solution (b)
As the source population is <10,000, we need
population correction

271
n f

271
 253.8  254children
1
4000

352
Single proportion…
Usually we obtain ‘p’ from previous
similar studies or pilot test

If we don’t have previous similar study


and pilot test is impossible, we use 50%
(p=0.5) to get the maximum sample size
with margin of error 5%(d=0.05).

353
Exercise
Suppose that a study is to be conducted to estimate the
smoking rate among adult males in Addis Ababa.
Assume that the current smoking rate among adult
males in Addis Ababa in general is about 27%. It was
desired that the rate of smoking to be within 3% of the
general population with 95% confidence.
a. Determine the required number of adult male to be
included in the study based on the above data.

b. What will be the required number of adult males to be


included in the study if the rate of smoking is
unknown?
354
For comparing two means
In this case we need power in addition to the
significance level

Power is the chance of being in the rejection region if


the alternative hypothesis is true

355
Sample size formula for difference in means…

Z Z 1  1
2 2 2 2
(r  1)  ( 1 
 Z1 ) (r  1)  (  Z )
n1  
r difference 2 r ( X1  X 2 ) 2

where :
n1  size of smaller group
r  ratio of larger group to smaller group
  standard deviation of the characteristic
diffference  clinically meaningful difference in means of the outcome
Z   corresponds to power (80% power, Z  0.84)
2

Z / 2  corresponds to two - tailed significance level


(95% level of confidence, Z  1.96 for   .05)
n 2  rn1
356
Difference in means…
If r  1 (equal groups), then

2 ( Z1   Z1 )
2 2

n1 
difference 2

n n
2 1

357
Example
 Suppose the investigator wanted to compare the difference in
mean hemoglobin level between adult males and adult
females. From previous study, The mean hemoglobin level for
normal adult males is 15 g/100ml and that of normal females
is 13g/100ml. The standard deviation is about 3g/100 ml.

a. Calculate the sample size to have equal number of males and


females in your sample.
b. Calculate the sample size to have the number of males sample
to be 3 times that of females.
Use 95% confidence level and 80% power.

358
Solutions (b)

2(3) 2 (0.84  1.96) 2 18(7.84)


n1  2
  35
(15 - 13) 4

n  n  35
2 1

We need to add 10% for non


responses
So, we need 39 females and 39
359
males a total of 78 study subjects.
Solutions (a)
It was asked to have male to
female ratio to be 3:1, r = 3
(3  1) (3) 2 (0.84  1.96) 2 4(9)(7.84)
n1  2
  24
3 (15 - 13) 12
n 2
 rx n1  3 x 24  72

Add 10% for non responses


So, we need 27 females and 79

360
males a total of 106 study
4. Comparing two proportion
To compare two proportions we use the following formula

n
Z 1 
 Z1  ( p1 (1  p1 )  p1 (1  p1 ))
2

 p1  p2  2

 n = sample size in each group (for 1:1 ratio)


Z power of the study usually 80% (Z = 0.84)
1-β = 1-β
 Z =Confidence level of the study usually 95% (Z =
1-ɑ 1-ɑ
1.96)
 P = proportion of outcome of interest in group 1
1
 P = proportion of outcome of interest in group 2
2
361
Example

Suppose the investigator wants to conduct a study to see


whether there is a difference in the rate of Malnutrition
among infants who are exclusively breast feed and who use
mixed feeding.

Assume that from previous study in similar population, the


prevalence of malnutrition among mixed feeding infants is
20% and among those on exclusive breast feeding is 15%.

Determine the minimum sample size required for both


groups of infants. Use 95% confidence level and 80%
power.

362
Solution
Given
P1=20% =0.2 (among on mixed feeding)
P2=15%=0.15 (among on exclusive breast feeding)
Confidence level=95%, Z1-ɑ=1.96
Power =80%, Z1-β=0.84

n1 
0.84  1.96  (0.2(0.8)  0.15(0.85))
2

 45
0.2  0.15 2

n2  n1  45
 By adding 10% for non responses we need 50 infants from both
groups a total of 100 infants will be included in the study
363
Thank you!

364

You might also like