1
Biostatistics
Samson G/Medhin, MPH
Course Objective
General Objectives:
• To acquaint students with the basic and intermediate
statistical concepts and tools for collecting, analyzing,
presenting and drawing conclusions from data.
2
Course Objective
Specific objectives:
At the end of the course students will be able to:
• Describe the scope and application of statistics;
• Acquaint with the types of variables and scale of
measurements;
• Describe data with appropriate diagrammatic and
numeric summery techniques;
• Understand the basic rules of probability and their
statistical application in health sciences;
• Comprehend different sources of health and
demographic data and appreciate their respective
advantage and disadvantage;
3
Course Objective Cont…
• Understand the basic sampling techniques;
• Calculate optimal sample size for different types of
studies;
• Calculate and interpret confidence intervals;
• Carryout hypothesis testing about different statistical
parameters;
• Understand and apply intermediate statistical methods
including correlation, linear regression, logistic
regression and ANOVA;
• Carryout exploratory data analysis using SPSS;
• Understand and interpret statements in published
articles pertaining to statistics.
4
3/1/2010
2
Time Schedule
• Time Schedule.doc
5
Mode of Evaluation
• Mid 35%
• Final 40%
• Assignments/Quiz 10%
• Term paper 15%
6
References
1. M. Pagano and K. Gauvreau. Principles of Biostatistics, 2
nd
ed.,
Duxbury Thompson Learning, 2000.
2. T. Colton. Statistics in Medicine, Lippincott Williams & Wilkins
Publisher, 1974.
3. B. Rosner. Fundamentals of Biostatistics, 6
th
ed., Thomson
Books, 2006.
4. M. Bland. An Introduction to Medical Statistics, 5
th
ed., Oxford
Medical Publications, 1993.
5. W. Daniel. Biostatistics: A Foundation for Analysis in Health
Sciences, 8
th
ed., John Wiley and Sons Inc, 2005.
6. Landau S and Everitt BS. Handbook of Statistical Analyses
using SPSS, Chapman & Hall/CRC, 2004.
7
Introduction
3/1/2010
3
What is Statistics?
• Statistics is a field of study concerned with the collection,
organization and summarization of data, and drawing of
inferences about a body of data when only part of the data
is observed.
• It is concerned with:
– Designing experiments and data collection,
– Summarizing information to aid understanding,
– Drawing conclusions from data,
– Estimating the present and predicting the future based
on Statistical evidence.
9
What is Statistics?
• Mathematical statistics: Concerns with the development
of new methods of statistical inference and requires
detailed knowledge of abstract mathematics.
• Applied statistics: Involves applying the method of
mathematical statistics to specific subject areas.
• Biostatistics is an application of statistical method to
Biological phenomena.
10
What is Statistics cont…
• In clinical medicine and PH Statistics can be applied to:
– Determine the accuracy of measurement,
– To compare measurement techniques,
– To assess diagnostic tests,
– To determine normal value,
– To estimate prognosis,
– To compare efficacy of treatment techniques,
– To determine prevalence of an event,
– To identify determinates of health problem,
– To compute adequate sample size for studies.
– Etc.
11
Statistical Data
• Refers to numerical description of things through the
form of count or measurement.
• Though statistical data always involves numeric
description, all numeric descriptions are not statistical
data.
• Statistical data should have the following characteristics:
– They must be in aggregate,
– They must be affected to marked extent by multiple causes,
– They must be collected in systematic manner,
– They must be estimated at reasonable accuracy,
– They must be placed in relation to each.
12
3/1/2010
4
Classification of Statistics
• Descriptive Statistics: Is the methodology of effectively
collecting, organizing and describing data.
• Inferential Statistics: Includes:
• Inductive Statistics: The process of drawing
conclusion about unknown characteristics of a
population, based on sample based study.
• Predictive Statistics: The process of predicting future
based on historical data.
13
Classification Cont..
• During analysis based on the underlying assumptions,
statistics (statistical methods) can be classified as:
• Parametric statistics: is a branch of statistics that
assumes data come from a type of probability
distribution and makes inferences about the data based
on the distribution.
• Nonparametric statistic: Interpretation does not depend
on the population fitting any distributions.
14
Rationale of Studying
Statistics
• Enable to organize information in formal manner.
• Issues in science are becoming more and more
quantitative,
• Statistics is extensively used in medical literature.
• The planning, conducting and implementing of medical
and public health research are highly reliant on statistical
methods.
• There is a great deal of intrinsic variations in most
biological process.
15
Possible Limitations of
Statistics
• It mainly deals with variables which can be quantified.
• It deals on aggregate of facts; it may not give individual
information.
• Highly reliant on cutoff points.
• Analysis is done based on multiple assumptions.
• Errors are possible in statistical decisions.
16
3/1/2010
5
Types of Variables
• A variable is any characteristic of a study unit (example
an individual) that is measureable and/or classifiable,
and can take any value for different units.
• Depending on their quantifiablity, can be classified as
Qualitative and Quantitative variables.
• Qualitative (Categorical) Variable: is a characteristic
which can not be measured in quantitative form but can
be identified by names or categories. For example
religion, ethnicity, illness status (well or ill), treatment
outcome (improved or not improved), Stage of breast
cancer (I, II, III, IV) etc
17
Types of Variables Cont…
• Quantitative Variable: is a characteristic that can be
measured and expressed numerically.
• This can be of two types:
• Discrete Quantitative Variable:
– Can only take on a finite number of values (usually whole
numbers).
– Example: number of children, number of episode of illness.
• Continuous Quantitative Variable:
– Measured on continuous scale.
– It can assume infinite number of values between two given
values.
– Example: height, weight, age, blood sugar level.
18
Scale of Measurement
• In clinical medicine and public health as in many other
areas of science, we typically assign numbers to various
attributes of people, objects, or concepts.
• This process is known as measurement.
• The process of measurement involves assigning
numbers to observations according to rules.
• The way that the numbers are assigned determines the
scale of measurement.
• Four scales of measurement are typically discussed
here.
19
Scale of Measurement Cont…
Nominal Scale:
• Is the lowest scale of measurement.
• Numbers are assigned to categories as "names"
arbitrarily.
• Therefore, the only number property of the nominal scale
of measurement is “identity”.
• For example classifying people according to gender is a
common application of a nominal scale. We may assign
number "1" to "male" and number "2" to "female" or the
opposite. The only mathematical operation we can
perform with nominal data is to count.
20
3/1/2010
6
Scale of Measurement Cont…
Ordinal Scale:
• Ordinal scale has the property of magnitude.
• It assigns each measurement to one of a limited number
of categories that are ranked in terms of graded order.
• However the interval between the categories is not
necessarily equal.
• Example: Cancer stage, rank in a race.
21
Scale of Measurement Cont…
Interval Scale:
• Interval scale has property of equal interval b/n values.
• It doesn’t have a true zero point; the number "0" is
arbitrary.
• Similarly the ratio between two values on interval scale
doesn’t have meaningful interpretation.
• Eg: in measuring temperature using
0
C scale, we can
always be confident that the distance between 25
0
C and
35
0
C is the same as the distance b/n 65
0
C and 75
0
C.
• However, 0
0
C doesn’t mean there is no temperature.
Similar, it would be inappropriate to say that 60
0
C
degrees is twice as hot as 30
0
C degrees.
22
Scale of Measurement Cont…
Ratio Scale:
• Ratio scale of measurement has the property of equal
interval between values and absolute/true zero.
• These properties allow us to apply all mathematical
operations (addition, subtraction, multiplication, and
division) in data analysis.
• The absolute/true zero allows us to know how many
times greater one case is than another.
23
Data Collection Method
• In order to generate valid conclusion from a data,
information has to be collected in a systematic manner.
• A haphazardly collected dataset is less likely to produce
valuable and generalizable information.
• Data may be derived from several sources.
• Depending on the source, it can be classified as Primary
or Secondary data.
• Primary data is gathered for the first time by the
researcher for a given purpose; while,
• Secondary data is data already collected by others, for
purposes other than the question of the research at hand.
24
3/1/2010
7
Data Collection Method
Cont…
Survey through interview:
• A quantitative approach in which a standardized
questionnaire, to be administered through interview, is
used to collect information.
• Advantage
– Quick and inexpensive,
– Responses from different respondents is comparable,
– Easy to quantify and analyze,
– Useful in describing quantifiable characteristics of a
large population,
25
Data Collection Method
Cont…
– Very large and representative samples are feasible,
– Standardized questions make measurement more
precise,
– Participants do not need to be able to read and write
to respond,
• Disadvantage:
– Doesn’t give qualitative information,
– Doesn’t give opportunity to probe and explore,
– Relatively inflexible,
– Less reliable to assess behavior and attitude of
respondents,
26
Data Collection Method
Cont…
Survey through self administered questionnaire:
• A quantitative method in which a standardized
questionnaire, to be filled by the respondents
themselves, is used.
Advantage:
• Quick and inexpensive,
• Responses from different respondents is comparable,
• Useful in describing quantifiable characteristics of a large
population,
• Very large and representative samples are feasible,
• Standardized questions make measurement more
precise. 27
Data Collection Method
Cont…
• Disadvantage:
– Participants need to be able to read and write to
respond,
– High nonresponse rate,
– Doesn’t give qualitative information,
– Doesn’t give opportunity to probe and explore,
– Less reliable to assess behavior and attitude of
respondents,
– Relatively inflexible,
28
3/1/2010
8
Data Collection Method
Cont…
Secondary data:
• A quantitative approach which utilizes data already
collected by others.
• Advantage:
– Less resource and time consuming,
• Disadvantage:
– May not give in depth information,
– No knowledge on the accuracy of data collection,
– Can be outdated,
– Limited control on the sampling method and size,
– Less likely to give qualitative information.
29
Data Collection Method
Cont…
Focus Group Discussion (FGD):
• A qualitative method to obtain indepth information on
concepts and perceptions about a certain topic through
spontaneous group discussion of approximately 6–12
persons, guided by a facilitators.
• Advantage:
– Excellent approach to gather information on indepth
attitudes, and beliefs of a group,
– Group dynamics might generate more ideas than
individual interviews,
– Provides an excellent opportunity to probe & explore,
– Participants are not required to read or write,
30
Data Collection Method
Cont…
– Unearth sensitive issues which are not commonly raised
by individuals.
– It facilitates the exploration of collective memories.
Disadvantage:
– Requires strong facilitator to guide discussion and
ensure participation by all members,
– Doesn’t give quantitative information,
– It is difficult to organize the discussion,
– Analysis is relatively difficult.
31
Data Collection Method
Cont…
Indepth interview:
• A qualitative method that relies on person to person
discussion.
• Advantage:
– Good approach to gather indepth attitudes and
beliefs from individual respondents,
– Provides an excellent opportunity to probe and
explore,
– Participants don’t need to be able to read and write to
respond,
– Assures privacy,
32
3/1/2010
9
Data Collection Method
Cont…
• Disadvantage:
– Doesn’t give quantitative information,
– It is time taking,
– the respondent may feel like ‘a bug under a
microscope’,
– The analysis is relatively difficult,
33
Data Collection Method
Cont…
Observation:
• A qualitative method that involves critical observation
and recording the practice (behavior, culture…) of
individuals or a group.
• Excellent approach to discover behaviors,
• Usually takes longer time,
• Liable to “Observational bias”
34
Designing Questionnaire
• Most of the data collection techniques utilize
questionnaires.
• Hence, the quality of the data is dependant on how best
the questionnaire is designed.
• There are two main objectives in designing a
questionnaire:
• To obtain accurate relevant information for the study,
• To maximize the response rate.
35
Designing Questionnaire
Cont…
• A questionnaire can be classified based on different issues:
• Structured Vs Nonstructured Questionnaire:
– The structured one is mainly designed for surveys.
– A series of questions are arranged in a logical order and
sequence and divided into subtopics.
– Skipped patter is important for structured questionnaire.
– The data collector is expected to smoothly go through the
sequence.
– The nonstructured one is commonly used for qualitative
studies.
– It doesn’t have strict sequence of questions.
– The data collector may rearrange the questions depending
on the response of the subject. 36
3/1/2010
10
Designing Questionnaire
Cont…
• Open ended Vs Close ended Questionnaire
(Question):
• Open ended questions permit free response that should
be recorded in respondent’s own word.
• Allows exploration of the range of possible themes.
• Close ended questions offer a list of possible options or
answers from which the respondents must choose.
• It is relatively easy and quick to fill, code, analyze and
report.
37
Designing Questionnaire
Cont…
Standardized Vs Nonstandardized Questionnaire:
• Standard questionnaire is developed by a well known
body and considered to be “standard” to assess a given
research question.
• A nonstandard one is developed by the researcher to
address the research question.
• What are the advantages and disadvantages of using
standardized questionnaire?
38
Steps in Designing a
Questionnaire
1. Developing Individual Questions:
– Use short and simple sentences.
– Ask for only one piece of information at a time.
– Ask precise questions to address the objective of the
study.
– Give extra attention to sensitive questions.
– Avoid leading questions.
2. Format of responses: Questions should be formatted
into open or closed formats depending on the need.
39
Steps Cont…
3. Arranging the Questions:
• Go from general to particular.
• Go from easy to difficult.
• Go from factual to abstract.
• Start with closed questions.
• Start with demographic and personal questions.
4. Piloting and Evaluation of Questionnaire.
• Given the complexity of designing a questionnaire, it is
impossible even for the experts to get it right the first
time round.
• Questionnaires must be pretested (piloted) on a small
sample of people characteristic of those in the survey.
40
3/1/2010
11
Diagrammatic
Summarization
Introduction
• Data collection yields a set of data called Raw Data.
• The size of the data can range from a few hundreds to
many thousands of observations.
• Raw data however will not necessarily provide
information that can easily be interpreted.
• Data presentation is a mechanism which enables easier
understanding of a given set of data through the use of
tables and graphs.
• In data summarization the detailness of the data is
compromised but this is compensated by gain in
knowledge of the data.
42
Tables
• Simplest means of data presentation which can be used
for all type of data.
Frequency Distribution
• One type information that is commonly used to organize
data in tables is Frequency Distribution.
• For nominal or ordinal data, the frequency distribution
consists of a set of categories along with numeric
counts that correspond to each one.
• Example:
43
Tables Cont…
Table 2.1: Ethnicity Composition of Women of Reproductive age in
Awassa Town, Jan 2006.
Ethnic Group Frequency Distribution
Wolita 377
Amhara 355
Sidama 163
Oromo 144
Guragae 138
Kenbata 82
Tigray 47
Hadya 20
Others 50
Total 1376
44
3/1/2010
12
Tables Cont…
• In displaying numeric data using frequency distribution
we should note the following:
• The range of values must be brokendown into a series
of distinct and nonoverlapping intervals.
• The intervals should cover all data points.
• Intervals are often constructed, though not necessarily,
so that all have equal width. This facilitates comparison
among classes.
• Open ended intervals should be avoided.
• The limits for each class must agree with the accuracy of
the raw data.
45
Tables Cont…
• Appropriate number of intervals should be considered as
too many intervals won’t be much explanatory and too
few intervals loose a great deal of information.
• The rule of thumb states the number of classes should
be between 1020.
• When we don’t have any evidence to decide number of
classes, we can use Sturge’s Formula:
• No of classes = 1+[3.322 x log (no of observations)]
• The width of each class can also be calculated as:
)
classes of No
Min value  Max value
( class the of Width =
46
Tables Cont…
Relative and Cumulative Frequency
• In addition to counts, it is useful to know the proportion of
values that fall into a given class.
• Relative frequency of a class is the proportion or
percentage of total number of observations that fall in a
given class.
• Cumulative relative frequency of a class is the proportion
(percentage) of total number of observations that have a
value less than or equal to the upper limit of a given
interval.
• If such information is given in the form of counts it is
simply called Cumulative frequency.
47
Tables Cont…
Age Group Number of women Relative Frequency
(%)
Cumulative Relative
Frequency (%)
1519 399 28.9 28.9
2024 341 24.7 53.6
2529 281 20.4 74.0
3034 143 10.4 84.3
3539 116 8.4 92.8
4044 54 3.9 96.7
4549 42 3.0 100.0
Total 1380 100.0
Table 2.2: Cumulative and Relative Frequency of Age Structure of Women of
Reproductive age in Awassa Town, Jan 2006.
48
3/1/2010
13
Tables Cont…
• Depending on the number of variables represented in,
tables can be classified as one way, two way and higher
order tables.
• Oneway Table: Only one variable is summarized in the
table.
• Twoway Table (Cross tabulation): Two variables are
organized simultaneously in combined manner in a table.
• Higher Order Table: Three or more variables are
presented simultaneously in a table. The higher order
the table the more complicated the interpretation.
49
Tables Cont…
Child Ever Born
>=5 < 5
E
d
u
c
a
t
i
o
n
a
l
s
t
a
t
u
s
o
f
w
o
m
e
n
Illiterates 42 68
Read and Write 9 19
1
st
4
th
grade 32 60
5
th
8
th
grade 46 211
9
th
12
th
grade 42 239
> 12
th
grade 7 68
Total 175 665
What type of table is this?
50
Tables Cont…
Child’s Age Child’s Sex History of illness in the preceding 2 weeks
Yes No Total
011 mo
Male 15 86
101
Female 18 84
102
1223 mo
Male 13 80
93
Female 12 78
90
2435 mo
Male 10 76
86
Female 11 77
88
3647 mo
Male 9 74
83
Female 9 73
82
4859 mo
Male 6 69
75
Female 7 70
77
51
Tables Cont…
• In constructing tables, the following standards should be
followed:
– Tables should be simple and self explanatory,
– Every table should have a title (usually at the top of the table)
which indicates who, what, when, where of the data presented,
– Row and columns should be labeled,
– Totals should be indicated,
– Numeric entities of zero should be written as “0” while missed or
unobserved data should be represented by “”,
– If the data are not original, there source should be given as
footnote,
– Complicated tables should be avoided.
52
3/1/2010
14
Diagrammatic Representation
• A second way to present data is through the use of graphs
or pictures. (Diagrammatic Representations).
• Though diagrammatic representation is easier to read than
tables, they supply a lesser degree of details.
• However, the lesser detail can be compensated by a gain
in understanding of the data.
• Diagrammatic representation has the following advantages:
– They are easier to understand and memorize,
– They are more attractive,
– They facilitate comparison among groups,
– They may show pattern within the data set.
53
Bar Charts (Bar Graphs)
• Bar graphs are popular type of graph used to display a
frequency distribution for Nominal or Ordinal data.
• In the case of the commonest Vertical Bar Graph
(Column Graph), various categories into which the
observation falls are presented along horizontal axis.
• A vertical graph is drawn above each category so that
the height of the bar represents either the frequency or
relative frequency of observations within that class.
• The bar should have equal width, and separated from
one another so that not to imply continuity.
• In the case of Horizontal Bar Graph, the viseversa holds
true.
54
Bar Charts Cont…
Bar graph has different types:
• Simple Bar Graph:
– Depicts the frequency /relative frequency of classes of a variable.
– The intension is to compare the frequency of different classes of a
variable.
0
10
20
30
40
50
60
70
Within an hr 124 hr After the first day
The time breast feeding was initated
P
e
r
c
e
n
t
a
g
e
o
f
c
h
i
l
d
r
e
n
a
g
e
d
0

1
1
m
o
n
t
h
s
55
Bar Charts Cont…
• Multiple Bar Graph:
– Depicts the frequency or relative frequency of classes of a
variable at two or more situations.
– This type enables comparison between the levels of classes of
the variable at different situations.
28
60
26
63.3
33.5
2.8
0
10
20
30
40
50
60
70
Within an hr Within a day After the first day
The Time Breastfeeding was Initated
%
Baseline
End line
56
3/1/2010
15
Bar Charts Cont…
• Component Bar Graph:
– Similar as that of simple bar graph except bars are divided into
components.
– The graph shows the relative contribution of the components to
the bar (category).
0
10
20
30
40
50
60
70
Within an hr 124 hr After the first day
The time breastfeeding was initiated.
P
e
r
c
e
n
ta
g
e
o
f
c
h
il
d
r
e
n
a
g
e
d
0
1
1
m
o
n
th
s
Female
Male
57
Bar Charts Cont…
• 100% Component Bar Graph:
– Similar as that of component bar graph.
– But the height of all the bars is set at 100% so that comparison
on the relative contribution of the components can easily be
made.
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
Within an hr With in a day After the first day
Females
Males
58
Pie Chart
• A Pie Chart is a circular chart divided into sectors,
illustrating relative magnitudes or frequencies of classes
of a given variable.
• Pie chart usually represents categorical data but it is also
possible to use it for discrete quantitative data.
• The angle of each sector has to be proportional to the
relative frequency of a given class.
59
Pie Chart Cont….
60
3/1/2010
16
Histogram
• Whereas Barchart is representation of a frequency
distribution for either nominal or ordinal data, a Histogram
depicts a frequency distribution for continuous data.
• The horizontal axis displays the true limit of the interval,
the vertical axis represents the frequency or relative
frequency of the interval.
• If the interval of the bars is equal, the frequency
associated with each interval can be represented by the
height of the respective bars.
• However if the bars have different width, the histogram
should be drawn in such a way that the Y axis represents
the frequency density and the X axis the interval.
61
Histogram Cont…
• Then the respective frequency of the interval is
represented by the area of the bar.
• Frequency density of an interval = frequency of the
interval /true class width.
• Unlike Bargraph, in the case of Histogram the
categories (bars) must be adjacent. Hence, in order to
construct a Histogram, rather than class intervals, true
class boundaries should be used.
• For example the following table summarizes the
Biostatistics mid exam score of 38 students out of 35
marks.
62
63
Frequency Polygon
• Frequency Polygon depicts a frequency distribution
continuous numeric data.
• Frequency polygons are a graphical device for
understanding the shapes of distributions.
• A Histogram can easily be changed to Frequency
Polygon by joining the mid points of the top of the
adjacent rectangles of the Histogram with a line.
• It is also possible to draw Frequency Polygon without
drawing Histogram. The procedure is as follows:
64
3/1/2010
17
Frequency Polygon Cont…
1. Identify the mid points of all the intervals of the classes
of the give data,
2. Plot the mid points (as X axis) with the respective
frequency distribution or relative frequency of the class
(as Y axis)
3. Connect adjacent plots with a straight line
65
Frequency Polygon Cont…
• For example the following Frequency Distribution
represents the ages (in years) of 60 patients at a
psychiatric counseling centre.
66
Frequency Polygon Cont…
• First we have to identify the mid points of each interval.
67
Frequency Polygon Cont…
• Finally we have to plot the midpoints (as X axis) with respective
frequency of each class (as Y axis) and connect adjacent plots with
a straight line.
68
3/1/2010
18
Scattered Plot (Scattered Graph)
• Scattered plot is used to show the relation between two
different continuous measurements.
• The scale for one quantity is marked on the X axis and
the scale for the other on the Y axis.
• Each point on the graph represents a pair of values for
the two measurements.
• For each value on the X axis, it is possible to have
multiple Y values.
• The following scattered plot, shows the relation between
age and blood glucose level among diabetic patients
aged 5070 years.
69
Scattered Plot Cont..
120
125
130
135
140
145
150
155
160
165
170
175
180
185
190
195
200
50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70
Age in Years
B
l
o
o
d
G
l
u
c
o
s
e
l
e
v
e
l
m
g
/
d
l
70
Line Graph
• A line graph is similar to scattered plot as it shows the
relation between two different continuous measurements.
• Once again each point on the graph represents a pair of
values.
• However, unlike scattered plot, each value on the X axis
has a single corresponding measurement on the Y axis.
• As the name indicates, points on the graph are connected
to the adjacent points with straight line.
• Most commonly the scale along the X axis represents time.
Consequently we are able to trace the chronological
changes.
71
Line Graph Cont…
Figure 2.8: Mean Number of Child Ever Born to Women at the Age of
25 years, Awassa Town (19802005)
1
1.25
1.5
1.75
2
2.25
2.5
2.75
3
3.25
3.5
3.75
1980 1985 1990 1995 2000 2005
Year (GC)
M
e
a
n
C
h
i
l
d
E
v
e
r
B
o
r
n
a
m
o
n
g
W
o
m
e
n
a
t
t
h
e
A
g
e
o
f
2
5
72
3/1/2010
19
Cumulative Line Graph
• Also known as Ogive Graph.
• It is best used when you want to display the total at any
given time.
• The relative slopes from point to point will indicate
greater or lesser increases.
• For example, a steeper slope means a greater increase
than a more gradual slope.
• For example, if you saved $300 in both January and
April and $100 in each of February, March, May, and
June, the Ogive would looks like as follows.
73
Cumulative Line Graph
Cont…
74
Box and Whisker Plot
• In descriptive statistics boxandwhisker plot is a
convenient way of pictorially depicting groups of
numerical data through their fivenumber summaries
• The smallest observation, 1
st
quartile, median, 3
rd
quartile, and largest observation.
75
Box and Whisker Plot Cont…
• However in some cases the ends of the whiskers can
represent several possible alternative values.
• For example In SPSS:
– The ends of the whiskers represent lowest datum but
still within 1.5 times of the IQR of the lower quartile,
and the highest datum still within 1.5 IQR of the upper
quartile.
– Values more than three IQR’s from the end of a box
are labeled as extreme, denoted with an asterisk (*).
Values more than 1.5 IQR’s but less than 3 IQR’s
from the end of the box are labeled as outliers (o).
76
3/1/2010
20
Stem and Leaf Plot
• Is a display that organizes data to show its shape and
distribution.
• Each data value is split into a "stem" and a "leaf" portion.
• The "leaf" is the last digit of the number and the other
digits to the left of the "leaf" form the "stem".
• For example, the number 42 would be split apart, with
the stem becoming the 4 and the leaf becoming the 2.
• Consider the following dataset, sorted in ascending
order: 8, 13, 16, 25, 26, 29, 30, 32, 37, 38, 40, 41, 44,
47, 49, 51, 54, 55, 58, 61, 63, 67, 75, 78, 82, 86, 95.
77
Stem and Leaf Plot Cont…
08
13 6
25 6 9
30 2 7 8
40 1 4 7 9
51 4 5 8
61 3 7
75 8
82 6
95
78
Pictogram
• Pictogram is a graph which uses pictures or symbols to
present a certain data.
• Usually presents the frequency of one or more
categorical or discrete numeric variables in the form of
symbols.
• The magnitude of the can be shown either by the size of
the picture or the number of pictures.
• For example the following pictogram represents the
number of passengers per year across four airports of
UK.
79
Pictogram Cont…
80
3/1/2010
21
Issues to be considered in
diagrammatic representation
• Depending on the type of the data, the right type of
diagrammatic representation should be selected.
• It is not common to use two or more types of
diagrammatic representation simultaneously for a
specific data. The best should be selected and used.
• Each graph and diagram should be labeled (usually the
title is given below the figure).
• The title should indicate “Who”, “What”, “When” and
“Where” of the data presented.
• If the representation is taken from another source the
primary source should be indicated.
81
Issues to be considered
Cont…
• In graphs, the X and Y axis should be indicated clearly
with their unit of measurement.
• In graphs, the scale of X and Y axis should be drawn
proportionally.
• Pictorial representations usually require “Key” to facilitate
easier interpretation.
• When colors are employed, contrasting colors should be
selected.
82
Diagrammatic Representation
Using SPSS
• In order to develop graphs using SPSS, the following
steps should be followed;
• Graphs > legacy dialogues > select appropriate graph
• Available types are Bar graph, Pie chart, Histogram, Line
graph, Scattered plot and Box plot.
• Other rarely used types are also there.
• Most of the graphs can also be found under “Analysis >
Descriptive Statistics” icon.
83
Numeric Summarization
3/1/2010
22
Introduction
• Even though diagrammatic representation greatly
enhance understanding of the data, it does not give
mathematically amenable outputs.
• This gap is addressed by numeric summarization.
• In summarizing a dataset using numeric indicators, we
often focus on describing the data with two summary
figures. These are:
– Central Tendency (Location)
– Variation (Spread)
85
Measures of Central Tendency
• One of the most commonly used measures to
summarize a set of data is its center.
• The center is value (usually a single value), chosen in
such a way that it gives a reasonable approximation of
the whole dataset.
• In statistics the number which tends to approximate the
center of a set of data is called Measure of Central
Tendency or Average.
• The Arithmetic Mean, Median and Mode are the most
commonly used measures of central tendency.
86
Measures of Central
Tendency Cont…
Attributes of good measure of central tendency are:
• It should be based on all observations.
• It should not be affected by extreme values.
• It should have a definite value.
• It should not be subjected to complicated computation.
• It should be capable of further algebraic treatment.
• It should be close to the location were majority of the
observations are located.
87
Arithmetic Mean
• The Arithmetic Mean is usually called the Mean.
• It is most familiar measure of central tendency.
• It is calculated by adding all of the individual values and
dividing the sum by the number of individual values.
• In statistics, two separate letters are used for the mean.
• The Greek letter (mu) is used to denote the population
mean.
• The symbol (read as "x bar") is used to denote the
sample mean.
88
3/1/2010
23
Arithmetic Mean Cont…
• When n is the total number of observations and X
i
is the value
of X for i
th
observation the formula of arithmetic mean is given
as:
• In calculating the mean from grouped data we assume all
values falling into particular class interval are located at the
mid point of the interval.
• The formula is given as:
n
f m
Mean
K
i
i i ¿
=
n
x
Mean
n
i
i ¿
=
=
1
89
Arithmetic Mean Cont…
Where k is the number of class intervals,
m
i
is the mid point of the i
th
class interval,
f
i
is the frequency of the i
th
class interval,
n is total number of observations,
• The formula simply means each value within the interval
is represented by the midpoint of the true class interval.
Then we can calculate the mean as usual.
90
Arithmetic Mean Cont…
Example 3.1: Consider the time taken by 30 students to do
a Biostatistics quiz.
Thus mean of the data is 350/30 = 11.7 minutes
Minutes spent
on Quiz
Number of
students (f)
True Class interval Mid point (m) m
i
f
i
15 2 0.55.5 3 6
610 12 5.510.5 8 96
1120 16 10.520.5 15.5 248
Total 30 350
91
Arithmetic Mean Cont…
• The major advantages of mean are:
– It is calculated based on all observations.
– Its mathematical computation is not complicated.
– It accommodates further mathematical applications.
– It can only have one value.
• The major disadvantages of mean are:
– It is affected by extreme values.
– It shouldn’t be used when the dataset is not normally
distributed.
92
3/1/2010
24
Median
• The Median is the value which divides the data into two equal
halves, with half of the values being lower than the Median
and half higher than the median.
• When n is the number of observation in a dataset, the median
is calculated in such a way:
– Sort the values into ascending order.
– If you have an odd number of observations, the median is
the middle observation, i.e. (n+1)/2 position of your data.
– If you have an even number of observations, the median is
the arithmetic mean of the two middle observations, i.e.
pick the numbers at positions n/2 and (n/2) + 1 and find the
mean of those two observations.
93
Median Cont…
Example 3.2: Compute the median for {1, 2, 3, 4, 5}
• The numbers are already sorted, so that it is easy to see
that the median is 3 (two numbers are less than 3 and
two are bigger).
Example 3.3: Compute the median for {1, 2, 3, 4, 5, 6}
• The median would be 3.5 since that is the middle
between 3 and 4, computed as (3 + 4)/ 2.
• Note that three numbers are less than 3.5, and three are
bigger, as the definition of the median requires.
94
Median Cont…
• When we are dealing with grouped data, the median can be
calculated as:
• Where:
– L
m
is the lower true class boundary of the interval
containing the interval,
– F
c
is cumulative frequency of the interval just above the
median class interval,
– F
m
is frequency of the interval containing the median,
– W is class interval width,
– n total number of observations.
w
F
F
n
L X
m
c
m
)
2
(
~
−
+ =
95
Median Cont…
• The major advantages of the median are:
– Not affect by extreme values,
– Can be used in skewed distribution,
– It is easy to calculate,
– It can only has one value,
– Can be calculated when there is open end interval.
• The major limitations of the median are:
– It could not be a good representative if the number of
observations is too few,
– It does not accommodates further mathematical
applications (in parametric statistics),
– It is calculated based on one or two observations.
96
3/1/2010
25
Mode
• Mode is by far the simplest, but the least widely used
measure of central tendency.
• It is simply the score that occurs most frequently.
• When the distribution has only one vale with highest
frequency it is called Unimodal. If it has two values with
equal and highest frequency it is called Bimodal.
Similarly, it is possible to have multimodal frequency.
• Example: {1, 2, 2, 3, 3, 4, 4, 4, 5}
• The mode is 4.
• In grouped data the mid point of the interval with highest
frequency is considered as the mode of the distribution.
97
Mode Cont…
98
Salary in Br Number of Factory Workers
500600 3
600700 6
700800 5
800900 5
9001000 0
10001100 1
Mode Cont…
For example the following table displays the salary of 20
factory workers in factory X.
mid point of this interval i.e. 650 is taken as the mode of
distribution.
99
Mode Cont…
• The major advantages of the mode are:
– It can be used when the variable is ordinal or nominal,
– It is very easy to compute,
– It is less likely to be affected by extreme values,
– Can be calculated to distributions with open end class
interval.
• The major disadvantages of mode are:
– It may not perfectly denote what central tendency imply,
– It does not accommodate further mathematical application,
– It is calculated based on few observations,
– It may have more than a value for a dataset,
– At times a mode value may not exist in a dataset.
100
3/1/2010
26
Skewness and the Measures
of Central Tendency
• The normal distribution is one that is bell shaped, unimodal
and symmetric.
• Skewness – measures the symmetry of a distribution.
• If the distribution is not symmetric, (one side does not reflect
the other), then it is skewed.
• Skewness is indicated by the “tail” or trailing frequencies of
the distribution.
• If the tail is to the right it is a positive skew. If the tail is to the
left then it is a negatively skewed distribution.
• In normal distribution, the mean, median and mode are equal.
• Skewness affect their arrangement of the three measures of
the central tendency in the following way.
101
Skewness and the Measures
of Central Tendency Cont…
102
Weighted Mean
• The weighted mean is similar to an arithmetic mean except it
is a mean where there is some variation in the relative
contribution of individual data values to the mean.
• Each data value (X
i
) has a weight assigned to it (W
i
).
• Data values with larger weights contribute more to the
weighted mean and data values with smaller weights
contribute less to the weighted mean.
• The formula is
103
Weighted Mean Cont…
• If all the weights are equal, then the weighted mean is
the same as the arithmetic mean.
• The best example for the application of weighted mean
is the calculation of GPA.
• Scoring an “A” grade has larger weight than scoring a
“B” grade.
104
3/1/2010
27
Geometric Mean
• The geometric mean is an average calculated by
multiplying a set of numbers and taking the n
th
root,
where n is the number of numbers.
• Geometric mean is related to the lognormal distribution.
• The lognormal distribution is a distribution which is
normal for the logarithm transformed values.
105
Harmonic Mean
• The harmonic mean (H) of n positive values is defined by the
formula;
• It is the reciprocal of the arithmetic mean of the reciprocals.
• It applies more accurately to situations involving rates.
• For example: A blood donor fills a 250mL blood bag at
70mL/min on the first visit, and 90mL/min the second visit.
What is the average rate at which the donor fills a bag?
• Given:
– 250mL at 70mL/min = 3.571 mins total
– 250mL at 90mL/min = 2.778 mins total
106
Harmonic Mean Cont…
• So 500mL total in (3.571+2.778) mins total = 500/6.349 =
78.753 mL/min
• The harmonic mean of 2/[1/70+1/90] = 78.750 gives a more
accurate description of average rate, than the arithmetic mean
(80mL/min).
• Source: http://wiki.answers.com/Q/What_is_the_application_of_harmonic_mean_in_medicine
107
Measures of Dispersion
• While measures of central tendency are used to estimate
"center" value of a dataset, measures of dispersion are
important for describing the spread of the data, or its
variation around a central value.
• Two distinct samples may have the same mean or
median, but completely different levels of variability, or
vice versa.
– Set 1: 30, 40, 40, 50, 60, 60, 70 (Mean = 50)
– Set 2: 48, 49, 49, 50, 50, 51, 53 (Mean = 50)
108
3/1/2010
28
Range
• Defined as the difference between the largest and
smallest sample values (x
max
x
min
).
• Major advantage: It is simple to calculate.
• Major disadvantages:
– It depends only on extreme values and provides no
information about how the remaining data is distributed.
– The range value can not be used when the units of
measurements are different.
– The extreme values are the most unreliable parts of the
data.
– It doesn’t accommodate further mathematical
application.
109
Standard Deviation and
Variance
• Standard deviation is the most common and useful
measure of dispersion.
• It is the average distance of each score from the mean.
• The formula for sample standard deviation is given as:
• The formula for population standard deviation is give as:
• What might be the reason for the difference?
1
) (
1
2
−
−
=
¿
=
n
x x
S
n
i
i
n
x
n
i
i ¿
=
−
=
1
2
) (µ
σ
110
Standard Deviation and
Variance Cont…
• Variance is just the square of the standard deviation.
• The formulas for sample and population variance are
given as follows:
• NB: Occasionally, the abbreviations SD for standard
deviation and Var for variance are used.
• Standard deviation for grouped data is calculated as:
1
) (
1
2
2
−
−
=
¿
=
n
x x
S
n
i
i
n
x
n
i
i ¿
=
−
=
1
2
2
) (µ
σ
2 1
2
1
x
n
m f
S
K
i
i i
−
−
=
¿
=
111
Standard Deviation and
Variance Cont…
• Advantages:
– They accommodate further mathematical
applications.
– They are calculated from the whole observations.
• Disadvantages:
– They must always be understood in the context of the
mean of the data.
– They are measured in the unit of measurement of the
observed data. Thus it is difficult to compare the
standard deviation/variance of two datasets
measured in two different units.
112
3/1/2010
29
Coefficient of Variation (CV)
• The standard formulation of the CV is the ratio of the
standard deviation to the mean of a give data.
• The coefficient of variation is a dimensionless number.
• So when comparing between data sets with different
units one should use CV instead of SD.
• The CV is useful in comparing the variability of several
different samples, each with different arithmetic mean as
higher variability is expected when the mean increases.
• CV is also important to compare reproducibility of
variables.
% 100 x
x
S
CV =
113
Example on Grouped Data
Example 3.4:
• Consider the time taken by 30 students to do a
Biostatistics quiz. Their time is summarized in the
following table.
Minutes spent on Quiz Number of students (f)
15 2
610 12
1120 16
Total 30
114
Example Cont…
Minutes
spent on
Quiz
Number of
students (f)
True Class
interval
Mid point
(m)
f
i
m
i
f
i
m
i
2
15 2 0.55.5 3 6 18
610 12 5.510.5 8 96 768
1120 16 10.520.5 15.5 248 3844
Total 30 350 4630
minutes 11.7 = 350/30 = =
¿
n
f m
Mean
K
i
i i
min 10.8 = (5/16) + 10.5 )
2
(
~
=
−
+ = w
F
F
n
L X
m
c
m
min 6.55 = 64 . 116
29
4630
1
2 1
2
− = −
−
=
¿
=
x
n
m f
S
K
i
i i
115
Measures of Position (Fractiles)
• In addition to measures of central tendency and
dispersion, measures of position give additional
information about a given data.
• Fractiles (Quantiles) are numbers that partition, or divide,
an ordered dataset into equal parts.
• For instance, the median is a fractile because it divides
an ordered data set into two equal parts.
• The commonly used measure of positions are Quartiles
(that divide the data into 4 parts), Deciles (that divide the
data into 10 parts), and Percentiles (that divide the data
into 100 parts).
116
3/1/2010
30
Quartiles
• Quartiles divide a data set into four equal parts.
• The three quartiles Q
1
, Q
2
, and Q
3
divide an ordered data
set into four equal parts.
– About ¼ of the data falls on or below the first quartile Q
1
.
– About ½ of the data falls on or below the second quartile
Q
2
(equivalent to median).
– About ¾ of the data falls on or below the third quartile
Q
3
.
– About ¼ of the data falls above the third quartile Q
3
.
117
Quartiles Cont…
• In order to identify the Quartiles of a given dataset
• Sort the values in increasing order
• Identify the Quartiles accordingly;
– Q
1
is the {0.25 (n+1)}
th
observation
– Q
2
is the median observation or {0.5 (n+1)}
th
– Q
3
is the {0.75(n+1)}
th
observation
• NB: if the identified observation is not a whole number
then it should be determined by interpolation of the
observations on either side.
118
Quartiles Cont…
• Example: Let’s assume the following dataset presents the
age of 8 factory workers. Identify the first and the third
quartiles.
{18, 21, 23, 24, 24, 32, 42, 59}
• First make sure that the data is sorted in increasing order.
• Q
1
is the {0.25 (n+1)}
th
observation
{0.25 (8+1)}
th
observation
{0.25 (9)}
th
observation
{2.25}
th
observation
119
Quartiles Cont…
• i.e. the Q
1
is a quarter distance between 21 and 23 this
can be interpolated as:
21 + (2321)0.25 = 21.5
• The interpretation is one forth of the observations are
below or equal to the value 21.5.
• Q
3
is the {0.75(n+1)}
th
observation
{6.75}
th
observation
32 + (4232)0.75 = 39.5
• The interpretation is three forth of the observations are
below or equal to the value 39.5.
120
3/1/2010
31
Quartiles Cont…
Additional use of the quartiles:
• The inter quartile range (Q
3
 Q
1
) can be used as
measure of dispersion (like that of Range). Inter quartile
range can over come one of the limitations of range, (i.e.
being affected by extreme values).
• Quartile deviation [(Q
3
 Q
1
)/2] and Coefficient of quartile
deviation [(Q
3
 Q
1
)/(Q
3
+ Q
1
)] are also rarely used as
measures of dispersion.
• A dataset can be summarized using the so called “The
five numbers summary” (this is sometimes represented
graphically as a boxandwhisker plot). The five numbers
are: the first and third quartiles, the median, and the
maximum and minimum values.
121
Deciles
• Deciles serve to partition data into10 equal parts.
• Not commonly used as common as percentiles and
Quartiles.
• There are 9 deciles dividing the population into 10 parts.
• The deciles are termed D
1
through D
9
.
• The interpretation of Deciles is as follows:
– About one tenth of the data falls on or below D
1
.
– About two tenth of the data falls on or below D
2
.
– The same meaning for other deciles.
• Note that the D
5
has similar meaning to the median or
the third quartile.
122
Deciles Cont…
A given percentile is determined in the following manner;
1. Arrange the data in ascending order.
2. Compute the decile using the formula:
3. NB: if the identified observation is not a whole number
then it should be determined by interpolation of the
observations on either side.
n observatio n
k
decile k
th
th
(
¸
(
¸
+ = ) 1 )(
10
(
123
Percentiles
• Percentiles are also like quartiles, but divide the data set
into 100 equal parts.
• Each group represents 1% of the data set.
• There are 99 percentiles termed P
1
through P
99
.
• P
50
is yet another term for median.
• Other equivalents, such as P
25
=Q
1
, P
75
=Q
3
, P
10
=D
1
, etc.,
should also be obvious.
• The interpretation of Percentiles is as follows:
– 1% of the data falls on or below P
1
.
– 2% of the data falls on or below P
2
.
– The same for other values.
124
3/1/2010
32
Percentiles Cont…
A given percentile is determined in the following manner;
1. Arrange the data in ascending order.
2. Compute the percentile using the formula:
3. If the identified observation is not a whole number then
it should be determined by interpolation of the
observations on either side.
n observatio n
k
percentile k
th
th
(
¸
(
¸
+ = ) 1 )(
100
(
125
Example
• The following data represents the Biostatistics result of
18 students out of 100 marks. Calculate the 4
th
decile
and 70
th
percentile.
{72, 51, 59, 80, 84, 71, 82, 71, 51, 48, 66, 81, 78, 69, 75,
67, 76, 75}
• Computing the 4
th
decile
• Before starting the computation arrange the observations
in increasing order. i.e.
{48, 51, 51, 59, 66, 67, 69, 71, 71, 72, 75, 75, 76, 78, 80,
81, 82, 84}
• Compute 4
th
decile using the formula:
126
Example Cont…
• Compute 4
th
decile using the formula:
4
th
decile is b/n the 7
th
& 8
th
observation (i.e. b/n 69 & 71)
In order to get the exact value we have to interpolate
69 + (7169) 0.6 = 70.2
About four tenth of the data falls on or below 70.2
n observatio n decile
th
th
(
¸
(
¸
+ = ) 1 )(
10
4
( 4
[ ] n observatio decile
th th
) 19 )( 4 . 0 ( 4 =
[ ] n observatio decile
th th
) 6 . 7 ( 4 =
127
Example Cont…
• Compute the 70
th
percentile
• The data is already sorted
• Compute the 70
th
percentile using the formula
70
th
percentile is b/n the 13
th
& 14
th
observation (i.e. b/n
76 & 78).
In order to get the exact value we have to interpolate
76 + (7876) 0.3 = 76.6
About 70% of the data falls on or below the value 76.6.
n observatio n percentile
th
th
(
¸
(
¸
+ = ) 1 )(
100
70
( 70
[ ] n observatio percentile
th th
) 3 . 13 ( 70 =
128
3/1/2010
33
Rate, Ratio and Proportion
• In addition to measures of central tendency, measures of
dispersion, and measures of position, a dataset can be
mathematically summarized by the use of Rate, Ratio
and Proportion.
129
Rate
• In mathematics rate is a numeric presentation which is
given in the form of fraction by which the numerator
measures one variable and the denominator another.
• Usually the denominator of rate is a time measure.
• In epidemiology we use rates to measure the occurrence
of events over time.
• If time element is directly reflected into the denominator
it is called real rate. (Example: Incidence density).
• If the fraction measures number of events per population
at risk in a given period of time it is called operational
rate (Example: Incidence proportion).
130
Ratio
• Mathematically a ratio is the comparison of two
quantities that have the same units (usually classes of a
variable).
• A ratio can be written in three different ways:
– As two numbers separated by a colon (a:b)
– As a fraction (a/b)
– As two numbers separated by the word to (a to b)
• In epidemiology ratio present two variables (as
numerator and denominator) where one is not included
in the other.
131
Proportion
• A proportion is usually presented in fraction, decimal or
percentage.
• Unlike ratio numerator is the subset of the denominator,
hence the value indicates the overall contribution of the
numerator to the denominator.
132
3/1/2010
34
Numeric Summarization
Using SPSS
• In SPSS numeric summaries are available under many
alternatives. Commonly used are:
– Analyze > Descriptive statistics > Frequency >
Statistics.
– Analyze > Descriptive statistics > Descriptives >
Statistics.
– Analyze > Descriptive statistics > Cross tabs >
Statistics.
– Analyze > Descriptive statistics > Explore > Statistics.
– Analyze > Reports > OLAP Cubes > Statistics.
133
Basic Probability
What is Probability
• Probability is the chance that an event will occur given
the trial has been conducted nearly infinitely under the
same condition. OR
• The probability of an event is the relative frequency of
set of outcomes over indefinitely large (or infinite)
number of trials.
• A sampling space is the set of all possible outcomes of a
trial or experiment.
• Event is the subset of the sample space.
• An event can be simple or composite. Composite event
contains more than one simple events.
135
Concept of Union, Intersection and
Complement
136
3/1/2010
35
Mutually Exclusive Events and The
Additive Law
• Events are said to be mutually exclusive if they have no
outcome in common.
• Examples:
• The Additive Law when applied to two mutually exclusive
events states that the probability of either of the two
events occurring is obtained by adding the probability of
each event.
• p(A or B) = p(A) + p(B)
137
Mutually Exclusive Cont..
Example 4.1:
• Role a six sided Die. The possible outcomes (Sampling
space) are six (1,2,3,4,5,6). Each event has equal
probability of occurrence (i.e. 1/6). Probability of rolling
an even number would be:
• p(even) = p(2)+ p(4)+ p(6)
• = (1/6)+(1/6)+(1/6)=1/2
138
Mutually Exclusive Cont..
Example 4.2:
• The natural history of Tuberculosis indicates for TB
patients without any treatment, at the end of the 5
th
year
of illness ½ of them would die, ¼ would develop
permanent disability and ¼ would recover. What is the
probability of an untreated TB patient either to recover or
to develop permanent disability (in other words to avoid
death) after 5 years of illness?
139
Conditional Probability and the
Multiplicative Law
• Conditional probability is defined as the probability that a
certain event will occur given that a composite event has
also occurred.
• p(AB) or "probability of A given B"
• This formula is conveniently rewritten as the following
which is commonly referred to as the Multiplicative Rule.
p(B)
B) p(A
B)  (
∩
= A p
) ( ) B  ( ) ( B p x A p B A p = ∩
140
3/1/2010
36
Conditional Probability Cont..
Example 4.3:
• What is the probability that the outcome of a roll of a die
is 2 (A2) given that the outcome is even?
Example 4.4:
• A medical practitioner measured the CD4 count of AIDS
patient on ART two times with in a month. About 25% of
the patients had normal value in both tests and 42% of
them had normal result in the first test. What percent of
those who had normal value in the first test also have the
same in the second test?
141
Independent Events and the
Multiplicative Law
• For two given events, if the occurrence or nonoccurrence
of one doesn’t affect in any way the occurrence or
nonoccurrence of the other, the events are called
independent events.
• With independent events the multiplicative law becomes:
p(A and B) = p (A)p(B)
142
Independent Events Cont..
Example 4.5:
• Assume we have rolled a die twice. What is the
probability to get 6 in both rolls?
Example 4.6:
• The probability of getting normal birth weight baby at 33
rd
weeks gestational age is 1/5. If two pregnant women at
the aforementioned gestational age gave birth in Bethel
Hospital yesterday, what is the probability for those two
babies to have normal birth weight?
143
Bayes' Theorem
• Bayes' theorem, was published in the eighteenth century
by Thomas Bayes’.
• It says that you can use conditional probability to make
predictions in reverse.
• Sometimes called the inverse probability law:
• P(BA) = P(A and B)/P(A) ………………………………1
P(AB) = P(A and B)/P(B) ………………………………2
• Solving [1] for P(A and B) and substituting into [2] gives
Bayes' Theorem:
P(AB) = [P(BA)][P(A)]/P(B)
• The general formula for Bayes' Theorem is:
144
3/1/2010
37
Bayes' Theorem Cont…
Example 4.7:
• Suppose there is a certain disease randomly found in
0.005% of the general population. A certain clinical blood
test is 99% effective in detecting the presence of the
disease among persons with the disease. But it also
yields falsepositive results in 5% of individuals without
the disease. The following tables show the probabilities
that are stipulated in the example and the probabilities
that can be inferred from the stipulated information:
• (Source: http://faculty.vassar.edu/lowry/bayes.html)
145
Bayes' Theorem Cont…
P
(A)
= .005
The probability that the disease will be present in any
particular person
P
(~A)
= 1—.005 = .995
The probability that the disease will not be present in
any particular person
P
(BA)
= .99
The probability that the test will yield a positive result
[B] if the disease is present [A]
P
(~BA)
= 1—.99 = .01
The probability that the test will yield a negative result
[~B] if the disease is present [A]
P
(B~A)
= .05
The probability that the test will yield a positive result
[B] if the disease is not present [~A]
P
(~B~A)
= 1—.05 = .95
The probability that the test will yield a negative result
[~B] if the disease is not present [~A]
Given:
146
Bayes' Theorem Cont…
P
(B)
= [P
(BA)
x P
(A)
] + [P
(B~A)
x P
(~A)
]
= [.99 x .005]+[.05 x .995] = .0547
The probability of a positive test result
[B], irrespective of whether the disease
is present [A] or not present [~A]
P
(~B)
= [P
(~BA)
x P
(A)
] + [P
(~B~A)
x P
(~A)
]
= [.01 x .005]+[.95 x .995] = .9453
The probability of a negative test result
[~B], irrespective of whether the
disease is present [A] or not present
[~A]
• Given this information, the derivation of two simple
probabilities is possible using conditional probability
formula.
147
Bayes' Theorem Cont…
P
(AB)
= [P
(BA)
x P
(A)
] / P
(B)
= [.99 x .005] / .0547 = .0905
The probability that the disease is present [A] if
the test result is positive [B]
P
(~AB)
= [P
(B~A)
x P
(~A)
] / P
(B)
= [.05 x .995] / .0547 = .9095
The probability that the disease is not present
[~A] if the test result is positive [B]
P
(~A~B)
= [P
(~B~A)
x P
(~A)
] / P
(~B)
= [.95 x .995] / .9453 = .99995
The probability that the disease is absent [~A] if
the test result is negative [~B]
P
(A~B)
= [P
(~BA)
x P
(A)
] / P
(~B)
= [.01 x .005] / .9453 = .00005
The probability that the disease is present [A] if
the test result is negative [~B]
• Then it is possible to calculate the remaining
probabilities.
148
3/1/2010
38
Summary of the Basic Properties of
Probability
1. The value of a probability can only be 0p1.
2. If an event is certain to occur, its probability is 1 and if an
event is certain not to occur, its probability is 0.
3. If two events are mutually exclusive (disjoint), the
probability that one or the other will occur equals the
sum of the probabilities: p(A or B) = p(A) + p(B)
4. If A and B are two events, not necessarily disjoint, then
p(A or B) = p(A) + p(B)p(A and B)
5. The sum of the probabilities that an event will occur and
that it will not occur is equal to 1.
6. If A and B are two independent events then p(A and B) =
p(A)p(B)
7. p(AB) = P (AnB)/P(B)
149
Random Variable and Probability
Distribution
Random Variable
• Any characteristic that can be measured or categorized
is called Variable.
• If a variable can assume a number of different values so
that any particular outcome is determined by chance, it is
called a Random Variable.
• A Random Variable is a function, which assigns unique
numerical values to all possible outcomes of a random
experiment under fixed conditions.
151
Random Variable Cont…
Example 4.8
• Three students are taken
at random from this
classroom. Suppose our
interest is the number of
female students that we
will get out of the three
samples. The possible list
of outcomes with number
of females is:
Outcome No of
Females
MMM 0
MMF 1
MFM 1
FMM 1
MFF 2
FMF 2
FFM 2
FFF 3
152
3/1/2010
39
Random Variable Cont…
• There are two types of random variables.
– A Continuous Random Variable is one that takes an
infinite number of possible values; and,
– A Discrete Random Variable: is one that takes finite
distinct values.
• Example 4.9:
– A coin is tossed 10 times. The random variable X is the
number of tails that are noted. X can only take the values
0, 1, ..., 10, so X is a Discrete Random Variable.
– A light bulb is burned until it burns out. The random
variable Y is its lifetime in hours. Y can take any positive
real value, so Y is a Continuous Random Variable.
153
Probability Distributions
• Every Random Variable has a corresponding Probability
Distribution.
• A Probability Distribution applies the theory of probability
to describe the behavior of the random variable.
• In the discrete case, it specifies all possible outcomes of
the random variable along with the probability that each
will occur.
• In the continuous case, it allows us to determine the
probabilities associated with specified ranges of values.
154
Discrete Probability Distribution
• Usually represented by
table.
Example 4.10:
• Table 4.1: Probability
Distribution of a random
variable X representing
the birth order of children
born in US.
x P(X=x)
1 0.416
2 0.330
3 0.158
4 0.058
5 0.021
6 0.009
7 0.004
8+ 0.004
Total 1.000
155
Continuous Probability
Distributions
• Since a continuous random variable assumes infinite
number of outcomes, it cannot be expressed in tabular
form. Instead, an equation or graph describes it.
• The equation used to describe a continuous probability
distribution is called a Probability Density Function
(PDF).
• PDF has the following properties:
156
3/1/2010
40
Continuous Probability
Distributions Cont..
• The area bounded by the curve of the density function
and the xaxis is equal to 1, when computed over the
domain of the variable.
• The probability that a random variable assumes a value
between a and b is equal to the area under the density
function bounded by a and b.
• The probability that a continuous random variable will
equal a specific value is always zero.
157
Binomial Distribution
• A discrete probability distribution.
• It handles dichotomous /binary/bernoulli random
variable.
• A variable which has only two outcomes (Success and
failure).
• The trial is called Bernoulli trial.
– The experiment consists of n repeated trials.
– Each trial can result in just two possible outcomes.
– The probability of success (x), denoted by P, is the
same on every trial.
– The trials are independent.
158
Binomial Distribution Cont..
• b(x; n, P): The probability that an ntrial binomial
experiment results in exactly x successes, when the
probability of success on an individual trial is P.
• b(x; n, P) =
n
C
x
* P
x
* (1  P)
n – x
159
Binomial Distribution Cont..
Example 4.11:
• Suppose a die is tossed 5 times. What is the probability
of getting exactly 2 fours?
• Suppose in Addis Ababa the probability of a commercial
sex worker to be HIV positive is 0.15. If we consider 5
randomly selected commercial sex workers in the city,
what is the probability that exactly 2 prostitutes will be
positive?
160
3/1/2010
41
Binomial Distribution Cont..
Cumulative Binomial Probability:
• Refers to the probability that the binomial random
variable falls within a specified range (e.g., is greater
than or equal to a stated lower limit and less than or
equal to a stated upper limit).
161
Binomial Distribution Cont…
Example 4.12:
• The probability that a student is accepted to a
prestigious college is 0.3. If 5 students from the same
school apply, what is the probability that at most 2 are
accepted?
• What is the probability of getting 4 or more HIV positives
among 5 randomly selected sex workers given that the
probability of a commercial sex worker to be HIV positive
is 0.15?
162
Poisson Distribution
• A discrete probability distribution.
• First introduced by SiméonDenis Poisson (1781–1840)
• It expresses the probability of a number of random
events occurring in a fixed period of time if these events
occur with a known average rate.
• A Poisson experiment is a statistical experiment that has
the following properties:
163
Poisson Distribution Cont…
– The experiment results in outcomes that can be
classified as successes or failures.
– The average number of successes () that occurs in a
specified period is known.
– The probability that a success will occur is
proportional to the duration of the time.
– The probability that a success will occur in an
extremely small time is virtually zero.
• Note that the distribution can also be used to quantify the
probability of occurrence of an event in a length, an area,
a volume, etc.
164
3/1/2010
42
Poisson Distribution Cont…
• The following notations are important,
– e: A constant equal to approximately 2.71828.
– : The mean number of successes (occurrence
of an event) that occur in a specified period of
time.
– x: The actual number of successes that occur in
a specified period of time.
– P(x; ): The Poisson probability that exactly x
successes occur in a Poisson experiment,
when the mean number of successes is .
165
Poisson Distribution Cont…
• Given the mean number of successes () that occur in a
specified period of time, we can compute the Poisson
probability based on the following formula:
P(x; ) = (e

) (
x
) / x!
Example 4.13:
• Let’s assume the average number of breast cancer
cases death is 2 per day. What is the probability that
exactly 3 will die tomorrow?
• = 2; since 2 patients die per day, on average.
• x = 3; i.e. likelihood that 3 will die tomorrow.
• e = 2.71828; 166
Poisson Distribution Cont…
• We put these values into the formula as follows;
P(x; ) = (e

) (
x
) / x!
P(3; 2) = (2.71828
2
) (2
3
) / 3!
P(3; 2) = (0.13534) (8) / 6
P(3; 2) = 0.180
• Thus, the probability of getting 3 deaths by tomorrow is
0.180.
167
Poisson Distribution Cont…
Example 4.14:
• In a study of suicides, a researcher found that the
monthly distribution of adolescent suicides in US follows
a poisson distribution with parameter of = 2.75. Find the
probability that a randomly selected month will be one in
which three adolescent suicides occur.
• P(x; ) = (e

) (
x
) / x!
• P(3; 2.75) = (e
2.75
) (2.75
3
) / 3!
• P(3; 2.75) = 0.222
168
3/1/2010
43
Poisson Distribution Cont…
• If the number of admissions in a hospital is 10 per hour
on average, determine the probability that, in any hour
there will be:
0 admissions;
6 admissions;
Less than 2 admissions.
169
Normal Distribution
• Is the most important probability distribution function.
• It is also known as the Gaussian Distribution.
• Named after Carl Friedrich Gauss (1777–1855).
• Given by the formula:
• The formula is affected by two main factors: mean and
SD
2
2
2
) (
* ] 2 * )
1
[(
σ
µ
π
σ
− −
=
x
e Y
170
Normal Distribution Cont…
Normal distribution has the following chx:
1. Bell shaped
2. Symmetrical at the mean
3. Unimodal
4. Mean median and mode are equal
5. Area under the curve is 1
6. Extends from negative infinity to positive infinity
• The normal distribution can be used to describe, at
least approximately, any variable that tends to cluster
around the mean. (Mainly as result the central limit
theorem)
171
Skewness, Kurtosis, and
Normal Curve
• Skewness and kurtosis are used to measure normality.
• Significant skewness and kurtosis indicate that data are
not normal.
• Skewness is a measure of asymmetry.
• For univariate data Y
1
, Y
2
, ..., Y
N
, the formula for
skewness is:
• Where Y bar is the mean, S is the standard deviation,
and N is the number of data points.
• The skewness for a normal distribution is zero, and any
symmetric data should have a skewness near zero. 172
3/1/2010
44
Skewness, Kurtosis Cont…
• Kurtosis is a measure of whether the data are peaked or
flat relative to a normal distribution.
• For univariate data Y
1
, Y
2
, ..., Y
N
, the formula for kurtosis
is:
• The kurtosis for a normal distribution is three.
• For this reason, some use the following definition of
kurtosis (often referred to as "excess kurtosis"):
• Positive kurtosis indicates a "peaked" distribution and
negative kurtosis indicates a "flat" distribution. 173
Normality Test
• Normality tests assess the likelihood that the given data
set comes from a normal distribution.
• It is important aspect statistics as many procedures
assume normality.
• Typically the null hypothesis H
0
is that the observations
are distributed normally with unspecified mean and
variance
2
.
• The alternative H
a
that the distribution is arbitrary.
• A great number of tests (over 40) have been devised for
this problem, the more prominent of them are outlined
below:
174
Normality Test Cont…
• The simplest method of assessing normality is to look at
the frequency distribution histogram. (symmetry,
peakiness of the curve, modality of the distribution).
• The other option is the use of probability plots.
• Probability Plot Is a graphical technique for comparing
two datasets, either two sets of empirical observations,
one empirical set against a theoretical set, or two
theoretical sets against each other.
– It is a common way of assessing normality, i.e. by
comparing a given data against normal distribution.
– Has two variants; QQ plot and PP plot.
175
Normality Test Cont…
• QuantileQuantile Plot (QQ plot):
– Compares two probability distributions by plotting
their quantiles against each other.
– If the two distributions being compared are similar,
the points in the QQ plot will approximately lie on the
line y = x.
• ProbabilityProbability plot (PP plot):
– Compares two probability distributions by plotting
their cumulative distribution functions against each
other.
176
3/1/2010
45
Normality Test Cont…
• It is possible to assess normality of a data objectively
using statistical techniques. (Example: Kolmogorov
Smirnov test, ShapiroWilk test).
• In SPSS:
• Analysis > descriptive statistic > explore > enter the
variable under dependent list > open plot and “check
“normality plots with test” > continue > ok.
• But such tests have serous limitation as:
– Small samples almost always pass a normality test,
– With large samples minor deviations from normality
may be flagged as statistically significant.
177
Normal Distribution Cont…
Application of Normal distribution to calculate probability:
1. Area under the curve is 1,
2. Probability of x > a is the area between a and positive
infinity,
3. Probability of x < a is the area between a and negative
infinity,
4. Probability of b<x<a is the area between a and b,
5. Probability of x = a is zero,
6. The empiric rule of 68%, 95% and 99.7% rule.
But how can we compute the area???
178
Standard Normal Distribution
• Is a normal distribution with a mean of 0 and a standard
deviation of 1.
• Any point (x) from a normal distribution can be converted
to the standard normal distribution (Z) with the formula:
Z = (xmean)/standard deviation.
• Corresponding area can be calculated from a standard
table.
179
Standard Normal Distribution
Cont..
Example 4.15:
• if 1.4m is the height of a student where the mean for
students of his age and sex is 1.2m with a standard
deviation of 0.4.
– What is the corresponding Z value for the student?
– What is the probability to have a student more than
height of 1.4?
180
3/1/2010
46
Standard Normal Distribution
Cont..
Example 4.16:
• Assume a distribution of blood glucose level among
medical students is normally distributed with mean of
90mg/dl and SD of 6mg/dl. Student X has mean glucose
level of 100mg/dl. Another student Y has mean glucose
level of 80mg/dl.
– What is the Z score for student X?
– What is the Z score for student Y?
– What is the probability of getting mean glucose level
less than 100mg/dl ?
– What is the probability of getting mean glucose level
less than 80mg/dl ?
181
Standard Normal Distribution
Cont..
– What range around the mean which encompasses
68% of the observation?
– What is the probability for a student to have blood
glucose level between 100 and 105 mg/dl?
182
Standard Normal Distribution
Cont..
Example 4.17:
• Among pregnant women having ANC followup in a
hospital, WBC count follows normal distribution with
mean of 8,000 and standard deviation of 800.
– What is the probability to get WBC more than 10,000
in those pregnant women?
– What is the probability to get WBC count between
7,500 and 10,000?
183
Standard Normal Distribution
Cont..
1. Suppose in BL Hospital the probability of a donated blood to be
positive to Hepatitis B is 0.2. If we consider 4 randomly selected
donated bloods, what is the probability that exactly 2 of the
samples will be positive for Hepatitis B?
2. Suppose that systolic blood pressures follow a normal distribution
with a mean of 108 and a SD of 14. According to this information
attempt the following questions.
– About 95% of the blood pressures are between ____ & ____.
– About ______% of the blood pressures are between 66 & 150
– What is the probability that a patient’s BP is > 120?
– What is the probability that the patient’s BP is b/n 110 & 130?
– What is the probability that a patient’s BP is < 108.
184
3/1/2010
47
Introduction to Demographic
Methods and Health Service
Statistics
What is Demography?
• “Demos” + “graphy”
• Is a discipline that studies human population with respect to
size, composition, distribution, mobility and its variation with
respect to all the above features and the causes of such
variations and the effect of all these on health,
environmental, social, ethical and economic conditions.
• Demography as a “method” and “data”.
• Demography studies a population in “static” and “dynamic”
aspects.
• Static aspects include characteristics at a point in time such
as composition by Age, Sex, Race, Marital status etc.
• Dynamic aspects are Fertility, Mortality, Nuptiality, Migration
and Growth.
186
Source of Demographic Data
• Demographic data can be acquired through three
methods:
– Census
– Survey
– Vital Registration
187
Census
• Refers to the total process of collecting, compiling,
analyzing, and publishing or otherwise disseminating
demographic, economic, and social data pertaining to all
persons in a country or in a welldelineated part of a
country at a specified time.
• Census has the following characters:
– Universality
– Simultaneity
– Individual enumeration
– Regular interval
188
3/1/2010
48
Census Cont..
• The first real census was conducted in UK in 1841.
• However there are evidences of large scale counting of
population starting from the prehistoric period.
• Content of Census
– Demographic data
– Economic data
– Social data
– Mortality and Birth
189
Approaches to Census
De jure:
• The enumeration is according to the legal or customary
place of residence.
• i.e. people are registered where they usually reside.
• Such type of counting gives information relatively
unaffected by seasonal and temporary movements.
• However, this might not be accurate when a person’s
legal or customary residence is not known.
• It also creates risk of omission and double counting.
• Information collected from a person away from his/her
usual residence can also be incomplete.
190
Approaches Cont…
De facto:
• The enumeration is according to physical residence at
the time of the census.
• i.e. people are registered where they are currently
staying/residing at the time of the census.
• This method is advantageous in a sense that it has got
less chance of double counting or omission.
• However, if it is applied in areas where there is high level
of migration and mobility, the result can be distorted.
191
Advantage and Disadvantage of
Census
• Advantage
– It represents the whole population,
– Serves as sampling frame for further studies,
– Provides population denominators,
– Provides small area data.
• Disadvantage
– Size limits content and quality control efforts,
– Cost limits frequency,
– Delay between field work and results,
– Sometimes politicized.
192
3/1/2010
49
Vital Registration (Civil Registration)
• Vital Registration is continuous registration of vital
events as they happen.
• What are the vital events?
• Vital Registration is relatively modern concept in its
present format.
• The major purpose of vital registration is primarily
administrative.
• Vital Registration has got the following features:
– Continuity
– Universality
193
Advantages of Vital Registration
• Continuously monitors vital rates,
• May provide both numerator and denominator for
some rates,
• Small area data available,
• Can be used as base for testing the accuracy of
censuses and surveys,
• Once a system is established, it would be cost
effective.
194
Disadvantages of Vital Registration
• Uncertain coverage,
• It is difficult to establish the system,
• Information may come from third party,
• It can easily be disrupted by political/economic events.
195
Survey
• Refers to the process of obtain information from a
sample representative of some population at a given
point in time.
• How can we make it representative?
• Survey can be of two types:
– Single rounded retrospective survey
– Multiround follow up survey
• The content of survey widely varies.
• Features of Survey:
– Representativeness,
– Smaller size
– More indepth information.
196
3/1/2010
50
Advantage and Disadvantage of
Survey
• Advantages:
– Quick and inexpensive,
– Gives detailed data,
– Follow up can be achieved
• Limitations:
– Small area data might not be available,
– Perfect representativeness is difficult to achieve,
– A survey can only be focused on few thematic areas.
197
Demographic Transition
• Conceptual framework to explain population change over
time.
• Developed by American demographer Warren
Thompson, 1929.
• Observed changes in birth and death rates in
industrialized societies over the past two hundred years.
• Demographic change has got three stages.
• Developed countries started the second stage in the
beginning of eighteenth century. Less developed
countries began the transition later.
198
Demographic Transition Cont…
199
Demographic Transition
Cont…
• Stage I: Characterized by high and fluctuating mortality,
high fertility and low population growth.
• Stage II: Characterized by beginning of mortality decline
followed by fertility decline. This is the period of rapid
population growth.
• Stage III: Characterized by low mortality, low and
fluctuating fertility, growth slows down and eventually
reaches a nogrowth stage.
200
3/1/2010
51
Important Indicators of Composition
of a Population
1. Sex Ratio: Is the total number of male population per
1000 female population. This can be explained as Y to
1000, Y:1 or Y/X when Y is number male and X is
number of female.
2. Child to Women Ratio: This is the ratio of number of
children under five to number of women of reproductive
age in given place and time. It can also be used as
measure of fertility.
3. Dependency Ratio: Describe the ratio between non
productive (age 014 and 65+) and productive (1564)
age groups in given place and time.
4. Population Pyramid:
201
Population Pyramid
• A graphical illustration that shows the distribution of
various age groups in a population.
• Normally forms the shape of a pyramid.
• Consists of two backtoback bar graphs, with the
population plotted on the Xaxis and age on the Yaxis,
• One showing the number of males and one showing
females in a particular population in fiveyear age
groups.
• Males are shown on the left and females on the right.
202
Population Pyramid
203
Population Pyramid
204
3/1/2010
52
Vital Statistics
• Among the focus of demography, some of the issues are
more important and applicable in public health.
• Especially the measures of mortality and fertility are vital
inputs to the health system so they are called Vital
Statistics.
205
Measures of Fertility
• Crude Birth Rate (CBR): The number of live births in a
year per 1000 mid year population in the same year.
1000 x
year same a in population year Mid
year a in births live of number Total
CBR =
206
Measures of Fertility Cont..
• General Fertility Rate (GFR): The number of live births
in a year per 1000 mid year women of reproductive age.
1000
49 15
x
year same a in yrs aged population female year Mid
year a in births live of number Total
GFR
−
=
207
Measures of Fertility Cont..
• Age Specific Fertility Rate (ASFR): Refers to the
number of live births in a year per 1000 women of
reproductive age in a give age or age group.
• Usually ASFR is calculated for the following 7 age
groups of 5 years age category: 1519 yr, 2024 yr, 25
29 yr, 3034 yr, 3539 yr, 4044 yr, 4549 yrs.
1000 x
year same the in group age same the for population female year Mid
year a during group age given a of women to births live of no Total
ASFR =
208
3/1/2010
53
Measures of Fertility Cont..
Age category ASFR
1519 104
2024 228
2529 241
3034 231
3539 160
4044 84
4549 34
209
Measures of Fertility Cont..
• Total Fertility Rate (TFR): The number of children a
woman expected to have at the end of her reproductive
age given the current ASFRs are maintained.
• Mathematically, it is the sum of all ASFRs from 1549
yrs.
• TFR for data given in the usual 5 years age category is
provided as:
¿
=
=
7
1
5
i
i
ASFR x TFR
210
Measures of Fertility Cont..
• Gross Reproduction Rate (GRR): Is the total fertility
rate restricted to female births only.
1000 Pr x births female of oportion x TFR GRR =
211
Measures of Fertility Cont..
• Child Ever Born (CEB):
• Total number of children a woman has ever given birth
to.
• It is the average number of children a woman has in a
given study area.
212
3/1/2010
54
Measures of Fertility Cont..
Example 5.1:
• Calculate ASFR, TFR, GFR, CBR from the following
data.
213
Measures of Fertility Cont..
Age category Women of
reproductive age
Live
births
ASFR
1519 15,600 1596
2024 14,400 3300
2529 13,300 3210
3034 12,200 2830
3539 11,600 1860
4044 10,100 850
4549 9,200 320
Total 86,400 13,966
214
Measures of Mortality
• Crude Death Rate (CDR): Refers to total number of
deaths in a given area usually in a year per 1000 mid
year population.
1000 x
population year Mid
year per death of number Total
CDR=
215
Measures of Mortality
• Age Specific Death Rate (ASDR): Quantifies death
occurring in defined age category in a given area per
1000 mid year population of same age category.
1000 x
year same the in category age that of population year Mid
year a in category age given a in death of No
ASFR =
216
3/1/2010
55
Measures of Mortality
• Neonatal Mortality Rate (NMR): It refers to number of
death before the age of 28 days (neonatal period) in a
year out of 1000 live births in the same year.
• Infant Mortality Rate (IMR): It refers to number of death
before the age of 1 year (Infancy period) in a year out of
1000 live births in the same year.
• Under Five Mortality Rate (U5MR): Quantifies the
probability of dying between birth and age five per 1000
live births in a given year.
• Child Death Rate (ChDR): Quantifies the probability of
dying between age of one and five years per 1000 live
births in a given year.
217
Measures of Mortality
• Cause Specific Mortality Rate (CSMR):
• Cause Specific Death Ratio (Proportionate
Mortality Ratio):
1000
sec
x
risk at Population
year a in cause given a to ondary death of No
CSMR =
1000
sec
Pr x
year same the in death of no Total
year a in cause a to ondary death of No
Ratio Mortality e oportionat =
218
Measures of Mortality
• Maternal Mortality Ratio:
• Maternal Mortality Rate:
100000 x
year same the in births live of number Total
year given a in death maternal of Number
MMR
o
=
100000 x
year same the in age ve reproducti of women of number Total
year given a in death maternal of Number
MMR
a
=
219
Measures of Migration
• Crude InMigration Rate: Number of inmigrants (I)
per 1,000 population in a given year.
• Crude OutMigration Rate: Number of outmigrants
(O) per 1,000 population in a given year.
• Crude Net Migration Rate: Difference between the
number of inmigrants (I) and number of outmigrants
(O) per 1000 population in a given year.
220
3/1/2010
56
Measures of Marriage
• Crude Marriage Rate: Number of marriage (M) per
1000 population in a given year.
• General Marriage Rate: Number of marriage (M) per
1000 population age 15 and older in a given year.
221
Measure of Population Growth
and Projection
• Crude Rate of Natural Increase (r):
• Population Projection:
• Population Doubling Time:
CDR CBR r − =
t
o t
r P P ) 1 ( + =
) 1 ( log
2 log
r
t
+
=
222
Health Service Statistics
• Data generated from the health system itself.
• Advantages:
– Gives morbidity information
– Identify priority health problem in the area.
– Determine met and unmet health need.
– Determine success or failure of specific
health care program.
– Assess utilization of health service.
223
Health Service Statistics Cont..
• Limitations
– Lack of completeness
– Lack of representativeness to the general
community
– Lack of denominators
– Lack of uniformity
– Lack of quality
– Lack of compliance with reporting
224
3/1/2010
57
Health Service Statistics Cont..
1. Relative Frequency of a Disease:
2. Cure Rate:
• Quantifies proportion of patients who have been cured
for a disease condition using a treatment modality out of
100 patients who received similar type of treatment.
• The term “Success Rate” can be used if the measured
parameter is a procedure.
% 100 disease given a of Frequency Relative x
visits n institutio health of number Total
disease specific a with diagnosed patients of No
=
% 100
mod sin
x
treatment the recieved who patients of Number
ality treatment a g u disease given a of patients cured of No
Rate Cure =
225
Health Service Statistics
Cont..
3. Admission Rate:
• Quantifies proportion of admissions of patients among
patients who visited the health institution in a given
period of time.
4. Hospital Death Rate:
• Quantifies proportion of deaths among hospitalized
patients in a given period of time.
226
% 100 x
n institutio the visited patients of number Total
n institutio health a to admitted patients of No
Rate Admission =
% 100 x
admission of no Total
patients ed hospitaliz among death of No
Rate Dealth Hospital =
Health Service Statistics
Cont..
5. Bed Occupancy Rate:
• Quantifies percentage occupancy of hospital beds in a
year.
6. Average Length of Stay:
• Quantifies the average duration (in days) of hospitalized
patients.
227
deaths or es disc of Number
days patient ed hospitaliz of number Annual
ALS
arg
=
% 100
365
x
beds of number total x
days patient ed hospitaliz of number Annual
BOR =
Sampling Method
3/1/2010
58
Why Sampling?
• Sampling is that part of statistical practice concerned with
the selection of individual observations intended to yield
reasonable knowledge about a population of concern,
especially for the purposes of statistical inference.
• Study population Vs Target (Source) (Reference)
Population.
• Parameter: A descriptive measure computed from the data
of the source population,
• Statistic: A descriptive measure computed from the data of
a sample.
• The issues of adequate sample size and representative
sampling technique are important for correct estimation of
the parameter using a statistic.
229
Why Sampling?
230
Why Sampling?
• Researchers rarely survey the entire population for two
reasons
(1) The cost is too high and
(2) The population is dynamic.
• Main advantages of sampling:
(1) The cost is lower,
(2) Data collection is faster, and
(3) It is possible to ensure accuracy and quality of
the data because the dataset is smaller.
• Main disadvantage of sampling
– Non representativeness (sampling error)
231
Sampling
Important terms:
• Sampling Unit: Is the unit of selection in the sampling
process.
• Study Unit: The unit on which information is collected.
• Sampling Frame: The list of all the units in the source
population from which a sample is to be taken.
• Sampling Fraction (Sampling Interval): The ratio
between the number of units in the sample to the
number of units in the source population.
232
3/1/2010
59
Types of Sampling
• Probability Sampling: Every unit in the population has
a known, nonzero probability, of being sampled and the
process involves random selection.
• Nonprobablity Sampling: Nonprobability sampling is
any sampling method where some elements of the
population have no chance of selection or where the
probability of selection can't be accurately determined.
233
Probability Sampling
– Simple Random Sampling (SRS)
– Systematic Random Sampling
– Stratified Sampling
– Cluster Sampling
– Multistage Sampling
234
A. Simple Random Sampling (SRS)
• Is the purest (the most representative) form.
• Each member of the population has an equal, nonzero
and known chance of being selected.
• This could be accomplished by writing each study units
name on a slip of paper and selecting adequate
number of them using Lottery Method.
• It can also be done by assigning a number to each
sampling unit then samples are selected using Table
of Random Numbers or Computer packages.
235
How to use table of random
numbers
1. Number each member of the population.
2. Determine population size (N).
3. Determine sample size (n).
4. Determine starting point in table by randomly picking a
page and dropping your finger on the page with your
eyes closed.
5. Choose a direction to read. (to the left, right, down or up)
6. Select the first n numbers read from the table whose last
digits are between 0 and N.
7. Once a number is chosen, do not use it again.
8. If you reach the end of the table before obtaining your n
numbers, pick another starting point, read in a different
direction, and continue until done.
236
3/1/2010
60
Simple Random Sampling
Cont…
• When large dataset is available in databases, statistical
packages can select a given size randomly.
• In SPSS:
– Data > Select Cases > Random > complete the
dialogue box accordingly.
• In Excel:
– Tools > Data Analysis > Sampling > Complete the
dialogue box accordingly.
237
Simple Random Sampling Cont…
Limitation of SRS
• Requires sampling frame,
• Takes longer time.
238
B. Systematic Random Sampling
• Selects units at a fixed interval throughout the sampling
frame after a random start.
• The steps are:
– Number the units in the population from 1 to N,
– Decide on the n (sample size) that you need,
– Calculate the Sampling Fraction k (K = N/n),
– Randomly select an integer between 1 to k,
– Then take every k
th
unit.
239
Systematic Random Sampling
Cont...
• Advantage:
– It is easier and less time consuming to perform.
– Rarely it can be conducted without sampling frame.
• Disadvantage:
– Can be biased when there is cyclic patter in the order
of the subjects.
240
3/1/2010
61
C. Stratified Sampling
• Applied when the source population is heterogeneous
on a variable of interest.
• The population is first divided into classes (strata).
• Then a separate sample is taken from each stratum
using Simple or Systematic Random Sampling tech.
• The number taken from each stratum might be equal
(Non Proportional Stratified Sampling) or the number is
determined based on the proportion of each class in
the source population (Proportional Stratified
Sampling).
241
Stratified Sampling Cont…
• Advantage: improves representativeness of the sample
(Proportional Stratified Sampling) or it creates
reasonable comparison among strata (Non Proportional
Stratified Sampling).
• Limitation: Requires separate sampling frame for each
stratum.
242
D. Cluster Sampling
• Is a sampling method applied when the source
population is composed of “natural” groups.
• Assuming the groups are homogenous among each
other, Cluster sampling selects few groups (clusters)
from the population as Primary Sampling Unit (PSU).
• Then the required information is collected from all
elements, Secondary Sampling Units (SSU), within
each selected group.
243
Cluster Sampling Cont..
• Advantage:
– It doesn’t require the sampling frame of the SSU.
– Requires less time and resource.
• Disadvantage:
– Relies on the assumption of homogeneity among
clusters.
– Less control on sample size.
244
3/1/2010
62
E. Multistage Sampling
• Is like cluster sampling, but involves selecting a sample
within each chosen cluster, rather than including all units
in the cluster.
• Thus, multistage sampling involves selecting a sample
in at least two stages.
• The advantage is it is simpler than SRS.
• But the disadvantage is as the “number of stages”
increased, sampling error inflates.
245
Probability Proportional to Size
Sampling Technique
• PPS is a variant of cluster sampling technique.
• Useful when the sampling units vary considerably in
size.
• Probability of selecting a sampling unit (e.g., village,
zone, district, health center) is proportional to the size of
its population.
Involves the following procedures
• List all clusters with their respective source population
size and cumulative frequency.
• Decide the number of clusters (a) which will be included
in the study.
246
PPS Cont…
• Decide the number of individuals which will be studied
per one selection of a cluster (b).
• Divide the total population by number of clusters to be
studies. This will give you the sampling interval (SI)
• Choose a number between 1 and the SI at random. This
is the Random Start (RS) point.
• Calculate the following series: RS; RS + SI; RS + 2SI;
.....RS + (a1)SI.
• Based on the cumulative frequency identify at which
clusters the selected numbers fall.
• For every selection of a cluster select b individuals at
random from it. Note that if a cluster is selected twice 2b
individuals should be selected at random.
247
2. Nonprobablity Sampling
• Here, the sample is less likely to be representative of
the population, thus it is difficult to extrapolate from the
sample to the population.
• Is used when there is no sampling frame or when it is
impossible to conduct probability sampling due to
economical and feasibility factors.
248
3/1/2010
63
Nonprobablity Sampling Cont..
• Judgmental or Purposive Sampling: The researcher
chooses the sample based on who he/she think would be
appropriate for the study.
• Convenience Sampling: The selection of units from the
population is based on availability and/or accessibility.
• Quota Sampling: It starts with systematically setting
“Quota” to represent subgroups of a population. Then
data is collected to meet the predefined Quota.
• Snowball Sampling: The researcher begins by identifying
someone who meets the inclusion criteria of the study.
Then the study subject would be asked to recommend
others who s/he may know who also meet the criteria.
249
Sampling Error
• Sampling error or estimation error is part of the total
error or uncertainty caused by observing a sample
instead of the whole population.
• Nonsampling errors such as nonresponse and
reporting errors may also affect the outcome of a sample
based study.
• Theoretically estimated from a sample minus the
population value.
• Unlike bias, sampling error can be predicted, calculated,
and accounted for.
• There are several measures of sampling error.
250
Sampling Error Cont…
1. Standard error
• Is a measure of the variability of an estimate due to
sampling.
• It indicates the extent to which an estimate derived from
a sample survey can be expected to deviate from the
population value.
• Depends upon the underlying variability in the population
for the characteristic as well as the sample size used for
the survey.
• The standard error is a foundational measure from which
other sampling error measures are derived.
251
Sampling Error Cont…
2. Confidence intervals:
• A range that is expected to contain the population value
of the characteristic with a known probability.
3. Margin of error:
• Is a measure of the precision of an estimate at a given
level of confidence.
4. Coefficient of variance:
• The relative amount of sampling error in comparison
with a sample estimate.
• CV = SE / Estimate * 100%
• No hard and fast rules to define acceptable level.
• The smaller the CV, the more reliable the estimate.
252
3/1/2010
64
Sampling Error Cont…
5. P values:
• is the probability of obtaining a test statistic at least as
extreme as the one that was actually observed,
assuming that the null hypothesis is true.
Importance of such measures:
• To indicate the statistical reliability and usability of
estimates.
• To make comparisons between estimates.
• To conduct tests of statistical significance.
• To help users draw appropriate conclusions about data.
253
Exercise 1
• A medical practitioner wanted to assess the quality of
family planning service offered in a hospital. Accordingly
he made an exit interview to those women who have ID
number of multiple of five. What sampling method is
employed?
254
Exercise 2
• A medical practitioner wanted to assess the prevalence
of malnutrition among under five children in a woreda.
Assuming all kebeles in the woreda are similar, he
included all under five children in two randomly selected
kebeles.
– What sampling method is employed?
– What possible limitation do you expect?
255
Exercise 3
• A medical practitioner wanted to assess the prevalence
of malnutrition among under five children in a woreda.
Assuming the problem is different across the three agro
ecological zones in the woreda he included children from
2 kebeles each from Kolla, Dega and Woynadega.
– What sampling method is employed?
– What possible limitation do you expect?
256
3/1/2010
65
Exercise 4
• A researcher wanted to study the prevalence of drug
addiction among adolescents in Addis Ababa. First he
randomly select Bole sub city. Then he selected woreda
17 at random from all woredas in Bole sub city. Finally
he conducted his study in Kebele 19 (after random
selection).
– What sampling method is employed?
– What possible limitation do you expect?
– If woreda 17 was selected because of its proximity to
the organization of the researcher what would have
been the sampling method?
257
Sampling Distribution and
Estimation
Estimation
• Estimation refers to the process by which one makes
inferences about a population, based on information
obtained from a sample.
• Can be of two types:
– Point Estimation
– Interval Estimation
259
Point Estimate
• Point Estimate: A point estimate of a population
parameter is a single value of a statistic.
• The following table gives commonly used point
estimators.
260
3/1/2010
66
Interval Estimate
• An interval estimate is defined by two numbers, between
which a population parameter is said to lie.
• For example, is an interval estimate of the
population mean .
• i.e. the population mean is greater than a but less than b.
• An interval estimate has got three components
(concepts).
b X a < <
261
Interval Estimate Cont….
• An interval estimate has got three components (concepts)
– A statistic: (the point estimator)
– A margin of error: (the measure of precision)
– A confidence level: (the measure of uncertainty)
• The interval estimate of a given confidence level is
defined by the sample statistic + margin of error.
• Interval Estimate is preferred than point estimate as it
considers the precision and uncertainty of estimation.
262
Interval Estimate Cont….
Margin of Error
• In a confidence interval, the range of values above and
below the sample statistic is called the margin of error.
• It measures the precision of a sampling method.
• It is the function of the confidence level and another
parameter called the standard error.
263
Interval Estimate Cont….
• Confidence Level
– The probability part of the interval.
– It describes how strongly we believe that a particular
sampling method will produce an interval that
includes the true population parameter.
– 90, 95, and 99% Confidence interval
– For example, 95% CI means: If we used the same
sampling method to select different samples and
compute different interval estimates, the true
population mean would fall within a range defined by
the sample statistic + margin of error in 95% of the
time.
264
3/1/2010
67
Interval Estimate Cont….
• Example 6.1:
– A local newspaper conducts an election survey and
reports that the independent candidate will receive
30% of the vote. The newspaper states that the
survey had a 5% margin of error and a confidence
level of 95%.
– Meaning: We are 95% confident that the independent
candidate will receive between 25% and 35% of the
vote.
265
CI for a single mean
• Background Concept: Sampling Distribution of Means.
– One can generate sampling distribution of means in the
following manner:
– Obtain a sample of n observations selected completely
at random from a large population. Determine their
mean and then replace the observations in the
population.
– Repeat the sampling procedure indefinitely.
– The result is a series of means of sample size n.
– If each mean in the series is now treated as individual
observation and arranged in a frequency distribution,
one comes up with the sampling distribution of means of
samples of size n. 266
CI for a single mean cont..
• The sampling distribution of means has the following
properties:
1. The mean of the sampling distribution of means is the
same as the population mean.
2. The SD of the sampling distribution of means (which is
called the standard error of the mean) is:
3. Sampling distribution of means is approximately a
normal distribution, regardless of the original distribution
provided n is large. (Central Limit Theorem)
n
x
/ σ σ =
267
CI for a single mean cont..
• The general formula is
• CI=Sample statistic + Z value x SE
95 . 0 ) 96 . 1
/
(1.96 Pr = ≤ 
.

\
 −
≤
n
x
σ
µ
[ ] 95 . 0 ) / ( 96 . 1 ) / ( 96 . 1 Pr = + ≤ ≤ − ¬ n X n X σ µ σ
) / ( 96 . 1 % 95 n X for CI σ µ ± =
) / (
2
n Z X for CI σ µ
α
± =
268
3/1/2010
68
CI for a single mean cont..
• However when the population variance is unknown and
the sample size is less than 30:
– Sample variance should replace population variance
– Student t distribution should be used in the place of
standard normal distribution.
– Hence the formula would be:
) / ( ,
) 1 (
2
n t X
n
σ µ
α −
± =
269
CI for a single mean cont..
Example 6.2:
• The mean blood glucose level of 100 randomly selected
healthy adults is 85mg/dl. Find 95% CI for the mean
blood glucose level for all health adults (µ) given the
standard deviation for the population is 15mg/dl.
270
CI for difference between two
means
• Background Concept: The Sampling distribution of
Difference of Means.
– Consider two different populations X and Y.
– The first population has mean of µ
x
and standard
deviation of
x
.
– The second population has mean of µ
y
and standard
deviation of
y
.
– From the first population take a sample of size n
x
and
compute its mean .
– From the second population take a sample size of n
y
and compute its mean .
– Then determine .
X
Y
Y X −
271
CI for difference between two
means cont…
• Do the same for all pairs of samples that can be chosen
independently from the two populations.
• The Differences are new set of scores which form
the sampling distribution of differences of means.
Y X −
272
3/1/2010
69
CI for difference between two
means cont…
• Properties of the sampling distribution of differences of
means.
1. The mean of the sampling distribution of differences of
means equals to the difference of the population means
( ).
2. The SD of the sampling distribution of differences of
means (SE) is equal to:
3. The distribution is approximately normally distributed.
2 1
µ µ −
2
2
2
1
2
1
) (
n n
Y X
σ σ
σ + =
−
273
CI for difference between two
means cont…
95 . 0 ) 96 . 1 ( ) ( ) ( ) 96 . 1 ( ) ( Pr
2
2
2
1
2
1
2 1
2
2
2
1
2
1
=
(
(
¸
(
¸
+ + − ≤ − ≤
(
(
¸
(
¸
+ − −
n n
Y X
n n
Y X
σ σ
µ µ
σ σ
95 . 0 ) 96 . 1
) ( ) (
96 . 1 ( Pr
2
2
2
1
2
1
2 1
= <
(
(
(
(
(
¸
(
¸
+
− − −
< −
n n
Y X
σ σ
µ µ
) ( ) (
2
2
2
1
2
1
2
2 1
n n
Z Y X
σ σ
µ µ
α
+ ± − = −
) 96 . 1 ( ) ( % 95
2
2
2
1
2
1
2 1
n n
Y X of CI
σ σ
µ µ + ± − = −
274
CI for difference between two
means cont….
Example 6.3:
• A randomly selected 120 HIV patients who were on ART
had averagely lived for 25 years with SD of 5 years since
their diagnosis for the virus was made. Similarly a
randomly selected 140 HIV patients who were not on
ART had averagely lived for 14 year with SD of 4 years.
• Calculate the point estimate for the difference between
the population means.
• Find the 95% CI for the difference between the means.
275
CI for single proportion
• Background Concept: The Sampling distribution of
Proportions
• Here we are interested in the proportion of the
population that has a certain characteristic represented
by P or .
• If we take indefinite random sample of n observation and
if we calculate p for all samples then we will have
sampling distribution of proportions.
• The sampling distribution of proportion has the following
characteristics:
276
3/1/2010
70
CI for single proportion cont…
• The sampling distribution of proportions has the
following properties:
1. The mean of sampling distribution of proportions = ,
2. The SD (SE) of the sampling distribution of proportions:
3. The distribution is approximately normally distributed.
n
P P
P
) 1 ( −
= σ
277
CI for single proportion cont..
95 . 0 ) 96 . 1
) 1 (
96 . 1 ( Pr = <
(
(
(
(
¸
(
¸
−
−
< −
n
P P
p π
n
P P
p for CI
) 1 (
( 96 . 1 % 95
−
± = π
)
) 1 (
(
2
n
P P
Z p
−
± =
α
π
95 . 0 )
) 1 (
( 96 . 1 )
) 1 (
( 96 . 1 Pr =
(
¸
(
¸
−
+ ≤ ≤
−
− ¬
n
P P
p
n
P P
p π
278
CI for single proportion cont..
Example 6.4:
• In Addis Ababa blood test of randomly selected 120
commercial sex workers revealed that 30 of them are
HIV positive. What will be the 99% confidence interval of
HIV/AIDS prevalence for whole commercial sex workers
in the city?
279
CI for difference between two
proportions
• Consider two different populations X and Y.
• The first population has proportion of
and the second
population has proportion of
.
• From the first population take a sample of size n
x
and
compute its sample proportion p
x.
From the second
population take a sample size of n
y
and compute its
sample proportion p
y.
• Then determine p
x
p
y
.
• Do for all pairs of samples that can be chosen
independently from the two populations.
• The Differences p
x
p
y
are new set of scores which form
the sampling distribution of differences of proportions.
280
3/1/2010
71
CI for difference between two
proportions cont…
• The sampling distribution of differences of proportions
has the following properties:
1. The mean of the sampling distribution of differences of
proportions equals the difference of the population
proportion (

)
2. The SD (SE) given as:
3. The distribution is approximately normally distributed.
2
2 2
1
1 1
) (
) 1 ( ) 1 (
2 1
n
p p
n
p p
p p
−
+
−
=
−
σ
281
CI for difference between two
proportions cont…
95 . 0 ) 96 . 1
) 1 ( ) 1 (
) ( ) (
96 . 1 ( Pr
2
2 2
1
1 1
2 1 2 1
= <
(
(
(
(
(
¸
(
¸
−
+
−
− − −
< −
n
p p
n
p p
p p π π
95 . 0 )
) 1 ( ) 1 (
96 . 1 ( ) ( ) (
) 1 ( ) 1 (
96 . 1 ( ) ( Pr
2
2 2
1
1 1
1 1 2 1
2
2 2
1
1 1
1 1
=
(
(
¸
(
¸
−
+
−
+ − ≤ − ≤
(
(
¸
(
¸
−
+
−
− −
n
p p
n
p p
p p
n
p p
n
p p
p p π π
2
2 2
1
1 1
2 1 2 1
) 1 ( ) 1 (
( 96 . 1 ) ( % 95
n
p p
n
p p
p p for CI
−
+
−
± − = −π π
2
2 2
1
1 1
2
2 1 2 1
) 1 ( ) 1 (
( ) (
n
p p
n
p p
Z p p
−
+
−
± − = −
α
π π
282
CI for difference between two
proportions cont…
• Example 6.5:
• Among randomly selected 200 illiterate married women,
50 of them use contraceptive. Similarly, among randomly
selected 300 married women who can read and write,
150 of them use contraceptive.
• Calculate the point estimate for the difference between
the population proportions.
• Find the 95% CI for the difference between the
proportions.
283
CI for OR and RR
• When the intention of measurement of association is to
have inference about a population parameter, CI for OR
or RR can be calculated using the following formula.
• Why do we need natural logarithm here?
]
1 1 1 1
[ln(OR) exp OR for CI
2
d c b a
Z + + + ± =
α
( ) ( )
]
1 1
[ln(RR) exp RR for CI
2
c
d c
c
a
b a
a
Z
+
−
+
+
−
± =
α
284
3/1/2010
72
CI for OR and RR Cont..
• SPSS can compute OR and RR with their confidence intervals
given the information is fed in the following manner.
• Create 3 variables in the variable view page:
– Frequency (for the four cells),
– Exposure (0 as Yes, 1 as No) and
– Outcome (0 as Yes, 1 as No)
• Enter the values into the data view page as mentioned above.
• Weight cases based on “frequency” variable.
• Do the analysis in the following manner:
– Descriptive statistics > Cross tabs > Put “exposure” as row
and “outcome” as column > Statistics > Check “risk” >
Continue > Ok
– OR is given as “Odds ratio for exposure (yes/no)”
– RR is given as “For cohort disease = yes” 285
Unbiased and Biased Estimators
• A statistic is called an unbiased estimator of a population
parameter if the mean of the sampling distribution of the
statistic is equal to the value of the parameter.
• Based on the Central Limit Theorem, the sample mean is an
unbiased estimator of population mean.
• If the mean value of an estimator is either less than or
greater than the true value of the quantity it estimates, then
the estimator is called a biased.
• A case of biased estimation is seen to occur when sample
variance, is used to estimate the population variance using
the following formula:
286
Unbiased and Biased
Estimators Cont…
• The sample variance calculated using this formula is always
less than the true population variance.
• This is because sample observations are closer to each
other than population observation.
• To compensate for this, n1 is used as the denominator.
• It is important to note that, using n1 as the denominator, the
sample variance still remains a biased estimator of the
population standard deviation, but for large sample sizes
this bias is negligible.
287
Estimation of Sample Size for
Cross Sectional Studies
Why we need to calculate sample size:
• Representativeness Vs Cost
• Estimation can be made based on a given confidence
level and standard error.
288
3/1/2010
73
Sample Size to Estimate a Single
Population Proportion
2
2
2
) 1 (
d
P P Z
n
−
=
α
• If the main objective of the study is to estimate single
population proportion, then the sample size can be
determined using the formula:
Where;
n is the minimum sample size required for very large
population (100,000)
Z is the critical value for a given confidence interval
P is expected proportion of the event to be studied (to
be estimated based findings of previous studies)
d is margin of error
289
Sample Size to Estimate a Single
Population Proportion Cont…
NB:
• If p is not known it has to be taken as 0.5. (Why?)
• Depending on the nature of the study 1015%
contingency should be added.
• If the size of the population is less than 100,000 the
sample size should be corrected using the formula;
• Where:
– n is the noncorrected sample size
– N is the size of the source population
N n
N x n
size sample Corrected
+
=
290
Sample Size to Estimate a Single
Population Proportion Cont…
Example 6.6:
• A researcher is interested to determine the prevalence of
family planning use in Addis Ababa city. A previous
study indicates the prevalence is around 55%. If the
researcher is interested to determine the sample size
with 95% CI and 5% of margin of error, what number of
women of reproductive age should be included into his
study?
291
Sample Size to Estimate Single
Population Mean
• If the main objective of the study is to estimate single
population mean, then the sample size can be determined
using the formula:
• Where:
– n is the minimum sample size required for large
population
– Z is the critical value for a given confidence level
– is the expected SD of the event to be studied
– d is the margin of error
2
2



.

\

=
d
Z
n
σ
α
292
3/1/2010
74
Sample Size to Estimate Single
Population Mean
Example 6.7:
• A researcher is interested to determine the mean blood
glucose level among high school students. A previous
study indicates the mean is 85mg/dl with standard
deviation of 15mg/dl. If the researcher is interested to
determine the sample size with 95% CI and tolerates 2
mg/dl margin of error, what number of students should
be included into his study?
293
Hypothesis Testing
What is a Hypothesis
• A statistical hypothesis is an assumption or a statement
which may or may not be true concerning one or more
population.
• Setting up and testing hypotheses is an essential part of
statistical inference.
• Examples of statistical hypothesis:
– The mean pulse rate among AAUHI students is 72/min.
– The prevalence of HIV in AA is 12%.
– The mean blood glucose level among Chinese and
Indians is the same.
– The prevalence of Hypertension in US and UK is the
same.
– The mean blood cholesterol level is the same before
and after taking a drug.
295
Steps in Hypothesis Testing
Hypothesis testing involves the following steps:
1. Choose the hypothesis to be tested,
2. Choose an alternative hypothesis which would be
accepted if the first hypothesis is rejected.
3. Decide on the appropriate test statistic for the
hypothesis (Z, t, X
2
)
4. Decide the level of significance and corresponding
critical value.
5. Obtain the value of the test statistic.
6. Make a decision and interpret it.
296
3/1/2010
75
The Null and Alternative
Hypothesis
• In hypothesis testing two hypotheses are involved: The Null
Hypothesis and the Alternative Hypothesis.
• Every hypothesis test requires the analyst to state a null
hypothesis and an alternative hypothesis.
• They are mutually exclusive and complementary events.
• Both hypotheses are about the parameter not about the
statistic.
• The null hypothesis (H
0
or H
N
):
– The first hypothesis to be set by the researcher.
– It commonly implies the meaning of “equals to”, “no
effect” or “no difference”, “no association” conclusions.
297
The Null and Alternative
Hypothesis Cont..
Example;
• The mean pulse rate among AAUHI students is 72/min.
• Drug A has no effect on the blood glucose level of
diabetic patients.
• There is no difference in the prevalence of malaria in
region A and Region B.
• There is no association between smoking and lung
cancer.
298
The Null and Alternative
Hypothesis Cont..
• The alternative hypothesis (H
A
or H
1
)
• The hypothesis that will be accepted if H
0
is rejected.
• Implies conclusions like “is not equal”, “has effect”, “there
is difference” and “there is association”.
Example:
• The mean pulse rate among AAUHI students is not
equal to 72/min.
• Drug A has effect on the blood glucose level of diabetic
patients.
• There is difference in the prevalence of malaria in region
A and B.
• There is association between smoking and lung cancer.
299
Test Statistic
• In hypothesis testing we accept or reject the hypothesis
through calculating the probability of getting the
estimated sample value given the hypothesized
population value is true.
• If the probability is very low we reject the null hypothesis.
• The probability is calculated using test statistic.
• The most commonly used test statistic are Z, student’st
and X
2
tests.
• The general formula to calculate test statistic is:
SE
value ed hypothesiz estimate
statistic test
) ( ) ( −
=
300
3/1/2010
76
Test Statistic
Student’s t Distribution:
• The use of ztest requires a knowledge of the variance of
the population from which the sample is taken.
• It is somewhat strange that once can have knowledge of
the population variance and not know the value of the
population mean.
• In statistics as long as sample size is large enough, most
datasets can be explained by standard normal dist.
• But when the sample size is small and population SD is
not known, statisticians rely on the distribution of the t
statistic.
301
Test Statistic Cont…
• Student’s t distribution was developed by William Gosset
(18761937) under the pseudonym of “Student t”.
• There are many different t distributions. (t distribution is a
family of distributions)
• The particular form of the t distribution is determined by
its Degrees of Freedom (df).
• The degrees of freedom (df) refers to the number of
independent observations in a dataset after some
restriction is made.
n
s
x
t
] [ µ −
=
302
Test Statistic Cont…
• The t distribution has the following properties:
– The mean of the distribution is equal to 0.
– Symmetrical about the mean.
– The variance is equal to v / ( v  2 ), where v is the df.
(i.e. V>2) In general the variance is greater than 1,
but approaches 1 as the sample size becomes large.
– Extends from + infinity to – infinity
– Compared to normal distribution, t distribution is less
picked in the center and has higher tails.
– The t distribution approaches the normal distribution
as n1 approaches infinity.
303
Test Statistic Cont…
304
3/1/2010
77
Test Statistic Cont…
• For the t distribution to apply strictly we need the
following two assumptions:
1. The observations are selected at random from the
population.
2. The population distribution is normal.
• Sometimes the second assumptions may not be met as
the t test is robust for departures from the normal
distribution.
• That means even when assumption 2 is not satisfied, the
probabilities calculated from the t table are still
approximately correct.
305
Test Statistic Cont…
Chi Square Distribution (X
2
):
• Mainly developed by Karl Pearson (18571936)
• A type of probability distribution like Z or t.
• Represented by the Greek letter Chi ( )
• It is the distribution of the sum of the squared values of
the observations drawn from the N(0,1) distribution.
• Let {X
1
, X
2
, ..., X
n
} be n independent random variables,
all ~ N(0,1).
• Then the X
2
n
is defined as the distribution of the sum X
1
²
+ X
2
² +...+ X
n
².
χ
306
Test Statistic Cont…
• Mainly used to check association between two
categorical variables.
• It is the most frequently used statistical technique for
analysis of count or frequency data.
• It is not a distribution but rather a family of distributions,
indexed by the df.
• The mathematical formula of X
2
distribution is given as
(where x is 0):
) 2 / ( 1 ) 2 / (
2
)
2
1
(
)! 1
2
(
1
x k
k
e x
k
Y
− −
−
=
307
Test Statistic Cont…
• The graph is given as:
308
3/1/2010
78
Test Statistic Cont…
• The formula for the test statistic which approximates X
2
distribution is: (where O is the observed frequency and E
is expected frequency)
• It has the following characteristics:
– Extends indefinitely to the right from 0.
– Has only one tail.
– As the df increase, the chisquare curve approaches
a normal distribution.
309
Test Statistic Cont…
310
Errors in Hypothesis Testing
• In testing hypothesis, two types of errors can be
committed: Type I and Type II errors.
• The probability of committing type I error is denoted as
. It is also called the Level of significance. (1
confidence level)
• The probability of committing type two error is denoted
as . (1power of the study)
Decision of the
hypothesis testing
Accept H
0
Reject H
0
Null
Hypothesis
H0 True Correct Type I error
H0 False Type II error Correct
311
One and Two Tailed Hypothesis
• Some hypotheses test whether one value is different
from another or not, without additionally predicting which
will be higher: Nondirectional or twotailed test
• At times some hypotheses not only test difference of one
value from the other but also direction of the difference.
i.e. it would be lower or higher: Directional or onetailed
test.
312
3/1/2010
79
Level of Significance, Critical
Values and Critical Area
• In practice, the level of significance () is chosen arbitrarily.
• Three levels 0.01, 0.05, or 0.10. (depending on confidence
level)
• The smaller the level of significance, the stronger the
hypothesis test.
• The level of significance determines the values of the test
statistic that would cause us to reject the hypothesis.
• The corresponding test statistic values for the level of
significance are called the Critical Values.
• In a probability distribution the area which is left to the
extreme right or/and left of the critical value is called the
Critical area (Rejection area).
• The area between the two critical values is called the
Acceptance Area.
313
Level of Significance, Critical
Values and Critical Area
314
Level of Significance, Critical
Values and Critical Area
• A level of significance has different critical values for one
and two tailed test,
• Level of significance of 0.05 has critical value of ±1.96 if
the test is two tailed.
• However if the test is one tailed the critical value would
be 1.64 to either of the tails.
• Note that critical values for a given level of significance
differ depending on the test statistic intended to be used.
315
Level of Significance, Critical
Values and Critical Area
316
3/1/2010
80
Level of Significance, Critical
Values and Critical Area
317
Level of Significance, Critical
Values and Critical Area
318
Level of Significance, Critical
Values and Critical Area
(level of
significance)
Two tailed
test
On tailed
test, <
On tailed test,
>
0.10 ±1.64 1.28 1.28
0.05 ±1.96 1.64 1.64
0.01 ±2.58 2.33 2.33
319
Interpretation and Conclusion
• Interpretation is made based on comparisons between:
– Test Statistic Calculated Vs Critical Value.
– P value Vs significance level.
• Conclusion (i.e. accepting and rejecting the null
hypothesis) should be made at the given level of
confidence.
320
3/1/2010
81
Test of Hypothesis about Single
Population Mean
• Shows how to test the null hypothesis that the population
mean is equal to some hypothesized value.
• One begins with a statement that claims a particular
value for the unknown population mean.
• The hypothesis testing for single population mean either
accepts or rejects this statement.
• The Z test and the t test used.
– Sample > 30: Z test
– Sample < 30 and population SD known: Z test
– Sample < 30 and population SD unknown: t test
321
Test of Hypothesis about Single
Population Mean Cont..
n
X
Z
/ σ
µ −
=
n S
X
t
/
µ −
=
322
Test of Hypothesis about Single
Population Mean Cont..
Example 7.1:
• Researchers are interested in the mean level of an
enzyme in a certain population. They take a sample of
36 individuals, determine the level of enzyme in each
and compute a sample mean 22. It is known that the
variable of interest is approximately normally distributed
with a standard deviation of 10. Let’s say that they are
asking the following question: Can we conclude that the
mean enzyme level in this population is different from
25?
323
Test of Hypothesis about Single
Population Mean Cont..
• Step 1 and 2: Define the H
o
and H
1
:
• Step 3: Decide approprate test statistic:
– Z test
• Step 4: Decide the level of significance and critical value:
– value of 0.05.
– ±1.96 is the critical value.
• Step 5: Obtain the value of the test statistic:
25 : = µ
o
H 25 :
1
≠ µ H
324
3/1/2010
82
Test of Hypothesis about Single
Population Mean Cont..
n
X
Z
/ σ
µ −
=
36 / 10
25 22 −
= Z
1.67
3 −
= Z
80 . 1 − = Z
325
Test of Hypothesis about Single
Population Mean Cont..
• Step 6: Make a decision and interpret it.
– Accept the H
0
at 95% confidence level:
– 1.80 is with in the acceptance region.
– P value of 0.036 is > /2 value of 0.025.
326
Test of Hypothesis about Single
Population Mean Cont..
Example 7.2:
• The researchers mentioned in example 7.1, instead of
asking if they could conclude that µ≠25, they asked: Can
we conclude that the mean enzyme level in this
population is less than 25?
Solution:
• Step 1 and 2: Define the H
0
and H
1
:
25 : ≥ µ
o
H
25 :
1
< µ H
327
Test of Hypothesis about Single
Population Mean Cont..
• Step 3: Decide approprate test statistic:
– Z test
• Step 4: Decide the level of significance and critical
value:
– value of 0.05.
– ±1.645 is the critical value.
• Step 5: Obtain the value of the test statistic:
n
X
Z
/ σ
µ −
=
36 / 10
25 22 −
= Z
1.67
3 −
= Z 80 . 1 − = Z
328
3/1/2010
83
Test of Hypothesis about Single
Population Mean Cont..
• Step 6: Make a decision and interpret it.
– Reject the H0 with 95% confidence level
– Test statistic 1.80 is with in the acceptance region.
– P value of 0.036 is less than the value of 0.05.
25 ≥ µ
329
Test of Hypothesis about Single
Population Mean Cont..
Example 7.3:
• Serum Amylase level determination was made on a
sample of 15 apparently health subjects. The sample
yielded the mean of 96 units/100 ml and a standard
deviation of 35 units /100 ml. The variance of the
population was unknown. We want to know wheter we
can conclude that the mean of the population is different
from 120 units/100 ml.
330
Test of Hypothesis about Single
Population Mean Cont..
• Step 1 and 2: Define the H
0
and H
1
.
• Step 3: Decide approprate test statistic.
– t test
• Step 4: Decide level of significance and critical value.
– value of 0.05.
– t value for of 0.0025 at df of 14: ±2.145
• Step 5: Obtain the value of the test statistic.
120 : = µ
o
H 120 :
1
≠ µ H
n S
X
t
/
µ −
=
15 / 35
120 96 −
= t 65 . 2 − = t
331
Test of Hypothesis about Single
Population Mean Cont..
• Step 6: Make a decision and interpret it.
• We reject the null hypothesis b/c
– The cal test statistic 2.65 is in the rejection area
– The corrspoinding P value of 2.65 (b/n 0.01 and
0.005) is less than the /2 value of 0.025.
332
3/1/2010
84
Testing of Hypothesis about Two
Population Means
• Compare the difference between two populations mean.
• H
0
: there is not difference between the two mean.
• H
1
: there is difference between the two means.
• Z or t test can be employed.
• Sumup the sample size of the two groups, if it is greater
than 30 use Z test, if less than 30 use t test.
2
2
2
1
2
1
2 1
) (
n n
X X
Z
σ σ
+
−
=
333
Testing of Hypothesis about Two
Population Means Cont..
• t test is carried out with df of n1+n22
2
2
1
2
2 1
) (
n
S
n
S
X X
t
+
−
=
2
) 1 ( ) 1 (
2 1
2
2 2
2
1 1
− +
− + −
=
n n
S n S n
S
334
Testing of Hypothesis about Two
Population Means Cont..
Example 7.4:
• A researcher wants to check whether the systolic blood
pressure among males is different from females or not.
Among 50 male samples the mean SBP was 100mmHg
with standard deviation of 5 mmHg. Among 60 females,
the mean SPB was 104mmHg with standard deviation of
10 mmHg. Is there significant difference between the two
means?
335
Testing of Hypothesis about Two
Population Means Cont..
• Step 1 and 2: Define the H
0
and H
1
• Step 3: Decide approprate test statistic:
– Z test
• Step 4: Decide the level of significance and critical
value:
– value of 0.05.
– ±1.96 is the critical value.
• Step 5: Obtain the value of the test statistic:
f m o
H µ µ = :
f m
H µ µ ≠ :
1
2
2
2
1
2
1
2 1
) (
n n
X X
Z
σ σ
+
−
=
60
10
50
5
104 100
2 2
+
−
= Z
67 . 1 5 . 0
4
+
−
= Z 72 . 2
47 . 1
4
− =
−
= Z
336
3/1/2010
85
Testing of Hypothesis about Two
Population Means Cont..
• Step 6: Make a decision and interpret it.
– We reject the H0 and accept the H1 (at 95%
confidence level) b/c
– The cal test statistic 2.72 is in the rejection region.
– The corrspoinding P value of 2.72 (0.0033) is less
than the value of 0.025.
f m
µ µ ≠
337
Testing of Hypothesis about Two
Population Means Cont..
Example 7.5:
• Serum amylase determination was made on a sample of
15 apparently health subjects and 12 hospitalized
subjects. Among health subjects, the mean was 96
units/100ml with standard deviation of 35 units/100 ml.
Among hospitalized patients, the mean was 120
units/100ml with standard deviation of 40 units/100 ml. Is
there significant difference between the two mean
values?
338
Testing of Hypothesis about Two
Population Means Cont…
• Step 1 and 2: Define the H
0
and H
1
• Step 3: Decide approprate test statistic.
– t test
• Step 4: Decide level of significance and critical value.
– value of 0.01.
– t value for /2 of 0.005 at df of 25: ±2.787
• Step 5: Obtain the value of the test statistic
2 1
: µ µ =
o
H
2 1 1
: µ µ ≠ H
3 . 37 1390
25
17600 17150
25
) 40 )( 11 ( ) 35 )( 14 (
2
) 1 ( ) 1 (
2 2
2 1
2
2 2
2
1 1
= =
+
=
+
=
− +
− + −
=
n n
S n S n
S
339
Testing of Hypothesis about Two
Population Means Cont…
• Step 6: Make a decision and interpret it.
• We accept the null hypothesis (at 99% confidence level)
b/c:
• The calculated test statistic 1.67 is in the acceptance
region.
• The corrspoinding P value of 1.67 (which is b/n 0.1 and
0.05) is greater than the value of 0.005.
67 . 1
4 . 14
24
12
3 . 37
15
3 . 37
120 96
2 2
− =
−
=
+
−
= t
340
3/1/2010
86
Testing of Hypothesis about Two
Population Means Cont…
Paired t test for difference between two means:
• Every observation in one sample has one matching
observation in the second sample.
• Commonly used in evaluation of interventions like new
treatment modalities.
• Hence pre and post intervention (treatment) results are
compared.
• Usually t test is used since individuals involved in the
trial are few.
• The null hypothesis: there is no significant difference
between the two tests.
341
Testing of Hypothesis about Two
Population Means Cont…
• Procedures of hypothesis testing are the same. Except
the formula for the test statistic calculation.
– d = mean of differences between the two samples.
– SD = is the standard deviation for the differences
between the two samples.
– n = the number of paired cases.
• Note that the calculated test statistic is compared at
degree of freedom of n1.
n
SD
d
t =
342
Testing of Hypothesis about Two
Population Means Cont…
Example 7.6:
• A random sample of 10 young men was taken and the
pulse rate was measured before and after taking a cup
of coffee. The result is given as follows. Does the coffee
has any effect on the heart rate? (perform the hypothesis
testing with 95% CI)
343
Testing of Hypothesis Cont…
Subject PR before PR after Difference
1 68 74 +6
2 64 68 +4
3 52 60 +8
4 76 72 4
5 78 76 2
6 62 68 +6
7 66 72 +6
8 76 76 0
9 78 80 +2
10 60 64 +4
Mean 68 71 +3
344
3/1/2010
87
Testing of Hypothesis about Two
Population Means Cont…
• H
0
: Coffee intake has no effect on PR
• H
1
: Coffee intake has effect on PR
• Test statistic: t test (paired)
• Critical value ±2.262
• First calculate the SD then the test statistic:
• Reject the null hypothesis (at 95% confidence level)
• Coffee intake has effect on PR.
92 . 3
1
) (
2
=
−
−
¿
n
d di 4 . 2
10
92 . 3
3
= = t
345
Test of Hypothesis About Single
Population Proportion
• The null hypothesis that the population proportion is
equal to some hypothesized value.
• One begins with a statement that claims a particular
value for the unknown population proportion.
• The hypothesis testing for single population proportion
either accepts or rejects this statement.
• Here Z test statistic is used. The formula is given as:
n
p
Z
) 1 ( π π
π
−
−
=
346
Test of Hypothesis on Means
Using SPSS
• In SPSS One sample T test, independent T test and
paired sample T test are available under;
• Analyze > means > One sample T test or independent T
test or paired sample T test
347
Test of Hypothesis About Single
Population Proportion
Example 7.7:
• A survey was conducted to determine the prevalence of
protein energy malnutrition in a rural kebele. Of 300
under five children assessed, 123 were stunted. Can we
conclude that the prevalence of PEM in the population is
50%?
348
3/1/2010
88
Test of Hypothesis About Single
Population Proportion
• Step 1 and 2: Define the H
0
and H
1
• Step 3: Approprate test statistic:
– Z statistic
• Step 4: Decide the level of significance and the
corresponding critical value:
– Let’s take value of 0.1. Hence ±1.645 is the critical
value.
• Step 5: Obtain the value of the test statistic:
5 . 0 : = π
o
H
11 . 3
300
25 . 0
09 . 0
300
) 5 . 0 ( 5 . 0
5 . 0 41 . 0
) 1 (
− = =
−
=
−
−
=
n
p
Z
π π
π
5 . 0 :
1
≠ π H
349
Test of Hypothesis About Single
Population Proportion
• Step 6: Make a decision and interpret it.
• At 90% confidence level wee reject the null hypothesis
that P=0.5.
– The calculated test statistic 3.11 is in the rejection
region.
– The corrspoinding P value of 3.11 (i.e. 0.0009) is
less than the value of 0.05.
350
Testing of Hypothesis About
Two Population Proportions
• The null hypothesis that a population proportion is equal
to another population proportion.
• The hypothesis testing for single population proportion
either accepts or rejects this statement.
• Here Z test statistic is used. The formula is given as:


.

\

+ −
−
=
2 1
2 1
1 1
) 1 (
n n
p P
p p
Z
2 1
2 2 1 1
n n
p n p n
P
+
+
=
351
Testing of Hypothesis About
Two Population Proportions
Example 7.8:
• The prevalence of malaria among two malaria endemic
kebeles X and Y was compared. In kebele X among 120
samples 15 were positive. In kebele B among 100
samples 20 were positive. Is there any significant
difference between the prevalence of malaria kebele X
and Y?
352
3/1/2010
89
Testing of Hypothesis About
Two Population Proportions
• Step 1 and 2: Define the H
0
and H
1
:
• Step 3: Decide approprate test statistic:
– Z statistic
• Step 4: Decide value & the critical value:
– Let’s take value of 0.05. Hence ±1.96 is the critical
value.
• Step 5: Obtain the value of the test statistic:
– First calculate the proportions & the pooled proportion
– P1 = 15/120 = 0.125, P2 = 20/100 = 0.2
2 1
: P P H
o
=
2 1 1
: P P H ≠
353
Testing of Hypothesis About
Two Population Proportions
• Then we calculate the test statistic:
• Step 6: Make a decision and interpret it.
At 95% confidence level we accept the H0 P1=P2 b/c:
– 1.51 is in the acceptance region.
–  0.0655 is greater than the value of 0.025.
2 1
2 2 1 1
n n
p n p n
P
+
+
=
100 120
) 2 . 0 ( 100 ) 125 . 0 ( 120
+
+
= P 159 . 0
220
20 15
=
+
= P

.

\

+ −
−
=
100
1
120
1
) 159 . 0 1 ( 159 . 0
2 . 0 125 . 0
Z
( )
51 . 1
0.0183 0.1337
075 . 0
− =
−
= Z
354
Test of Hypothesis on
Proportions Using SPSS
• There is no “point and click” option in SPSS to do such
hypothesis testing on proportions.
• Syntax based analysis can be done.
355
Test of Hypothesis about
Categorical Data
• It is also possible to apply hypothesis testing on
categorical data.
• The Chisquare (
2
) test statistic commonly used.
• This test is usually applied to tabulated data.
• The table contains two variables called the row and
column variables.
• The test measures the discripancy between K observed
frequencies (O) and correspoinding K expected
frequencies (e). i.e. for all cells of the tabulation.
• Expected frequencies are frequencies which happen
when there is no association between the raw and
column variables.
356
3/1/2010
90
Test of Hypothesis about
Categorical Data
• The H
0
of Chisquare test is there is no association
between the row and column variables.
• While the H
1
is there is associaiton between the row and
column variables.
• The closer observed frequencies are to expected
frequencies, the more likely the H0 is true.
¿
=


.

\
 −
=
k
i i
i i
e
e O
x
1
2
2
) (
total grand
cell the for total column x cell the for total row
e =
357
Test of Hypothesis about
Categorical Data
• Assumptions of Chisquare test:
– No cell of the table has expected frequency less than
1,
– No more than 20% of the the expected frequencies
should be less than 5.
• Chisquare test should compaired with chisquare
disribution with df of (R1)(C1).
• Though the distribution of Chisquare is one tailed, the
test is always two tailed.
358
Test of Hypothesis about
Categorical Data
Example 7.9:
• A researcher is interested to assess the effect of litracy
on family planning use. Accordingly he collected data
and tabulated the findings in the following manner. Can
we say there is association between educational status
and family planning use?
FP use Educational Status
Illiterate Literate Total
Yes 63 49 112
No 15 33 48
Total 78 82 160
359
Test of Hypothesis about
Categorical Data
• Step 1 and 2: Define the H
0
and H
1
:
– H
0
: There is not association between litracy and
family planning use.
– H
1
: There is association between litracy and family
planning use.
• Step 3: Decide approprate test statistic:
– X
2
test.
• Step 4: Decide and the corresponding critical value:
– Let’s take value of 0.01.
– At df of 1 the critical value is 6.635.
– Accptance area is 06.635, Rejection area X
2
> 6.635.
360
3/1/2010
91
Test of Hypothesis about
Categorical Data
• Step 5: Obtain the value of the test statistic:
– First the expected frequency should be calculated:
• Expected frequency for cell a: 78 x 112/160 = 54.6
• Expected frequency for cell b: 82 x 112/160 = 57.4
• Expected frequency for cell c: 78 x 48/160 = 23.4
• Expected frequency for cell d: 82 x 48/160 = 24.6
– Assumptions of X
2
test fulfilled.
– Then we calculate the Chisquare statistic.
¿
=


.

\
 −
=
k
i i
i i
e
e O
x
1
2
2
) (
361
Test of Hypothesis about
Categorical Data
• Step 6: Make a decision and interpret it.
• At 99% confidence level we accept the H
1
that the two
variables are associated due to the following reasons:
– The calculated test statistic 8.41 is in the rejection area.
– The corrspoinding P value of 8.41 (between 0.005 and
0.002) is less than the value of (0.01).
• But how is the direction of association?


.

\
 −
+


.

\
 −
+


.

\
 −
+


.

\
 −
=
6 . 24
) 6 . 24 33 (
4 . 23
) 4 . 23 15 (
4 . 57
) 4 . 57 49 (
6 . 54
) 6 . 54 63 (
2 2 2 2
2
x
( ) ( ) ( ) ( ) 41 . 8 87 . 2 02 . 3 23 . 1 29 . 1
2
= + + + = x
362
Test of Hypothesis about
Categorical Data Using SPSS
• In order to do chisquare test using SPSS, track the
following steps.
• Analyze > Descriptive Statistics > Cross tab > Put the two
categorical variables as column and row > Statistics >
Check “Chisquare” > Ok.
• Chisquare test is given in a table as “Pearson Chisquare”.
363
Fisher's exact test
• Fisher's exact test is a statistical significance test used in the
analysis of contingency tables when sample size is small.
(when assumption of chi square test are not fulfilled)
• It is named after its inventor, R. Fisher.
• For hand calculations, the test is only feasible in the case of a
2 x 2 contingency table.
• Its application to higher order tables is controversial.
• H
0
: there is no association between the two variables
• H
1
: there is association between the two variables
• The hypothesis is tested by comparing the probability of
observing the given or more extreme tables with the level of
significance, given the null hypothesis is true.
364
3/1/2010
92
Fisher's exact test
• The exact probability of observing a given table is given as:
• = [(a+b)!(c+d)!(a+c)!(b+d)!]/[N!a!b!c!d!]
a b (a+b)
c d (c+d)
(a+c) (b+d) N
365
Fisher's exact test
• Hypothesis testing using fisher’s exact test involves the
following steps:
1. Calculate the probability of the observed table itself,
2. List all possible extreme tables manually (given the
marginal totals are maintained),
3. Calculate their respective exact probability,
4. Calculate the probability of getting observed or more
extreme tables,
5. Multiply the total by 2 (to get 2 tailed value)
6. Compare the value with value of
366
Fisher's exact test
Example 7.10:
• In the following tabulated data, Is there any
association between the treatment type and survival
rate of patients? (Test the hypothesis at 95%
confidence level)
Treatment type Survived Died Total
A 7 2 9
B 5 6 11
Total 12 8 20
367
Fisher's exact test
• H
0
: No association between the treatment modalities and
survival rate.
• H
1
: There is association between the treatment
modalities and survival rate.
• Test statistic: F exact test b/c two of the expected
frequencies have values less than 5.
• Level of significance: 5%
• Calculate the probability of getting the given or more
extreme tables.
368
3/1/2010
93
Fisher's exact test
• Observed table:
• Probability of observing this table = 9!11!12!8!/20!7!2!5!6!
= 0.132
Treatment type Survived Died Total
A 7 2 9
B 5 6 11
Total 12 8 20
369
Fisher's exact test
• First possible extreme table:
• Probability of observing this table = 9!11!12!8!/20!8!1!4!7!
= 0.024
Treatment type Survived Died Total
A 8 1 9
B 4 7 11
Total 12 8 20
370
Fisher's exact test
• Second possible extreme table:
• Probability of observing this table = 9!11!12!8!/20!9!0!3!8!
= 0.001
Treatment type Survived Died Total
A 9 0 9
B 3 8 11
Total 12 8 20
371
Fisher's exact test
• Probability of getting the observed or more extreme
tables:
– 0.132 + 0.024 + 0.001 = 0.157 (one tailed)
– Two tailed 2 x 0.157 = 0.314
• Conclusion and interpretation:
– Accept the null hypothesis at 95% confidence level
– There is no association between the treatment
modalities and survival rate.
372
3/1/2010
94
Fisher's exact test using
SPSS
• In order to do Fisher’s exact test using SPSS, track the
following steps.
• Analyze > Descriptive Statistics > Cross tab > Put the
two categorical variables as column and row > Statistics
> Check “Chisquare” > Ok.
• Fisher’s exact test is given in a table titled “Chisquare
tests”.
• NB: SPSS doesn’t do Fisher’s exact test for higher order
tables.
373
Summary
• The interpretation of the hypothesis test is dependent on the
confidence level at which the test is conducted.
• A hypothesis which is accepted at a lower level of confidence
can not be rejected at a higher level of confidence.
• A hypothesis which is rejected at a lower level of confidence
can be accepted at a higher level of confidence.
• A hypothesis which is rejected at a higher level of confidence
can not be accepted at a lower level of confidence.
• A hypothesis which is accepted at a higher level of confidence
can be rejected at lower level of confidence.
374
Sample Size Calculation for
Comparative Studies.
• The concept discussed in this chapter can be applied to
the calculation of sample size for comparative studies.
• For comparative studies like case control, cohort,
interventional ,optimal size for the two groups is
calculated using the formula;
• Where
2
2 1
2 2
1 1
2
1
) (
) 1 (
) 1 ( ) 1 ( )
1
1 (
P P
r
P P
P P Z p P
r
Z
n
−
−
+ − + − +
=
β α
r
rP P
P
+
+
=
1
2 1
375
Sample Size Calculation
Cont..
• Were;
P is the pooled proportion
P1 is the expected 1
st
proportion
P2 is the expected 2
nd
proportion
r is the number of controls per a case
Alpha is the probability of type I error
Beta is the probability of type II error
n
1
is sample size for the first group
NB: n
2
is calculated by multiplying n
1
by r.
376
3/1/2010
95
Correlation and Linear
Regression
Regression and Correlation
• Many medical investigations are concerned with:
– Establishment of relationship between two variables.
– The strength of a relationship.
– Predicting one variable on the basis of another.
– Controlling the effect of unwanted variables.
• Such intentions can be addressed either by using
correlation or regression analysis.
378
Correlation Analysis
• Initially developed by Sir Francis Galton (1888) and Karl
Pearson (1896)
• Correlation is the quantification of the degree to which two
random quantitative variables are related provided the
relationship is linear.
• Both of the variables should be measured on the same
set of study units.
• Strength of relationship measurement: Correlation
Coefficient.
• Most commonly used coefficients: Product Momentum
Correlation or Pearson Correlation Coefficient (r).
• The symbol rho ( ) used to represent population
correlation coefficient
• Unit less measure.
ρ
379
Correlation Analysis
• Does not imply cause and effect relationship.
• The value of r ranges from 1 to +1.
• If the correlation coefficient is greater than 0, the
variables are said to be positively correlated (i.e. as X
increases, Y tends to increase).
• If the correlation coefficient is less than 0, the variables
are said to be negatively correlated (i.e. as X increases,
Y tends to decrease).
• If the correlation coefficient is 0 then the variables are
said to be uncorrelated.
380
3/1/2010
96
Correlation Analysis Cont…
• The formula for computing sample correlation coefficient
(r) for two variables X and Y is given as:
• Or
• Before computing r, scattered plot between the two
variables should be drawn. Why?
¿ ¿
¿
− −
− −
=
] ) ( ][ ) ( [
) )( (
2 2
y y x x
y y x x
r
¿ ¿ ¿ ¿
¿ ¿ ¿
− −
−
=
] ) ( ) ( ][ ) ( ) ( [
2 2 2 2
y y n x x n
y x xy n
r
381
Correlation Analysis Cont…
y
x
y
x
y
y
x
x
Linear relationships Curvilinear relationships
382
Correlation Analysis Cont…
y
x
y
x
y
y
x
x
Strong relationships Weak relationships
(continued)
383
Correlation Analysis Cont…
y
x
y
x
No relationship
(continued)
384
3/1/2010
97
Correlation Analysis Cont…
• Assumptions of correlation analysis:
– Independent random samples are taken
– Both variables are on interval/ratio scale
– Linear association between X and Y
– Paired measures for X and Y
– Normal distribution for X and Y
– Homogeneity of variance (Homoscedasticity)
• In situations where its assumptions are violated,
correlation becomes inadequate to explain a given
relationship.
385
Correlation Analysis Cont…
Example 8.1:
• The data of a random sample of 20 countries are shown
in the following table. X represents the percentage of
children immunized by age one year and Y represents
the under five year mortality rate. Determine the strength
of association between the two variables.
386
Correlation Analysis Cont…
387
Country % Immunized (X) CMR/1000LB (Y) XY Y
2
X
2
Bolivia 77 118 9086 13924 5929
Brazil 69 65 4485 4225 4761
Cambodia 32 184 5888 33856 1024
Canada 85 8 680 64 7225
China 94 43 4042 1849 8836
Czech 99 12 1188 144 9801
Egypt 89 55 4895 3025 7921
Ethiopia 13 208 2704 43264 169
Finland 95 7 665 49 9025
France 95 9 855 81 9025
Greece 54 9 486 81 2916
India 89 124 11036 15376 7921
Italy 95 10 950 100 9025
Japan 87 6 522 36 7569
Mexico 91 33 3003 1089 8281
Poland 98 16 1568 256 9604
Russia 73 32 2336 1024 5329
Senegal 47 145 6815 21025 2209
Turkey 76 87 6612 7569 5776
UK 90 9 810 81 8100
Total 1548 1180 68626 147118 130446
Correlation Analysis Cont…
• There is strong linear relationship between the two
variables.
¿ ¿ ¿ ¿
¿ ¿ ¿
− −
−
=
] ) ( ) ( ][ ) ( ) ( [
2 2 2 2
y y n x x n
y x xy n
r
] ) 1180 ( ) 147118 ( 20 [ ] ) 1548 ( ) 130446 ( 20 [
) 1180 1548 ( ) 68626 ( 20
2 2
− −
−
=
x
x
r
79 . 0 − = r
388
3/1/2010
98
Correlation Analysis Cont…
• Interpretation option:
– 100% r
2
:
• Shows proportion of variation of a variable
explained by the other.
– Rule of thumb:
Sze of Coeffcent Genera Interpretaton
0.81.0 very sLrong relaLlonshlp
0.60.8 SLrong relaLlonshlp
0.40.6 ModeraLe relaLlonshlp
0.20.4 Weak relaLlonshlp
0.00.2 very weak or no relaLlonshlp
389
Correlation Analysis Cont…
• Hypothesis Testing for a Correlation Coefficient
• As that of mean and percentage, it is also possible to
test significance about population correlation.
• For two tailed test
– H
0
: r is 0
– H
1
: r is different from 0
• The t test statistic is given as (with n2 df):
2
1
2
r
n
r t
−
−
=
390
Correlation Analysis Cont…
Example 8.2:
• At the 0.05 level of significance, can we claim the
correlation coefficient in example 8.1 indicates significant
negative relationship between immunization coverage
and child mortality?
391
Correlation Analysis Cont..
• The critical t value for 0.05 level of significance at 18
degree of freedom is  1.734. Then we calculate the test
statistics.
• Hence we accept the H
1
that r indicates significant
negative relationship between immunization coverage
and child mortality.
5.47 )
0.3759
18
( 79 . 0 )
) 79 . 0 ( 1
2 20
( 79 . 0
1
2
2 2
− = − =
− −
−
− =
−
−
=
r
n
r t
392
3/1/2010
99
Correlation Analysis Cont..
Limitations:
• Applied only to a linear relationship.
• One must not extrapolate an observed correlation
beyond observed ranges of the x and y value.
• Does not differentiate dependent and independent
variable.
• Confounding by a third variable.
393
Correlation Analysis Cont..
Spearman’s Rank Correlation
• It is a nonparametric (distributionfree) rank statistic
proposed by Charles Spearman in 1904 as a measure of
the strength of the associations between two variables
• Denoted as r
s
• Is applied when:
• Normality assumption is not satisfied or can not be
tested,
• At least one of the variable is given in ordinal scale,
• In the calculation of the coefficient, actual values of both
variables should be changed into ranks.
394
Correlation Analysis Cont..
• The formula for the Spearman Correlation Coefficient is
(given that there is no tied rank):
• Where;
– 6 is a constant,
– D is the difference between a subjects ranks on the
two variables,
– n is the number of subjects.
• Consider the following example.
) 1 (
) ( 6
1
2
2
−
− =
¿
n n
D
r
s
395
Correlation Analysis Cont..
Countries
MMR
(Per100,00
0LB)
MMR
Rank
Delivery
Service
Coverage
(%)
Rank D D
2
1 315 4 55 6 2 4
2 450 6 40 5 1 1
3 200 1 70 8 7 49
4 250 3 79 10 7 49
5 243 2 75 9 7 49
6 830 9 25 3 6 36
7 850 10 20 2 8 64
8 656 7 20 1 6 36
9 701 8 30 4 4 16
10 410 5 60 7 2 4
308
The following table
presents the MMR level
and delivery service
coverage in 10 developing
countries.
= 1 [(6x308)/10(1001)]
= 1[1848/990]
= 11.87
= 0.87
) 1 (
) ( 6
1
2
2
−
− =
¿
n n
D
r
s
396
3/1/2010
100
Correlation Analysis Cont..
• Inference about r
s
• For hypothesis testing t score can be calculated (at df of
n2) using the formula;
• For the previous example the t score would be;
• If the hypothesis test is a two tailed test at 0.05 level of
significant, we reject the H
0
as 5 > 2.306.
2
1
2
−
−
=
n
r
r
t
s
s
5
2 10
) 87 . 0 ( 1
87 . 0
2
=
−
− −
−
= t
397
Correlation Analysis Cont..
Partial Correlation
• A method used to describe the relationship between two
variables while taking away the effects of another
variable, or several other variables, on this relationship.
• Still requires meeting all the usual assumptions of
Pearsonian correlation.
• But the covariate may not be necessary numeric.
398
Correlation Analysis Using
SPSS
• In order to do correlation analysis using SPSS follow the
following steps;
• Analyze > Correlate > Bivariate correlations > Put the
two variables in the variable box > Select Pearson or
Spearman (another option is also there) > OK.
• Partial correlation can also be done.
• Analyze > Correlate > Partial correlation.
• But before that, don’t forget the scattered plot.
399
Regression Analysis
• In correlation analysis the interest is to show how two
numeric variables are related.
• However in regression analysis, we are interested in
explaining or modeling a dependent variable (Y) as a
function of one or more independent variables (X).
• Regression analysis is used to:
– Assess association between two variables.
– Predict/explain the value of a dependent variable
based on the value of at least one independent
variable. (i.e. Mathematical modeling)
– Control for confounding factors.
– Show possible effect of interaction among variables.
400
3/1/2010
101
Regression Analysis Cont..
• The general regression equation is given as:
Y = +
1
X
1
+
2
X
2
…….
n
X
n
Where: Y is the value of the dependent variable,
X is the independent variable,
is the intercept,
is the coefficient of the independent variable
• If the equation has only one independent variable the
regression is called Simple Regression
• If multiple independent variables are involved it is called
Multiple Regression.
• In public health the most commonly used types of
regression analysis are: Linear and Logistic Regression
401
Linear Regression
• Also known as linear least squares regression.
• It is by far the most widely used modeling method.
• The dependent variable is assumed to be a linear
function of one or more independent variables plus an
error introduced to account for all other factors.
• Where Y is the dependent variable, Xs are the
independent variable and E is the random error term.
• The DV (Y) is given in continuous numeric scale while
the IV/s (X) can be of any type. (mostly numeric variable)
ε β β β α + + + + =
n n
x x x Y .........
2 2 1 1
402
Linear Regression Cont..
• The equation provides what value the DV would have for
a given value/s of the IV/s.
• For example if we develop a linear model with the DV of
body height and the IV of serum growth hormone, we
can predict height for a person with a given value of
serum GH.
• Can be simple or multiple regression.
• It attempts to model the relationship between the
dependent and independent variables by fitting a linear
equation to observed data.
403
Linear Regression Cont..
• A scattered plot is helpful to assesses the presence of
linear trend of association.
• Consider the following data showing the number of
households in China with TV.
¥ear (k)
(0 represents 2000)
nousehods wth 1V
(mons)
0 68
1 72
2 80
3 83
404
3/1/2010
102
Linear Regression Cont..
• If we plot these data, we get the following graph.
405
Linear Regression Cont..
• Although no straight line passes exactly through these
points, there are many straight lines that pass close to
them. Here is one of them.
406
Linear Regression Cont..
• How would you draw a line through the points? How do
you determine which line ‘fits best’?
• The most common method for fitting a regression line is
the method of leastsquares.
• This method calculates the bestfitting line for the
observed data by minimizing the sum of the squares of
the vertical deviations from each data point to the line.
• “Best fit” means difference between actual Y values &
predicted Y values are minimum.
• Hence, linear regression is a method of finding the linear
equation that comes closest to fitting a collection of data
points.
407
Linear Regression Cont..
ε
2
Y
X
ε
1 ε
3
ε
4
^
^
^
^
Y X
2 0 1 2 2
= == = + ++ + + ++ +
β ββ β β ββ β ε εε ε
Y X
i i
= == = + ++ + β ββ β β ββ β
0 1
L S m i n i m i z e s
ε εε ε ε εε ε ε εε ε ε εε ε ε εε ε
i
i
n
2
1
1
2
2
2
3
2
4
2
= == = + ++ + + ++ + + ++ +
= == =
¿ ¿¿ ¿
408
3/1/2010
103
Linear Regression Cont..
• Suppose that we used the line rather than the data
points to estimate the number of households with TV.
• Then we would get slightly different values from the
original observed values shown above. These values are
called predicted values.
Year (X)
(0 represents 2000)
Households with TV (millions)
Observed Values
Households with TV (millions)
Predicted Values Residual
0 68 62 6
1 72 70 2
2 80 78 2
3 83 86 3
409
Linear Regression Cont..
• The better our choice of line, the closer the predicted
values will be to the observed values.
• The difference between the predicted value and the
observed value is called the residue.
• Residue = Observed Value  Predicted Value
• The best line is the line with the smallest sum of squares
of error (SSE). (i.e. list square estimation)
• SSE = Sum of squares of residues = Sum of (y
observed
–
y
predicted
)
2
410
Linear Regression Cont..
• The manual calculation for the coefficients of linear
regression is possible when we have one independent
variable. i.e.:
Y = + X
• As that of correlation analysis, here we should have a
set of paired DV and IV values for all study units.
• The line which represent the dataset (Y = + X) is
calculated using the formula:
•
¿
¿
¿
¿ ¿
−
−
=
]
) (
[
] [
2
2
n
x
x
n
y x
xy
β
x y β α − =
411
Linear Regression Cont..
x ?
1 1
2 1
3 2
4 2
3 4
• Consider the following data.
• First we should plot a scattered diagram.
412
3/1/2010
104
0
1
2
3
4
0 1 2 3 4 5 6
Linear Regression Cont..
Y
X
413
Linear Regression Cont…
( )( )
( )
( )( ) 10 . 0 3 70 . 0 2
ˆ ˆ
70 . 0
5
15
55
5
10 15
37
ˆ
1 0
2
1
2
1
2
1
1
1
1
− = − = − =
=
−
−
=

.

\

−

.

\


.

\

−
=
¿
¿
¿
¿
¿
=
=
=
=
=
X Y
n
X
X
n
Y
X
Y X
n
i
n
i
i
i
n
i
i
n
i
i
n
i
i i
β β
β
414
Linear Regression Cont…
• One of the indices to measure model goodness of fit for
simple linear regression is Rsquared or coefficient of
determination.
• It is the proportion of variation explained by the best line
model.
• It depends on the ratio of sum of square error from the
regression model (SSE) and the sum of squares difference
around the mean (SST = sum of square total).
• Where:
415
Linear Regression Cont…
• For multiple linear regression adjusted r squared is used.
• For general rule of thumb, the Rsquared or adjusted R
squared should be higher than 0.80 to produce a good
linear model.
• If your Rsquared is less than 0.5, it is recommended
that you consider other type of model rather than linear
model.
416
3/1/2010
105
Linear Regression Cont…
Interpretation of linear regression coefficient:
• Let’s consider the following simple linear reg equation;
• Y = + X
• represents the slope, and represents the yintercept.
• The slope represents the estimated average change in Y
when X increases by one unit.
• The intercept represents the estimated average value of Y
when X equals zero. (Practically less important)
• When we represent a binary independent variable (coded
as 01), the slope represents the estimated average
change in Y when you switch from 0 to 1.
417
Linear Regression Cont…
Example 8.3:
• Assume that the duration of breast feeding in weeks (Y)
was found to be positively correlated with maternal age
in years(X). A linear regression model was developed to
explain the association. The equation is given as Y =
5.92 + 0.389X. How do you want to explain the
equation?
418
Linear Regression Cont…
Assumptions:
• Normal distribution: Regression assumes that variables
have normal distributions.
• Homoscedasticity: The variance of the error terms is
constant for each value of x.
• Linearity: The relationship between each x and y is linear.
• Normally distributed error terms: The error terms follow the
normal distribution.
• Independence of error terms: Successive residuals are not
correlated.
• No multicolinarity: The independent variables are not
correlated each other.
419
Linear Regression Cont…
Hypothesis testing in linear regression:
• Questions to be answered through the hypothesis testing
are:
– Does the entire set of independent variables contribute
significantly to the prediction of y?
– Does the addition of one particular variable of interest
add significantly to the prediction of y achieved by the
other independent variables already in the model?
• The null and alternative hypothesis are given as:
– H0: 1 = 2 = · · · = p = 0
– H1: j 0 for at least one j.
420
3/1/2010
106
Linear Regression Cont…
• F test and t test are used to test the hypothesis.
• F is a test for statistical significance of the regression
equation as a whole. It is obtained by dividing the
explained variance by the unexplained variance.
(Given as ANOVA table)
• T test is used to see whether that a specific variable is
significant in explaining the dependant variable or not.
421
Linear Regression Using SPSS
• Analyze > Regression > Linear Regression > Put the
dependent and independent variables > Select
appropriate statistics > Ok.
422
Logistic Regression
424
Introduction
• Logistic Regression is a model used for prediction the
probability of occurrence of categorical event by fitting data
into a Logistic Curve.
• Common dichotomous dependant variables are like
disease status (healthy or ill), clinical outcome (alive or
dead), treatment outcome (success or failure), utilization
health commodities (utilization or nonutilization) etc.
• Application:
– Modeling for risk prediction, identification of
determinants and health programming,
– Controlling confounding and interacting factors.
3/1/2010
107
425
Introduction Cont……
• Comparative advantage of Logistic Regression
– Fewer assumptions,
– Mathematically amenable,
– Easier interpretation.
• Classification of Logistics Regression (LR):
– Binomial LR: Dependant variable is dichotomous.
– Multinomial LR: Dependant variable with more than
two classes.
– Ordinal LR: Dependant variable with multiple and
ranked classes.
426
Logistic Regression Function
• Binary dependant variable are coded as 0 or 1.
• The probablity of the distribution is equal to the proportion
of 1s in the distribution (P).
• The logistic function associates the Independent Variable
(IV) X with the probability of occurrence of the Dependant
Variable (DV) Y.
• The function is given as:
427
LR Function Cont…
• The function is represented by S shaped “Sigmoid graph”
which is called the Logistic Curve.
• Examples:
428
LR Function Cont…
• Derivation of the function can be demonstrated with an ex.
• Suppose, we want to predict the person’s sex based on the
person's height.
• Let's say the probability of being male at a given ht is 0.9
• Odds (P/1P) of being male = 0.9/0.1 = 9
• Odds of being female = 0.1/0.9 = 0.11
• However the values look asymmetrical.
• Can be corrected by the application of ln.
• ln(9) = 2.217 and ln(0.11) = 2.217
• The over all transformation is Logit Transformation.
• The log of odds is abbreviated as the Logit.
3/1/2010
108
429
LR Function Cont…
Mathematically:
x
p
p
β α + =
(
¸
(
¸
− 1
ln
x
e
p
p
β α +
=
(
¸
(
¸
−
¬
1
x
x
e
e
P
β α
β α
+
+
+
= ¬
1
z
e
P
−
+
= ¬
1
1
n nx x x z where β β β α ........ 2 2 1 1 + + =
430
LR Function Cont…
• One of the advantages of Logistic Regression: it is
possible to compute OR from its coefficient.
• Let’s assume a researcher is interested to study the effect
of smocking as predicting variable (X) on dependant
variable lung cancer (Y).
– X can be present (X=1) or absent (X=0),
– Y can be present (Y=1) or absent (Y=0),
X
Y P
Y P
β α + =
(
¸
(
¸
= −
=
) 1 ( 1
) 1 (
log
431
LR Function Cont…
• Hence;
• The OR = Odds of smokers ÷ Odds of nonsmokers
[ ] ) 1 ( ) 1 / 1 ( log β α + = = = X Y odds
[ ] ) 0 ( ) 0 / 1 ( log β α + = = = X Y odds
α
β α
e
e
OR
) 1 ( +
= ¬
β
e OR = ¬
432
Assumptions of Logistic Regression
• Logistic Regression has fewer assumptions than Linear
Regression:
– The DV need not be normally distributed.
– Normally distributed error terms are not assumed.
– Error terms should not be homoscedastic for each
level of the IVs.
3/1/2010
109
433
Assumptions of LR Cont…
But it has the following assumptions:
1. Data type: A dichotomous or polytomous DV.
2. Inclusion of all relevant variables and exclusion of the
irrelevant ones: i.e. Based on scientific framework or
statistical cutoff point (P=0.3).
3. No interaction: LR doesn’t consider interaction effects
except when interactions are created as a variable.
4. No outliers and influential cases: Such cases can affect the
model significantly.
434
Assumptions of LR Cont…
5. No multicollinearity: As the IVs increase in correlation with
each other, the standard errors become inflated.
– A standard error > 2.0.
– Examining the correlations and associations b/n IVs
– Tolerance and VIF.
6. No outliers and influential cases: Such cases can affect
the model significantly.
7. Large samples:
– The minimum Ratio of Valid Cases to Variables
should be at least 10:1. The preferred ratio is 20:1.
435
Assumptions of LR Cont…
8. Linearity:
– Linear relationship b/n numeric IVs & the logit of the DV.
– If not the model underestimates association, lacks power.
– BoxTidwell Test: If there is non linearity for numeric IV
X, [(X)*ln(X)] interaction term become significant in model.
436
Fitting Logistic Model to a Dataset
• In Linear Regression, the fitness of the model into the
dataset is achieved through List Square Estimation
(LSE).
• In Logistic Regression LSE can’t be used.
• In its place Maximum Likelihood Estimation (MLE) is
used.
• MLE relies on the concept of Likelihood.
• The likelihood of a set of data is the probability of obtaining
that particular set of data, using a given model.
3/1/2010
110
437
Fitting Logistic Model Cont…
For example:
• Dataset B has five cases. Observed values for Y are
(1,0,1,0,1)
• The model predicts the probability of occurrence of Y is 0.7
(i.e. Probability of Y=1 is 0.7, and Y=0 is 0.3)
• Likelihood of B is the joint probability of predicting the
correct observed value of Y for every case using the model.
• i.e. L (B)=(0.7)(0.3)(0.7)(0.3)(0.7)=0.03087
∏
=
−
− =
n
i
yi yi
p P B L
1
1
) 1 ( ) (
438
Fitting Logistic Model Cont…
• Mathematically it is easier to work with the Log likelihood.
• Maximum Likelihood picks the values of the model
parameters that make the data "more likely" than any
other values of the parameters would make them.
• The MLE of the parameter P is that value of P that
maximizes L or ln L.
[ ]
¿
=
− − + =
n
i
i i
P y P y B L
1
) 1 ln( 1 ) ln( ) ( ln
439
Fitting Logistic Model Cont…
• Iteration: Repeated testing of the data and tuning of the
model parameter to provide the best fitting equation.
• Once P is determined, then and are estimated.
Probability 440
Interpretation of Reg. Coefficients
• is called the Intercept and
1
,
2
, and so on, are called the
Regression Coefficients of x
1
, x
2
,…, respectively.
• is the value of Z when the value of all risk factors is zero.
• A +Ve coefficient means the risk factor increases the
probability of the outcome, while a Ve means the opposite.
• A large coefficient means that the risk factor strongly
influences the probability of the outcome; while a nearzero
means the opposite.
z
e
P
−
+
= ¬
1
1
n nx x x z where β β β α ........ 2 2 1 1 + + =
3/1/2010
111
441
Hypothesis Testing in Logistic Reg.
• In Logistics Regression t or F test statistic can not be used
for hypothesis testing since it has Bernoulli Distribution.
• Options:
– The (log) Likelihood Ratio Statistic (2LL),
– The Wald Test,
• All test either of the following nullhypothesis:
– Ho: 1 = 2 = 3 = …………n = 0
– Ho: Removing an IV from the model doesn’t change its
the predictive ability.
442
Hypothesis Testing LR Cont….
A. Likelihood Ratio Test Statistic (2LL):
• Usually two nested models (the Full and Reduced
Models) are presented.
• Reduced model mean a model from which a variable is
purposely omitted.
• Ho: The removed variable is not significant in the model.
• 2 Log L = 2 [log L Reduced model – Log L Full model]


.

\

− =
mod
mod
log 2
full of L
reduced the of L
statistic LR
443
Hypothesis Testing LR Cont….
• If the full model explains the data `much better' than the
reduced model, the difference will be `large‘:
Reject the Ho that the removed variable is non
significant.
• If the reduced model explains the data as the full model,
the difference will be close to 0:
Accept the Ho that the removed variable is non
significant.
• LRT ~ X
2
df = number of removed variables.
444
Hypothesis Testing LR Cont….
B. Wald Statistic:
• Commonly used to test the significance of coefficients for
each independent variable.
• H
o
: A particular coefficient is zero.
• W ~ X
2
df of 1.
• For a particular IV if the W is significant, then the
parameter associated with this variable is not zero, so that
it should be included in the model.
β
β
of Varience
test Wald
2
=
3/1/2010
112
445
Pseudo RSquares
• In Linear Regression, R
2
measures proportion of variance
of DV explained by the predictors.
• Ranges from 01
• Logistic Regression doesn’t have an equivalent to the R
2
• However, there are varieties of Pseudo R
2
which are
designed to simulate the real R
2
.
• Common used: Cox & Snell R
2
and Nagelkerke R
2
• Pseudo R
2
doesn’t mean what R
2
exactly means in Linear
Regression: Interpretation should be made with caution.
446
Pseudo RSquares Cont….
A. Cox and Snell’s Pseudo R
2
B. Nagelkerke Pseudo R
2
N
Full
Intercept
M L
M L
R
/ 2
2
) (
) (
1
)
`
¹
¹
´
¦
− =
N
Intercept
N
Full
Intercept
M L
M L
M L
R
/ 2
/ 2
2
) ( 1
) (
) (
1
−
)
`
¹
¹
´
¦
−
=
447
Goodness of Fit Analysis
A. HosmerLemeshow Statistic
• The recommended test for overall fitness of a Logistic
Regression model,
• A type of chisquare test but considered stronger than the
traditional chisquare test, particularly if continuous
covariates are in the model or sample size is small.
• HL statistic first sort observations in increasing order of
their estimated event probability and divides observations
into deciles based on the predicted probabilities.
• HL statistic ~ X
2
df of 8.
448
Goodness of Fit Analysis Cont…
• Where
– n
j
is Number of observation in the j
th
group
– O
j
is Observed number of cases in the j
th
group
– E
j
is Expected number of cases in the j
th
group
• Nonsignificance means the model adequately fits the data.
• P value of 0.05 is considered as level of significance.
8
) 1 (
) (
2
10
1
2
2
of df
n
E
E
E O
G
j
j
j
j
j j
HL χ ≈
−
−
=
¿
=
3/1/2010
113
449
Goodness of Fit Analysis Cont…
B. Loglikelihood Statistics
• A good model is the one that results in a high likelihood of
the observed results.
• This translates into a small value for 2LL.
• If a model fits perfectly, the 2LL would be 0.
• Since there is no acceptable upper cutoff point for 2LL
test, it is difficult to interpret the meaning of the score.
• Less commonly used.
Logistic Regression Using SPSS
• Analyze > Regression > Binary Logistic >Put the
dependent and independent variables > Mark categorical
independent variables > check for the options > Ok.
Or
• Analyze > Regression > Multinomial Logistic > Put the
dependent variable > Put the independent variables as
factors or covariates depending on their nature > check
for available options > Ok.
450
Analysis of Variance
(ANOVA)
ANOVA
• Used to compare mean of a quantitative variable across
different categories of a categorical variable.
• The specific type is called Oneway ANOVA.
• If two covariates are involved it is called Twoway ANOVA.
• If the categorical variable has only 2 values: 2sample t
test can be used.
• ANOVA allows for comparison among 3 or more groups.
• ANOVA is helpful because it possess a certain advantage
over a twosample ttest.
• Doing multiple twosample ttests would result in a largely
increased chance of committing a type I error.
452
3/1/2010
114
ANOVA Cont…
• ANOVA functions by checking whether the differences
between the groups are significant depends on:
– The difference in the means
– The standard deviations of each group
– The sample sizes
• ANOVA determines Pvalue from the F statistic.
• Hypothesis:
– H
0
: The means of all the groups are equal.
– H
1
: Not all the means are equal.
• Doesn’t explain which ones differs.
• Once a global difference is detected, it should be follow
up with “multiple comparisons” (Post hoc test) to identify
specific differences.
453
ANOVA Cont…
Assumptions of ANOVA:
• Each group is approximately normally distributed,
• Observed data constitute independent random samples
from the respective population,
• Standard deviations of each group are approximately
equal
– Rule of thumb: ratio of largest to smallest sample
standard deviation must be less than 2:1
454
ANOVA Cont…
• ANOVA is a technique whereby the total variation
present in a dataset is segregated into several
components.
• Variation is the sum of the squares of the deviations
between a value and the mean of the value.
• Sum of square (SS) is another name for variation.
• ANOVA measures two sources of variation in the data
and compares their relative sizes.
– Between group variation
– Within group variation
455
ANOVA Cont…
Between group variation:
• Is there some variation between the groups?
• Sometimes called the variation due to the factor.
• Denoted SS(B) for Sum of Squares (variation) between
the groups.
• Calculated as follows (given x double bar is the grand
mean):
¿
=
− =
k
i
i i
x x n B SS
1
2
) ( ) (
¿
=
− − + − =
k
i
n n
x x n x x n x x n B SS
1
2 2
2 2
2
1 1
) ( ......... ) ( ) ( ) (
456
3/1/2010
115
ANOVA Cont…
Within group variation:
• Is there some variation within the groups?
• Sometimes called the error variation as it is the variation
that can’t be explained by the factor.
• Denoted SS(W) for Sum of Squares (variation) within
the groups.
• Calculated as follows given n is the sample size for
every group.
¿
=
− =
k
i
i i
s n W SS
1
2
) ( 1 ) (
2 2
2 2
2
1 1
) ( 1 ........ ) ( 1 ) ( 1 ) (
n n
s n s n s n W SS − − + − =
457
ANOVA Cont…
Variance:
• Based on the variation (SS), variance is calculated for
both categories.
• The variance is also called the Mean of the Squares and
abbreviated by MS, often with an accompanying variable
MS(B) or MS(W).
• Calculated by dividing the variation by the df
• MS = SS / df
• The between group df is one less than the number of
groups (k1)
• The within group df is the sum of the individual dfs of
each group. Or in other words it is (nk)
458
ANOVA Cont…
The F distribution:
• Used as test of significance in ANOVA.
• The F distribution is defined as the distribution of
(Z/n1)/(W/n2), where Z has a chisquare distribution with
n1 df, W has a chisquare distribution with n2 df, and Z
and W are statistically independent.
• In ANOVA F test statistic is the ratio of two sample
variances. (MSB/MSW).
• The df for the numerator are the df for the between
group (k1) and the df for the denominator are the df for
the within group (nk).
• A large F is evidence against H
0
, since it indicates that
there is more difference b/n groups than within groups.
459
ANOVA Cont…
Example:
• Suppose we have three groups:
– Group 1: 5.3, 6.0, 6.7
– Group 2: 5.5, 6.2, 6.4, 5.7
– Group 3: 7.5, 7.2, 7.9
• Then we computer ANOVA F statistic in the following
manner.
460
3/1/2010
116
ANOVA Cont…
WITHIN BETWEEN
difference: difference
group data  group mean group mean  overall mean
data group mean plain squared plain squared
5.3 1 6.00 0.70 0.490 0.4 0.194
6.0 1 6.00 0.00 0.000 0.4 0.194
6.7 1 6.00 0.70 0.490 0.4 0.194
5.5 2 5.95 0.45 0.203 0.5 0.240
6.2 2 5.95 0.25 0.063 0.5 0.240
6.4 2 5.95 0.45 0.203 0.5 0.240
5.7 2 5.95 0.25 0.063 0.5 0.240
7.5 3 7.53 0.03 0.001 1.1 1.188
7.2 3 7.53 0.33 0.109 1.1 1.188
7.9 3 7.53 0.37 0.137 1.1 1.188
TOTAL 1.757 5.106
TOTAL/df 0.25095714 2.55275
overall mean: 6.44 F = 2.5528/0.25025 = 10.21575
461
ANOVA Cont…
ANOVA
Source of Variation SS df MS F Pvalue F crit
Between Groups 5.127333 2 2.563667 10.21575 0.008394 4.737416
Within Groups 1.756667 7 0.250952
Total 6.884 9
1 less than number
of groups
number of data values 
number of groups
(equals df for each
group added together)
1 less than number of individuals
(just like other situations)
462
ANOVA Using SPSS
463
• Analyze > Compare means > One way ANOVA > Put
the continuous variable under “Dependent list” > Put the
categorical variable under “Factor” > Select “Post hoc”
tests > Ok.
Thank You
464
3/1/2010
Time Schedule
• Time Schedule.doc
• Mid
Mode of Evaluation
35% 40% 10% 15% • Final • Assignments/Quiz • Term paper
5
6
References
1. M. Pagano and K. Gauvreau. Principles of Biostatistics, 2nd ed., Duxbury Thompson Learning, 2000. 2. T. Colton. Statistics in Medicine, Lippincott Williams & Wilkins Publisher, 1974. 3. B. Rosner. Fundamentals of Biostatistics, 6th ed., Thomson Books, 2006. 4. M. Bland. An Introduction to Medical Statistics, 5th ed., Oxford Medical Publications, 1993. 5. W. Daniel. Biostatistics: A Foundation for Analysis in Health Sciences, 8th ed., John Wiley and Sons Inc, 2005. 6. Landau S and Everitt BS. Handbook of Statistical Analyses using SPSS, Chapman & Hall/CRC, 2004.
Introduction
7
2
3/1/2010
What is Statistics?
• Statistics is a field of study concerned with the collection, organization and summarization of data, and drawing of inferences about a body of data when only part of the data is observed. • It is concerned with: – Designing experiments and data collection, – Summarizing information to aid understanding, – Drawing conclusions from data, – Estimating the present and predicting the future based on Statistical evidence.
9
What is Statistics?
• Mathematical statistics: Concerns with the development of new methods of statistical inference and requires detailed knowledge of abstract mathematics. • Applied statistics: Involves applying the method of mathematical statistics to specific subject areas. • Biostatistics is an application of statistical method to Biological phenomena.
10
What is Statistics cont…
• In clinical medicine and PH Statistics can be applied to: – Determine the accuracy of measurement, – To compare measurement techniques, – To assess diagnostic tests, – To determine normal value, – To estimate prognosis, – To compare efficacy of treatment techniques, – To determine prevalence of an event, – To identify determinates of health problem, – To compute adequate sample size for studies. – Etc. 11
Statistical Data
• Refers to numerical description of things through the form of count or measurement. • Though statistical data always involves numeric description, all numeric descriptions are not statistical data. • Statistical data should have the following characteristics:
– – – – – They must be in aggregate, They must be affected to marked extent by multiple causes, They must be collected in systematic manner, They must be estimated at reasonable accuracy, They must be placed in relation to each.
12
3
• Predictive Statistics: The process of predicting future based on historical data. 13 14 Rationale of Studying Statistics • Enable to organize information in formal manner. • Analysis is done based on multiple assumptions. 15 16 4 . it may not give individual information. conducting and implementing of medical and public health research are highly reliant on statistical methods. • Inferential Statistics: Includes: • Inductive Statistics: The process of drawing conclusion about unknown characteristics of a population. • Statistics is extensively used in medical literature. • Nonparametric statistic: Interpretation does not depend on the population fitting any distributions. • Highly reliant on cutoff points. • There is a great deal of intrinsic variations in most biological process. • Errors are possible in statistical decisions. based on sample based study. statistics (statistical methods) can be classified as: • Parametric statistics: is a branch of statistics that assumes data come from a type of probability distribution and makes inferences about the data based on the distribution.3/1/2010 Classification of Statistics • Descriptive Statistics: Is the methodology of effectively collecting. Possible Limitations of Statistics • It mainly deals with variables which can be quantified.. • The planning. • Issues in science are becoming more and more quantitative. organizing and describing data. Classification Cont. • During analysis based on the underlying assumptions. • It deals on aggregate of facts.
• The way that the numbers are assigned determines the scale of measurement. objects. We may assign number "1" to "male" and number "2" to "female" or the opposite. ethnicity. – It can assume infinite number of values between two given values. • This process is known as measurement. • The process of measurement involves assigning numbers to observations according to rules. we typically assign numbers to various attributes of people. age. IV) etc 17 Types of Variables Cont… • Quantitative Variable: is a characteristic that can be measured and expressed numerically. number of episode of illness. blood sugar level. For example religion. 18 Scale of Measurement • In clinical medicine and public health as in many other areas of science. and can take any value for different units. • Therefore. • Continuous Quantitative Variable: – Measured on continuous scale. • This can be of two types: • Discrete Quantitative Variable: – Can only take on a finite number of values (usually whole numbers). 20 5 . weight. • Qualitative (Categorical) Variable: is a characteristic which can not be measured in quantitative form but can be identified by names or categories. or concepts. The only mathematical operation we can perform with nominal data is to count. illness status (well or ill). the only number property of the nominal scale of measurement is “identity”. • Four scales of measurement are typically discussed here. 19 Scale of Measurement Cont… Nominal Scale: • Is the lowest scale of measurement. III. treatment outcome (improved or not improved). Stage of breast cancer (I. • Numbers are assigned to categories as "names" arbitrarily. – Example: height. can be classified as Qualitative and Quantitative variables. • Depending on their quantifiablity.3/1/2010 Types of Variables • A variable is any characteristic of a study unit (example an individual) that is measureable and/or classifiable. II. • For example classifying people according to gender is a common application of a nominal scale. – Example: number of children.
22 21 Scale of Measurement Cont… Ratio Scale: • Ratio scale of measurement has the property of equal interval between values and absolute/true zero. and division) in data analysis. while. • Depending on the source. • The absolute/true zero allows us to know how many times greater one case is than another. we can always be confident that the distance between 250C and 350C is the same as the distance b/n 650C and 750C. multiplication. • It doesn’t have a true zero point. • However. it can be classified as Primary or Secondary data. rank in a race. subtraction. Scale of Measurement Cont… Interval Scale: • Interval scale has property of equal interval b/n values. • Example: Cancer stage. • These properties allow us to apply all mathematical operations (addition. Similar. it would be inappropriate to say that 600C degrees is twice as hot as 300C degrees.3/1/2010 Scale of Measurement Cont… Ordinal Scale: • Ordinal scale has the property of magnitude. • It assigns each measurement to one of a limited number of categories that are ranked in terms of graded order. for purposes other than the question of the research at hand. Data Collection Method • In order to generate valid conclusion from a data. 23 24 6 . information has to be collected in a systematic manner. • Eg: in measuring temperature using 0C scale. the number "0" is arbitrary. • However the interval between the categories is not necessarily equal. • Data may be derived from several sources. • Secondary data is data already collected by others. • Similarly the ratio between two values on interval scale doesn’t have meaningful interpretation. 00C doesn’t mean there is no temperature. • Primary data is gathered for the first time by the researcher for a given purpose. • A haphazardly collected dataset is less likely to produce valuable and generalizable information.
– Doesn’t give qualitative information. – Doesn’t give opportunity to probe and explore.3/1/2010 Data Collection Method Cont… Survey through interview: • A quantitative approach in which a standardized questionnaire. to be administered through interview. – Doesn’t give opportunity to probe and explore. to be filled by the respondents themselves. – Less reliable to assess behavior and attitude of respondents. – Relatively inflexible. is used. – Relatively inflexible. – Participants do not need to be able to read and write to respond. – Useful in describing quantifiable characteristics of a large population. • Standardized questions make measurement more 27 precise. • Very large and representative samples are feasible. • Responses from different respondents is comparable. – Responses from different respondents is comparable. • Disadvantage: – Doesn’t give qualitative information. • Useful in describing quantifiable characteristics of a large population. Advantage: • Quick and inexpensive. • Advantage – Quick and inexpensive. – Easy to quantify and analyze. – High nonresponse rate. is used to collect information. – Standardized questions make measurement more precise. 26 Data Collection Method Cont… Survey through self administered questionnaire: • A quantitative method in which a standardized questionnaire. 25 Data Collection Method Cont… – Very large and representative samples are feasible. – Less reliable to assess behavior and attitude of respondents. 28 7 . Data Collection Method Cont… • Disadvantage: – Participants need to be able to read and write to respond.
– It facilitates the exploration of collective memories.3/1/2010 Data Collection Method Cont… Secondary data: • A quantitative approach which utilizes data already collected by others. guided by a facilitators. • Advantage: – Less resource and time consuming. 31 Data Collection Method Cont… Indepth interview: • A qualitative method that relies on person to person discussion. and beliefs of a group. – It is difficult to organize the discussion. • Advantage: – Good approach to gather indepth attitudes and beliefs from individual respondents. – Limited control on the sampling method and size. Disadvantage: – Requires strong facilitator to guide discussion and ensure participation by all members. – Less likely to give qualitative information. – Provides an excellent opportunity to probe & explore. – Doesn’t give quantitative information. 30 29 Data Collection Method Cont… – Unearth sensitive issues which are not commonly raised by individuals. – Assures privacy. • Disadvantage: – May not give in depth information. – Provides an excellent opportunity to probe and explore. – Analysis is relatively difficult. – Participants don’t need to be able to read and write to respond. Data Collection Method Cont… Focus Group Discussion (FGD): • A qualitative method to obtain indepth information on concepts and perceptions about a certain topic through spontaneous group discussion of approximately 6–12 persons. – Group dynamics might generate more ideas than individual interviews. • Advantage: – Excellent approach to gather information on indepth attitudes. – Can be outdated. – Participants are not required to read or write. 32 8 . – No knowledge on the accuracy of data collection.
– The data collector may rearrange the questions depending 36 on the response of the subject. • There are two main objectives in designing a questionnaire: • To obtain accurate relevant information for the study. • Liable to “Observational bias” 33 34 Designing Questionnaire • Most of the data collection techniques utilize questionnaires. – The analysis is relatively difficult. • Usually takes longer time. – the respondent may feel like ‘a bug under a microscope’. – It doesn’t have strict sequence of questions. the quality of the data is dependant on how best the questionnaire is designed. – It is time taking. – A series of questions are arranged in a logical order and sequence and divided into subtopics. • Excellent approach to discover behaviors. – Skipped patter is important for structured questionnaire. Data Collection Method Cont… Observation: • A qualitative method that involves critical observation and recording the practice (behavior.3/1/2010 Data Collection Method Cont… • Disadvantage: – Doesn’t give quantitative information. – The nonstructured one is commonly used for qualitative studies. 35 9 . – The data collector is expected to smoothly go through the sequence. • Hence. Designing Questionnaire Cont… • A questionnaire can be classified based on different issues: • Structured Vs Nonstructured Questionnaire: – The structured one is mainly designed for surveys. • To maximize the response rate. culture…) of individuals or a group.
– Ask for only one piece of information at a time. analyze and report. Steps Cont… 3. • Close ended questions offer a list of possible options or answers from which the respondents must choose. code. • A nonstandard one is developed by the researcher to address the research question. – Ask precise questions to address the objective of the study. Piloting and Evaluation of Questionnaire. Designing Questionnaire Cont… Standardized Vs Nonstandardized Questionnaire: • Standard questionnaire is developed by a well known body and considered to be “standard” to assess a given research question. it is impossible even for the experts to get it right the first time round. • Go from factual to abstract. • What are the advantages and disadvantages of using standardized questionnaire? 37 38 Steps in Designing a Questionnaire 1. • Given the complexity of designing a questionnaire. Format of responses: Questions should be formatted into open or closed formats depending on the need. 2. • It is relatively easy and quick to fill.3/1/2010 Designing Questionnaire Cont… • Open ended Vs Close ended Questionnaire (Question): • Open ended questions permit free response that should be recorded in respondent’s own word. Developing Individual Questions: – Use short and simple sentences. • Allows exploration of the range of possible themes. • Go from easy to difficult. – Give extra attention to sensitive questions. • Start with closed questions. Arranging the Questions: • Go from general to particular. 4. • Questionnaires must be pretested (piloted) on a small sample of people characteristic of those in the survey. 39 40 10 . – Avoid leading questions. • Start with demographic and personal questions.
the frequency distribution consists of a set of categories along with numeric counts that correspond to each one.1: Ethnicity Composition of Women of Reproductive age in Awassa Town. The size of the data can range from a few hundreds to many thousands of observations. Raw data however will not necessarily provide information that can easily be interpreted. Ethnic Group Wolita Amhara Sidama Oromo Guragae Kenbata Tigray Hadya Others Total 43 Frequency Distribution 377 355 163 144 138 82 47 20 50 1376 44 Frequency Distribution • One type information that is commonly used to organize data in tables is Frequency Distribution. 42 Diagrammatic Summarization • • • Tables • Simplest means of data presentation which can be used for all type of data. In data summarization the detailness of the data is compromised but this is compensated by gain in knowledge of the data. • Example: 11 . • For nominal or ordinal data.3/1/2010 Introduction • • Data collection yields a set of data called Raw Data. Tables Cont… Table 2. Data presentation is a mechanism which enables easier understanding of a given set of data through the use of tables and graphs. Jan 2006.
7 20.9 24. • The rule of thumb states the number of classes should be between 1020. • Cumulative relative frequency of a class is the proportion (percentage) of total number of observations that have a value less than or equal to the upper limit of a given interval.6 74.4 10.8 96.3 92.9 3.0 47 12 . Jan 2006.0 100. Age Group 1519 2024 2529 3034 3539 4044 4549 Total Tables Cont… Table 2.7 100. • Open ended intervals should be avoided. so that all have equal width.9 53. • When we don’t have any evidence to decide number of classes. • Relative frequency of a class is the proportion or percentage of total number of observations that fall in a given class. Number of women 399 341 281 143 116 54 42 1380 Relative Frequency (%) 28. we can use Sturge’s Formula: • No of classes = 1+[3. though not necessarily. This facilitates comparison among classes.0 84. 45 Tables Cont… • Appropriate number of intervals should be considered as too many intervals won’t be much explanatory and too few intervals loose a great deal of information. • Intervals are often constructed.0 48 Cumulative Relative Frequency (%) 28.4 8. • The limits for each class must agree with the accuracy of the raw data. it is useful to know the proportion of values that fall into a given class.2: Cumulative and Relative Frequency of Age Structure of Women of Reproductive age in Awassa Town.3/1/2010 Tables Cont… • In displaying numeric data using frequency distribution we should note the following: • The range of values must be brokendown into a series of distinct and nonoverlapping intervals.Min value ) No of classes 46 Tables Cont… Relative and Cumulative Frequency • In addition to counts.322 x log (no of observations)] • The width of each class can also be calculated as: Width of the class = ( Max value .4 3. • If such information is given in the form of counts it is simply called Cumulative frequency. • The intervals should cover all data points.
• Twoway Table (Cross tabulation): Two variables are organized simultaneously in combined manner in a table. where of the data presented. – Complicated tables should be avoided. – Row and columns should be labeled. two way and higher order tables. – Totals should be indicated. there source should be given as footnote. – Numeric entities of zero should be written as “0” while missed or unobserved data should be represented by “”. what. the following standards should be followed: – Tables should be simple and self explanatory. – Every table should have a title (usually at the top of the table) which indicates who. • Oneway Table: Only one variable is summarized in the table. The higher order the table the more complicated the interpretation. when. 52 13 . – If the data are not original. 49 Tables Cont… What type of table is this? Child Ever Born >=5 Illiterates Read and Write 1st4th grade 5th8th grade 9th12th grade > 12th grade Total 42 9 32 46 42 7 175 Educational status of women <5 68 19 60 211 239 68 665 50 Tables Cont… Child’s Age Child’s Sex History of illness in the preceding 2 weeks Yes Male 011 mo Female Male 1223 mo Female Male 2435 mo Female Male 3647 mo Female Male 4859 mo Female 15 18 13 12 10 11 9 9 6 7 No 86 84 80 78 76 77 74 73 69 70 Total 101 102 93 90 86 88 83 82 75 77 51 Tables Cont… • In constructing tables. • Higher Order Table: Three or more variables are presented simultaneously in a table.3/1/2010 Tables Cont… • Depending on the number of variables represented in. tables can be classified as one way.
3 60 55 56 14 .3/1/2010 Diagrammatic Representation • A second way to present data is through the use of graphs or pictures. 70 60 50 % 40 30 20 10 0 Within an hr Within a day After the first day The Time Breastfeeding was Initated 2. – They are more attractive. – They facilitate comparison among groups. – The intension is to compare the frequency of different classes of a variable. • In the case of Horizontal Bar Graph. • In the case of the commonest Vertical Bar Graph (Column Graph). the viseversa holds true.5 26 Baseline End line 63. 53 Bar Charts (Bar Graphs) • Bar graphs are popular type of graph used to display a frequency distribution for Nominal or Ordinal data. – They may show pattern within the data set. • Though diagrammatic representation is easier to read than tables. – This type enables comparison between the levels of classes of the variable at different situations. they supply a lesser degree of details. 70 Percentage of children aged 011 months 60 50 40 30 20 10 0 Within an hr 124 hr After the first day The time breast feeding was initated Bar Charts Cont… • Multiple Bar Graph: – Depicts the frequency or relative frequency of classes of a variable at two or more situations. various categories into which the observation falls are presented along horizontal axis.8 28 33. • A vertical graph is drawn above each category so that the height of the bar represents either the frequency or relative frequency of observations within that class. • The bar should have equal width. the lesser detail can be compensated by a gain in understanding of the data. 54 Bar Charts Cont… Bar graph has different types: • Simple Bar Graph: – Depicts the frequency /relative frequency of classes of a variable. (Diagrammatic Representations). • However. • Diagrammatic representation has the following advantages: – They are easier to understand and memorize. and separated from one another so that not to imply continuity.
3/1/2010 Bar Charts Cont… • Component Bar Graph: – Similar as that of simple bar graph except bars are divided into components. illustrating relative magnitudes or frequencies of classes of a given variable. Female Male Bar Charts Cont… • 100% Component Bar Graph: – Similar as that of component bar graph. • Pie chart usually represents categorical data but it is also possible to use it for discrete quantitative data. 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Within an hr With in a day After the first day Females Males P rc n g o c ild n a e 0 1 e e ta e f h re g d 1 57 58 Pie Chart • A Pie Chart is a circular chart divided into sectors. Pie Chart Cont…. 59 60 15 . – The graph shows the relative contribution of the components to the bar (category). 70 60 50 mn s o th 40 30 20 10 0 Within an hr 124 hr After the first day The time breastfeeding was initiated. • The angle of each sector has to be proportional to the relative frequency of a given class. – But the height of all the bars is set at 100% so that comparison on the relative contribution of the components can easily be made.
• Frequency polygons are a graphical device for understanding the shapes of distributions. The procedure is as follows: 63 64 16 .3/1/2010 Histogram • Whereas Barchart is representation of a frequency distribution for either nominal or ordinal data. • If the interval of the bars is equal. in order to construct a Histogram. • Unlike Bargraph. true class boundaries should be used. • It is also possible to draw Frequency Polygon without drawing Histogram. the histogram should be drawn in such a way that the Y axis represents the frequency density and the X axis the interval. • Frequency density of an interval = frequency of the interval /true class width. • The horizontal axis displays the true limit of the interval. 61 Histogram Cont… • Then the respective frequency of the interval is represented by the area of the bar. the vertical axis represents the frequency or relative frequency of the interval. rather than class intervals. • For example the following table summarizes the Biostatistics mid exam score of 38 students out of 35 marks. Hence. • A Histogram can easily be changed to Frequency Polygon by joining the mid points of the top of the adjacent rectangles of the Histogram with a line. • However if the bars have different width. the frequency associated with each interval can be represented by the height of the respective bars. in the case of Histogram the categories (bars) must be adjacent. 62 Frequency Polygon • Frequency Polygon depicts a frequency distribution continuous numeric data. a Histogram depicts a frequency distribution for continuous data.
65 66 Frequency Polygon Cont… • First we have to identify the mid points of each interval. 67 68 17 . Identify the mid points of all the intervals of the classes of the give data. Connect adjacent plots with a straight line Frequency Polygon Cont… • For example the following Frequency Distribution represents the ages (in years) of 60 patients at a psychiatric counseling centre. 2.3/1/2010 Frequency Polygon Cont… 1. • Frequency Polygon Cont… Finally we have to plot the midpoints (as X axis) with respective frequency of each class (as Y axis) and connect adjacent plots with a straight line. Plot the mid points (as X axis) with the respective frequency distribution or relative frequency of the class (as Y axis) 3.
8: Mean Number of Child Ever Born to Women at the Age of 25 years. it is possible to have multiple Y values.75 2. 71 M ean C hild Ever B orn among W omen at the A of 25 ge Line Graph Cont… 3. 69 Scattered Plot Cont. • For each value on the X axis. each value on the X axis has a single corresponding measurement on the Y axis.25 3 2.75 1.5 2. Consequently we are able to trace the chronological changes.5 1. • The scale for one quantity is marked on the X axis and the scale for the other on the Y axis. • As the name indicates. • However. • The following scattered plot. points on the graph are connected to the adjacent points with straight line. unlike scattered plot. • Once again each point on the graph represents a pair of values. 200 195 190 185 180 175 170 165 160 155 150 145 140 135 130 125 120 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 Age in Years B lood Glucose level mg/dl 70 Line Graph • A line graph is similar to scattered plot as it shows the relation between two different continuous measurements. shows the relation between age and blood glucose level among diabetic patients aged 5070 years.5 3. Awassa Town (19802005) 72 18 .75 3..25 2 1. • Each point on the graph represents a pair of values for the two measurements. • Most commonly the scale along the X axis represents time.25 1 1980 1985 1990 1995 2000 2005 Year (GC) Figure 2.3/1/2010 Scattered Plot (Scattered Graph) • Scattered plot is used to show the relation between two different continuous measurements.
1st quartile. and June. • For example. • It is best used when you want to display the total at any given time. median. Values more than 1. – Values more than three IQR’s from the end of a box are labeled as extreme.5 IQR of the upper quartile. March. and the highest datum still within 1. • For example.5 times of the IQR of the lower quartile.3/1/2010 Cumulative Line Graph • Also known as Ogive Graph. • The relative slopes from point to point will indicate greater or lesser increases. a steeper slope means a greater increase than a more gradual slope.5 IQR’s but less than 3 IQR’s from the end of the box are labeled as outliers (o). the Ogive would looks like as follows. denoted with an asterisk (*). May. • For example In SPSS: – The ends of the whiskers represent lowest datum but still within 1. Box and Whisker Plot Cont… • However in some cases the ends of the whiskers can represent several possible alternative values. and largest observation. 73 Cumulative Line Graph Cont… 74 Box and Whisker Plot • In descriptive statistics boxandwhisker plot is a convenient way of pictorially depicting groups of numerical data through their fivenumber summaries • The smallest observation. 3rd quartile. if you saved $300 in both January and April and $100 in each of February. 75 76 19 .
16. 79 Pictogram Cont… 80 20 . 75. 44. 37. 38.3/1/2010 Stem and Leaf Plot • Is a display that organizes data to show its shape and distribution. • Each data value is split into a "stem" and a "leaf" portion. • For example the following pictogram represents the number of passengers per year across four airports of UK. the number 42 would be split apart. 82. 77 Stem and Leaf Plot Cont… 08 13 6 25 6 9 30 2 7 8 40 1 4 7 9 51 4 5 8 61 3 7 75 8 82 6 95 78 Pictogram • Pictogram is a graph which uses pictures or symbols to present a certain data. 55. 61. 95. 29. 67. 63. 26. 47. 41. 54. 58. • Consider the following dataset. 25. 49. with the stem becoming the 4 and the leaf becoming the 2. • The magnitude of the can be shown either by the size of the picture or the number of pictures. 51. • The "leaf" is the last digit of the number and the other digits to the left of the "leaf" form the "stem". 30. • Usually presents the frequency of one or more categorical or discrete numeric variables in the form of symbols. 13. 86. 78. sorted in ascending order: 8. 32. • For example. 40.
• Pictorial representations usually require “Key” to facilitate easier interpretation. the right type of diagrammatic representation should be selected.3/1/2010 Issues to be considered in diagrammatic representation • Depending on the type of the data. • Graphs > legacy dialogues > select appropriate graph • Available types are Bar graph. Line graph. 82 Diagrammatic Representation Using SPSS • In order to develop graphs using SPSS. • In graphs. the following steps should be followed. • Each graph and diagram should be labeled (usually the title is given below the figure). • Most of the graphs can also be found under “Analysis > Descriptive Statistics” icon. Numeric Summarization 83 21 . • It is not common to use two or more types of diagrammatic representation simultaneously for a specific data. contrasting colors should be selected. Histogram. 81 Issues to be considered Cont… • In graphs. • If the representation is taken from another source the primary source should be indicated. “What”. • Other rarely used types are also there. • When colors are employed. Pie chart. the scale of X and Y axis should be drawn proportionally. the X and Y axis should be indicated clearly with their unit of measurement. The best should be selected and used. “When” and “Where” of the data presented. • The title should indicate “Who”. Scattered plot and Box plot.
• The center is value (usually a single value). it does not give mathematically amenable outputs. • The Arithmetic Mean. Arithmetic Mean • The Arithmetic Mean is usually called the Mean. • In statistics.3/1/2010 Introduction Even though diagrammatic representation greatly enhance understanding of the data. • It should be close to the location were majority of the observations are located. Median and Mode are the most commonly used measures of central tendency. • It should not be affected by extreme values. • It should not be subjected to complicated computation. chosen in such a way that it gives a reasonable approximation of the whole dataset. • It is calculated by adding all of the individual values and dividing the sum by the number of individual values. • In summarizing a dataset using numeric indicators. 87 88 22 . • In statistics the number which tends to approximate the center of a set of data is called Measure of Central Tendency or Average. 85 86 Measures of Central Tendency Cont… Attributes of good measure of central tendency are: • It should be based on all observations. • The Greek letter (mu) is used to denote the population mean. two separate letters are used for the mean. • It is most familiar measure of central tendency. • The symbol (read as "x bar") is used to denote the sample mean. we often focus on describing the data with two summary figures. These are: – Central Tendency (Location) – Variation (Spread) • Measures of Central Tendency • One of the most commonly used measures to summarize a set of data is its center. • This gap is addressed by numeric summarization. • It should be capable of further algebraic treatment. • It should have a definite value.
91 92 Thus mean of the data is 350/30 = 11.5 10.1: Consider the time taken by 30 students to do a Biostatistics quiz. – It accommodates further mathematical applications. Minutes spent on Quiz 15 610 1120 Total Number of students (f) 2 12 16 30 True Class interval Mid point (m) mifi 0. – Its mathematical computation is not complicated.510.7 minutes 23 . n is total number of observations. Mean = i =1 xi n • In calculating the mean from grouped data we assume all values falling into particular class interval are located at the mid point of the interval. Mean = mi f i n 89 90 i Arithmetic Mean Cont… Example 3. – It can only have one value.5 6 96 248 350 Arithmetic Mean Cont… • The major advantages of mean are: – It is calculated based on all observations.520.55. – It shouldn’t be used when the dataset is not normally distributed.3/1/2010 Arithmetic Mean Cont… • When n is the total number of observations and Xi is the value of X for ith observation the formula of arithmetic mean is given as: n Arithmetic Mean Cont… Where k is the number of class intervals. mi is the mid point of the ith class interval.5 3 8 15. • The formula is given as: K • The formula simply means each value within the interval is represented by the midpoint of the true class interval. • The major disadvantages of mean are: – It is affected by extreme values. fi is the frequency of the ith class interval.5 5. Then we can calculate the mean as usual.
– Can be calculated when there is open end interval. 5. (n+1)/2 position of your data. pick the numbers at positions n/2 and (n/2) + 1 and find the mean of those two observations. and three are bigger. i. – Fm is frequency of the interval containing the median. 2. • The major limitations of the median are: – It could not be a good representative if the number of observations is too few. – n total number of observations. – Can be used in skewed distribution.5 since that is the middle between 3 and 4. – It is calculated based on one or two observations. – If you have an odd number of observations. 3. • When n is the number of observation in a dataset. i. 6} • The median would be 3. computed as (3 + 4)/ 2. 4. 94 Median Cont… • When we are dealing with grouped data. • Note that three numbers are less than 3. the median can be calculated as: n − Fc ~ X = Lm + ( 2 )w Fm Median Cont… • The major advantages of the median are: – Not affect by extreme values. 3. as the definition of the median requires.e. – It is easy to calculate. 5} • The numbers are already sorted. 2. Example 3. so that it is easy to see that the median is 3 (two numbers are less than 3 and two are bigger).5.3: Compute the median for {1.e.2: Compute the median for {1. 93 Median Cont… Example 3. with half of the values being lower than the Median and half higher than the median.3/1/2010 Median • The Median is the value which divides the data into two equal halves. • Where: – Lm is the lower true class boundary of the interval containing the interval. – If you have an even number of observations. the median is the middle observation. 4. – Fc is cumulative frequency of the interval just above the median class interval. 95 96 24 . – It can only has one value. the median is the arithmetic mean of the two middle observations. – W is class interval width. – It does not accommodates further mathematical applications (in parametric statistics). the median is calculated in such a way: – Sort the values into ascending order.
97 Mode Cont… 98 Mode Cont… For example the following table displays the salary of 20 factory workers in factory X.e. 99 25 . – It is less likely to be affected by extreme values. Similarly. • Example: {1. 4. – Can be calculated to distributions with open end class interval. but the least widely used measure of central tendency. – It is calculated based on few observations. 100 – At times a mode value may not exist in a dataset. • The major disadvantages of mode are: – It may not perfectly denote what central tendency imply. 2. 4. 3. 650 is taken as the mode of distribution. 5} • The mode is 4. 4. 3. • When the distribution has only one vale with highest frequency it is called Unimodal. – It may have more than a value for a dataset. – It does not accommodate further mathematical application. • It is simply the score that occurs most frequently. • In grouped data the mid point of the interval with highest frequency is considered as the mode of the distribution. Salary in Br 500600 600700 700800 800900 9001000 10001100 Number of Factory Workers 3 6 5 5 0 1 Mode Cont… • The major advantages of the mode are: – It can be used when the variable is ordinal or nominal. – It is very easy to compute. 2. it is possible to have multimodal frequency. If it has two values with equal and highest frequency it is called Bimodal.3/1/2010 Mode • Mode is by far the simplest. mid point of this interval i.
• Data values with larger weights contribute more to the weighted mean and data values with smaller weights contribute less to the weighted mean. • Each data value (Xi) has a weight assigned to it (W i). • If the distribution is not symmetric. then the weighted mean is the same as the arithmetic mean. the mean. then it is skewed. 101 Skewness and the Measures of Central Tendency Cont… 102 Weighted Mean • The weighted mean is similar to an arithmetic mean except it is a mean where there is some variation in the relative contribution of individual data values to the mean. median and mode are equal. 103 104 26 . (one side does not reflect the other). • Skewness is indicated by the “tail” or trailing frequencies of the distribution. If the tail is to the left then it is a negatively skewed distribution.3/1/2010 Skewness and the Measures of Central Tendency • The normal distribution is one that is bell shaped. • In normal distribution. • The best example for the application of weighted mean is the calculation of GPA. • Skewness affect their arrangement of the three measures of the central tendency in the following way. • If the tail is to the right it is a positive skew. • The formula is Weighted Mean Cont… • If all the weights are equal. • Skewness – measures the symmetry of a distribution. • Scoring an “A” grade has larger weight than scoring a “B” grade. unimodal and symmetric.
answers. • Two distinct samples may have the same mean or median. 53 (Mean = 50) 107 108 27 .3/1/2010 Geometric Mean • The geometric mean is an average calculated by multiplying a set of numbers and taking the nth root.349 = 78. 60. • For example: A blood donor fills a 250mL blood bag at 70mL/min on the first visit. or vice versa.778 mins total 106 Harmonic Mean Cont… • So 500mL total in (3. What is the average rate at which the donor fills a bag? 105 • Given: – 250mL at 70mL/min = 3.778) mins total = 500/6.750 gives a more accurate description of average rate. or its variation around a central value. 50. but completely different levels of variability.753 mL/min • The harmonic mean of 2/[1/70+1/90] = 78. 49. • The lognormal distribution is a distribution which is normal for the logarithm transformed values. where n is the number of numbers. 51. 50. • Geometric mean is related to the lognormal distribution. 40.571+2. • It applies more accurately to situations involving rates. • It is the reciprocal of the arithmetic mean of the reciprocals. 70 (Mean = 50) – Set 2: 48. • Source: http://wiki. 60. measures of dispersion are important for describing the spread of the data. than the arithmetic mean (80mL/min). and 90mL/min the second visit. 49. Harmonic Mean • The harmonic mean (H) of n positive values is defined by the formula.571 mins total – 250mL at 90mL/min = 2. 40. 50.com/Q/What_is_the_application_of_harmonic_mean_in_medicine Measures of Dispersion • While measures of central tendency are used to estimate "center" value of a dataset. – Set 1: 30.
– It doesn’t accommodate further mathematical application. – They are calculated from the whole observations.3/1/2010 Range • Defined as the difference between the largest and smallest sample values (xmaxxmin). • Major disadvantages: – It depends only on extreme values and provides no information about how the remaining data is distributed. • The formulas for sample and population variance are given as follows: n Standard Deviation and Variance Cont… • Advantages: – They accommodate further mathematical applications. • It is the average distance of each score from the mean. – They are measured in the unit of measurement of the observed data. 109 Standard Deviation and Variance • Standard deviation is the most common and useful measure of dispersion. – The extreme values are the most unreliable parts of the data. • Disadvantages: – They must always be understood in the context of the mean of the data. the abbreviations SD for standard deviation and Var for variance are used. • The formula for sample standard deviation is given as: n S= i =1 ( x − xi ) 2 n −1 • The formula for population standard deviation is give as: n σ= i =1 (µ − xi ) 2 n • What might be the reason for the difference? 110 Standard Deviation and Variance Cont… • Variance is just the square of the standard deviation. Thus it is difficult to compare the standard deviation/variance of two datasets measured in two different units. 111 112 S2 = i =1 ( x − xi ) 2 n −1 n σ2 = i =1 ( µ − xi ) 2 n • NB: Occasionally. – The range value can not be used when the units of measurements are different. • Standard deviation for grouped data is calculated as: K S= i =1 f i mi 2 n −1 − x2 28 . • Major advantage: It is simple to calculate.
the median is a fractile because it divides an ordered data set into two equal parts.5 10.520. • For instance. Their time is summarized in the following table. CV = S x 100% x Example on Grouped Data Example 3. and Percentiles (that divide the data into 100 parts).4: • Consider the time taken by 30 students to do a Biostatistics quiz. 115 116 K S = i =1 fim i 2 n −1 − x 2 = 4630 29 − 116 . • So when comparing between data sets with different units one should use CV instead of SD. 113 114 Example Cont… Minutes spent on Quiz 15 610 1120 Total K Measures of Position (Fractiles) fimi2 18 768 3844 4630 Number of students (f) 2 12 16 30 = 350/30 = 11.5 5. • The CV is useful in comparing the variability of several different samples.5 3 8 15.3/1/2010 Coefficient of Variation (CV) • The standard formulation of the CV is the ratio of the standard deviation to the mean of a give data.5 6 96 248 350 Mean = i m i fi n minutes ~ X = L m + n ( 2 − F m F c ) w = 10. 64 = 6.55. • The commonly used measure of positions are Quartiles (that divide the data into 4 parts). Deciles (that divide the data into 10 parts). each with different arithmetic mean as higher variability is expected when the mean increases.7 True Class Mid point fimi interval (m) 0. or divide. measures of position give additional information about a given data. an ordered dataset into equal parts. • Fractiles (Quantiles) are numbers that partition. Minutes spent on Quiz 15 610 1120 Total Number of students (f) 2 12 16 30 • The coefficient of variation is a dimensionless number.55 min 29 .8 min • In addition to measures of central tendency and dispersion. • CV is also important to compare reproducibility of variables.510.5 + (5/16) = 10.
– About ½ of the data falls on or below the second quartile Q2 (equivalent to median).5 • The interpretation is one forth of the observations are below or equal to the value 21. • The three quartiles Q1. • Q3 is the {0. 120 119 30 . – Q1 is the {0. – About ¾ of the data falls on or below the third quartile Q3.e. {18.5 • The interpretation is three forth of the observations are below or equal to the value 39. – About ¼ of the data falls on or below the first quartile Q1.5 (n+1)}th – Q3 is the {0. • Q1 is the {0. Identify the first and the third quartiles.25 (n+1)}th observation {0.75}th observation 32 + (4232)0. Quartiles Cont… • In order to identify the Quartiles of a given dataset • Sort the values in increasing order • Identify the Quartiles accordingly.25 (8+1)}th observation {0. the Q1 is a quarter distance between 21 and 23 this can be interpolated as: 21 + (2321)0. 59} • First make sure that the data is sorted in increasing order. and Q3 divide an ordered data set into four equal parts.25 (n+1)}th observation – Q2 is the median observation or {0.25 = 21. 24.25 (9)}th observation {2. 42. 117 118 Quartiles Cont… • Example: Let’s assume the following dataset presents the age of 8 factory workers.5. Q2. 24.75 = 39.3/1/2010 Quartiles • Quartiles divide a data set into four equal parts. 21. 32.75(n+1)} th observation • NB: if the identified observation is not a whole number then it should be determined by interpolation of the observations on either side. 23.5.25}th observation Quartiles Cont… • i. – About ¼ of the data falls above the third quartile Q3.75(n+1)} th observation {6.
etc.e. – The same meaning for other deciles. Compute the decile using the formula: k th decile = ( k )(n + 1) 10 th Percentiles • Percentiles are also like quartiles. • A dataset can be summarized using the so called “The five numbers summary” (this is sometimes represented graphically as a boxandwhisker plot). • P50 is yet another term for median. – About two tenth of the data falls on or below D2. • There are 99 percentiles termed P1 through P99.Q1)/(Q3+ Q1)] are also rarely used as measures of dispersion. – 2% of the data falls on or below P2. NB: if the identified observation is not a whole number then it should be determined by interpolation of the observations on either side. The five numbers are: the first and third quartiles. • There are 9 deciles dividing the population into 10 parts. • The interpretation of Percentiles is as follows: – 1% of the data falls on or below P1. Inter quartile range can over come one of the limitations of range.3/1/2010 Quartiles Cont… Additional use of the quartiles: • The inter quartile range (Q3. P10=D1. 1. such as P25=Q1. • Note that the D5 has similar meaning to the median or the third quartile. • Other equivalents. 2. and the 121 maximum and minimum values. Arrange the data in ascending order.Q1)/2] and Coefficient of quartile deviation [(Q3. being affected by extreme values).. – The same for other values. • Not commonly used as common as percentiles and Quartiles. (i. the median. P75=Q3. Deciles • Deciles serve to partition data into10 equal parts. should also be obvious. but divide the data set into 100 equal parts. 122 Deciles Cont… A given percentile is determined in the following manner. • The interpretation of Deciles is as follows: – About one tenth of the data falls on or below D1.Q1) can be used as measure of dispersion (like that of Range). 123 124 observation 3. • Quartile deviation [(Q3. • The deciles are termed D1 through D9. 31 . • Each group represents 1% of the data set.
2 About four tenth of the data falls on or below 70.6 = 70. 71. 59.3/1/2010 Percentiles Cont… A given percentile is determined in the following manner. 71. 69. b/n 69 & 71) In order to get the exact value we have to interpolate 69 + (7169) 0. Arrange the data in ascending order. 71. 82. 69. 72. 51. If the identified observation is not a whole number then it should be determined by interpolation of the observations on either side.2 4th 127 7th 8th 70th percentile is b/n the 13th & 14th observation (i. 80. 51. {72. 78. 75. 80. 81. i. 51.6 About 70% of the data falls on or below the value 76. {48. 125 Example Cont… • Compute 4th decile using the formula: 4 4 th decile = ( )(n + 1) 10 th Example Cont… • Compute the 70th percentile • The data is already sorted • Compute the 70th percentile using the formula 70 70 th percentile = ( )( n + 1) 100 th th observation th 4 th decile = [(0. 82.6. 1.e. 67. In order to get the exact value we have to interpolate 76 + (7876) 0. 75. b/n 76 & 78). 84. 84} • Compute 4th decile using the formula: 126 observation 3. 71. 67. 66. 76.e.3 = 76.6)] observatio n th 70 th percentile = [(13. 78. 81.3)] observation decile is b/n the & observation (i. 76. 59.4)(19)] observation observation 4 th decile = [(7. Compute the percentile using the formula: k k th percentile = ( )(n + 1) 100 th Example • The following data represents the Biostatistics result of 18 students out of 100 marks.e. 75. Calculate the 4th decile and 70th percentile. 75} • Computing the 4th decile • Before starting the computation arrange the observations in increasing order. 48. 128 32 . 2. 66. 51.
3/1/2010 Rate. 130 129 Ratio • Mathematically a ratio is the comparison of two quantities that have the same units (usually classes of a variable). Ratio and Proportion • In addition to measures of central tendency. a dataset can be mathematically summarized by the use of Rate. measures of dispersion. • If the fraction measures number of events per population at risk in a given period of time it is called operational rate (Example: Incidence proportion). hence the value indicates the overall contribution of the numerator to the denominator. • A ratio can be written in three different ways: – As two numbers separated by a colon (a:b) – As a fraction (a/b) – As two numbers separated by the word to (a to b) • In epidemiology ratio present two variables (as numerator and denominator) where one is not included in the other. Rate • In mathematics rate is a numeric presentation which is given in the form of fraction by which the numerator measures one variable and the denominator another. • In epidemiology we use rates to measure the occurrence of events over time. • Usually the denominator of rate is a time measure. and measures of position. decimal or percentage. Ratio and Proportion. • If time element is directly reflected into the denominator it is called real rate. 131 Proportion • A proportion is usually presented in fraction. • Unlike ratio numerator is the subset of the denominator. 132 33 . (Example: Incidence density).
• An event can be simple or composite. OR • The probability of an event is the relative frequency of set of outcomes over indefinitely large (or infinite) number of trials. – Analyze > Descriptive statistics > Explore > Statistics. 133 Basic Probability What is Probability • Probability is the chance that an event will occur given the trial has been conducted nearly infinitely under the same condition. – Analyze > Descriptive statistics > Cross tabs > Statistics. – Analyze > Descriptive statistics > Descriptives > Statistics. 135 Concept of Union. Commonly used are: – Analyze > Descriptive statistics > Frequency > Statistics. • Event is the subset of the sample space. – Analyze > Reports > OLAP Cubes > Statistics. Intersection and Complement 136 34 . • A sampling space is the set of all possible outcomes of a trial or experiment. Composite event contains more than one simple events.3/1/2010 Numeric Summarization Using SPSS • In SPSS numeric summaries are available under many alternatives.
What is the probability of an untreated TB patient either to recover or to develop permanent disability (in other words to avoid death) after 5 years of illness? Conditional Probability and the Multiplicative Law • Conditional probability is defined as the probability that a certain event will occur given that a composite event has also occurred...2. • p(AB) or "probability of A given B" p( A  B) = p(A ∩ B) p(B) • This formula is conveniently rewritten as the following which is commonly referred to as the Multiplicative Rule.3.6).4. Each event has equal probability of occurrence (i.5.1: • Role a six sided Die. ¼ would develop permanent disability and ¼ would recover. Probability of rolling an even number would be: • p(even) = p(2)+ p(4)+ p(6) • = (1/6)+(1/6)+(1/6)=1/2 138 Mutually Exclusive Cont. • p(A or B) = p(A) + p(B) 137 Mutually Exclusive Cont.e.2: • The natural history of Tuberculosis indicates for TB patients without any treatment. p ( A ∩ B) = p ( A  B) x p ( B) 140 139 35 . at the end of the 5th year of illness ½ of them would die. 1/6). Example 4. Example 4. • Examples: • The Additive Law when applied to two mutually exclusive events states that the probability of either of the two events occurring is obtained by adding the probability of each event. The possible outcomes (Sampling space) are six (1.3/1/2010 Mutually Exclusive Events and The Additive Law • Events are said to be mutually exclusive if they have no outcome in common.
• It says that you can use conditional probability to make predictions in reverse. • With independent events the multiplicative law becomes: p(A and B) = p (A)p(B) 142 Independent Events Cont. • Sometimes called the inverse probability law: • P(BA) = P(A and B)/P(A) ………………………………1 P(AB) = P(A and B)/P(B) ………………………………2 • Solving [1] for P(A and B) and substituting into [2] gives Bayes' Theorem: P(AB) = [P(BA)][P(A)]/P(B) • The general formula for Bayes' Theorem is: 144 36 . the events are called independent events.3: • What is the probability that the outcome of a roll of a die is 2 (A2) given that the outcome is even? Example 4..3/1/2010 Conditional Probability Cont. if the occurrence or nonoccurrence of one doesn’t affect in any way the occurrence or nonoccurrence of the other.4: • A medical practitioner measured the CD4 count of AIDS patient on ART two times with in a month. What is the probability to get 6 in both rolls? Example 4..5: • Assume we have rolled a die twice. was published in the eighteenth century by Thomas Bayes’. Example 4. What percent of those who had normal value in the first test also have the same in the second test? 141 Independent Events and the Multiplicative Law • For two given events.6: • The probability of getting normal birth weight baby at 33rd weeks gestational age is 1/5. If two pregnant women at the aforementioned gestational age gave birth in Bethel Hospital yesterday. what is the probability for those two babies to have normal birth weight? 143 Bayes' Theorem • Bayes' theorem. Example 4. About 25% of the patients had normal value in both tests and 42% of them had normal result in the first test.
P(AB) = [P(BA) x P(A)] / P(B) = [.95 145 The probability that the test will yield a negative result [~B] if the disease is present [A] The probability that the test will yield a positive result [B] if the disease is not present [~A] The probability that the test will yield a negative result [~B] if the disease is not present [~A] 146 Bayes' Theorem Cont… • Given this information. the derivation of two simple probabilities is possible using conditional probability formula.95 x .95 x .99 x .vassar.005] / .005% of the general population.9453 = .edu/lowry/bayes. irrespective of whether the = [.005]+[.0905 P(~AB) = [P(B~A) x P(~A)] / P(B) = [.99995 P(A~B) = [P(~BA) x P(A)] / P(~B) = [.7: • Suppose there is a certain disease randomly found in 0.3/1/2010 Bayes' Theorem Cont… Example 4.005]+[.995] / .0547 is present [A] or not present [~A] The probability of a negative test result P(~B) = [P(~BA) x P(A)] + [P(~B~A) x P(~A)] [~B]. A certain clinical blood test is 99% effective in detecting the presence of the disease among persons with the disease.05 x .995] = . Bayes' Theorem Cont… • Then it is possible to calculate the remaining probabilities.0547 = . irrespective of whether the disease = [.05 = .995 any particular person The probability that the test will yield a positive result P(BA) = .00005 The probability that the disease is present [A] if the test result is positive [B] The probability that the disease is not present [~A] if the test result is positive [B] The probability that the disease is absent [~A] if the test result is negative [~B] The probability that the disease is present [A] if the test result is negative [~B] The probability of a positive test result P(B) = [P(BA) x P(A)] + [P(B~A) x P(~A)] [B].99 x .005 The probability that the disease will be present in any particular person The probability that the disease will not be present in P(~A) = 1—.995] = .9453 disease is present [A] or not present [~A] 147 148 37 .005] / .99 [B] if the disease is present [A] P(~BA) = 1—.01 P(B~A) = . The following tables show the probabilities that are stipulated in the example and the probabilities that can be inferred from the stipulated information: • (Source: http://faculty.005 = .9095 P(~A~B) = [P(~B~A) x P(~A)] / P(~B) = [.9453 = .05 x .01 x .0547 = .995] / . But it also yields falsepositive results in 5% of individuals without the disease.html) Bayes' Theorem Cont… Given: P(A) = .01 x .05 P(~B~A) = 1—.99 = .
not necessarily disjoint. its probability is 0. The value of a probability can only be 0 p 1. The sum of the probabilities that an event will occur and that it will not occur is equal to 1. p(AB) = P (AnB)/P(B) Random Variable and Probability Distribution 149 Random Variable • Any characteristic that can be measured or categorized is called Variable. 2. If A and B are two events. its probability is 1 and if an event is certain not to occur. The possible list of outcomes with number of females is: Outcome MMM MMF MFM FMM MFF FMF FFM FFF No of Females 0 1 1 1 2 2 2 3 152 38 . which assigns unique numerical values to all possible outcomes of a random experiment under fixed conditions. If an event is certain to occur. If two events are mutually exclusive (disjoint). Suppose our interest is the number of female students that we will get out of the three samples. If A and B are two independent events then p(A and B) = p(A)p(B) 7.3/1/2010 Summary of the Basic Properties of Probability 1. the probability that one or the other will occur equals the sum of the probabilities: p(A or B) = p(A) + p(B) 4. it is called a Random Variable. • If a variable can assume a number of different values so that any particular outcome is determined by chance. 3. 151 Random Variable Cont… Example 4.8 • Three students are taken at random from this classroom. 6. then p(A or B) = p(A) + p(B)p(A and B) 5. • A Random Variable is a function.
• The equation used to describe a continuous probability distribution is called a Probability Density Function (PDF).. 154 Discrete Probability Distribution • Usually represented by table. an equation or graph describes it. • PDF has the following properties: 156 39 . and.1: Probability Distribution of a random variable X representing the birth order of children born in US. Example 4.000 155 Continuous Probability Distributions • Since a continuous random variable assumes infinite number of outcomes. it allows us to determine the probabilities associated with specified ranges of values.058 0.021 0. Y can take any positive real value.3/1/2010 Random Variable Cont… • There are two types of random variables. • In the discrete case. The random variable Y is its lifetime in hours. 10. The random variable X is the number of tails that are noted. X can only take the values 0.10: • Table 4. • Example 4. so Y is a Continuous Random Variable.. 1. – A light bulb is burned until it burns out.416 0. • A Probability Distribution applies the theory of probability to describe the behavior of the random variable. • In the continuous case. Instead. x 1 2 3 4 5 6 7 8+ Total P(X=x) 0.. it cannot be expressed in tabular form.330 0. it specifies all possible outcomes of the random variable along with the probability that each will occur.004 0.158 0. – A Continuous Random Variable is one that takes an infinite number of possible values. so X is a Discrete Random Variable.009 0. 153 Probability Distributions • Every Random Variable has a corresponding Probability Distribution. – A Discrete Random Variable: is one that takes finite distinct values.004 1.9: – A coin is tossed 10 times. .
n. P): The probability that an ntrial binomial experiment results in exactly x successes. 158 Binomial Distribution Cont. • The area bounded by the curve of the density function and the xaxis is equal to 1.. 157 Binomial Distribution • A discrete probability distribution. • b(x. • b(x. What is the probability of getting exactly 2 fours? • Suppose in Addis Ababa the probability of a commercial sex worker to be HIV positive is 0.11: • Suppose a die is tossed 5 times. • A variable which has only two outcomes (Success and failure). – The experiment consists of n repeated trials. If we consider 5 randomly selected commercial sex workers in the city. – The probability of success (x).15. – Each trial can result in just two possible outcomes. • It handles dichotomous /binary/bernoulli random variable. n. when the probability of success on an individual trial is P. • The trial is called Bernoulli trial. • The probability that a random variable assumes a value between a and b is equal to the area under the density function bounded by a and b. denoted by P.P)n – x Binomial Distribution Cont. • The probability that a continuous random variable will equal a specific value is always zero. P) = nCx * Px * (1 . – The trials are independent. is the same on every trial. what is the probability that exactly 2 prostitutes will be positive? 159 160 40 ... Example 4. when computed over the domain of the variable.3/1/2010 Continuous Probability Distributions Cont.
. 164 163 41 .3/1/2010 Binomial Distribution Cont. is greater than or equal to a stated lower limit and less than or equal to a stated upper limit).12: • The probability that a student is accepted to a prestigious college is 0. a volume.3. Binomial Distribution Cont… Example 4. an area. what is the probability that at most 2 are accepted? • What is the probability of getting 4 or more HIV positives among 5 randomly selected sex workers given that the probability of a commercial sex worker to be HIV positive is 0. Cumulative Binomial Probability: • Refers to the probability that the binomial random variable falls within a specified range (e. etc. • Note that the distribution can also be used to quantify the probability of occurrence of an event in a length.g..15? 161 162 Poisson Distribution • A discrete probability distribution. • First introduced by SiméonDenis Poisson (1781–1840) • It expresses the probability of a number of random events occurring in a fixed period of time if these events occur with a known average rate. – The probability that a success will occur is proportional to the duration of the time. – The probability that a success will occur in an extremely small time is virtually zero. If 5 students from the same school apply. • A Poisson experiment is a statistical experiment that has the following properties: Poisson Distribution Cont… – The experiment results in outcomes that can be classified as successes or failures. – The average number of successes ( ) that occurs in a specified period is known.
the probability of getting 3 deaths by tomorrow is 0. 2.71828.222 P(3. ) = (e. when the mean number of successes is .180 • Thus. 165 Poisson Distribution Cont… • Given the mean number of successes ( ) that occur in a specified period of time.75) = 0. i. 2) = (2. Find the probability that a randomly selected month will be one in which three adolescent suicides occur.) ( x) / x! • P(3.) ( x) / x! Poisson Distribution Cont… Example 4. P(x. likelihood that 3 will die tomorrow. • P(x. 2) = 0. – : The mean number of successes (occurrence of an event) that occur in a specified period of time. – x: The actual number of successes that occur in a specified period of time. – P(x. a researcher found that the monthly distribution of adolescent suicides in US follows a poisson distribution with parameter of = 2.3/1/2010 Poisson Distribution Cont… • The following notations are important. since 2 patients die per day. ) = (e.75) (2.) ( x) / x! Example 4. 166 Poisson Distribution Cont… • We put these values into the formula as follows. • x = 3.13534) (8) / 6 P(3.753) / 3! • P(3.75.718282) (23) / 3! P(3. on average. 2. 2) = (0.71828. ) = (e. ): The Poisson probability that exactly x successes occur in a Poisson experiment.14: • In a study of suicides. – e: A constant equal to approximately 2.e. What is the probability that exactly 3 will die tomorrow? • = 2. • e = 2.75) = (e2. we can compute the Poisson probability based on the following formula: P(x.13: • Let’s assume the average number of breast cancer cases death is 2 per day. 167 168 42 .180.
3/1/2010 Poisson Distribution Cont… • If the number of admissions in a hospital is 10 per hour on average. 43 . It is also known as the Gaussian Distribution. any variable that tends to cluster around the mean. Given by the formula: 1 Y = [( ) * 2π ] * e −( x−µ )2 2σ 2 σ • The formula is affected by two main factors: mean and SD 169 170 Normal Distribution Cont… Normal distribution has the following chx: 1. . the formula for skewness is: • Where Y bar is the mean. Less than 2 admissions. Y2. at least approximately. Named after Carl Friedrich Gauss (1777–1855). and any 172 symmetric data should have a skewness near zero. and N is the number of data points. • For univariate data Y1. YN. (Mainly as result the central limit 171 theorem) Skewness.. Mean median and mode are equal 5.. • Skewness is a measure of asymmetry. • • • • Normal Distribution Is the most important probability distribution function. • The skewness for a normal distribution is zero. determine the probability that. Unimodal 4.. Bell shaped 2. S is the standard deviation. Symmetrical at the mean 3. and Normal Curve • Skewness and kurtosis are used to measure normality. in any hour there will be: 0 admissions. Extends from negative infinity to positive infinity • The normal distribution can be used to describe. Area under the curve is 1 6. • Significant skewness and kurtosis indicate that data are not normal. Kurtosis. 6 admissions.
or two theoretical sets against each other. • ProbabilityProbability plot (PP plot): – Compares two probability distributions by plotting their cumulative distribution functions against each other. • For univariate data Y1. one empirical set against a theoretical set. QQ plot and PP plot. peakiness of the curve. . Y2. the more prominent of them are outlined below: 174 • Positive kurtosis indicates a "peaked" distribution and negative kurtosis indicates a "flat" distribution. 175 Normality Test Cont… • QuantileQuantile Plot (QQ plot): – Compares two probability distributions by plotting their quantiles against each other. modality of the distribution). – If the two distributions being compared are similar. • A great number of tests (over 40) have been devised for this problem. 176 44 . some use the following definition of kurtosis (often referred to as "excess kurtosis"): Normality Test • Normality tests assess the likelihood that the given data set comes from a normal distribution. • For this reason. – Has two variants. • The other option is the use of probability plots. the formula for kurtosis is: • The kurtosis for a normal distribution is three. i.3/1/2010 Skewness. the points in the QQ plot will approximately lie on the line y = x... Kurtosis Cont… • Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. • Typically the null hypothesis H0 is that the observations are distributed normally with unspecified mean and variance 2. by comparing a given data against normal distribution. YN. 173 Normality Test Cont… • The simplest method of assessing normality is to look at the frequency distribution histogram. • It is important aspect statistics as many procedures assume normality.e. – It is a common way of assessing normality. • Probability Plot Is a graphical technique for comparing two datasets. either two sets of empirical observations. • The alternative Ha that the distribution is arbitrary.. (symmetry.
4m is the height of a student where the mean for students of his age and sex is 1. • In SPSS: • Analysis > descriptive statistic > explore > enter the variable under dependent list > open plot and “check “normality plots with test” > continue > ok. – With large samples minor deviations from normality may be flagged as statistically significant. 4. Probability of x = a is zero.4.2m with a standard deviation of 0. Area under the curve is 1. 2. ShapiroWilk test). Probability of x > a is the area between a and positive infinity.7% rule. (Example: KolmogorovSmirnov test. Probability of x < a is the area between a and negative infinity. But how can we compute the area??? 178 Standard Normal Distribution • Is a normal distribution with a mean of 0 and a standard deviation of 1..15: • if 1. 177 Normal Distribution Cont… Application of Normal distribution to calculate probability: 1. The empiric rule of 68%. 3. 95% and 99. • Any point (x) from a normal distribution can be converted to the standard normal distribution (Z) with the formula: Z = (xmean)/standard deviation.4? 180 45 . Example 4. – What is the corresponding Z value for the student? – What is the probability to have a student more than height of 1. 179 Standard Normal Distribution Cont. 5. 6. • But such tests have serous limitation as: – Small samples almost always pass a normality test. Probability of b<x<a is the area between a and b. • Corresponding area can be calculated from a standard table.3/1/2010 Normality Test Cont… • It is possible to assess normality of a data objectively using statistical techniques.
.. According to this information attempt the following questions. – About 95% of the blood pressures are between ____ & ____. – What is the probability to get WBC more than 10.. what is the probability that exactly 2 of the samples will be positive for Hepatitis B? 2.17: • Among pregnant women having ANC followup in a hospital. Another student Y has mean glucose level of 80mg/dl.500 and 10. – What is the Z score for student X? – What is the Z score for student Y? – What is the probability of getting mean glucose level less than 100mg/dl ? – What is the probability of getting mean glucose level less than 80mg/dl ? 181 Standard Normal Distribution Cont.. If we consider 4 randomly selected donated bloods. – About ______% of the blood pressures are between 66 & 150 – What is the probability that a patient’s BP is > 120? – What is the probability that the patient’s BP is b/n 110 & 130? – What is the probability that a patient’s BP is < 108. Student X has mean glucose level of 100mg/dl. 184 183 46 .2. 1. WBC count follows normal distribution with mean of 8. Suppose in BL Hospital the probability of a donated blood to be positive to Hepatitis B is 0.000? Standard Normal Distribution Cont.16: • Assume a distribution of blood glucose level among medical students is normally distributed with mean of 90mg/dl and SD of 6mg/dl.000 in those pregnant women? – What is the probability to get WBC count between 7. Suppose that systolic blood pressures follow a normal distribution with a mean of 108 and a SD of 14.3/1/2010 Standard Normal Distribution Cont.000 and standard deviation of 800. Example 4. Example 4. – What range around the mean which encompasses 68% of the observation? – What is the probability for a student to have blood glucose level between 100 and 105 mg/dl? 182 Standard Normal Distribution Cont.
• Census has the following characters: – Universality – Simultaneity – Individual enumeration – Regular interval 187 188 47 . environmental. and publishing or otherwise disseminating demographic. Marital status etc.3/1/2010 What is Demography? Introduction to Demographic Methods and Health Service Statistics • “Demos” + “graphy” • Is a discipline that studies human population with respect to size. • Demography as a “method” and “data”. Source of Demographic Data • Demographic data can be acquired through three methods: – Census – Survey – Vital Registration Census • Refers to the total process of collecting. Race. social. • Dynamic aspects are Fertility. • Demography studies a population in “static” and “dynamic” aspects. mobility and its variation with respect to all the above features and the causes of such variations and the effect of all these on health. Mortality. Migration 186 and Growth. composition. distribution. analyzing. and social data pertaining to all persons in a country or in a welldelineated part of a country at a specified time. ethical and economic conditions. economic. • Static aspects include characteristics at a point in time such as composition by Age. Nuptiality. compiling. Sex.
• However.e. • The first real census was conducted in UK in 1841. • i. – Sometimes politicized. – Cost limits frequency. if it is applied in areas where there is high level of migration and mobility. • Content of Census – Demographic data – Economic data – Social data – Mortality and Birth Approaches to Census De jure: • The enumeration is according to the legal or customary place of residence..e. • 191 192 48 . – Serves as sampling frame for further studies. – Delay between field work and results. – Provides small area data. the result can be distorted. • Advantage and Disadvantage of Census Advantage – It represents the whole population. • i. • It also creates risk of omission and double counting. – Provides population denominators. • However there are evidences of large scale counting of population starting from the prehistoric period. • Information collected from a person away from his/her usual residence can also be incomplete. Disadvantage – Size limits content and quality control efforts. people are registered where they are currently staying/residing at the time of the census. this might not be accurate when a person’s legal or customary residence is not known.3/1/2010 Census Cont. • Such type of counting gives information relatively unaffected by seasonal and temporary movements. people are registered where they usually reside. 190 189 Approaches Cont… De facto: • The enumeration is according to physical residence at the time of the census. • However. • This method is advantageous in a sense that it has got less chance of double counting or omission.
• The major purpose of vital registration is primarily administrative. Small area data available. – Smaller size – More indepth information.3/1/2010 Vital Registration (Civil Registration) • Vital Registration is continuous registration of vital events as they happen. • Vital Registration has got the following features: – Continuity – Universality 193 Advantages of Vital Registration • • • • • Continuously monitors vital rates. Once a system is established. • Features of Survey: – Representativeness. Information may come from third party. It is difficult to establish the system. It can easily be disrupted by political/economic events. it would be cost effective. 196 195 49 . • What are the vital events? • Vital Registration is relatively modern concept in its present format. 194 Disadvantages of Vital Registration • • • • Uncertain coverage. • How can we make it representative? • Survey can be of two types: – Single rounded retrospective survey – Multiround follow up survey • The content of survey widely varies. Survey • Refers to the process of obtain information from a sample representative of some population at a given point in time. Can be used as base for testing the accuracy of censuses and surveys. May provide both numerator and denominator for some rates.
– Perfect representativeness is difficult to achieve. • Observed changes in birth and death rates in industrialized societies over the past two hundred years. • Developed by American demographer Warren Thompson. This is the period of rapid population growth. – A survey can only be focused on few thematic areas. • Stage II: Characterized by beginning of mortality decline followed by fertility decline. • Stage III: Characterized by low mortality. • Demographic change has got three stages. Less developed countries began the transition later. 197 Demographic Transition • Conceptual framework to explain population change over time. – Gives detailed data. 199 200 50 . low and fluctuating fertility. high fertility and low population growth.3/1/2010 Advantage and Disadvantage of Survey • Advantages: – Quick and inexpensive. 198 Demographic Transition Cont… Demographic Transition Cont… • Stage I: Characterized by high and fluctuating mortality. 1929. • Developed countries started the second stage in the beginning of eighteenth century. – Follow up can be achieved • Limitations: – Small area data might not be available. growth slows down and eventually reaches a nogrowth stage.
Population Pyramid: 201 Population Pyramid • A graphical illustration that shows the distribution of various age groups in a population. 202 Population Pyramid Population Pyramid 203 204 51 . 2. Dependency Ratio: Describe the ratio between non productive (age 014 and 65+) and productive (1564) age groups in given place and time. • Normally forms the shape of a pyramid. 3. It can also be used as measure of fertility. Y:1 or Y/X when Y is number male and X is number of female. with the population plotted on the Xaxis and age on the Yaxis. • Consists of two backtoback bar graphs. Child to Women Ratio: This is the ratio of number of children under five to number of women of reproductive age in given place and time. Sex Ratio: Is the total number of male population per 1000 female population.3/1/2010 Important Indicators of Composition of a Population 1. • One showing the number of males and one showing females in a particular population in fiveyear age groups. • Males are shown on the left and females on the right. This can be explained as Y to 1000. 4.
• Especially the measures of mortality and fertility are vital inputs to the health system so they are called Vital Statistics. GFR = Total number of live births in a year x 1000 Mid year female population aged 15 − 49 yrs in a same year Measures of Fertility Cont. some of the issues are more important and applicable in public health. • General Fertility Rate (GFR): The number of live births in a year per 1000 mid year women of reproductive age. 4549 yrs. 3034 yr.3/1/2010 Vital Statistics • Among the focus of demography. • Usually ASFR is calculated for the following 7 age groups of 5 years age category: 1519 yr. Measures of Fertility • Crude Birth Rate (CBR): The number of live births in a year per 1000 mid year population in the same year. CBR = Total number of live births in a year x 1000 Mid year population in a same year 205 206 Measures of Fertility Cont. 2024 yr. 2529 yr. 4044 yr.. ASFR = Total no of live births to women of a given age group during a year x 1000 Mid year female population for the same age group in the same year 207 208 52 .. • Age Specific Fertility Rate (ASFR): Refers to the number of live births in a year per 1000 women of reproductive age in a give age or age group. 3539 yr.
it is the sum of all ASFRs from 1549 yrs. • Mathematically. Age category 1519 2024 2529 3034 3539 4044 4549 ASFR 104 228 241 231 160 84 34 209 Measures of Fertility Cont.. • TFR for data given in the usual 5 years age category is provided as: TFR = 5 x 7 i =1 ASFRi 210 Measures of Fertility Cont.3/1/2010 Measures of Fertility Cont... • Total Fertility Rate (TFR): The number of children a woman expected to have at the end of her reproductive age given the current ASFRs are maintained. • Gross Reproduction Rate (GRR): Is the total fertility rate restricted to female births only.. • Child Ever Born (CEB): • Total number of children a woman has ever given birth to. 211 212 53 . • It is the average number of children a woman has in a given study area. GRR = TFR x Pr oportion of female births x 1000 Measures of Fertility Cont.
Measures of Fertility Cont. ASFR = No of death in a given age category in a year x 1000 Mid year population of that age category in the same year 215 216 54 .400 13.300 12.200 86...100 9. CDR = Total number of death per year x 1000 Mid year population Measures of Mortality • Age Specific Death Rate (ASDR): Quantifies death occurring in defined age category in a given area per 1000 mid year population of same age category. CBR from the following data.966 ASFR 214 Measures of Mortality • Crude Death Rate (CDR): Refers to total number of deaths in a given area usually in a year per 1000 mid year population. Age category 1519 2024 2529 3034 3539 4044 4549 Total 213 Women of reproductive age 15.600 10. TFR.1: • Calculate ASFR. GFR.3/1/2010 Measures of Fertility Cont.200 11.600 14.400 Live births 1596 3300 3210 2830 1860 850 320 13. Example 5.
• Child Death Rate (ChDR): Quantifies the probability of dying between age of one and five years per 1000 live births in a given year.000 population in a given year. • Infant Mortality Rate (IMR): It refers to number of death before the age of 1 year (Infancy period) in a year out of 1000 live births in the same year. Crude OutMigration Rate: Number of outmigrants (O) per 1. 217 • Measures of Mortality Cause Specific Mortality Rate (CSMR): CSMR = No of death sec ondary to a given cause in a year x 1000 Population at risk • Cause Specific Death Ratio (Proportionate Mortality Ratio): Pr oportionate Mortality Ratio = No of death sec ondary to a cause in a year x 1000 Total no of death in the same year 218 Measures of Mortality • Maternal Mortality Ratio: Number of maternal death in a given year MMRo = x 100000 Total number of live births in the same year Measures of Migration • Crude InMigration Rate: Number of inmigrants (I) per 1. Crude Net Migration Rate: Difference between the number of inmigrants (I) and number of outmigrants (O) per 1000 population in a given year.000 population in a given year. • Under Five Mortality Rate (U5MR): Quantifies the probability of dying between birth and age five per 1000 live births in a given year.3/1/2010 Measures of Mortality • Neonatal Mortality Rate (NMR): It refers to number of death before the age of 28 days (neonatal period) in a year out of 1000 live births in the same year. • • Maternal Mortality Rate: MMRa = Number of maternal death in a given year x 100000 Total number of women of reproductive age in the same year • 219 220 55 .
– Assess utilization of health service. General Marriage Rate: Number of marriage (M) per 1000 population age 15 and older in a given year. Measure of Population Growth and Projection • • Crude Rate of Natural Increase (r): r = CBR − CDR Population Projection: • Pt = Po (1 + r ) t • Population Doubling Time: log 2 t= log (1 + r ) 222 221 Health Service Statistics • • Data generated from the health system itself. Advantages: – Gives morbidity information – Identify priority health problem in the area. – Determine met and unmet health need. – Determine success or failure of specific health care program.. • Limitations – Lack of completeness – Lack of representativeness to the general community – Lack of denominators – Lack of uniformity – Lack of quality – Lack of compliance with reporting 224 56 . 223 Health Service Statistics Cont.3/1/2010 Measures of Marriage • Crude Marriage Rate: Number of marriage (M) per 1000 population in a given year.
3/1/2010
Health Service Statistics Cont..
1. Relative Frequency of a Disease:
Relative Frequency of a given disease = No of patients diagnosed with a specific disease x 100% Total number of health institutio n visits
Health Service Statistics Cont..
3. Admission Rate: • Quantifies proportion of admissions of patients among patients who visited the health institution in a given period of time.
Admission Rate = No of patients admitted to a health institutio n x 100 % Total number of patients visited the institutio n
2. Cure Rate: • Quantifies proportion of patients who have been cured for a disease condition using a treatment modality out of 100 patients who received similar type of treatment. • The term “Success Rate” can be used if the measured parameter is a procedure.
Cure Rate = No of cured patients of a given disease u sin g a treatment mod ality x 100 % Number of patients who recieved the treatment
225
4. Hospital Death Rate: • Quantifies proportion of deaths among hospitalized patients in a given period of time.
Hospital Dealth Rate = No of death among hospitalized patients x 100% Total no of admission
226
Health Service Statistics Cont..
5. Bed Occupancy Rate: • Quantifies percentage occupancy of hospital beds in a year.
ALS = Annual number of hospitaliz ed patient days Number of disc arg es or deaths
Sampling Method
6. Average Length of Stay: • Quantifies the average duration (in days) of hospitalized patients.
BOR = Annual number of hospitaliz 365 x total number ed patient of beds days x100 %
227
57
3/1/2010
Why Sampling?
• Sampling is that part of statistical practice concerned with the selection of individual observations intended to yield reasonable knowledge about a population of concern, especially for the purposes of statistical inference. • Study population Vs Target (Source) (Reference) Population. • Parameter: A descriptive measure computed from the data of the source population, • Statistic: A descriptive measure computed from the data of a sample. • The issues of adequate sample size and representative sampling technique are important for correct estimation of 229 the parameter using a statistic.
Why Sampling?
230
Why Sampling?
• Researchers rarely survey the entire population for two reasons (1) The cost is too high and (2) The population is dynamic. • Main advantages of sampling: (1) The cost is lower, (2) Data collection is faster, and (3) It is possible to ensure accuracy and quality of the data because the dataset is smaller. • Main disadvantage of sampling – Non representativeness (sampling error)
231
Sampling
Important terms: • Sampling Unit: Is the unit of selection in the sampling process. • Study Unit: The unit on which information is collected. • Sampling Frame: The list of all the units in the source population from which a sample is to be taken. • Sampling Fraction (Sampling Interval): The ratio between the number of units in the sample to the number of units in the source population.
232
58
3/1/2010
Types of Sampling
• Probability Sampling: Every unit in the population has a known, nonzero probability, of being sampled and the process involves random selection. • Nonprobablity Sampling: Nonprobability sampling is any sampling method where some elements of the population have no chance of selection or where the probability of selection can'be accurately determined. t – – – – –
Probability Sampling
Simple Random Sampling (SRS) Systematic Random Sampling Stratified Sampling Cluster Sampling Multistage Sampling
233
234
A. Simple Random Sampling (SRS)
• • • Is the purest (the most representative) form. Each member of the population has an equal, nonzero and known chance of being selected. This could be accomplished by writing each study units name on a slip of paper and selecting adequate number of them using Lottery Method. It can also be done by assigning a number to each sampling unit then samples are selected using Table of Random Numbers or Computer packages.
235
How to use table of random numbers
1. Number each member of the population. 2. Determine population size (N). 3. Determine sample size (n). 4. Determine starting point in table by randomly picking a page and dropping your finger on the page with your eyes closed. 5. Choose a direction to read. (to the left, right, down or up) 6. Select the first n numbers read from the table whose last digits are between 0 and N. 7. Once a number is chosen, do not use it again. 8. If you reach the end of the table before obtaining your n numbers, pick another starting point, read in a different direction, and continue until done.
236
•
59
Simple Random Sampling Cont… Limitation of SRS • Requires sampling frame. – Calculate the Sampling Fraction k (K = N/n). • Disadvantage: – Can be biased when there is cyclic patter in the order of the subjects. statistical packages can select a given size randomly.. • In SPSS: – Data > Select Cases > Random > complete the dialogue box accordingly. 239 240 60 .3/1/2010 Simple Random Sampling Cont… • When large dataset is available in databases. Systematic Random Sampling • Selects units at a fixed interval throughout the sampling frame after a random start. – Randomly select an integer between 1 to k. • Takes longer time. – Then take every kth unit. – Rarely it can be conducted without sampling frame. • In Excel: – Tools > Data Analysis > Sampling > Complete the dialogue box accordingly. 237 238 B.. – Decide on the n (sample size) that you need. • Advantage: – It is easier and less time consuming to perform. Systematic Random Sampling Cont. • The steps are: – Number the units in the population from 1 to N.
Assuming the groups are homogenous among each other. Cluster sampling selects few groups (clusters) from the population as Primary Sampling Unit (PSU).. Cluster Sampling • Is a sampling method applied when the source population is composed of “natural” groups. • Disadvantage: – Relies on the assumption of homogeneity among clusters. within each selected group. Then the required information is collected from all elements. • • 244 61 . Stratified Sampling • • • • Applied when the source population is heterogeneous on a variable of interest.3/1/2010 C. – Less control on sample size. The number taken from each stratum might be equal (Non Proportional Stratified Sampling) or the number is determined based on the proportion of each class in the source population (Proportional Stratified Sampling). • Limitation: Requires separate sampling frame for each stratum. 242 D. The population is first divided into classes (strata). 241 Stratified Sampling Cont… • Advantage: improves representativeness of the sample (Proportional Stratified Sampling) or it creates reasonable comparison among strata (Non Proportional Stratified Sampling). Secondary Sampling Units (SSU). – Requires less time and resource. 243 Cluster Sampling Cont. • Advantage: – It doesn’t require the sampling frame of the SSU. Then a separate sample is taken from each stratum using Simple or Systematic Random Sampling tech.
• Divide the total population by number of clusters to be studies.. • For every selection of a cluster select b individuals at random from it. • Probability of selecting a sampling unit (e.. health center) is proportional to the size of its population. Nonprobablity Sampling Here. Involves the following procedures • List all clusters with their respective source population size and cumulative frequency. Multistage Sampling • Is like cluster sampling. • Useful when the sampling units vary considerably in size. RS + SI.. rather than including all units in the cluster. 246 PPS Cont… • Decide the number of individuals which will be studied per one selection of a cluster (b). This is the Random Start (RS) point. • The advantage is it is simpler than SRS. • 248 62 . This will give you the sampling interval (SI) • Choose a number between 1 and the SI at random. village..RS + (a1)SI.. sampling error inflates. multistage sampling involves selecting a sample in at least two stages. zone. but involves selecting a sample within each chosen cluster. Is used when there is no sampling frame or when it is impossible to conduct probability sampling due to economical and feasibility factors. • Based on the cumulative frequency identify at which clusters the selected numbers fall. Note that if a cluster is selected twice 2b 247 individuals should be selected at random. • 2. • Calculate the following series: RS. thus it is difficult to extrapolate from the sample to the population. district.g. • Thus. • But the disadvantage is as the “number of stages” increased. the sample is less likely to be representative of the population. 245 Probability Proportional to Size Sampling Technique • PPS is a variant of cluster sampling technique. RS + 2SI.3/1/2010 E. • Decide the number of clusters (a) which will be included in the study. .
3/1/2010
Nonprobablity Sampling Cont..
• Judgmental or Purposive Sampling: The researcher chooses the sample based on who he/she think would be appropriate for the study. • Convenience Sampling: The selection of units from the population is based on availability and/or accessibility. • Quota Sampling: It starts with systematically setting “Quota” to represent subgroups of a population. Then data is collected to meet the predefined Quota. • Snowball Sampling: The researcher begins by identifying someone who meets the inclusion criteria of the study. Then the study subject would be asked to recommend others who s/he may know who also meet the criteria. 249
Sampling Error
• Sampling error or estimation error is part of the total error or uncertainty caused by observing a sample instead of the whole population. • Nonsampling errors such as nonresponse and reporting errors may also affect the outcome of a sample based study. • Theoretically estimated from a sample minus the population value. • Unlike bias, sampling error can be predicted, calculated, and accounted for. • There are several measures of sampling error.
250
Sampling Error Cont…
1. Standard error • Is a measure of the variability of an estimate due to sampling. • It indicates the extent to which an estimate derived from a sample survey can be expected to deviate from the population value. • Depends upon the underlying variability in the population for the characteristic as well as the sample size used for the survey. • The standard error is a foundational measure from which other sampling error measures are derived.
251
Sampling Error Cont…
2. Confidence intervals: • A range that is expected to contain the population value of the characteristic with a known probability. 3. Margin of error: • Is a measure of the precision of an estimate at a given level of confidence. 4. Coefficient of variance: • The relative amount of sampling error in comparison with a sample estimate. • CV = SE / Estimate * 100% • No hard and fast rules to define acceptable level. • The smaller the CV, the more reliable the estimate. 252
63
3/1/2010
Sampling Error Cont…
5. P values: • is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. Importance of such measures: • To indicate the statistical reliability and usability of estimates. • To make comparisons between estimates. • To conduct tests of statistical significance. • To help users draw appropriate conclusions about data.
253
Exercise 1
• A medical practitioner wanted to assess the quality of family planning service offered in a hospital. Accordingly he made an exit interview to those women who have ID number of multiple of five. What sampling method is employed?
254
Exercise 2
• A medical practitioner wanted to assess the prevalence of malnutrition among under five children in a woreda. Assuming all kebeles in the woreda are similar, he included all under five children in two randomly selected kebeles. – What sampling method is employed? – What possible limitation do you expect?
Exercise 3
• A medical practitioner wanted to assess the prevalence of malnutrition among under five children in a woreda. Assuming the problem is different across the three agroecological zones in the woreda he included children from 2 kebeles each from Kolla, Dega and Woynadega. – What sampling method is employed? – What possible limitation do you expect?
255
256
64
3/1/2010
Exercise 4
• A researcher wanted to study the prevalence of drug addiction among adolescents in Addis Ababa. First he randomly select Bole sub city. Then he selected woreda 17 at random from all woredas in Bole sub city. Finally he conducted his study in Kebele 19 (after random selection). – What sampling method is employed? – What possible limitation do you expect? – If woreda 17 was selected because of its proximity to the organization of the researcher what would have been the sampling method?
257
Sampling Distribution and Estimation
Estimation
• Estimation refers to the process by which one makes inferences about a population, based on information obtained from a sample. • Can be of two types: – Point Estimation – Interval Estimation
Point Estimate
• Point Estimate: A point estimate of a population parameter is a single value of a statistic. • The following table gives commonly used point estimators.
259
260
65
Interval Estimate Cont…. • An interval estimate has got three components (concepts). Margin of Error • In a confidence interval. • It measures the precision of a sampling method. • Interval Estimate is preferred than point estimate as it considers the precision and uncertainty of estimation. 95% CI means: If we used the same sampling method to select different samples and compute different interval estimates.3/1/2010 Interval Estimate • An interval estimate is defined by two numbers. • An interval estimate has got three components (concepts) – A statistic: (the point estimator) – A margin of error: (the measure of precision) – A confidence level: (the measure of uncertainty) • The interval estimate of a given confidence level is defined by the sample statistic + margin of error. – It describes how strongly we believe that a particular sampling method will produce an interval that includes the true population parameter. between which a population parameter is said to lie. 262 Interval Estimate Cont…. the range of values above and below the sample statistic is called the margin of error. • i. a < X < b is an interval estimate of the population mean . 263 264 66 . the true population mean would fall within a range defined by the sample statistic + margin of error in 95% of the time. • Confidence Level – The probability part of the interval. • For example. and 99% Confidence interval – For example. 95. • It is the function of the confidence level and another parameter called the standard error. the population mean is greater than a but less than b. – 90.e. 261 Interval Estimate Cont….
. one comes up with the sampling distribution of means of 266 samples of size n. – Meaning: We are 95% confident that the independent candidate will receive between 25% and 35% of the vote. CI for a single mean cont. regardless of the original distribution provided n is large. (Central Limit Theorem) 267 • The general formula is CI for µ = X ± Z α (σ / n ) 2 • CI=Sample statistic + Z value x SE 268 67 . The mean of the sampling distribution of means is the same as the population mean. 265 CI for a single mean • Background Concept: Sampling Distribution of Means. • Example 6. 2. Determine their mean and then replace the observations in the population. • The sampling distribution of means has the following properties: 1.. Pr (1.95 95% CI for µ = X ± 1.96 ≤ x−µ ≤ 1.96(σ / n ) = 0. – Repeat the sampling procedure indefinitely. – One can generate sampling distribution of means in the following manner: – Obtain a sample of n observations selected completely at random from a large population. Sampling distribution of means is approximately a normal distribution.96 (σ / n ) [ ] 3.96) = 0.1: – A local newspaper conducts an election survey and reports that the independent candidate will receive 30% of the vote.96(σ / n ) ≤ µ ≤ X + 1. The SD of the sampling distribution of means (which is called the standard error of the mean) is: σx =σ / n CI for a single mean cont.95 σ/ n Pr X − 1.3/1/2010 Interval Estimate Cont…. The newspaper states that the survey had a 5% margin of error and a confidence level of 95%. – If each mean in the series is now treated as individual observation and arranged in a frequency distribution. – The result is a series of means of sample size n.
Find 95% CI for the mean blood glucose level for all health adults (µ) given the standard deviation for the population is 15mg/dl. – From the first population take a sample of size nx and compute its mean X. – Hence the formula would be: CI for a single mean cont.2: • The mean blood glucose level of 100 randomly selected healthy adults is 85mg/dl. µ = X ± t α .. • The Differences X − Y are new set of scores which form the sampling distribution of differences of means. – From the second population take a sample size of ny and compute its mean Y. ( n −1) (σ / n ) 2 269 270 CI for difference between two means • Background Concept: The Sampling distribution of Difference of Means. – The first population has mean of µx and standard deviation of x. – The second population has mean of µy and standard deviation of y. – Consider two different populations X and Y.. 271 CI for difference between two means cont… • Do the same for all pairs of samples that can be chosen independently from the two populations. – Then determine X − Y. • However when the population variance is unknown and the sample size is less than 30: – Sample variance should replace population variance – Student t distribution should be used in the place of standard normal distribution. 272 68 .3/1/2010 CI for a single mean cont. Example 6.
The mean of the sampling distribution of differences of means equals to the difference of the population means ( µ − µ ). The SD of the sampling distribution of differences of means (SE) is equal to: 1 2 CI for difference between two means cont… Pr ( −1.3: • A randomly selected 120 HIV patients who were on ART had averagely lived for 25 years with SD of 5 years since their diagnosis for the virus was made. 1.95 Pr ( X − Y ) − (1.95 σ ( X −Y ) = σ1 2 n1 + σ2 2 n2 95%CI of µ1 − µ 2 = ( X − Y ) ± (1. 2. CI for single proportion • Background Concept: The Sampling distribution of Proportions • Here we are interested in the proportion of the population that has a certain characteristic represented by P or . • Calculate the point estimate for the difference between the population means. µ1 − µ 2 = ( X − Y ) ± Z α ( 273 2 σ1 2 n1 + σ2 n2 ) 274 CI for difference between two means cont….96 σ 12 n1 2 + σ 22 n2 ) 3.96 σ1 2 n1 + σ2 2 n2 ) = 0.96 σ1 n1 + σ2 2 n2 ) ≤ ( µ1 − µ 2 ) ≤ ( X − Y ) + (1. • Find the 95% CI for the difference between the means.96) = 0.3/1/2010 CI for difference between two means cont… Properties of the sampling distribution of differences of means. Example 6. • The sampling distribution of proportion has the following characteristics: 276 275 69 .96 < ( X − Y ) −( µ1 − µ2 ) • σ 12 n1 + σ 22 n2 2 < 1. Similarly a randomly selected 140 HIV patients who were not on ART had averagely lived for 14 year with SD of 4 years. The distribution is approximately normally distributed. • If we take indefinite random sample of n observation and if we calculate p for all samples then we will have sampling distribution of proportions.
and the second • The first population has proportion of . Example 6. π = p ± Zα ( 2 P(1 − P) ) n 278 277 CI for single proportion cont. The SD (SE) of the sampling distribution of proportions: σP = P(1 − P) n Pr p − 1.96( P (1 − P ) P (1 − P ) ) ≤ π ≤ p + 1.95 n n P(1 − P) n 95%CI for π = p ± 1. The distribution is approximately normally distributed. The sampling distribution of proportions has the following properties: The mean of sampling distribution of proportions = CI for single proportion cont. From the second population take a sample size of ny and compute its sample proportion py..96( ) = 0.95 .96) = 0.96 < p −π P (1 − P ) n < 1. 280 279 70 . • Then determine pxpy.. What will be the 99% confidence interval of HIV/AIDS prevalence for whole commercial sex workers in the city? CI for difference between two proportions • Consider two different populations X and Y. population has proportion of • From the first population take a sample of size nx and compute its sample proportion px.96( 3. • Do for all pairs of samples that can be chosen independently from the two populations. Pr ( −1.4: • In Addis Ababa blood test of randomly selected 120 commercial sex workers revealed that 30 of them are HIV positive. 2.3/1/2010 CI for single proportion cont… • 1. • The Differences pxpy are new set of scores which form the sampling distribution of differences of proportions.
96 p1 (1 − p1 ) p 2 (1 − p 2 ) + n1 n2 3.96 1 + ) = 0. The distribution is approximately normally distributed. The SD (SE) given as: σ (p −p ) = 1 2 CI for difference between two proportions cont… Pr (−1.95 p1 (1 − p1 ) p2 (1 − p2 ) + n1 n2 p (1− p1 ) p2 (1 − p2 ) p1 (1− p1 ) p2 (1− p2 ) + ≤ (π1 −π 2 ) ≤ ( p1 − p1 ) + (1. 50 of them use contraceptive.) 2.96) = 0. CI for OR = exp [ln(OR) ± Z α 2 1 1 1 1 + + + ] a b c d CI for RR = exp [ln(RR) ± Z α 2 1− a ( a + b) + 1 − (c c + d ) ] a c 284 • Why do we need natural logarithm here? 283 71 .95 n1 n2 n1 n2 • Pr ( p1 − p1 ) − (1. • Calculate the point estimate for the difference between the population proportions.3/1/2010 CI for difference between two proportions cont… The sampling distribution of differences of proportions has the following properties: 1. among randomly selected 300 married women who can read and write.5: • Among randomly selected 200 illiterate married women. CI for OR and RR • When the intention of measurement of association is to have inference about a population parameter. The mean of the sampling distribution of differences of proportions equals the difference of the population proportion ( . CI for OR or RR can be calculated using the following formula.96 < ( p1 − p2 ) −(π 1 − π 2 ) < 1. 95%CI for π 1 − π 2 = ( p1 − p2 ) ± 1. 150 of them use contraceptive. • Find the 95% CI for the difference between the proportions. Similarly.96 ( p1 (1 − p1 ) p2 (1 − p2 ) + n1 n2 π 1 − π 2 = ( p1 − p 2 ) ± Z α ( 281 2 p1 (1 − p1 ) p 2 (1 − p 2 ) + n1 n2 282 CI for difference between two proportions cont… • Example 6.
– Exposure (0 as Yes. the sample variance still remains a biased estimator of the population standard deviation. • Weight cases based on “frequency” variable. • A case of biased estimation is seen to occur when sample variance.. the sample mean is an unbiased estimator of population mean. then the estimator is called a biased. n1 is used as the denominator.3/1/2010 CI for OR and RR Cont. • Create 3 variables in the variable view page: – Frequency (for the four cells). • To compensate for this. 1 as No) and – Outcome (0 as Yes. Estimation of Sample Size for Cross Sectional Studies Why we need to calculate sample size: • Representativeness Vs Cost • Estimation can be made based on a given confidence level and standard error. but for large sample sizes this bias is negligible. 1 as No) • Enter the values into the data view page as mentioned above. • Do the analysis in the following manner: – Descriptive statistics > Cross tabs > Put “exposure” as row and “outcome” as column > Statistics > Check “risk” > Continue > Ok – OR is given as “Odds ratio for exposure (yes/no)” 285 – RR is given as “For cohort disease = yes” Unbiased and Biased Estimators • A statistic is called an unbiased estimator of a population parameter if the mean of the sampling distribution of the statistic is equal to the value of the parameter. is used to estimate the population variance using the following formula: 286 Unbiased and Biased Estimators Cont… • The sample variance calculated using this formula is always less than the true population variance. using n1 as the denominator. • Based on the Central Limit Theorem. • SPSS can compute OR and RR with their confidence intervals given the information is fed in the following manner. 287 288 72 . • This is because sample observations are closer to each other than population observation. • It is important to note that. • If the mean value of an estimator is either less than or greater than the true value of the quantity it estimates.
3/1/2010 Sample Size to Estimate a Single Population Proportion • If the main objective of the study is to estimate single population proportion. then the sample size can be determined using the formula: Z n = α 2 2 Sample Size to Estimate a Single Population Proportion Cont… NB: • If p is not known it has to be taken as 0. A previous study indicates the prevalence is around 55%.6: • A researcher is interested to determine the prevalence of family planning use in Addis Ababa city. (Why?) • Depending on the nature of the study 1015% contingency should be added. then the sample size can be determined using the formula: Z n = α 2 σ 2 d • Where: –n –Z – –d 291 is the minimum sample size required for large population is the critical value for a given confidence level is the expected SD of the event to be studied 292 is the margin of error 73 .000) Z is the critical value for a given confidence interval P is expected proportion of the event to be studied (to be estimated based findings of previous studies) d is margin of error 289 • Where: – n is the noncorrected sample size – N is the size of the source population 290 Sample Size to Estimate a Single Population Proportion Cont… Example 6.5. Corrected sample size = n x N n + N P (1 − P ) d 2 Where. n is the minimum sample size required for very large population ( 100. If the researcher is interested to determine the sample size with 95% CI and 5% of margin of error. what number of women of reproductive age should be included into his study? Sample Size to Estimate Single Population Mean • If the main objective of the study is to estimate single population mean.000 the sample size should be corrected using the formula. • If the size of the population is less than 100.
what number of students should be included into his study? Hypothesis Testing 293 What is a Hypothesis • A statistical hypothesis is an assumption or a statement which may or may not be true concerning one or more population. – The mean blood glucose level among Chinese and Indians is the same. 3. Choose an alternative hypothesis which would be accepted if the first hypothesis is rejected. Make a decision and interpret it.7: • A researcher is interested to determine the mean blood glucose level among high school students. X2) 4. Choose the hypothesis to be tested. 296 74 . 295 Steps in Hypothesis Testing Hypothesis testing involves the following steps: 1. – The mean blood cholesterol level is the same before and after taking a drug. Obtain the value of the test statistic. If the researcher is interested to determine the sample size with 95% CI and tolerates 2 mg/dl margin of error. – The prevalence of Hypertension in US and UK is the same. • Setting up and testing hypotheses is an essential part of statistical inference. A previous study indicates the mean is 85mg/dl with standard deviation of 15mg/dl. t. Decide on the appropriate test statistic for the hypothesis (Z. • Examples of statistical hypothesis: – The mean pulse rate among AAUHI students is 72/min. Decide the level of significance and corresponding critical value. 6. – The prevalence of HIV in AA is 12%. 2.3/1/2010 Sample Size to Estimate Single Population Mean Example 6. 5.
• If the probability is very low we reject the null hypothesis. • The mean pulse rate among AAUHI students is 72/min. 297 298 The Null and Alternative Hypothesis Cont.. “no effect” or “no difference”. • The general formula to calculate test statistic is: test statistic = (estimate) − ( hypothesized value) SE 300 75 . • Drug A has no effect on the blood glucose level of diabetic patients. student’st and X2 tests. • Every hypothesis test requires the analyst to state a null hypothesis and an alternative hypothesis. • The most commonly used test statistic are Z. • There is association between smoking and lung cancer. • Drug A has effect on the blood glucose level of diabetic patients. Example. The Null and Alternative Hypothesis Cont. 299 Test Statistic • In hypothesis testing we accept or reject the hypothesis through calculating the probability of getting the estimated sample value given the hypothesized population value is true. • There is no association between smoking and lung cancer. “there is difference” and “there is association”. • There is difference in the prevalence of malaria in region A and B. • There is no difference in the prevalence of malaria in region A and Region B. “has effect”. – It commonly implies the meaning of “equals to”. • Implies conclusions like “is not equal”. • The alternative hypothesis (HA or H1) • The hypothesis that will be accepted if H0 is rejected. “no association” conclusions. • The null hypothesis (H0 or H N): – The first hypothesis to be set by the researcher.3/1/2010 The Null and Alternative Hypothesis • In hypothesis testing two hypotheses are involved: The Null Hypothesis and the Alternative Hypothesis. • Both hypotheses are about the parameter not about the statistic. • They are mutually exclusive and complementary events. • The probability is calculated using test statistic. Example: • The mean pulse rate among AAUHI students is not equal to 72/min..
3/1/2010 Test Statistic Student’s t Distribution: • The use of ztest requires a knowledge of the variance of the population from which the sample is taken.e. most datasets can be explained by standard normal dist. but approaches 1 as the sample size becomes large. (i. – The t distribution approaches the normal distribution as n1 approaches infinity.2 ). • The degrees of freedom (df) refers to the number of independent observations in a dataset after some restriction is made. • But when the sample size is small and population SD is not known. – Symmetrical about the mean. – The variance is equal to v / ( v . t distribution is less picked in the center and has higher tails. (t distribution is a family of distributions) • The particular form of the t distribution is determined by its Degrees of Freedom (df). • In statistics as long as sample size is large enough. t= [x − µ ] s n • There are many different t distributions. statisticians rely on the distribution of the t statistic. • It is somewhat strange that once can have knowledge of the population variance and not know the value of the population mean. 301 Test Statistic Cont… • Student’s t distribution was developed by William Gosset (18761937) under the pseudonym of “Student t”. – Extends from + infinity to – infinity – Compared to normal distribution. where v is the df. 302 Test Statistic Cont… • The t distribution has the following properties: – The mean of the distribution is equal to 0. V>2) In general the variance is greater than 1. 303 Test Statistic Cont… 304 76 .
all ~ N(0. 305 Test Statistic Cont… Chi Square Distribution (X2): • Mainly developed by Karl Pearson (18571936) • A type of probability distribution like Z or t.1). • Let {X1. • Sometimes the second assumptions may not be met as the t test is robust for departures from the normal distribution. the probabilities calculated from the t table are still approximately correct. • It is not a distribution but rather a family of distributions. X2.. 2. • It is the most frequently used statistical technique for analysis of count or frequency data.+ Xn². . The observations are selected at random from the population. Xn} be n independent random variables. The population distribution is normal. • The mathematical formula of X2 distribution is given as (where x is 0): Y= 1 1 ( k ) x ( k / 2 ) −1e − ( x / 2 ) k ( − 1)! 2 2 2 307 Test Statistic Cont… • The graph is given as: 308 77 . indexed by the df. • Represented by the Greek letter Chi ( χ ) • It is the distribution of the sum of the squared values of the observations drawn from the N(0..1) distribution.3/1/2010 Test Statistic Cont… • For the t distribution to apply strictly we need the following two assumptions: 1... • That means even when assumption 2 is not satisfied. 306 Test Statistic Cont… • Mainly used to check association between two categorical variables. • Then the X2 n is defined as the distribution of the sum X1² + X2² +..
(1power of the study) 311 312 78 . it would be lower or higher: Directional or onetailed test. (1confidence level) • The probability of committing type two error is denoted as . Type II error Correct • The probability of committing type I error is denoted as . two types of errors can be committed: Type I and Type II errors.e. – As the df increase. i.3/1/2010 Test Statistic Cont… • The formula for the test statistic which approximates X2 distribution is: (where O is the observed frequency and E is expected frequency) Test Statistic Cont… • It has the following characteristics: – Extends indefinitely to the right from 0. – Has only one tail. without additionally predicting which will be higher: Nondirectional or twotailed test • At times some hypotheses not only test difference of one value from the other but also direction of the difference. It is also called the Level of significance. 309 310 Errors in Hypothesis Testing • In testing hypothesis. Decision of the hypothesis testing Accept H0 Null Hypothesis H0 True H0 False Correct Reject H0 Type I error One and Two Tailed Hypothesis • Some hypotheses test whether one value is different from another or not. the chisquare curve approaches a normal distribution.
64 to either of the tails. • The area between the two critical values is called the Acceptance Area.10. • However if the test is one tailed the critical value would be 1.3/1/2010 Level of Significance. or 0.05. Critical Values and Critical Area • In practice.01. the stronger the hypothesis test. Level of Significance. • The corresponding test statistic values for the level of significance are called the Critical Values. Critical Values and Critical Area 314 Level of Significance. 313 Level of Significance. • In a probability distribution the area which is left to the extreme right or/and left of the critical value is called the Critical area (Rejection area). • Level of significance of 0. the level of significance ( ) is chosen arbitrarily. • The level of significance determines the values of the test statistic that would cause us to reject the hypothesis.05 has critical value of ±1.96 if the test is two tailed. (depending on confidence level) • The smaller the level of significance. • Three levels 0. Critical Values and Critical Area 315 316 79 . • Note that critical values for a given level of significance differ depending on the test statistic intended to be used. Critical Values and Critical Area • A level of significance has different critical values for one and two tailed test. 0.
33 Interpretation and Conclusion • Interpretation is made based on comparisons between: – Test Statistic Calculated Vs Critical Value.10 0.05 0.64 2. > 1.28 1. accepting and rejecting the null hypothesis) should be made at the given level of confidence.e.28 1. – P value Vs significance level. < 1. 319 320 80 .3/1/2010 Level of Significance. Critical Values and Critical Area Level of Significance.96 ±2. Critical Values and Critical Area (level of significance) 0. • Conclusion (i.58 On tailed test.64 ±1.01 Two tailed test ±1.64 2.33 On tailed test. Critical Values and Critical Area 317 318 Level of Significance.
. • Step 5: Obtain the value of the test statistic: 323 324 81 . • The Z test and the t test used. Let’s say that they are asking the following question: Can we conclude that the mean enzyme level in this population is different from 25? Test of Hypothesis about Single Population Mean Cont. • One begins with a statement that claims a particular value for the unknown population mean.05.96 is the critical value. It is known that the variable of interest is approximately normally distributed with a standard deviation of 10. – ±1. • The hypothesis testing for single population mean either accepts or rejects this statement.. Example 7. They take a sample of 36 individuals. • Step 1 and 2: Define the Ho and H1: H o : µ = 25 H 1 : µ ≠ 25 • Step 3: Decide approprate test statistic: – Z test • Step 4: Decide the level of significance and critical value: – value of 0.3/1/2010 Test of Hypothesis about Single Population Mean • Shows how to test the null hypothesis that the population mean is equal to some hypothesized value. Z= X −µ σ/ n t= X −µ S/ n 322 Test of Hypothesis about Single Population Mean Cont. determine the level of enzyme in each and compute a sample mean 22.1: • Researchers are interested in the mean level of an enzyme in a certain population. – Sample > 30: Z test – Sample < 30 and population SD known: Z test – Sample < 30 and population SD unknown: t test 321 Test of Hypothesis about Single Population Mean Cont..
05. – Accept the H0 at 95% confidence level: – 1.645 is the critical value. X −µ σ/ n 22 − 25 10 / 36 −3 1..025.67 Z = − 1. Z= Z= Test of Hypothesis about Single Population Mean Cont.80 Z= 325 326 Test of Hypothesis about Single Population Mean Cont.80 H1 : µ < 25 327 328 82 . – ±1..3/1/2010 Test of Hypothesis about Single Population Mean Cont. • Step 3: Decide approprate test statistic: – Z test • Step 4: Decide the level of significance and critical value: – value of 0.036 is > /2 value of 0.. • Step 6: Make a decision and interpret it.1.2: • The researchers mentioned in example 7. Example 7. instead of asking if they could conclude that µ≠25. they asked: Can we conclude that the mean enzyme level in this population is less than 25? Solution: • Step 1 and 2: Define the H0 and H1: H o : µ ≥ 25 Test of Hypothesis about Single Population Mean Cont.80 is with in the acceptance region.67 Z = − 1.. • Step 5: Obtain the value of the test statistic: Z= X −µ σ/ n Z= 22 − 25 10 / 36 Z= −3 1. – P value of 0.
.0025 at df of 14: ±2. • We reject the null hypothesis b/c – The cal test statistic 2.3: • Serum Amylase level determination was made on a sample of 15 apparently health subjects. 329 330 Test of Hypothesis about Single Population Mean Cont. • Step 6: Make a decision and interpret it.05. – Reject the H0 µ ≥ 25 with 95% confidence level – Test statistic 1.01 and 0. – t value for of 0.80 is with in the acceptance region. We want to know wheter we can conclude that the mean of the population is different from 120 units/100 ml. The variance of the population was unknown.025. Test of Hypothesis about Single Population Mean Cont.. – P value of 0.65 (b/n 0. The sample yielded the mean of 96 units/100 ml and a standard deviation of 35 units /100 ml. • Step 6: Make a decision and interpret it. – t test • Step 4: Decide level of significance and critical value.3/1/2010 Test of Hypothesis about Single Population Mean Cont.65 is in the rejection area – The corrspoinding P value of 2.145 • Step 5: Obtain the value of the test statistic.65 331 332 83 . H o : µ = 120 H 1 : µ ≠ 120 Test of Hypothesis about Single Population Mean Cont.. • Step 1 and 2: Define the H0 and H1.05. Example 7. – value of 0. • Step 3: Decide approprate test statistic.036 is less than the value of 0. t= X −µ S/ n t= 96 − 120 35 / 15 t = − 2.005) is less than the /2 value of 0..
Example 7. H1: there is difference between the two means..67 Z= n2 −4 = − 2. • Step 1 and 2: Define the H0 and H1 H o : µm = µ f H1 : µ m ≠ µ f • Step 3: Decide approprate test statistic: – Z test • Step 4: Decide the level of significance and critical value: – value of 0. t= ( X1 − X 2 ) S2 S2 + n1 n2 Z= (X1 − X 2 ) σ1 2 n1 + σ2 2 S= n2 (n1 − 1) S1 + ( n 2 − 1) S 2 n1 + n 2 − 2 2 2 • Sumup the sample size of the two groups.47 336 84 ..72 1. H0: there is not difference between the two mean. Testing of Hypothesis about Two Population Means Cont. Is there significant difference between the two means? Testing of Hypothesis about Two Population Means Cont.4: • A researcher wants to check whether the systolic blood pressure among males is different from females or not. Among 60 females.. • Step 5: Obtain the value of the test statistic: Z= (X1 − X 2 ) σ1 2 n1 335 + σ2 2 Z= 100 − 104 5 2 10 2 + 50 60 Z= −4 0. if it is greater than 30 use Z test.5 + 1. the mean SPB was 104mmHg with standard deviation of 10 mmHg. Among 50 male samples the mean SBP was 100mmHg with standard deviation of 5 mmHg. • t test is carried out with df of n1+n22 334 333 Testing of Hypothesis about Two Population Means Cont.96 is the critical value. – ±1. if less than 30 use t test.05.3/1/2010 Testing of Hypothesis about Two Population Means • • • • Compare the difference between two populations mean. Z or t test can be employed.
• The corrspoinding P value of 1. • We accept the null hypothesis (at 99% confidence level) b/c: • The calculated test statistic 1. the mean was 120 units/100ml with standard deviation of 40 units/100 ml. – We reject the H0 and accept the H1 µ m ≠ µ f (at 95% confidence level) b/c – The cal test statistic 2.05) is greater than the value of 0. the mean was 96 units/100ml with standard deviation of 35 units/100 ml.005.5: • Serum amylase determination was made on a sample of 15 apparently health subjects and 12 hospitalized subjects.4 • Step 3: Decide approprate test statistic. Among hospitalized patients.72 is in the rejection region. – The corrspoinding P value of 2.32 37. Example 7. – t test • Step 4: Decide level of significance and critical value. – value of 0. Testing of Hypothesis about Two Population Means Cont. Among health subjects.787 • Step 5: Obtain the value of the test statistic S= (n1 − 1) S1 + (n 2 − 1)S 2 = n1 + n 2 − 2 2 2 (14)(35) 2 + (11)(40) 2 17150 + 17600 = = 1390 = 37.67 is in the acceptance region.67 14.01. Is there significant difference between the two mean values? 337 338 Testing of Hypothesis about Two Population Means Cont… • Step 1 and 2: Define the H0 and H1 H o : µ1 = µ 2 H 1 : µ1 ≠ µ 2 Testing of Hypothesis about Two Population Means Cont… t= 96 − 120 37.005 at df of 25: ±2...1 and 0.025. – t value for /2 of 0. 339 340 85 .0033) is less than the value of 0.32 + 15 12 = − 24 = − 1.67 (which is b/n 0.72 (0. • Step 6: Make a decision and interpret it.3 25 25 • Step 6: Make a decision and interpret it.3/1/2010 Testing of Hypothesis about Two Population Means Cont.
• Commonly used in evaluation of interventions like new treatment modalities. 341 Testing of Hypothesis about Two Population Means Cont… • Procedures of hypothesis testing are the same. d t= SD n – d = mean of differences between the two samples. Except the formula for the test statistic calculation. – n = the number of paired cases. • Usually t test is used since individuals involved in the trial are few.3/1/2010 Testing of Hypothesis about Two Population Means Cont… Paired t test for difference between two means: • Every observation in one sample has one matching observation in the second sample. • Hence pre and post intervention (treatment) results are compared. • The null hypothesis: there is no significant difference between the two tests. – SD = is the standard deviation for the differences between the two samples. 342 Testing of Hypothesis about Two Population Means Cont… Example 7. • Note that the calculated test statistic is compared at degree of freedom of n1. Does the coffee has any effect on the heart rate? (perform the hypothesis testing with 95% CI) Testing of Hypothesis Cont… Subject PR before PR after 1 2 3 4 5 6 7 8 9 10 68 64 52 76 78 62 66 76 78 60 68 74 68 60 72 76 68 72 76 80 64 71 Difference +6 +4 +8 4 2 +6 +6 0 +2 +4 +3 344 343 Mean 86 . The result is given as follows.6: • A random sample of 10 young men was taken and the pulse rate was measured before and after taking a cup of coffee.
The formula is given as: Z= p −π π (1 − π ) n • • Reject the null hypothesis (at 95% confidence level) Coffee intake has effect on PR. • One begins with a statement that claims a particular value for the unknown population proportion.92 t= 3 3.7: • A survey was conducted to determine the prevalence of protein energy malnutrition in a rural kebele. Of 300 under five children assessed. • Analyze > means > One sample T test or independent T test or paired sample T test Test of Hypothesis About Single Population Proportion Example 7. • The hypothesis testing for single population proportion either accepts or rejects this statement.4 10 Test of Hypothesis About Single Population Proportion • The null hypothesis that the population proportion is equal to some hypothesized value.3/1/2010 Testing of Hypothesis about Two Population Means Cont… • • • • • H0: Coffee intake has no effect on PR H1: Coffee intake has effect on PR Test statistic: t test (paired) Critical value ±2.92 = 2. 345 346 Test of Hypothesis on Means Using SPSS • In SPSS One sample T test. Can we conclude that the prevalence of PEM in the population is 50%? 347 348 87 . independent T test and paired sample T test are available under. • Here Z test statistic is used.262 First calculate the SD then the test statistic: ( di − d ) 2 n −1 = 3. 123 were stunted.
11 is in the rejection region.05. • Here Z test statistic is used.5 H o : π = 0. • At 90% confidence level wee reject the null hypothesis that P=0.09 = = = − 3. In kebele X among 120 samples 15 were positive.5) 0.5. 0. In kebele B among 100 samples 20 were positive. The formula is given as: Z= p1 − p 2 P(1 − p) 1 1 + n1 n2 Testing of Hypothesis About Two Population Proportions Example 7.41 − 0.645 is the critical value. • The hypothesis testing for single population proportion either accepts or rejects this statement.25 n 300 300 349 Test of Hypothesis About Single Population Proportion • Step 6: Make a decision and interpret it.5 • Step 3: Approprate test statistic: – Z statistic • Step 4: Decide the level of significance and the corresponding critical value: – Let’s take value of 0.e.5(0. Hence ±1.5 0. Is there any significant difference between the prevalence of malaria kebele X and Y? P= n1 p1 + n2 p 2 n1 + n2 351 352 88 . – The calculated test statistic 3. 350 Testing of Hypothesis About Two Population Proportions • The null hypothesis that a population proportion is equal to another population proportion. • Step 5: Obtain the value of the test statistic: Z= p −π 0. – The corrspoinding P value of 3.1.11 (i.11 π (1 − π ) 0.3/1/2010 Test of Hypothesis About Single Population Proportion • Step 1 and 2: Define the H0 and H1 H 1 : π ≠ 0 .8: • The prevalence of malaria among two malaria endemic kebeles X and Y was compared.0009) is less than the value of 0.
2 • Then we calculate the test statistic: Z= 0. 354 353 Test of Hypothesis on Proportions Using SPSS • There is no “point and click” option in SPSS to do such hypothesis testing on proportions. • The table contains two variables called the row and column variables.159 220 H o : P1 = P2 H 1 : P1 ≠ P2 • Step 3: Decide approprate test statistic: – Z statistic • Step 4: Decide value & the critical value: – Let’s take value of 0.0.125 − 0.51 • Step 6: Make a decision and interpret it. i. Test of Hypothesis about Categorical Data • It is also possible to apply hypothesis testing on categorical data.159) 1 1 + 120 100 Z= 0. P2 = 20/100 = 0.96 is the critical value. • The Chisquare ( 2) test statistic commonly used. 355 356 89 .125.2 0.3/1/2010 Testing of Hypothesis About Two Population Proportions • Step 1 and 2: Define the H0 and H1: P= Testing of Hypothesis About Two Population Proportions n1 p1 + n 2 p 2 n1 + n2 P= 120(0. • Syntax based analysis can be done. • Step 5: Obtain the value of the test statistic: – First calculate the proportions & the pooled proportion – P1 = 15/120 = 0.0655 is greater than the value of 0.0183) − 0.075 = −1.159(1 − 0. • This test is usually applied to tabulated data.025.05. • Expected frequencies are frequencies which happen when there is no association between the raw and column variables. • The test measures the discripancy between K observed frequencies (O) and correspoinding K expected frequencies (e).e. for all cells of the tabulation.1337(0. At 95% confidence level we accept the H0 P1=P2 b/c: – 1. Hence ±1.2) 120 + 100 P= 15 + 20 = 0.51 is in the acceptance region.125) + 100(0. – .
• While the H1 is there is associaiton between the row and column variables.635. Rejection area X2 > 6. – At df of 1 the critical value is 6. (Oi − ei ) 2 ei e= row total for the cell x column total for the cell grand total 357 358 Test of Hypothesis about Categorical Data Example 7. • Though the distribution of Chisquare is one tailed. – Accptance area is 06.01.9: • A researcher is interested to assess the effect of litracy on family planning use.635. 360 90 . • The closer observed frequencies are to expected frequencies. – No more than 20% of the the expected frequencies should be less than 5. • Step 4: Decide and the corresponding critical value: – Let’s take value of 0. x2 = k i =1 Test of Hypothesis about Categorical Data • Assumptions of Chisquare test: – No cell of the table has expected frequency less than 1. Can we say there is association between educational status and family planning use? FP use Yes No Total Educational Status Illiterate 63 15 78 Literate 49 33 82 Total 112 48 160 359 Test of Hypothesis about Categorical Data • Step 1 and 2: Define the H0 and H1: – H0: There is not association between litracy and family planning use.3/1/2010 Test of Hypothesis about Categorical Data • The H0 of Chisquare test is there is no association between the row and column variables.635. • Chisquare test should compaired with chisquare disribution with df of (R1)(C1). – H1: There is association between litracy and family planning use. the test is always two tailed. the more likely the H0 is true. Accordingly he collected data and tabulated the findings in the following manner. • Step 3: Decide approprate test statistic: – X2 test.
• But how is the direction of association? 362 Test of Hypothesis about Categorical Data Using SPSS • In order to do chisquare test using SPSS. • Analyze > Descriptive Statistics > Cross tab > Put the two categorical variables as column and row > Statistics > Check “Chisquare” > Ok.6 57.4 • Expected frequency for cell d: 82 x 48/160 = 24. • For hand calculations.4) 2 (33 − 24.01). 364 363 91 . (when assumption of chi square test are not fulfilled) • It is named after its inventor.005 and 0. • Chisquare test is given in a table as “Pearson Chisquare”.41 (Oi − ei ) 2 ei 361 • Step 6: Make a decision and interpret it.29 ) + (1.23) + (3. • Its application to higher order tables is controversial.4 • Expected frequency for cell c: 78 x 48/160 = 23. – Then we calculate the Chisquare statistic. x2 = k i =1 Test of Hypothesis about Categorical Data x2 = (63 − 54.41 (between 0.6 – Assumptions of X2 test fulfilled. Fisher.02 ) + (2. • At 99% confidence level we accept the H1 that the two variables are associated due to the following reasons: – The calculated test statistic 8. track the following steps. Fisher's exact test • Fisher' exact test is a statistical significance test used in the s analysis of contingency tables when sample size is small.002) is less than the value of (0. the test is only feasible in the case of a 2 x 2 contingency table.6) 2 (49 − 57.87 ) = 8.4 24.41 is in the rejection area. given the null hypothesis is true. • H0: there is no association between the two variables • H1: there is association between the two variables • The hypothesis is tested by comparing the probability of observing the given or more extreme tables with the level of significance.6 x 2 = (1. – The corrspoinding P value of 8.4) 2 (15 − 23.6 • Expected frequency for cell b: 82 x 112/160 = 57.4 23. R.6) 2 + + + 54.3/1/2010 Test of Hypothesis about Categorical Data • Step 5: Obtain the value of the test statistic: – First the expected frequency should be calculated: • Expected frequency for cell a: 78 x 112/160 = 54.
Calculate the probability of the observed table itself. 368 92 . 3. 5. • Level of significance: 5% • Calculate the probability of getting the given or more extreme tables. Is there any association between the treatment type and survival rate of patients? (Test the hypothesis at 95% confidence level) Treatment type A B Total Survived 7 5 12 Died 2 6 8 Total 9 11 20 367 Fisher's exact test • H0: No association between the treatment modalities and survival rate. List all possible extreme tables manually (given the marginal totals are maintained). Calculate their respective exact probability. • Test statistic: F exact test b/c two of the expected frequencies have values less than 5. 4.10: • In the following tabulated data. • H1: There is association between the treatment modalities and survival rate. Calculate the probability of getting observed or more extreme tables.3/1/2010 Fisher's exact test a c (a+c) b d (b+d) (a+b) (c+d) N Fisher's exact test • Hypothesis testing using fisher’s exact test involves the following steps: 1. Multiply the total by 2 (to get 2 tailed value) 6. 2. Compare the value with value of 366 • The exact probability of observing a given table is given as: • = [(a+b)!(c+d)!(a+c)!(b+d)!]/[N!a!b!c!d!] 365 Fisher's exact test Example 7.
001 = 0.132 + 0.024 369 370 Fisher's exact test • Second possible extreme table: Treatment type A B Total Survived 9 3 12 Died 0 8 8 Total 9 11 20 Fisher's exact test • Probability of getting the observed or more extreme tables: – 0.024 + 0.314 • Conclusion and interpretation: – Accept the null hypothesis at 95% confidence level – There is no association between the treatment modalities and survival rate. • Probability of observing this table = 9!11!12!8!/20!9!0!3!8! = 0.001 371 372 93 .157 (one tailed) – Two tailed 2 x 0.3/1/2010 Fisher's exact test • Observed table: Treatment type A B Total Survived 7 5 12 Died 2 6 8 Total 9 11 20 Fisher's exact test • First possible extreme table: Treatment type A B Total Survived 8 4 12 Died 1 7 8 Total 9 11 20 • Probability of observing this table = 9!11!12!8!/20!7!2!5!6! = 0.157 = 0.132 • Probability of observing this table = 9!11!12!8!/20!8!1!4!7! = 0.
track the following steps. • NB: SPSS doesn’t do Fisher’s exact test for higher order tables. • The concept discussed in this chapter can be applied to the calculation of sample size for comparative studies. • A hypothesis which is rejected at a higher level of confidence can not be accepted at a lower level of confidence. • Were. • A hypothesis which is accepted at a lower level of confidence can not be rejected at a higher level of confidence. 373 Summary • The interpretation of the hypothesis test is dependent on the confidence level at which the test is conducted. cohort. 376 • Where P= P1 + rP2 1+ r 375 94 . P P1 P2 r Alpha Beta n1 NB: n2 is calculated is the pooled proportion is the expected 1st proportion is the expected 2nd proportion is the number of controls per a case is the probability of type I error is the probability of type II error is sample size for the first group by multiplying n1 by r. • Fisher’s exact test is given in a table titled “Chisquare tests”.3/1/2010 Fisher's exact test using SPSS • In order to do Fisher’s exact test using SPSS. • A hypothesis which is accepted at a higher level of confidence can be rejected at lower level of confidence. n1 = P (1 − P2 ) 1 Z α (1 + ) P (1 − p) + Z β P1 (1 − P1 ) + 2 r r 2 ( P1 − P2 ) 2 Sample Size Calculation Cont. • For comparative studies like case control. 374 Sample Size Calculation for Comparative Studies.optimal size for the two groups is calculated using the formula. interventional .. • A hypothesis which is rejected at a lower level of confidence can be accepted at a higher level of confidence. • Analyze > Descriptive Statistics > Cross tab > Put the two categorical variables as column and row > Statistics > Check “Chisquare” > Ok.
• Most commonly used coefficients: Product Momentum Correlation or Pearson Correlation Coefficient (r).e.3/1/2010 Regression and Correlation Correlation and Linear Regression • Many medical investigations are concerned with: – Establishment of relationship between two variables. • The symbol rho ( ρ ) used to represent population correlation coefficient • Unit less measure. the variables are said to be negatively correlated (i. 379 Correlation Analysis • Does not imply cause and effect relationship. as X increases.e. • If the correlation coefficient is 0 then the variables are said to be uncorrelated. 380 95 . • Strength of relationship measurement: Correlation Coefficient. – Predicting one variable on the basis of another. • If the correlation coefficient is greater than 0. Y tends to decrease). • Such intentions can be addressed either by using correlation or regression analysis. • If the correlation coefficient is less than 0. – The strength of a relationship. • Both of the variables should be measured on the same set of study units. Y tends to increase). 378 Correlation Analysis • Initially developed by Sir Francis Galton (1888) and Karl Pearson (1896) • Correlation is the quantification of the degree to which two random quantitative variables are related provided the relationship is linear. the variables are said to be positively correlated (i. as X increases. • The value of r ranges from 1 to +1. – Controlling the effect of unwanted variables.
3/1/2010 Correlation Analysis Cont… • The formula for computing sample correlation coefficient (r) for two variables X and Y is given as: r= ( x − x )( y − y ) [ ( x − x ) 2 ][ ( y − y) 2 ] Correlation Analysis Cont… Linear relationships y y Curvilinear relationships • Or r= [n ( 2 n x )−( xy − 2 x y y )−( 2 x x y x) ][n( y) ] 2 y • Before computing r. Why? 381 x 382 x Correlation Analysis Cont… Strong relationships y y Weak relationships Correlation Analysis Cont… No relationship y x y y x y x x 383 x x 384 96 . scattered plot between the two variables should be drawn.
3/1/2010 Correlation Analysis Cont… • Assumptions of correlation analysis: – Independent random samples are taken – Both variables are on interval/ratio scale – Linear association between X and Y – Paired measures for X and Y – Normal distribution for X and Y – Homogeneity of variance (Homoscedasticity) • In situations where its assumptions are violated. correlation becomes inadequate to explain a given relationship. X represents the percentage of children immunized by age one year and Y represents the under five year mortality rate. Determine the strength of association between the two variables. 387 388 97 .79 • There is strong linear relationship between the two variables. 385 Correlation Analysis Cont… Example 8. 386 Correlation Analysis Cont… Country Bolivia Brazil Cambodia Canada China Czech Egypt Ethiopia Finland France Greece India Italy Japan Mexico Poland Russia Senegal Turkey UK Total % Immunized (X) 77 69 32 85 94 99 89 13 95 95 54 89 95 87 91 98 73 47 76 90 1548 CMR/1000LB (Y) 118 65 184 8 43 12 55 208 7 9 9 124 10 6 33 16 32 145 87 9 1180 XY 9086 4485 5888 680 4042 1188 4895 2704 665 855 486 11036 950 522 3003 1568 2336 6815 6612 810 68626 Y2 13924 4225 33856 64 1849 144 3025 43264 49 81 81 15376 100 36 1089 256 1024 21025 7569 81 147118 X2 5929 4761 1024 7225 8836 9801 7921 169 9025 9025 2916 7921 9025 7569 8281 9604 5329 2209 5776 8100 130446 Correlation Analysis Cont… r= n [ n( x2 ) − ( xy − x y y2 ) − ( y) 2 ] x) 2 ][ n( r= 20(68626) − (1548 x 1180) [20(130446) − (1548) 2 ] x [20(147118) − (1180) 2 ] r = − 0.1: • The data of a random sample of 20 countries are shown in the following table.
79) 2 • Hence we accept the H1 that r indicates significant negative relationship between immunization coverage and child mortality. 391 392 98 . it is also possible to test significance about population correlation.47 0. can we claim the correlation coefficient in example 8.79 ( ) = − 0.1 indicates significant negative relationship between immunization coverage and child mortality? Correlation Analysis Cont.05 level of significance.1.3759 1− r 2 1 − (−0. • For two tailed test – H0: r is 0 – H1: r is different from 0 • The t test statistic is given as (with n2 df): t=r 389 n−2 1− r 2 390 Correlation Analysis Cont… Example 8..2: • At the 0.734. • The critical t value for 0.79 ( ) = − 5. – Rule of thumb: Correlation Analysis Cont… • Hypothesis Testing for a Correlation Coefficient • As that of mean and percentage.05 level of significance at 18 degree of freedom is . t=r n−2 20 − 2 18 = − 0. Then we calculate the test statistics.3/1/2010 Correlation Analysis Cont… • Interpretation option: – 100% r2: • Shows proportion of variation of a variable explained by the other.
The following table presents the MMR level and delivery service coverage in 10 developing countries. Limitations: • Applied only to a linear relationship.. • The formula for the Spearman Correlation Coefficient is (given that there is no tied rank): rs = 1 − 6( D ) 2 Correlation Analysis Cont. – D is the difference between a subjects ranks on the two variables. • At least one of the variable is given in ordinal scale.. 393 394 Correlation Analysis Cont. • Consider the following example. actual values of both variables should be changed into ranks. Spearman’s Rank Correlation • It is a nonparametric (distributionfree) rank statistic proposed by Charles Spearman in 1904 as a measure of the strength of the associations between two variables • Denoted as rs • Is applied when: • Normality assumption is not satisfied or can not be tested. – n is the number of subjects.87 7 8 9 10 396 99 .[(6x308)/10(1001)] = 1[1848/990] = 11. • Confounding by a third variable. Correlation Analysis Cont. • Does not differentiate dependent and independent variable. • In the calculation of the coefficient. • One must not extrapolate an observed correlation beyond observed ranges of the x and y value. 395 4 5 6 = 1. MMR Countries (Per100..00 0LB) 1 2 3 315 450 200 250 243 830 850 656 701 410 MMR Rank 4 6 1 3 2 9 10 7 8 5 Delivery Service Coverage (%) 55 40 70 79 75 25 20 20 30 60 Rank 6 5 8 10 9 3 2 1 4 7 D 2 1 7 7 7 6 8 6 4 2 D2 4 1 49 49 49 36 64 36 16 4 308 n( n 2 − 1) rs = 1 − 6( D2 ) n( n 2 − 1) • Where.3/1/2010 Correlation Analysis Cont.87 = 0. – 6 is a constant..
• Still requires meeting all the usual assumptions of Pearsonian correlation. or several other variables.. t = rs 1 − rs n − 2 2 Correlation Analysis Cont. • But before that.e. • Regression analysis is used to: – Assess association between two variables.3/1/2010 Correlation Analysis Cont.87 1 − (−0. • But the covariate may not be necessary numeric. – Predict/explain the value of a dependent variable based on the value of at least one independent variable. Partial Correlation • A method used to describe the relationship between two variables while taking away the effects of another variable.05 level of significant. • For the previous example the t score would be.. t= − 0. 397 398 Correlation Analysis Using SPSS • In order to do correlation analysis using SPSS follow the following steps.87) 10 − 2 2 =5 • If the hypothesis test is a two tailed test at 0. • Inference about rs • For hypothesis testing t score can be calculated (at df of n2) using the formula. Regression Analysis • In correlation analysis the interest is to show how two numeric variables are related. • Partial correlation can also be done. (i. – Show possible effect of interaction among variables.400 399 100 . don’t forget the scattered plot. • Analyze > Correlate > Bivariate correlations > Put the two variables in the variable box > Select Pearson or Spearman (another option is also there) > OK. • Analyze > Correlate > Partial correlation. • However in regression analysis.306. we reject the H0 as 5 > 2. we are interested in explaining or modeling a dependent variable (Y) as a function of one or more independent variables (X). on this relationship. Mathematical modeling) – Control for confounding factors.
• The general regression equation is given as: Y = + 1X1+ 2X2…….. • In public health the most commonly used types of regression analysis are: Linear and Logistic Regression Linear Regression • Also known as linear least squares regression. • Can be simple or multiple regression. we can predict height for a person with a given value of serum GH.. 404 101 .. • It is by far the most widely used modeling method.. • It attempts to model the relationship between the dependent and independent variables by fitting a linear equation to observed data. is the coefficient of the independent variable • If the equation has only one independent variable the regression is called Simple Regression • If multiple independent variables are involved it is called Multiple Regression. • The dependent variable is assumed to be a linear function of one or more independent variables plus an error introduced to account for all other factors. X is the independent variable. • The equation provides what value the DV would have for a given value/s of the IV/s. • A scattered plot is helpful to assesses the presence of linear trend of association. 403 Linear Regression Cont... nXn Where: Y is the value of the dependent variable. is the intercept.. (mostly numeric variable) 402 401 Linear Regression Cont. Xs are the independent variable and E is the random error term..3/1/2010 Regression Analysis Cont.β n x n + ε • Where Y is the dependent variable. • The DV (Y) is given in continuous numeric scale while the IV/s (X) can be of any type.. • For example if we develop a linear model with the DV of body height and the IV of serum growth hormone. Y = α + β1 x1 + β 2 x 2 + ... • Consider the following data showing the number of households in China with TV.
Here is one of them.3/1/2010 Linear Regression Cont.. • How would you draw a line through the points? How do you determine which line ‘fits best’? • The most common method for fitting a regression line is the method of leastsquares. Linear Regression Cont. • This method calculates the bestfitting line for the observed data by minimizing the sum of the squares of the vertical deviations from each data point to the line. L S m in im iz e s n i=1 ε 2 i = ε 2 1 + ε 2 2 + ε 2 3 + ε 2 4 Y ^ ε2 ^ ε1 Y2 = β 0 + β 1X 2 + ε 2 ^ ε4 ^ ε3 Yi = β 0 + β 1X i X 408 407 102 . there are many straight lines that pass close to them. linear regression is a method of finding the linear equation that comes closest to fitting a collection of data points. • If we plot these data. • Hence. Linear Regression Cont. 405 406 Linear Regression Cont. • Although no straight line passes exactly through these points... • “Best fit” means difference between actual Y values & predicted Y values are minimum.. we get the following graph.
e. • The difference between the predicted value and the observed value is called the residue. • The manual calculation for the coefficients of linear regression is possible when we have one independent variable.3/1/2010 Linear Regression Cont. • Then we would get slightly different values from the original observed values shown above. • Suppose that we used the line rather than the data points to estimate the number of households with TV. list square estimation) • SSE = Sum of squares of residues = Sum of (y observed – y predicted)2 409 410 Linear Regression Cont. • The line which represent the dataset (Y = + X) is calculated using the formula: • β= xy − [ x2 − [ x y n ( x) 2 n ] Linear Regression Cont. 411 412 α = y − βx ] 103 .. • Consider the following data. These values are called predicted values. ! " • First we should plot a scattered diagram. (i..e. here we should have a set of paired DV and IV values for all study units. i.: Y= + X • As that of correlation analysis. • Residue = Observed Value . the closer the predicted values will be to the observed values.Predicted Value • The best line is the line with the smallest sum of squares of error (SSE).. • The better our choice of line. Year (X) (0 represents 2000) 0 1 2 3 Households with TV (millions) Households with TV (millions) Observed Values Predicted Values Residual 68 72 80 83 62 70 78 86 6 2 2 3 Linear Regression Cont..
Y Linear Regression Cont… n n 4 3 2 1 0 ˆ β1 = i =1 X iYi − n i =1 i =1 Xi n n n i =1 2 Yi = 37 − (15)(10) 5 = 0.70)(3) = −0. • It is the proportion of variation explained by the best line model. • Where: 415 416 104 .5.10 413 414 Linear Regression Cont… • One of the indices to measure model goodness of fit for simple linear regression is Rsquared or coefficient of determination. • If your Rsquared is less than 0.80 to produce a good linear model.3/1/2010 Linear Regression Cont. • For general rule of thumb. Linear Regression Cont… • For multiple linear regression adjusted r squared is used. the Rsquared or adjusted Rsquared should be higher than 0. it is recommended that you consider other type of model rather than linear model.70 5 X − 2 i i =1 Xi n (15)2 55 − 0 1 2 3 X 4 5 6 ˆ ˆ β 0 = Y − β1 X = 2 − (0.. • It depends on the ratio of sum of square error from the regression model (SSE) and the sum of squares difference around the mean (SST = sum of square total).
Linear Regression Cont… Hypothesis testing in linear regression: • Questions to be answered through the hypothesis testing are: – Does the entire set of independent variables contribute significantly to the prediction of y? – Does the addition of one particular variable of interest add significantly to the prediction of y achieved by the other independent variables already in the model? • The null and alternative hypothesis are given as: – H0: 1 = 2 = · · · = p = 0 – H1: j 0 for at least one j. • Homoscedasticity: The variance of the error terms is constant for each value of x.92 + 0.3/1/2010 Linear Regression Cont… Interpretation of linear regression coefficient: • Let’s consider the following simple linear reg equation. the slope represents the estimated average change in Y when you switch from 0 to 1. A linear regression model was developed to explain the association. • No multicolinarity: The independent variables are not correlated each other.3: • Assume that the duration of breast feeding in weeks (Y) was found to be positively correlated with maternal age in years(X). • Y= + X • represents the slope.389X. and represents the yintercept. • The intercept represents the estimated average value of Y when X equals zero. How do you want to explain the equation? 418 Linear Regression Cont… Assumptions: • Normal distribution: Regression assumes that variables have normal distributions. 417 Linear Regression Cont… Example 8. The equation is given as Y = 5. 420 419 105 . • Normally distributed error terms: The error terms follow the normal distribution. (Practically less important) • When we represent a binary independent variable (coded as 01). • Independence of error terms: Successive residuals are not correlated. • The slope represents the estimated average change in Y when X increases by one unit. • Linearity: The relationship between each x and y is linear.
It is obtained by dividing the explained variance by the unexplained variance. (Given as ANOVA table) • T test is used to see whether that a specific variable is significant in explaining the dependant variable or not. treatment outcome (success or failure). utilization health commodities (utilization or nonutilization) etc. 421 422 Introduction • Logistic Regression is a model used for prediction the probability of occurrence of categorical event by fitting data into a Logistic Curve.3/1/2010 Linear Regression Cont… • F test and t test are used to test the hypothesis. identification of determinants and health programming. clinical outcome (alive or dead). Linear Regression Using SPSS • Analyze > Regression > Linear Regression > Put the dependent and independent variables > Select appropriate statistics > Ok. 424 106 . Logistic Regression • Common dichotomous dependant variables are like disease status (healthy or ill). • Application: – Modeling for risk prediction. • F is a test for statistical significance of the regression equation as a whole. – Controlling confounding and interacting factors.
• Examples: LR Function Cont… • Derivation of the function can be demonstrated with an ex.1/0. • The probablity of the distribution is equal to the proportion of 1s in the distribution (P). Logistic Regression Function • Binary dependant variable are coded as 0 or 1. • The logistic function associates the Independent Variable (IV) X with the probability of occurrence of the Dependant Variable (DV) Y.9 s • Odds (P/1P) of being male = 0. • Can be corrected by the application of ln. 427 • The log of odds is abbreviated as the Logit.11) = 2. 428 107 .217 • The over all transformation is Logit Transformation.9/0.217 and ln(0. s • Let' say the probability of being male at a given ht is 0.3/1/2010 Introduction Cont…… • Comparative advantage of Logistic Regression – Fewer assumptions. • Suppose. – Easier interpretation. – Mathematically amenable.9 = 0.11 • However the values look asymmetrical. • ln(9) = 2. • Classification of Logistics Regression (LR): – Binomial LR: Dependant variable is dichotomous. – Multinomial LR: Dependant variable with more than two classes. we want to predict the person’s sex based on the person' height. – Ordinal LR: Dependant variable with multiple and ranked classes.1 = 9 • Odds of being female = 0. • The function is given as: 425 426 LR Function Cont… • The function is represented by S shaped “Sigmoid graph” which is called the Logistic Curve.
. • Let’s assume a researcher is interested to study the effect of smocking as predicting variable (X) on dependant variable lung cancer (Y).. log [odds ( Y = 1 / X = 1 ) ] = α + β (1 ) log [odds ( Y = 1 / X = 0 ) ] = α + β ( 0 ) Assumptions of Logistic Regression • Logistic Regression has fewer assumptions than Linear Regression: – The DV need not be normally distributed... β nxn 429 LR Function Cont… • Hence. log P (Y = 1) = α + βX 1 − P (Y = 1) 430 eα + β x P= 1 + e α + βx P = 1 1 + e−z where z = α + β 1 x1 + β 2 x 2. • The OR = Odds of smokers ÷ Odds of nonsmokers e α + β (1 ) OR = eα OR = e β 431 432 108 .3/1/2010 LR Function Cont… Mathematically: ln p = α + βx 1− p p 1− p = e α + βx LR Function Cont… • One of the advantages of Logistic Regression: it is possible to compute OR from its coefficient. – – X can be present (X=1) or absent (X=0).. – Normally distributed error terms are not assumed. Y can be present (Y=1) or absent (Y=0)... – Error terms should not be homoscedastic for each level of the IVs.
• MLE relies on the concept of Likelihood. The preferred ratio is 20:1. 4. • In Logistic Regression LSE can’t be used. No interaction: LR doesn’t consider interaction effects except when interactions are created as a variable. No multicollinearity: As the IVs increase in correlation with each other.e. Linearity: – Linear relationship b/n numeric IVs & the logit of the DV.3). the fitness of the model into the dataset is achieved through List Square Estimation (LSE). Fitting Logistic Model to a Dataset • In Linear Regression. No outliers and influential cases: Such cases can affect the model significantly.3/1/2010 Assumptions of LR Cont… But it has the following assumptions: 1. lacks power. No outliers and influential cases: Such cases can affect the model significantly. – Examining the correlations and associations b/n IVs – Tolerance and VIF. 433 Assumptions of LR Cont… 5. 6. • The likelihood of a set of data is the probability of obtaining that particular set of data. the standard errors become inflated. – BoxTidwell Test: If there is non linearity for numeric IV X. Data type: A dichotomous or polytomous DV. – A standard error > 2.0. 434 Assumptions of LR Cont… 8. Large samples: – The minimum Ratio of Valid Cases to Variables should be at least 10:1. 3. 7. – If not the model underestimates association. Inclusion of all relevant variables and exclusion of the irrelevant ones: i. 435 436 109 . • In its place Maximum Likelihood Estimation (MLE) is used. Based on scientific framework or statistical cutoff point (P=0. 2. using a given model. [(X)*ln(X)] interaction term become significant in model.
and Y=0 is 0..0. while a nearzero means the opposite.7 (i. L (B)=(0.7)(0. Probability 439 440 110 .7)(0. • Once P is determined.e. Probability of Y=1 is 0. x2.3) • Likelihood of B is the joint probability of predicting the correct observed value of Y for every case using the model.. respectively. is the value of Z when the value of all risk factors is zero.1. 2.1) • The model predicts the probability of occurrence of Y is 0. then and are estimated. β n xn is called the Intercept and 1. Observed values for Y are (1. • • Interpretation of Reg. • The MLE of the parameter P is that value of P that maximizes L or ln L. while a Ve means the opposite. and so on.. • i. are called the Regression Coefficients of x1.….. L (B) = ∏ i =1 n P yi (1 − p ) 1− yi 437 438 Fitting Logistic Model Cont… • Iteration: Repeated testing of the data and tuning of the model parameter to provide the best fitting equation. Coefficients P = 1 1 + e−z where z = α + β 1 x1 + β 2 x 2 .03087 Fitting Logistic Model Cont… • Mathematically it is easier to work with the Log likelihood.7)=0.3)(0..3/1/2010 Fitting Logistic Model Cont… For example: • Dataset B has five cases.e. ln L (B) = n i =1 [ yi ln(P) +1− yi ln(1− P)] • Maximum Likelihood picks the values of the model parameters that make the data "more likely" than any other values of the parameters would make them. • A +Ve coefficient means the risk factor increases the probability of the outcome...7. • A large coefficient means that the risk factor strongly influences the probability of the outcome.0.3)(0.
the difference will be ` large‘: Reject the Ho that the removed variable is nonsignificant. • LRT ~ X2 df = number of removed variables.3/1/2010 Hypothesis Testing in Logistic Reg. so that it should be included in the model. • Options: – The (log) Likelihood Ratio Statistic (2LL). Wald Statistic: • Commonly used to test the significance of coefficients for each independent variable. then the parameter associated with this variable is not zero. LR statistic = − 2 log L of the reduced mod L of full mod 2 Log L = 2 [log L Reduced model – Log L Full model] 442 Hypothesis Testing LR Cont…. • If the reduced model explains the data as the full model. • All test either of the following nullhypothesis: – Ho: 1 = 2 = 3 = ………… n = 0 – Ho: Removing an IV from the model doesn’t change its the predictive ability. • If the full model explains the data ` much better' than the reduced model. Wald test = β2 Varience of β • Ho: A particular coefficient is zero. • Ho: The removed variable is not significant in the model. 443 Hypothesis Testing LR Cont…. Likelihood Ratio Test Statistic (2LL): • Usually two nested models (the Full and Reduced Models) are presented. • Reduced model mean a model from which a variable is purposely omitted. the difference will be close to 0: Accept the Ho that the removed variable is nonsignificant. • For a particular IV if the W is significant. 444 111 . • 441 Hypothesis Testing LR Cont…. • In Logistics Regression t or F test statistic can not be used for hypothesis testing since it has Bernoulli Distribution. – The Wald Test. B. • W ~ X2 df of 1. A.
• HL statistic first sort observations in increasing order of their estimated event probability and divides observations into deciles based on the predicted probabilities. • P value of 0.3/1/2010 Pseudo RSquares • In Linear Regression. • Ranges from 01 • Logistic Regression doesn’t have an equivalent to the R2 • However.05 is considered as level of significance. 448 112 . • A type of chisquare test but considered stronger than the traditional chisquare test. particularly if continuous covariates are in the model or sample size is small. there are varieties of Pseudo R2 which are designed to simulate the real R2. • HL statistic ~ X2 df of 8. Cox and Snell’s Pseudo R2 R2 =1− L(M L( M Intercept Full ) 2/ N ) B. R2 measures proportion of variance of DV explained by the predictors. HosmerLemeshow Statistic • The recommended test for overall fitness of a Logistic Regression model. A. 445 Pseudo RSquares Cont…. 447 Goodness of Fit Analysis Cont… G 2 HL = (O j −E j ) 2 ≈ χ 2 df of 8 Ej j =1 E j (1 − ) nj 10 • Where – nj is Number of observation in the jth group – Oj is Observed number of cases in the jth group – Ej is Expected number of cases in the jth group • Nonsignificance means the model adequately fits the data. • Common used: Cox & Snell R2 and Nagelkerke R2 • Pseudo R2 doesn’t mean what R2 exactly means in Linear Regression: Interpretation should be made with caution. Nagelkerke Pseudo R2 1− R2 = L(M L(M Intercept Full ) 2/N ) )2/ N 446 1 − L(M Intercept Goodness of Fit Analysis A.
• If the categorical variable has only 2 values: 2sample ttest can be used. Or • Analyze > Regression > Multinomial Logistic > Put the dependent variable > Put the independent variables as factors or covariates depending on their nature > check for available options > Ok. • The specific type is called Oneway ANOVA. 449 Logistic Regression Using SPSS • Analyze > Regression > Binary Logistic >Put the dependent and independent variables > Mark categorical independent variables > check for the options > Ok. 452 113 . Loglikelihood Statistics • A good model is the one that results in a high likelihood of the observed results. • Since there is no acceptable upper cutoff point for 2LL test. • ANOVA allows for comparison among 3 or more groups. it is difficult to interpret the meaning of the score. the 2LL would be 0. • This translates into a small value for 2LL. • Doing multiple twosample ttests would result in a largely increased chance of committing a type I error. • Less commonly used. • ANOVA is helpful because it possess a certain advantage over a twosample ttest. • If two covariates are involved it is called Twoway ANOVA. • If a model fits perfectly. 450 ANOVA Analysis of Variance (ANOVA) • Used to compare mean of a quantitative variable across different categories of a categorical variable.3/1/2010 Goodness of Fit Analysis Cont… B.
.. – Between group variation – Within group variation 455 ANOVA Cont… Between group variation: • Is there some variation between the groups? • Sometimes called the variation due to the factor. • ANOVA measures two sources of variation in the data and compares their relative sizes.. ANOVA Cont… Assumptions of ANOVA: • Each group is approximately normally distributed.n n ( x n − x ) 2 456 114 . • Hypothesis: – H0: The means of all the groups are equal. • Observed data constitute independent random samples from the respective population.. • Denoted SS(B) for Sum of Squares (variation) between the groups. • Once a global difference is detected. it should be follow up with “multiple comparisons” (Post hoc test) to identify 453 specific differences. – H1: Not all the means are equal. • Sum of square (SS) is another name for variation. • Doesn’t explain which ones differs. • Calculated as follows (given x double bar is the grand mean): SS ( B ) = k i =1 ni(x i− x )2 SS ( B) = k i =1 n1 ( x 1 − x ) 2 + n 2 ( x 2 − x ) 2 .. • Standard deviations of each group are approximately equal – Rule of thumb: ratio of largest to smallest sample standard deviation must be less than 2:1 454 ANOVA Cont… • ANOVA is a technique whereby the total variation present in a dataset is segregated into several components.3/1/2010 ANOVA Cont… • ANOVA functions by checking whether the differences between the groups are significant depends on: – The difference in the means – The standard deviations of each group – The sample sizes • ANOVA determines Pvalue from the F statistic.... • Variation is the sum of the squares of the deviations between a value and the mean of the value.
460 115 .9 • Then we computer ANOVA F statistic in the following manner.2.. Or in other words it is (nk) 458 ni − 1 ( s i ) 2 SS (W ) = n1 − 1 ( s1 ) 2 + n 2 − 1 ( s 2 ) 2 .3. 459 ANOVA Cont… Example: • Suppose we have three groups: – Group 1: 5.7 – Group 3: 7. • Denoted SS(W) for Sum of Squares (variation) within the groups. SS (W ) = k i =1 ANOVA Cont… Variance: • Based on the variation (SS). variance is calculated for both categories. • Calculated by dividing the variation by the df • MS = SS / df • The between group df is one less than the number of groups (k1) • The within group df is the sum of the individual dfs of each group. W has a chisquare distribution with n2 df. often with an accompanying variable MS(B) or MS(W). 6. • A large F is evidence against H0. • In ANOVA F test statistic is the ratio of two sample variances.5. where Z has a chisquare distribution with n1 df.0. • The F distribution is defined as the distribution of (Z/n1)/(W/n2)... 6.. 5... 6.3/1/2010 ANOVA Cont… Within group variation: • Is there some variation within the groups? • Sometimes called the error variation as it is the variation that can’t be explained by the factor.n n − 1 ( s n ) 2 457 ANOVA Cont… The F distribution: • Used as test of significance in ANOVA. • The variance is also called the Mean of the Squares and abbreviated by MS. and Z and W are statistically independent. • The df for the numerator are the df for the between group (k1) and the df for the denominator are the df for the within group (nk). 7.7 – Group 2: 5. since it indicates that there is more difference b/n groups than within groups. (MSB/MSW). 7.4..5. 6.2. • Calculated as follows given n is the sample size for every group.
9 3 TOTAL TOTAL/df group mean 6. Thank You 463 464 116 .25095714 BETWEEN difference group mean .756667 Total 6.2 2 6.106 2.188 1.188 5.group mean plain squared 0.203 0.063 0.00 6.95 5.95 5.45 0.884 df MS F Pvalue F crit 2 2.5528/0.21575 ANOVA Using SPSS • Analyze > Compare means > One way ANOVA > Put the continuous variable under “Dependent list” > Put the categorical variable under “Factor” > Select “Post hoc” tests > Ok.737416 7 0.45 0.overall mean plain squared 0.490 0.194 0.25 0.70 0.4 2 5.53 WITHIN difference: data .4 0.5 0.240 0.37 0.7 2 7.7 1 5.53 7.137 1.063 0.5 0.25 0.109 0.3/1/2010 ANOVA Cont… data group 5.1 1.5 0.563667 10.00 5.490 0.70 0.194 0.008394 4.00 0.1 1.95 5.1 1.194 0.188 1.25025 = 10.240 1.4 0.53 7.127333 Within Groups 1.95 7.03 0.0 1 6.44 F = 2.001 0.33 0.21575 0.5 2 6.5 0.2 7.5 3 3 7.00 6.757 0.3 1 6.250952 9 number of data values number of groups (equals df for each group added together) 462 1 less than number of groups 1 less than number of individuals (just like other situations) overall mean: 6.55275 461 ANOVA Cont… ANOVA Source of Variation SS Between Groups 5.240 0.203 0.000 0.240 0.4 0.