You are on page 1of 307

Advanced Biostatistics

Kedir Hussein Abegaz


Biostatistics and Health informatics (Asst Professor)
Madda Walabu University
Tel: +251-913-012630
Email: kedir6300@gmail.com
1. Introduction
1.1 Statistics

Defined in to two modes


1. As statistical data: it is a numerical
representation of things
2. As statistical method: it is a field of study
that deals with COPAID.
It helps to know the object under study in a better way
– Statistical methods include:
1. Designing studies
2. Collecting data
3. Presenting data
4. summarizing data
5. Drawing inferences
What is Biostatistics?
• It is the application of statistical methods to
the biological and life sciences.
Rationale of statistics

– Public health and medicine are becoming


increasingly quantitative
• Statistics is the language of assembling and handling
these quantitative material.

– The planning, conduct, and interpretation of


much of medical research are becoming
increasingly reliant on statistical technology

5
Rationale…

Among the rationale


– To enlarge our knowledge of complex phenomena

– It present facts in a definite and precise form

– Data reduction, Comparison, Estimation

– Measuring the magnitude of variations in data

– Testing and formulating of hypothesis

– Studying the relationship between variables

– Forecasting future events

6
Limitations of statistics
 As a science, statistics has its own limitation
– Deals with only quantitative information

– Deals with aggregate of facts, not with individual data


items

– Data are approximately, not mathematical correct

– Statistics can be easily misused

7
1.2 Data

• Data — The Key Component of a Study


– More important than the methods
used in the analysis are the use of the
appropriate study design and the proper
definition and measurement of the study
variables.
• No good study without good data!
1.3 Design

• Design — The Road to Relevant Data


• Obtaining relevant data requires a
carefully drawn plan that identifies
– the population of interest
– the procedure used to select study units
– the process used in the measurement of
the attributes of interest.
Design…
• Standard methods of data collection:
1. Surveys: deals with ways to select a
random sample that is representative of
the population of interest and from which
a valid inference can be made.
2. Experiment: involves the creation of a
plan for determining whether or not there
are differences between groups.
3. Records: provide ready-made data for
routine and continuous information.
Design…

• Sometimes we also analyze data that


were already collected.
• In this case, we need to understand
how the data were collected
– in order to determine the appropriate
methods of analysis
1.4 Replication

• Replication — Part of the Scientific


Method
– Statistical analysis of data may demonstrate
that there is a high probability of an
association between two variables.
– However, a single study rarely provides proof
that such an association exists.
– Results must be replicated by additional studies
that eliminate other factors that could have
accounted for the relationship observed between
the study variables.
1.5 Applying Statistical Methods

• It requires more than the ability to use


statistical software, deriving formula...
– It is understanding the context for the use
of statistical procedures (study’s goal, the
data, and how data collected and
measured)
• Think instead of simply memorizing
formulas, and statistical software
1.6 Scales of measurement
• Observations and Variable:
– is a characteristic under study that assume
different value for different element like blood
pressure, age, sex, …
– In statistics, we observe or measure
characteristics, called variables, of study
subjects, called observational units.

• The main divisions are qualitative (categorical) and


quantitative (numerical variables).

14
Scales of measurement…

• Qualitative variable: a variable which can’t be


measured in quantitative form. But can only
be identified by name or categories
– E.g. place of birth, types of drug, stages of breast
cancer (I, II, III, or IV), degree of pain (minimal,
moderate, severe). …

15
Scales of measurement…

• Quantitative variable: A variable that can be


measured and expressed numerically and they
can be of two types (discrete or continuous).
– The values of a discrete variable are usually whole
numbers, e.g. the number of episodes of diarrhea in
the first five years of life.

– A continuous variable is a measurement on a


continuous scale, e.g. weight, height, blood
pressure, age, etc.

16
Types of measurement scales
• Nominal
– Data that represent categories or names
– There is no implied order to the categories of
nominal data.
– No arithmetic and relational operation can be
applied.
– E.g.
• Blood type (A, B, O and AB)
• Eye color (brown, black, blue, etc.)
• Sex (Male, Female)

17
Types of measurement scales…
• Ordinal
– Categories that can be ranked, but differences
between ranks do not exist
– Arithmetic operations are not applicable but
relational operations.
– Ordering is the sole property of ordinal scale.
– E.g.
• Degree of pain (minimal, moderate, severe)
• Rating scales (Excellent, Very good, Good, Fair, poor)
• Letter grade (A, B, C, D and F)

18
Types of measurement scales…
• Interval
– Data that can be ranked and differences are
meaningful. However, there is no meaningful
zero, so ratios are meaningless.
– All arithmetic operations except division and
relational operations are also possible.
– E.g.

– IQ
– Temperature in degree Fahrenheit (30F is not as
much as two times of 15F)

19
Types of measurement scales…

• Ratio
– Data can be ranked, differences are
meaningful, and there is a true zero.
– All arithmetic and relational operations are
applicable.
– E.g.
• Age (30 year individual is two times of 15 years)
• Weight (0kg is to mean, no weight)
• Number of drugs (0 means no drug)

20
1.7 Sources of data

Two source: primary and secondary


1. Primary Data: a data collected by the user
directly from the source.

– Methods of collection
 Personal Interview ƒ
(Telephone, face-to-face…)
 Focus Group discussion (FGD)
ƒMail Questionnaires
ƒDoor-to-Door Survey
ƒNew Product Registration ƒ
Sources of data…

2. Secondary Data: a data gathered or compiled


from published and unpublished sources.
– From journals, reports, government publications,
publications of professionals and research
organizations.

– E.g. - CSA: Central statistics agency

- DHS: the demographic and Health Survey

- HDS: Health and Demographic Surveillance


1.8 Division of statistics

Depending on how data can be used

• Descriptive statistics (Exploratory): is concerned with


summary calculations, graphs, charts and tables…
about a given data.

• Inferential statistics (Confirmatory): is a method used


to generalize from a sample to a population.
– sometimes called analytical statistics
1.9 Stages in statistical investigation

Five stages

• Collection of data: the process of measuring,


gathering, assembling the raw data up on which
investigation is to be based.

• Organization of data: Summarization of data in


some meaningful way, e.g. table form

24
Stages in…
• Presentation of the data: The process of re-
organization, classification, compilation… of data
to present it in a meaningful form.
• Analysis of data: The process of extracting
relevant information from the summarized data
• Inference of data: The interpretation and further
observation of the various statistical measures
through the analysis of the data
– And by implementing those methods by which
conclusions are formed and inferences made.

25
1.10 Types of questions

.
1 Open-ended questions
• Permit free responses
• Not allowed any possible answers to
choose from.
• Mostly used for investigation of
• Facts which the researcher is not familiar
• Opinions, attitudes, and suggestions of
informants
• Sensitive issues
Types of questions…Example
• Can you describe exactly what the traditional
birth attendant did when your labor started?

• What sensations did you experience during


your cataract surgery?

• How do you feel when your baby’s diarrhea


does not stop?
Types of questions…Example

2. Close-ended Questions
• Offer a list of possible options/answers
• When designing closed questions you should
try to:
• Make lists are complete and mutually exclusive
(events can’t happen at same time)
• Keep the number of options as few as possible
• It is useful if the range of possible responses
is known
Types of questions…Example
• What is your marital status?
1. Single

2. Married/living together

3. Separated/divorced/widowed

• Have you every gone to the local village


health worker for treatment?
1. Yes

2. No
Steps in designing questionnaire

1. Content
Decide what questions will be needed to

measure your variables and reach objectives

2. Formulating Questions

Specific and precise enough that respondents


do not interpret them differently
Steps…

3. Sequencing of Questions

Better to be logical for the respondent

4. Formatting the Questionnaire


Not only be consumer but also user friendly

5. Translation
If the interview will be conducted in one or
more local languages, translate
2. Data presentation
Data presentation

• Having collected and edited the data, the next


step is to organize it.
• That is to present it in a readily clear
condensed form
• The presentation of data is classified in to two
1. Tabulation
2. Diagrammatic
Tabular presentation

• Frequency distribution: is the organization of


raw data in table form using classes and
frequencies
• There are three basic types of frequency
distributions
• ƒ
Categorical frequency distribution
• ƒ
Ungrouped frequency distribution
• ƒ
Grouped frequency distribution
Categorical frequency distribution

• Used for data that can be place in specific


categories such as nominal or ordinal.
E.g. a researcher collected the following
data on marital status for 25 Patients.
(M=married, S=single, W=widowed and
D=divorced)
Qs: Present the given data in table form
M S D W D
S S M M M
W D S M M
W D D S S
S W W D D
Solution
Make a table as shown
Class Tally Frequency Percent
M ////// 6 24%
S /////// 7 28%
D /////// 7 28%
W ///// 5 20%
Ungrouped frequency distribution

• Is a table of all the potential raw score values


• Often constructed for small set or data on
discrete variable.
E.g. The following data represent the Weight
of 12 clients in nutrition consulting clinic.
80 76 90
70 60 62
63 60 63
76 70 70
Construct ungrouped frequency distribution
Solution

Make a table as shown


Mark Tally Frequency
60 // 2
62 / 1
63 // 2
70 /// 3
76 // 2
80 / 1
90 / 1
Grouped frequency Distribution

• When the range of the data is large, the data


must be grouped in to classes that are more
than one unit in width
Example: Construct a frequency distribution
for the following data on.
11 29 6 33 14 31 22 27 19 20
18 17 22 38 23 21 26 34 39 27

N.B: After many steps… next page


(Reading Assignment on steps to do)
Solution

Make table as follows

Class limit class boundary Class mark Freq. CF(<) CF(>)


6-11 5.5.-11.5 8.5 2 2 20
12-17 11.5-17.5 14.5 2 4 18
18-23 17.5-23.5 20.5 7 11 16
24-29 23.5-29.5 26.5 4 15 9
30-35 29.5-35.5 32.5 3 18 5
35-41 35.5-41.5 38.5 2 20 2
Diagrammatic and Graphic
presentation
• presenting data in visual displays
• Importance
– They have greater attraction.
– They facilitate comparison
– They are easily understandable
• The commonly used diagrammatic
presentation for discrete as well as
qualitative data are:
– Pie charts, Bar charts, Pictogram, map…
Pie chart

• A pie chart is a circle that is divided in to


sections according to the percentage of
frequencies in each category of the
distribution.
Example: Draw a pie chart to represent the
following OPD Patients of the year 2018 in the
given hospital.
Men Women Girls Boys
2500 2000 4000 1500
Solution
• First make a table like:

• Then, draw
Bar chart
• is the most widely used graphical method for
describing qualitative data.
• A set of bars representing some magnitude
over time space.
• The common types of bar chart
– Simple
– Multiple
– Component … etc
Simple bar chart

E.g. Distribution of Decayed teeth among


children of a primary school
Multiple bar chart
E.g. Distribution of marital status by sex

%
60 Male

Female
40

20

0
Single Married Divorced Widowed

Marital status
Graphical presentation of data
• The commonly used graphs for
continuous data are
– histogram,
– Frequency polygon
– Ogive (CF graph)…
Histogram
• A graph which displays the data by using
vertical bars of various heights to represent
frequencies.
• Class boundaries are placed along the horizontal
axes.
• Example: Construct a histogram to represent
the previous data
– i.e., (example on grouped freq.distrib.)
Solution

4
3
2

5.5 11.5 17.5 23.5 29.5 35.5 41. 5

Class boundaries on x-axis and frequency on y-axis


Frequency Polygon:

• it is a line graph where,


– The frequency is placed along the vertical axis and
Class marks at horizontal axis
• Example: draw a line graph for the above
example on histogram
Solution
class marks are in the x-axis
3. MCT and MV
MCT (Measures of central tendency)

• useful in data editing as well as in aiding our


understanding of the data
• Sometimes called Average

• Objectives
• To understand the data easily
• To facilitate comparison
• To make further statistical analysis
Types of MCT

• The Mean (Arithmetic, Geometric and


Harmonic)
• The Mode
• The Median
• Quantiles (Quartiles, deciles and percentiles)

• The choice of these averages depends up on


which best fit the property under discussion.

54
The Mean ( X )
• The Arithmetic Mean:
• Is defined as the sum of the magnitude of the
items divided by the number of items
• The mean of X1+X2+X3+,…+Xn is denoted by
A.M ,m or X and is given by:

,, Or

55
Mean for Ungrouped data

• Example: Obtain the mean age of the


following ages of children under Pedi clinic
2, 7, 8, 2, 7, 3, 7
• Solution:

56
Mean for grouped data

• If data are given in the shape of a continuous


frequency distribution, then the mean is

57
Example: calculate the mean for the
following data

Class frequency Solution


6- 10 35
11- 15 23
16- 20 15
21- 25 12
26- 30 9
31- 35 6

58
The Mode ( X̂)

• Mode is a value which occurs most frequently in


a set of values
• The mode may not exist and even if it does
exist, it may not be unique.
• In case of discrete distribution the value having
the maximum frequency is the modal value.
• The mode of a set of numbers X1, X2, X3,…Xn is
usually denoted by: X̂

59
Examples:

1. Find the mode of 5, 3, 5, 8, 9


Mode =5
2. Find the mode of 8, 9, 9, 7, 8, 2, and 5.
It is a bimodal Data: 8 and 9
3. Find the mode of 4, 12, 3, 6, and 7.
No mode for this data.

60
~
The Median( X )
• In a distribution, median is the value of the
variable which divides it in to two equal
halves.
• In an ordered series of data median is an
observation lying exactly in the middle of the
series.

61
Example:
Find the median of the following numbers.
a) 6, 5, 2, 8, 9, 4
b) 2, 1, 8, 3, 5
Solution:
a) First order the data: b) Order the data :
2, 4, 5, 6, 8, 9 1, 2, 3, 5, 8
Here n=6, which is even Here n=5 , which is
n=6 odd

62
MV: Measures of variation

• The spread of items of a distribution is known


as dispersion or variation.

• In other words, the degree to which numerical


data tend to spread about an average value is
called dispersion or variation of the data.

63
Objectives of measures of variation

• To judge the reliability of MCT

• To control variability itself

• To compare two or more groups of numbers in terms


of their variability

• To make further statistical analysis

64
Types of Measures of Dispersion

• The most commonly used measures of


dispersions are:
– Range and relative range
– Standard deviation and coefficient of
variation
– Quartile deviation and coefficient of
Quartile deviation

65
The Range

• The range is the largest score minus the


smallest score.

• It is a quick and dirty measure of variability.

• It is greatly affected by extreme scores.

• R=L-S, L=Largest and S=Smallest

Example: 32 35 36 42 42 43 43 45

Range is 45-32=13

66
Mean Deviation

• Is the arithmetic mean of the values of the


absolute deviations from a given average
• Depending up on the type of averages used
we have different mean deviations
• Mean deviation for raw data and for frequency
distribution respectively as follows:

67
The variance and standard deviation
Population Variance:
• If we divide the variation by the number of
values in the population, we get the
population variance.
• This variance is the "average squared
deviation from the mean"

• And for frequency distribution

68
Sample Variance

• It simply be the population variance with the


population mean replaced by the sample
mean.

• However, one of the major uses of statistics


is to estimate the corresponding parameter.

• To counteract this, the sum of the squares of


the deviations is divided by one less than the
sample size

69
Sample variance formula
For raw data:

, Or

, shorthand formula

For frequency distribution:

Or

,
shorthand formula

70
Standard deviation

• It is the square root of variance


• Population standard deviation

• Sample standard deviation

71
Examples:
• Find the variance and standard deviation of
the following sample data
1. 5, 17, 12, 10.
2. The data is given in the form of frequency
distribution

72
Cont…

73
Cont…

74
Coefficient of Variation (C.V)

• Is defined as the ratio of standard deviation


to the mean usually expressed as percents.

• The distribution having less C.V is said to be


less variable or more consistent.

75
Example:
• An analysis of the monthly wages paid to
workers in two dep’t Pedi (A) and Ortho (B)
belonging to the same campus gives the
following results

Value Dep’t A Dep’t B


Mean wage 52.5 47.5
Variance 100 121

In which dep’t is there greater variability in


individual wages?

76
Cont…

• in dep’t B there is a greater variability in


individual wages.

77
Standard Scores (Z-scores)

• If X is a measurement from a distribution with


mean X and standard deviation S, then its
value in standard units is

78
Cont…

• Z gives the deviations from the mean in units


of standard deviation
• Z gives the number of standard deviation a
particular observation lie above or below the
mean.
• It is used to compare two observations
coming from different group

79
Examples:

1. Two sections were given Biostatistics


examinations. The following information was
given.
Value HO (Sec1) Nursing (Sec2)
Mean 78 90
Sd 6 5
• Student A from section 1 scored 90 and
student B from section 2 scored 95.
Relatively speaking who performed better?

80
Solutions:

• Calculate the standard score of both students

• Student A performed better relative to his


section because the score of student A is
2SD above the mean score of his section
while, the score of student B is only 1s.d
above the mean score of his section.

81
Measures of shape
• Measures of skewness
– Skewed to the right
– Skewed to the left
– Symmetric
• Measures of kurtosis
– Leptokurtic
– Mesokurtic
– Platykurtic

• Reading Assignment

82
4. Probability and its distribution
What is probability?
• It is the chance of an outcome in an exp’t.

• It is the measure of how likely an outcome is to occur.

• It helps us to cope up with uncertainty

• probability is “0’’ If an event can't occur, and it is “1” if an


event is certain to occur
 A physician may say that a patient has a 50-50 chance of
surviving in a given operation.

 Another physician may say that the patient is 95 percent


certain that has a particular disease.

84
Important definitions
• Experiment: Any process of observation or measurement
which generates a well defined outcome.
– E.g. The parasite counts of malaria pts entering
Hospital
• Probability Experiment:
– It is an experiment that can be repeated a number of
times under similar conditions
– it is possible to enumerate the total number of
outcomes with out predicting an individual outcome.
Example: If a fair die is rolled once
– it is possible to list all the possible outcomes
• i.e.{1, 2, 3, 4, 5, 6}
– but it is not possible to predict which outcome will
occur.

85
Important def…

• Outcome :The result of a single trial of a random


experiment
• Sample space (S): The set of all possible outcomes of an
experiment , for example, {H,T}.
• Event: Any subset of the sample space for example, {H} or
{T} or {H,T}
• Empty set (Φ) : Contain no elements.
• Equally Likely Events: Events which have the same chance
of occurring.
• Complement of an Event: the complement of an event A
means non-occurrence of A
86
Important def…
• Elementary Event: an event having only a single
element or sample point.
• Mutually Exclusive Events: Two events which cannot
happen at the same time.
• Independent Events: Two events are independent if the
occurrence of one does not affect the probability of
the other occurring.
• Dependent Events: Two events are dependent if the
first event affects the outcome or occurrence of the
second event in a way the probability is changed.

87
Approaches to measure probability

• There are four approaches in studying of probability


theory.

– These are:

• The classical approach

• The frequentist approach

• The axiomatic approach

• The subjective approach

88
Classical approach

• If a random experiment with N equally likely


outcomes is conducted and out of these NA outcomes
are favorable to the event A

• then the probability that event A occur denoted P(A) is


defined as:

89
Example

• A fair die is tossed once. What is the probability of


getting

a) Number 4?

b) An odd number?

c) An even number?

d) Number 8?

90
Solutions:
• First identify the sample space,
S={1,2,3,4,5,6}; N=n(S)=6
a) Let A be the event of number 4; A={4}
NA= n(A)=1;
P(A)=n(A)/n(S)=1/6
b) Let A be the event of odd numbers; A={1,3,5}
NA= n(A)=3;
P(A)=n(A)/n(S)=3/6=0.5
c) and d)… calculate by your self

91
Addition Rule
• The sum of the probabilities of all Mutually exclusive
event outcomes is equals to “1”
– For any events A and B;
p(AUB)= p(A)+p(B)-p(AnB)…Addition rule

– For Mutually exclusive events


p(AUB)= p(A)+p(B)

– For two independent variables A and B


p(AnB)=p(A).p(B)…Multiplication rule

– For two dependent variables


p(AnB)=p(A).p(B/A)…conditional probability

92
Example
• Out of 200 seniors at a certain college, 98 are women, 34
are majoring in Health Officer, and 20 Health officer
majors are women. If one student is chosen at random
from the senior class, what is the probability that the
choice will be either a Health officer major or a women.

Solution:
p( HO major or woman)=P( HO major) +p(woman) - Pr(HO Major
and woman)
=34/200 + 98/200 - 20/200 = 112/200 =0.56

93
Conditional probability

• The conditional probability of an event A given that B


has already occurred, denoted p(A/B) is;

94
Example

• Suppose in country X; the chance that an infant lives


to age 25 is 0.95, whereas the chance that he lives to
age 65 is 0.65. What is the chance that a person 25
years of age survives to age 65?

– Hint: it is clear that to survive to age 65 means to survive


both from birth to age 25 and from age 25 to 65.

95
Solution
A: Survive birth to age 25=0.95

B: Survive both birth to age 25 and age 25 to 65=0.65

B/A: Survive age 25 to 65 given survival to age 25=?

Then, Pr(B/A) =Pr(A n B)/Pr(A)

= 0.65/0.95

=0.684

That is, a person aged 25 has a 68.4 percent chance


of living to age 65

96
Probability distribution

97
Random Variable

• It is a numerical description of the outcomes of the


experiment or

• It is a numerical valued function defined on sample space

• usually denoted by capital letters

• A random variable takes a possible outcome and assigns


a number to it

98
Cont…

• Example: Toss a coin two times and let X be the number of


heads in two tosses

• S={(HH), (HT), (TH), (TT)}


– X(HH)=2,

– X(HT)=X(TH)=1,

– X(TT)=0

• X=0, 1, 2

• X assumes a specific number of values with some


probabilities.

99
RV are of two types:
1. Discrete random variable:

– are variables which can assume only a specific


number of values.

– These numbers will be only a finite or countable


infinite number of outcomes
Examples:

– Number of bacteria per two cubic centimeter of water.

– Number of drugs soled per week in a pharmacy

– Number of children in a family.

100
Cont…
2. Continuous random variable: are variables that can
assume all values between any two give values

Examples:
– Length of time required to complete a kidney transplant
surgery.

– Life time of Drugs.

– Height of students at certain college.

101
Discrete Probability Distribution

• It may be table, graph or formula consists of a value


that a RV assumes and the corresponding probability of
the values

or

• A probability function maps the possible values of x


against their respective probabilities of occurrence, p(x)

102
Properties of Discrete prob.Distr

1. 0 ≤ P(X = x) ≤ 1

2. ∑ P(X = x) = 1

3. P(X < b) = P(X  b-1)

4. P(a  X  b) = P(X  b) – P(X  a-1)

103
Example:

• Consider the experiment of tossing a single coin two


times. Let X be the number of heads. Construct the
probability distribution of X.
• S={(HH), (HT), (TH), (TT)}
– X(HH)=2,

– X(HT)=X(TH)=1,

– X(TT)=0

• X=0, 1, 2

X 0 1 2
P(X=x) 1/4 2/4 1/4

104
Example 2:

• The following data shows x P(X=x)


the number of diagnostic 0 0.671
services with their 1 0.229
respective probability; a
2 0.053
patient receives
3 0.031
4 0.010
5 0.006

105
Solution
• What is the probability that a patient receives exactly 3
diagnostic services?
P(X=3) = 0.031

• What is the probability that a patient receives at most one


diagnostic service?
P (X≤1) = P(X = 0) + P(X = 1)
= 0.671 + 0.229
= 0.900

• What is the probability that a patient receives at least four


diagnostic services?
P (X≥4) = P(X = 4) + P(X = 5)
= 0.010 + 0.006
= 0.016

106
Probability distributions can also be displayed using a
graph

0.8
Probability, X=x

0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0 1 2 3 4 5
No. of diagnostic services, x
107
Intro. to Expectation [ E(X) ]

• Let a discrete random variable X assume the values X1,


X2,…Xn with the probabilities P(X1), P(X2), ….,P(Xn)
respectively.

• Then the expected value of X ,denoted as E(X) and it is


defined as:

108
Example

• What is the expected value of a RV X on the number of


diagnostic services a patient receives example (Example 2)

Solution
E(X)=X1P(X1) + X2P(X2)+ …. +X6P(X6)

=(0*0.067)+(1*0.229)+(2*0.053)+(3*0.031)+(4*
0.010)+(5*0.006)

=0+0.229+0.106+0.093+0.04+0.03=0.498

• The expected number of diagnostic services on average is


0.498

109
Mean and Variance of RV

Let X be given random variable.

• Mean of x is E(X)

• Variance of x is: Var(X)=E(X2) – [E(X)] 2

Where,

110
Some general rules for expectation

Let X and Y are random variables and k be a constant.

1. E(k) = k

2. Var(k) = 0

3. E(kX) = kE(X)

4. Var(kX) = k 2 Var(X)

5. E(X+Y) = E(X)+E(Y)

111
Example:
• Calculate Mean and Variance of the above Example
(Example on the number of diagnostic services a
patient receives)

Solution

Mean= E(X)= X1P(X1) + X2P(X2)+ …. +X6P(X6)

=(0*0.067)+(1*0.229)+(2*0.053)+(3*0.031)+(4*0.010)+

(5*0.006)

=0+0.229+0.106+0.093+0.04+0.03

=0.498

112
Solution cont…
• Variance: Var(X)=E(X2) – [E(X)] 2

First calculate E(X2)


= (02 *0.067)+(12 *0.229)+(22 *0.053)+(32 *0.031)+(42
*0.010)+(52 *0.006)
= 0+0.229+0.212+0.279+0.016+0.4
=1.136
Then, Var(X) =E(X2) – [E(X)] 2
=1.136 - [0.498]2
=1.136 - 0.248
=0.887

113
Discrete probability distribution

The common Discrete probability


distributions are

• Binomial Distribution

• Poisson Distribution

114
Binomial Distribution
• A binomial experiment is a probability experiment
which satisfies the following four requirements
• Assumptions of a binomial distribution.
1. The experiment consists of “n” identical trials
2. Each trial has only one of the two possible mutually
exclusive outcomes, success or a failure
3. The probability of each outcome does not change
from trial to trial, and
4. The trials are independent, thus we must sample
with replacement

115
Examples

• Registering a newly produced drug product as


defective or non-defective.

• Asking 100 people if they favor the ruling party.

• Asking 200 people if they watch BBC/EBC news.

• Tossing a coin 20 times to see how many tails


occur.

• Rolling a die to see if a 5 appears.

116
Binomial Distribution…
• The outcomes of the binomial experiment and the
corresponding probabilities are called Binomial
Distribution.

• Then the probability of getting “x” successes in “n” trials


becomes:

Where, p=probability of success


q= probability of failure
x=Number of success desired
n= number of trials

117
Example

• What is the probability of getting three female children


from four births (not twin or more) in a family?

Solution
– Let X be the number of female in four single births in a
family?

– n=4, p=0.5

118
Solution cont…

• Therefore in one given family the probability of getting


three female child from four births is 0.25

119
Poisson Distribution

• A random variable X is said to have a Poisson


distribution if its probability distribution is given by:

• Where is the average number

• The Poisson distribution depends only on the


average number of occurrences per unit time of
space.

120
Poisson distribution…

• The Poisson distribution is used as a


distribution of rare events like;

– Number of misused drugs.

– Car Accidents…

– Natural disasters like earth quake.

121
Example

• If 1.6 accidents can be expected an intersection on any


given day, what is the probability that there will be 3
accidents on any given day?

122
Normal distribution

• A random variable X is said to have a normal


distribution if its probability density function is given
by;

Where;

• and are the parameters of normal distribution

123
Properties of Normal distribution

• It is bell shaped and is symmetrical about its mean and


it is mesokurtic

• It is unimodal; It is a family of curves

• It is a continuous distribution

• Total area under the curve sums to 1

• Mean=Median=Mode

• The probability that a random variable will have a value


between any two points is equal to the area under the
curve between those points

124
Normal distribution…

125
Standard Normal distribution

• The transformed value of normal distribution


by standardization of RV X

where,

– Mean is zero and Variance is one

126
Standard Normal distribution…

• Given a normal distributed random variable X


with mean and standard deviation

• Note that;

127
Example

• Find the area under the standard normal distribution


which lies

a. Between Z=0 and Z=0.96

b. Between Z= -1.45 and Z=0

c. To the right of Z= -0.35

d. Between Z=-1.45 and Z=0.96

128
Solution

129
Solution…

d) Exercise ??
130
131
5. Sampling
Definitions

• Population: is the complete set of possible measurements


for which inferences are to be made

• Sample: is the set of measurements that are collected in


the course of an investigation.

• Parameter: Characteristic or measure obtained from a


population.

• Statistic: Characteristic or measure obtained from a sample

133
Definitions…

• Sampling: The process or method of sample selection


from the population

• Sampling unit: the ultimate unit to be sampled or


elements of the population to be sampled

• Sample size: the number of elements or observation to


be included in the sample.

134
The main concern in sampling

• To ensure that the sample represents the population,

• The findings can be generalized

• Researchers are

– Not interested in the sample itself; But in what can


be learned from the sample and

– How this information can be applied to the entire


population.

135
Reasons for Sampling
– Reduced cost

– Greater speed

– Greater accuracy

– Greater scope

– The only option when the population is infinite or


simply large to handle

136
Sampling Techniques

There are two types of sampling techniques

• Probability sampling . Non probability sampling


– Simple random sampling • Judgment sampling
– Stratified random sampling • Convenience sampling

– Cluster sampling • Quota Sampling

– Systematic sampling

– Multi-stage sampling

137
Probability sampling

• Is a method of sampling in which;

– all elements in the population have a pre-


assigned non zero probability to be
included in to the sample.

138
Simple Random Sampling(SRS)

• Is a method of selecting items from a population


– that every possible sample of specific size has an
equal chance of being selected.
– In this case, sampling may be with or without
replacement.

• All elements in the population have the same pre-


assigned non zero probability to be included in to
the sample.
• SRS can be done either using the lottery method
or table of random numbers.
139
Stratified Random Sampling
• The population will be divided in to non-overlapping but
exhaustive groups called strata.
• SRS will be chosen from each stratum.
• Elements in the same strata should be more or less
homogeneous while different in different strata.
• It is applied if the population is heterogeneous.
• Some of the criteria for dividing a population into strata
are:
– Sex (male, female);
– Age (under 18, 18 to 28, 29 to 39);
– Occupation (blue-collar, professional, other).
140
Cluster Sampling

• The population is divided in to non-overlapping groups


called clusters.

• A SRS of cluster of elements is chosen and all the


sampling units in the selected clusters will be surveyed.

• Clusters are formed in a way that elements with in a


cluster are heterogeneous
– i.e. observations in each cluster should be more or less
dissimilar.

• Cluster sampling is useful when it is difficult or costly to


generate a SRS.
141
Systematic Sampling:
• A complete list of all elements with in the population
(sampling frame) is required.

• The procedure starts in determining the first element


to be included in the sample.

• Then the technique is to take the kth item from the


sampling frame.

142
Non probability sampling
• It is a sampling technique in which the choice of
individuals for a sample depends on the basis of;

– convenience,

– personal choice or

– Interest…etc

143
Types of Non probability sampling

• Judgment Sampling
– The person taking the sample has direct or indirect
control over which items are selected for the sample.

• Convenience Sampling
– The decision maker selects a sample from the
population in a manner that is relatively easy and
convenient.

• Quota Sampling
– The decision maker requires the sample to contain a
certain number of items with a given characteristic.
Many political polls are, in part, quota sampling.
144
Errors in sample survey:

There are two types of errors

a) Sampling error:

• Is the discrepancy between the population value and sample


value.

• May arise due to in appropriate sampling techniques applied

b) Non sampling errors:

• are errors due to procedure bias such as:


– Due to incorrect responses

– Measurement

– Errors at different stages in processing the data.

145
Sample size determination
How many subjects should a researcher study?

Decide how many people need to be studied in order to


answer the study objectives

It is much better to increase the accuracy of data


collection than to increase the sample size after a
certain point.

146
Sample size determination…

Describe how the sample size is determined


Too small sample;
– May fail to detect important effects
– May estimate effects too imprecisely
Results have no practical use

Too large sample;


– Waste of resources
– Data quality compromised

147
Sample size determination…

When deciding on sample size:

PRECISION COST

Sample size = Precision = Cost

148
Sample size determination…
Example:
• A prevalence of 10% from a sample size of 20
– would have a 95% CI of 3% to 23%,
– which is not very precise or informative.
• But, a prevalence of 10% from a sample of
size 400
– would have a 95% CI of 9% to 13%,
– which may be considered sufficiently accurate.

149
Sample size determination depends on the

• Objective of the study


• Design of the study
• Descriptive/Analytic
• Accuracy of the measurements to be made (margin of
error)
• Degree of precision required for generalization
• Plan for statistical analysis
• Degree of confidence with which to conclude
• The feasible sample size is also determined by the
availability of resources
• Time, manpower. transport, available facility and money

150
Sample Size for estimating a Single Proportion

(z  ) pq2

n 2

d2

Where, p = proportion
q= 1-p
d= the degree of precision
Zα/2= The confidence level at α level of significance

This formula works for a large population!

151
Example:

• Suppose that you are interested to know the proportion of


HIV infected adult patients who developed Tuberculosis in
Goba Referral Hospital. Suppose that in this hospital the
proportion (p) of Tuberculosis was found to be 20%. What
sample size is required to estimate the true proportion
within ±3% with 95% confidence level.

• Given: p=0.20, d=0.03, α=5%

(z  )2 pq
1.96 2 (0.2)(0.8)
n 2
  683
d2 (0.03) 2

152
Example

• If the sample is to be taken from a relatively small


population (<10,000); the above formula needs some
adjustment.

Final sample (fpc) = n/1+(n/N)

• Suppose in the above example that the total population of


patients are 5000. What sample size will be needed to
conduct the study?

n=n/1+(n/N)=683/1+(683/5000)=601

153
Example

• Suppose there is no prior information about the prop. of


HIV infected adult patients (p) who developed TB
 If you don’t have any information about p, take it as 50%
and get the maximum value of p*q which is ¼ (25%).
• Assume p=q=0.5 (most conservative)
• Then, the required sample size will be

(z  )2 pq
1.96 2 (0.5)(0.5)
n 2
  1068
d2 (0.03) 2

154
Sample Size for Estimating a Single Mean

(z  ) 2 2

n 2

d 2

Where,
n = sample size
σ = standard deviation
d = desired precision = half of the
confidence interval (width (w)=margin of
error (e)=2d)

155
Example

• Suppose that for a certain group of cancer patients, we


are interested in estimating the mean age at diagnosis.
We would like a 95% confidence interval of 5 years wide.
If the population standard deviation is 12 years, how
large should our sample be?

(z  )2  2
(1.96)2 (144)
n 2
  88.5  89
d2 (2.5) 2

156
Example

• Suppose d=1
• Then the sample size increases!

(z  )2  2
(1.96)2 (144)
n 2
  553.2  554
d2 12

157
But, the population 2 is most of the time unknown!

As a result, it has to be estimated from:


• Pilot or preliminary sample:
– Select a pilot sample and estimate 2 with
the sample variance, s2
• Previous or similar studies

158
6. Statistical Estimation and
Hypothesis Testing
Inference

• Inference is the process of making a conclusion


from sample data for the totality of the
population.
• It is only the sample data that is ready for
inference.
• In statistics there are two ways though which
inference can be made.
– ™
Statistical estimation
– ™
Statistical hypothesis testing
Inference…

• Data analysis is the process of extracting relevant


information from the summarized data.
Statistical estimation

• This is one way of making inference about the


population parameter where;

• the investigator does not have any prior notion about


values or characteristics of the population parameter.

• Two ways of estimation

– Point Estimation

– Interval Estimation
Statistical estimation…

Point Estimation

• It is a procedure that results in a single value as an


estimate for a parameter
Sample statistic Population parameter
X (sample mean) μ (population mean)
S2 ( sample variance) σ2 (population variance)
S (sample SD) σ (population SD)
p ( sample proportion) P or π (Pop. proportion)
Interval estimation

• It is the procedure that results in the interval of


values as an estimate for a parameter,
• which is interval that contains the likely values of a
parameter.
Confidence Interval
• How confident can we be that the value of the statistic
falls within a certain "distance" of the parameter?
• Or, what is the probability that the parameter's value is
within a certain range of the statistic's value?
• This range is the confidence interval.
Estimator and Estimate

• Estimator is the rule or random variable that


helps us to approximate a population parameter.

• But estimate is the different possible values


which an estimator can assume.

Example: The sample mean is an estimator for


the population mean and is an estimate, which
is one of the possible value of .
Properties of best estimator

The following are some qualities of an estimator

• It should be unbiased

• It should be consistent

• It should be relatively efficient


Estimator…

• To explain these properties let ˆ be an estimator


of 
– Unbiased Estimator: An estimator whose expected
value is the value of the parameter being estimated.
i.e.

– Consistent Estimator: An estimator which gets closer


to the value of the parameter as the sample size
increases. i.e. ˆ gets closer to  as the sample size
increases.
Cont….

• Relatively Efficient Estimator:


– The estimator for a parameter with the smallest
variance.

– This actually compares two or more estimators for


one parameter.
Hypothesis Testing

• This is also one way of making inference about


population parameter, where

• the investigator has prior notion about the value of the


parameter.
Definitions:

• Statistical hypothesis: is a statement about the


population whose acceptability is to be evaluated on
the basis of the sample data.

• Test statistic: is a statistics whose value serves to


determine whether to reject or accept the hypothesis to
be tested.

– It is a random variable.
Two types of hypothesis

Null hypothesis (H0):

• It is the hypothesis to be tested.

• It is the hypothesis of equality or the hypothesis of no


difference.

Alternative hypothesis(H1) or (HA):

• It is the hypothesis available when the null hypothesis


has to be rejected.

• It is the hypothesis of difference.


Types of errors:

Two types of errors in hypothesis testing

• Type I error(α): Rejecting the null hypothesis when it is


true.

– It is sometimes called level of significance.

• Type II error (β): Failing to reject the null hypothesis


when it is false.
General steps in hypothesis testing
1. The first step in hypothesis testing is to specify the
null and alternative hypothesis

2. The next step is to select a significance level, α

3. Identify the sampling distribution of the estimator.

4. Calculate a statistic analogous to the parameter


specified by the null hypothesis.

5. Identify the critical region.

6. Making decision.

7. Summarization of the result.


Example:
• It is known in a pharmacological experiment that rats
fed with a particular diet over a certain period gain an
average of 40 gms in weight. A new diet was tried on a
sample of 20 rats yielding a weight gain of 43 gms
with variance 7 gms2 .

– Test the hypothesis that the new diet is an


improvement assuming normality.
Solution

Given:
Steps

Conclusion:
• Reject Ho and conclude that the new diet has an
improvement on the rats.
Test of Association

• Suppose we have a population consisting of


observations having two attributes; say A and B.

• If the attributes are independent then the probability


of possessing both A and B is PA*PB. Where,
– PA is the probability that a number has attribute A.

– PB is the probability that a number has attribute B.

• Suppose
– A has r mutually exclusive and exhaustive classes.

– B has c mutually exclusive and exhaustive classes


Test of association…

• The entire set of data can be represented using r *c


contingency table
Test of association examples

• Whether the presence or absence of hypertension is


independent of smoking habit or not.

• Whether the size of the family is independent of the


level of education attained by the mothers.

• Whether there is association between father and son


regarding boldness
The Chi-square Test
Hypothesis

• The null and alternative hypothesis may be


stated as:
– H0 : There is no association between A and B
– H1 : not H0 (There is an association between A and B)

Decision Rule:
Reject H0 for at α level of significance if the calculated
value of χ2 exceeds the tabulated value of χ2 with
degree of freedom equal to (r − 1)(c −1)
Example

Solution:
Solution…
Solution…

Conclusion
• At 5% level of significance we have evidence to say
there is association between father and son regarding
boldness, based on this sample data.
Exercise

• Attack rates among the vaccinated and unvaccinated


against measles are given in the Table below.
• Prove the protective value of vaccination by χ2 test.
7. Correlation and Linear regression
Correlation

• It is the quantification of the degree to which


two random variables are related, provided
that the relationship is linear.

• Used to investigate the relationships that can


exist among continuous variables

• We will discuss on
– Two way scatter plot

– Pearson’s correlation coefficient


Two way scatter plot
• Example: Percentage of children immunized against DPT and
under-five mortality rate for 20 countries, 1992

Not surprisingly, the mortality rate tends


to decrease as the % of children
immunized increases.
Pearson's Correlation Coefficient
Scatter plots showing possible relationships between X and Y
Pearson's Correlation…

• The correlation between the random variables X and Y


is denoted by the Greek letter p (rho).

• The correlation quantifies the strength of the linear


relationship between the outcomes x and y.

• It can be thought of as the average of the product of


the standard normal deviates of X and Y
Pearson's Correlation…

The correlation between X and Y may be one of the


following

1. Perfect positive (r=1)

2. Positive(r between 0 and 1)

3. No correlation (r=0)

4. Negative(r between -1 and 0)

5. Perfect negative(r=-1)
Pearson's Correlation…

• The presence of correlation between two variables


may be due to three reasons:

– One variable being the cause of the other.

– Both variables being the result of a common cause

– Chance
Pearson's Correlation…

• In the previous example, on immunization and U5


mortality.

• There is a strong linear relationship between the percentage


of children immunized against DPT in a specified country and
its under-five mortality rate;
• the correlation coefficient is fairly close to its minimum
value of -1. Since r is negative, mortality rate decreases in
magnitude as percentage of immunization increases
Simple Linear Regression

• It refers to the linear relationship between two


continuous variables
– We usually denote the dependent variable by Y and the
independent variable by X.

• A simple regression line is the line fitted to the points


plotted in the scatter diagram which would describe
the average relationship between the two variables.
– Therefore, to see the type of relation ship, it is advisable
to prepare scatter plot before fitting the model.
Simple Linear Regression…

• The linear model is :


Simple Linear Regression…

• The above model is estimated by (OLS):


– Where is “a” constant which gives the value of Y when
X=0. It is called the Y intercept.

– And “b” is a constant indicating the slope of the


regression line, and it gives a measure of the change in Y
for a unit change in X. It is also regression coefficient of
Y on X

– The calculation formula for “a” and “b” are:


Example:

The data on Resting metabolic rate (RMR) in (kcal/24


hrs) and body weight in (kg) for 10 Nutrition clinic
clients

BW 57.6 64.9 59.2 60.0 72.8 77.1 82.0 86.2 91.6 99.8
RMR 1325 1365 1342 1316 1382 1439 1536 1466 1519 1639

a. Plot the scatter diagram to view the relationship


b. Calculate a simple correlation coefficient and interpret
c. Fit a regression line of RMR on BW using least square
estimates.
d. Predict the value of RMR if the BW is 85.
Solution
a. The scatter diagram looks: it seems linear
relationship
Solution…

b. We can use the “r” calculating formula

13,510.72
r  0.955
(1953.56)(102, 424.9)
Interpretation
• There is a positive high linear relationship between BW
and RMR
Solution…

c. First we have to calculate a and b using the formula

Thus the regression line is given by y= 913.3729+ 6.91596x


Solution…

d. Simply replace 85 in the place of x in the fitted model y=


913.3729+ 6.91596x.

Thus, y= 913.3729+6.91596*85
y=1501.2295

Interpretation

In the given data, the RMR will be 1501.2295 if the BW is 85 kg


Multiple linear regression

• is a model with a 2 and more regressors having a


linear relationship with a response variable Y.
• The multiple regression model is

Y  o  1 X1  2 X 2  ...   p X p  

• Where:
– Y= response V - b = slope
1
– Xs= regressors - ε =random error component
– b = intercept
0
Example: As a research question

• Do number of cigarettes (IV1) and exercise (IV2),


predict CHD mortality (DV)?

• Cigarettes CHD Mortality


• Exercise

202
…the SPSS output
…cont
• From the 1st table we can see the correlation
between Cig and CHD
• From the 2nd table again we can see the
ANOVA table
• We are interested with the 3rd table
– We will focus on the Unstandardized predicted and
residual values.
– The model “CHD=27.08+0.45Cig–5.92Exercise”
…cont
The interpretation looks
• The model “CHD=27.08+0.45Cig–5.92Exercise”
• Smoking and Exercise are a significant factors to the CHD
The conclusion will be
• “In the given 21 countries, the 1 cigarette increase in
smoking will rise the CHD mortality by 27.53” and “when
Number of exercise per week is decreased by 1 the CHD
mortality will increase by 21.16”
8. Logistic Regression
Logistic Regression

• Is a method for examining the relationship


between a Categorical (Dichotomous)DV with
one or more IVs.
– Simple Logistic Regression
• 1 dichotomous DV and 1IV
– Multiple Logistic Regression
• 1 dichotomous DV and >1IVs
…cont
• Logistic regression is used to predict a
categorical (usually dichotomous) variable from a
set of predictor variables.
• Logit analysis is usually employed if all of the
predictors are categorical; and
• logistic regression is often chosen if the predictor
variables are a mix of continuous and categorical
variables.
• Logistic regression has been especially popular
with medical research in which the dependent
variable is whether or not a patient has a disease.
Cont…

• The model;

• where is the predicted probability of the


event which is coded with 1.
,

• Odds Ratio (OR)= eβ


Assumptions of BinaryLR
1. The DV to be binary(e.g. 0 and 1)
2. Since it assumes; p(y=1) the DV will be coded
accordingly
3. The model should be fitted correctly(Neither
over or under fitted)
– Only meaning full variables should be included
4. Error terms need to be independent; each
observation be independent(Collinearity)
5. Linearity of independent variables and log
odds
6. It requires quite large sample size
– at least 30 cases for each parameter to be
estimated
Example
• Factors associated with physician agreement
on causes of death.
• DV;
– Physician Agreement (1=Agree Vs 0=Disagree)
• IV;
– AgeCat1
– OccupDeceaCataaa , EducDeceasedCat
– RespAge, ,
– DeceaSex
Cont…

• See the output from the SPSS output window


• In the final model “OccupDecea” and
“DecesSex” were the factors that affect the
physician agreement
Cont…

• The interpretation looks


– From all deaths physicians had agreed 28%
less likely on the deceased who were females
compared to those deceased who were males
with the odds of (OR=0.72, 95% CI: 0.525-
0.988).

– And physicians had 2.18 times more likely to


agree on those deceased who were illegible for
any work compared to the deceased who were
workers with (OR 2.18, 95% CI: 1.43-3.32).
– OR = eβ
Example 2

• The DV;
– Decision about research (0=stop and 1=continue)
• The IV;
– gender (0=F and 1=M)
• The model and the output looks

Meaning???
9. Survival analysis
What is Survival Analysis?
• Survival Analysis is referred to statistical methods
for analyzing survival data

• Survival data could be derived from laboratory


studies of animals or from clinical and
epidemiologic studies

• Survival data could relate to outcomes for studying


acute or chronic diseases
What is Survival Time?
• Survival time refers to a variable which measures
the time from a particular starting time (e.g., time
initiated the treatment) to a particular endpoint of
interest (e.g., attaining certain functional abilities)

• It is important to note that for some subjects in the


study a complete survival time may not be available
due to censoring
Censored Data
Some patients may still be alive or in remission
at the end of the study period

The exact survival times of these subjects are


unknown

These are called censored observation or


censored times and can also occur when
individuals are lost to follow-up after a period of
study
Random Right Censoring

• Suppose 4 patients with acute leukemia enter a


clinical study for three years

• Remission times of the four patients are recorded as


10, 15+, 35 and 40 months

• 15+ indicate that for one patient the remission time


is greater than 15 months but the actual value is
unknown
Important Areas of Application

• Clinical Trials (e.g., Recovery Time after heart


surgery)

• Longitudinal or Cohort Studies (e.g., Time to


observing the event of interest)

• Life Insurance (e.g., Time to file a claim)

• Quality Control & Reliability in Manufacturing (e.g.,


The amount of force needed to damage a part such
that it is not useable)
Survival Function or Curve
Let T denote the survival time

S(t) = P(surviving longer than time t )


= P(T > t)
The function S(t) is also known as the cumulative
survival function. 0 S( t )  1

Ŝ(t)=number of patients surviving longer than t


total number of patients in the study
E.g: Four patients’ survival time are 10, 20, 35
and 40 months. Estimate the survival function.

0.8
% Surviving

0.6

0.4

0.2

0
0 10 20 30 40 50
Month
Example: Four patients’ survival data are 10,
15+, 35 and 40 months. Estimate the survival
function

0.8
% Surviving

0.6

0.4

0.2

0
0 10 20 30 40 50
Month
In 1958, Product-Limit (P-L) method was
introduced by Kaplan and Meier (K-M)

• As you move from left to right in estimation of the


survival curve first assign equal weights to each
observation. Do not jump at the censored observations

• Redistribute equally the pre-assigned weight to the


censored observations to all observations to the right of
each censored observation

• Median survival is a point of time when S(t) is 0.5

• Mean is equal to the area under the survival curve


A few critical features of P-L or K-M
Estimator

• The PL method assumes that censoring is


independent of the survival times

• K-M estimates are limited to the time interval in


which the observations fall

• If the largest observation is uncensored, the PL


estimate at that time equals zero
Comparison Of Two Survival Curves

• Let S1(t) and S2(t) be the survival functions of the


two groups.
• The null hypothesis is
H0: S1(t) =S2(t), for all t > 0

• The alternative hypothesis is:


H1: S1(t)  S2(t), for some t > 0
The Logrank Test

• SPSS, Stata, SAS, S-Plus and many other statistical


software packages have the capability of analyzing
survival data
• Logrank Test can be used to compare two survival
curves
• A p-value of less than alpha level (0.05) based on the
Logrank test indicate a difference between the two
survival curves
EXAMPLE

• Survival time of 30 patients with Acute


Myeloid Leukemia (AML)

• Two possible prognostic factors


Age = 1 if Age of the patient  50
Age = 0 if Age of the patient < 50
Cellularity = 1 if cellularity of marrow clot section
is 100%
Cellularity =0 otherwise
Format of the DATA

Survival Times and Data of Two Possible


Prognostic Factors of 30 AML Patients

* Censored = 1 if Lost to follow-up


Censored = 0 if Data is Complete
Comparing the survival curves by
Age Groups using Logrank Test
Comparing the survival curves by
Cellularity using Logrank Test
Hazard Function

• The hazard function h(t) of survival time T gives the


conditional failure rate

• The hazard function is also known as the


instantaneous failure rate, force of mortality, and
age-specific failure rate

• The hazard function gives the risk of failure per unit


time during the aging process
Multivariate Analysis: (CPHM)
Cox's Proportional Hazards Model
• CPHM is a technique for investigating the
relationship between survival time and
independent variables

• A PHM possesses the property that different


individuals have hazard functions that are
proportional to one another
Comparing the survival curves by Age
Groups after Adjusting Cellularity using
CPHM
Comparing the survival curves by
Cellularity Groups after Adjusting Age
using CPHM
10. Study Designs in Epidemiology
Case report

• Are the most basic types of observational study designs.


• These studies describe the experiences of a single person
(case report) or a group of people (case series) who have a
specific disease or condition.
• Case reports and case series typically describe previously
unrecognized diseases or unusual variants of a known
disease process.
• Consequently, data from these studies are particularly
useful for alerting the health community to the presence of
a new disease and for generating hypotheses regarding
possible causes.
Cross sectional study

• Cross-sectional studies are a type of observational


study in which the exposure and outcome are
measured simultaneously.

• Concurrent measurement of potential risk factors and


a disease outcome implies that there is no follow-up
time in cross sectional studies.
Cohort study

• Are observational studies that compare the


incidence of disease among different exposure
groups.
• The cohort study design separates potential risk
factors from the development of disease over
time to demonstrate temporal associations.
• Cohort studies are conducted in three
fundamental steps:
1. Identify a group of people who are initially free of the
disease outcome
2. Measure the exposure(s) of interest to create cohorts
3. Follow the cohorts over time to determine the
incidences of disease
Cohort…

• Design of cohort study


Case control study

• Are observational studies that begin by targeting a


disease or condition of interest and then work
backward to determine associations with previous
exposures.

• The case-control study design is ideally suited for


examining potential risk factors for rare diseases.
Randomized trial

• A randomized trial is a prospective study in


humans that evaluates the benefits and
harms of an intervention against control
procedures
overview of common research study designs
11. Measurement error and bias
Bias
• Epidemiological studies measure x-stics of
populations. These parameter may be
– a disease rate
– the prevalence of an exposure
– The association between an exposure and disease.
• Because studies are carried out on people and
have all the attendant practical and ethical
constraints, they are almost invariably subject
to bias.
Bias…

• Bias is a systematic tendency to under or


overestimate the parameter of interest
because of a deficiency in the design or
execution of a study.

• Two main sources of bias here:


– Selection bias and information bias
Bias…

• Selection bias occurs when the subjects


studied are not representative of the target
population about which conclusions are to be
drawn
– The possibility of selection bias should always be
considered when defining a study sample
• Information bias: arises from errors in
measuring exposure or disease.
Bias…

• Bias cannot usually be totally eliminated from


epidemiological studies.
• The aim, therefore, must be
– to keep it to a minimum
– to identify those biases that cannot be avoided
– to assess their potential impact, and
– to take this into account when interpreting results
• The motto of the epidemiologist could well be
“dirty hands but a clean mind”
Measurement error

• As indicated above, errors in measuring exposure


or disease can be an important source of bias in
epidemiological studies.
• In conducting studies, therefore, it is important to
assess the quality of measurements
• Sometimes a reliable standard is available
against which the validity of a survey method can
be assessed
– E.g. the validity of a mammographic diagnosis of
breast cancer can be tested by biopsy. More often,
however, there is no sure reference standard
Measurement error…

• Measurements of disease in life are often


incapable of full validation.
• In practice, therefore, validity may have to be
assessed indirectly
• Two techniques of measurement
1. Survey method
2. Standard reference test
Analyzing validity

• When a survey technique or test is used to


dichotomize subjects its validity may be
analyzed by classifying subjects as positive
or negative
– firstly by the survey method and secondly
according to a standard reference test
• Four important validity analyzing: sensitivity,
specificity, systematic error, and predictive
value
Analyzing validity…

Comparison of a survey test with a reference test


Analyzing validity…

• Sensitivity—A sensitive test detects a high proportion


of the true cases, and this quality is measured here by
a/a + c.
• Specificity—A specific test has few false positives, and
this quality is measured by d/b + d.
• Systematic error—For epidemiological rates it is
particularly important for the test to give the right
total count of cases.
– This is measured by the ratio of the total numbers positive to the
survey and the reference tests, or (a + b)/(a + c).
• Predictive value—This is the proportion of positive test
results that are truly positive.
Analyzing validity…

• For example,
• The sensitivity of mammography for detecting breast
cancer is 90%. This value is interpreted as “90% of
women who have biopsy-proven breast cancer will have a
positive mammogram.”
• The specificity of mammography for detecting breast
cancer is also about 90%. This value is interpreted as
“90% of women who have biopsy-proven absence of
breast cancer will have a negative mammogram.

Sensitive or specific? A matter of choice


Repeatability

• It is helpful examining repeatability when there is


no satisfactory standard against which to assess
the validity of a measurement technique

• Consistent findings do not necessarily imply that


the technique is valid:
– a laboratory test may yield persistently false positive
results, or a very repeatable psychiatric questionnaire
may be an insensitive measure of, for example,
“stress”.
Repeatability…

• Repeatability can be tested


– within observer (that is, the same
observer performing the measurement on two
separate occasions) and also
– between observers (comparing measurements
made by different observers on the same subject or
specimen).
To dissect the total variability into four

• Within observer variation—Discovering one’s own


inconsistency
• Between observer variation—This includes the first
component but adds to it an extra and systematic
component due to individual differences in techniques
and criteria
• Random subject variation—When measured repeatedly
in the same person, physiological variables like blood
pressure tend to show a roughly normal distribution
around the subject’s mean.
• Biased (systematic) subject variation—Blood pressure
is much influenced by the temperature of the
examination room, as well as by less readily
standardized emotional factors.
Analyzing repeatability

• For continuous numerical variables


• Calculated by SD or CV(standard deviation ÷
mean)
• Scatter plot will show the extent and pattern
of observer variation
– to plot the difference between each pair of
measurements against their mean.
– E.g. Blood pressure
Analyzing repeatability…

• For qualitative attributes


• κ statistic, which measures the level
of agreement over and above what would be
expected from the prevalence of the attribute.

• The proportion of the total in cells a and d, is


the level of agreement
12. Validity and Reliability
Validity
• The ability of a test, data
collection mechanism or process,
to accurately measure the variable
of interest:
– to distinguish diseased from non-diseased
subjects
– To measure the presence or absence of a
particular risk factor (exposure)
– To measure the magnitude of disease or risk
in a population
Threats to Validity

• Bias
• Confounding
• Chance
Reliability
• The degree to which results are
consistently measured by any type
of data collection instrument
– medical test
– medical record
– observation
– study questionnaire
Example:

• In a study assessing the exposure between


alcohol intake (high vs. low) and high blood
pressure the investigator calculated the
following results:

RR = 2.13 (95% C.I.: 1.05 - 12.10) p = 0.01

• CONCLUSIONS?

• PROBLEMS?
Conclusions:
RR = 2.13 (1.05 - 12.10) p = 0.01

• P-value:
We have observed an association that is significantly
different than the null hypothesis (RR=1) and the
probability that an observed effect is actually due to
chance is 1 in 100.

• Confidence Interval:
If we did this study 100 times (took 100 different
samples from the target population) approximately
95% of the time the interval would cover the true
population measure.
Conclusions/Problems
RR = 2.13 (1.05 - 12.10) p = 0.01

• How sure can we really be that the true RR is 2.13?

• Why is this Confidence Interval so wide?

• What are the sources of error which may have affected


precision (wide CI) and internal validity?
Possible errors in every study
should be carefully considered
Error
• “Epidemiology can be considered an exercise in
measurement and estimation”
– Sanders Greenland

• Epidemiologic studies attempt to approximate the “real


world” by evaluating relationship between exposure and
outcome in a sample of people.

• Therefore, error is inevitable!


– Both in sample selection: It is seldom desirable, necessary,
or possible to study everyone.
– And in measurement: Can never measure exposure and
outcome perfectly.
Truth and Approximation
Truth in “real Approximation in
world” study
• Actual people to apply Study population
findings (TARGET
POPULATION)

• Actual exposure and Measurements collected


outcome

• Actual relationship Results from a given study


between exposure and
outcome

Difference between truth and approximation = error


Sources of Error in a Study

Lack of Study Precision Lack of study validity


(RELIABILITY) (INTERNAL VALIDITY)

Random Error Systematic Error Confounding

Selection Bias Information Bias


Types of Error in Epidemiologic Research

• Random error
– reflects fluctuations around the true value of a
parameter.
– is essentially attributable to sampling variation,
the extent of which may depend on aspects of the
study design (e.g. sample size) and statistical
characteristics of the estimator (e.g. its variance).
• Systematic or non-random error
– leads to BIAS
– reflects a deviation of results or inferences from
the truth.
– the processes leading to such deviation can be
introduced at any point in an investigation.
Errors and Study Size

(BIAS)
Effect of Bias
• Bias will result in an estimate that is not the same as
the true value.

• Directions of bias:
– Away from the null:
• study RR=8, true RR=2
• study RR=0.5, true RR=0.9
– Towards the null:
• study RR=1.3, true RR=5.0
• Study RR=0.9, true RR=0.4
– “Switchover”:
• Study RR=0.5, true RR=2.0
Internal vs. External Validity

• Bias undermines internal validity, which is the


ability to measure what the study sets out to
measure.
– It requires proper selection of study subjects and
lack of error in measurement.

• External Validity concerns inferences to an


external population beyond the study’s
restricted interest.
– Such inferences require generalization based on
judgmental aspects, such as findings from other
studies and existing knowledge about the biology of
the disease.

• In this course, we limit the discussion of


validity to internal validity.
Sources of error

Random error affects Systematic error (bias)


precision (RELIABILITY) affects VALIDITY

X X X
X
X X X XX
X X XX
X
X
X X X
x X
Aday, 1996
How can the relative risk or odds ratio be wrong?

Lack of study precision Lack of study validity


(RELIABILITY) (INTERNAL VALIDITY)

Random Error Systematic Error


How can the relative risk or odds ratio be wrong ?

lack of study validity


(INTERNAL VALIDITY)

Systematic Error
Confounding

Selection Bias Information Bias


Sources of error
(random or systematic)

Error can be introduced by the…


 Study observer/investigator

 Study participant

 Study instrument

During the process of…


 Selection of study subjects

 Measurement of disease and/or exposure

 Analysis or interpretation of findings


How do we prevent threats to validity
(systematic error) in our research?

1) Study design: Minimize Bias


(more on this in upcoming lectures)

2) Study implementation:
Quality Assurance & Quality Control

3) Use “validated tools” (best if validated in your


population)
Assessing the Quality of a
Measurement Tool
• Accuracy (validity):
– sensitivity/specificity with a gold standard
• Either validity or reliability:
– Mean difference
– Kappa/% agreement
– Correlation
– Regression
• Reliability
– Correlation coefficient (ICC)
– Coefficient of variation (CV)
– Bland-Altman/limit of agreement (LOA)
Sensitivity/Specificity

• The ability of a test to distinguish


diseased from non-diseased subjects
Two by Two Table

Disease Disease
Total
Yes No
Test
TP FP TP + FP
Pos
Test
FN TN TN + FN
Neg

Total TP + FN TN + FP
Sensitivity

• percentage of all true cases identified

(TP / TP+FN) X 100


Specificity

• percentage of true negatives identified

(TN / TN + FP) X 100


Two by Two Table

Disease Disease
Total
Yes No
Test
TP FP TP + FP
Pos
Test
FN TN TN + FN
Neg

Total TP + FN TN + FP

Sensitivity Specificity
False positive & negative results
• False positives
– burden on HC system
– unnecessary anxiety
– labeling
• False negatives
– delay treatment
– false sense of “security” regarding risk
behaviors
Improving sensitivity and/or specificity

• Sequential testing
– initial test positives examined using other
method
– improves specificity
• Simultaneous tests
– multiple variables assessed at the same time
– improves sensitivity
Measure of Yield

Predictive value positive (PVP) is the


proportion of positive tests that are actually
diseased

PVP = TP / TP + FP
Influences on PVP

PVP influenced by:


– sensitivity and specificity of the test used,
especially specificity
– prevalence of disease in the population
being tested
Increasing Reliability :
(the precision and reproducibility of data collected)

1. Reduce intra-subject variability


-Repeated Measurements
-Standardized data collection times

2. Reduce inter-observer variability


-Standardized diagnostic criteria, tests,
and instruments

3. Increase sample size


Assessing Reliability

– Inter-rater
• % agreement, kappa statistic
– Internal consistency
• Kuder-Richardson20 , Cronbach’s coefficient
alpha
– Test-retest
• Quantified by correlation co-efficient
*See Szklo book for more examples*
Assessing agreement between observers,
instruments, etc.
• Percent (observed) agreement
– proportion of measurements that have the same
results by two (or more) methods, expressed as a
percentage

% agreement =(a+d) / (a+b+c+d)

• Kappa measure: the extent to which 2 measures agree,


taking into account their agreement expected by
chance alone (ex: agreement if two assessors rated
responses at random)
Calculating % (observed) agreement

Test 1 + Test 1 - Total


Test 2 + 140 52 192
Test 2 - 69 725 794
Total 209 777 986

% agreement = (140 + 725) / (986)

% Agreement = 0.877 or 87.7%


Kappa
Test 1 + Test 1 - Total
Test 2 + 140 52 192
Test 2 - 69 725 794
Total 209 777 986

1) Calculate % (observed) agreement

2) Calculate % chance agreement


Expected value for Test1+/Test 2 +  (cell a) = (209*192) / 986 = 40.7
Expected value for Test1-/Test 2 -  (cell d) = (777*794) / 986 = 625.7

3) Kappa = (%obs agreement - % chance agreement)


(1 - % chance agreement)

= [((140 + 725)/ 986)– ((40.7+ 625.7)/ 986) ] = (0.877 – 0.676) = 0.62


[ 1 – (40.7+ 625.7)/ 986 )] (1 – 0.676)
Evaluation of Kappa

Values of kappa range from –1 to 1:

– If kappa = 0, observed agreement same as chance


alone
– If kappa < 0, observed agreement worse than by
chance alone
– If kappa = 1, observed agreement = 100% (perfect!)

In medical research:
 > 0.75 excellent
0.40 <  < 0.75 good
0 <  < 0.40 marginal/poor
Correlation coefficient

– Pearson for normally distributed data (actual values)


– Spearman for non-normally distributed data (ranks)
– Both measure the degree to which a scatter diagram
between 2 readings approaches a straight line (if one
goes up, the other goes up; if one goes down, the
other goes down)
– If replicate measurements show reliability, they will
be highly POSITIVELY correlated
• A negative correlation is NOT what you want in
reliability or validity studies
Correlation Coefficients


• •
• •
• •
• • •

• •• •
•• •

A. B. C.

• All 3 r=1.0
– A. Both observers get same exact value
– B and C. Systematic differences between
observers, but very reliable differences
Intraclass Correlation Coefficient (ICC)

• Also known as the reliability coefficient (RC)


• (more than one way to calculate)
• The fraction of the total measurement variability that
is due to variation between patients
• High ICC indicates little variability due to the
technologist, and large variability due only to the
patient…high is Good!

• Variation between patients / (variation between


patients + variation due to error)
• Therefore, substantially affected by amount of
variation between patients (particularly important
when comparing studies)
• It is the equivalent of the Kappa statistic for
continuous data, and also ranges from 0 to 1
Now that I know the reliability/validity for my
study, what do I do?

• If good – feel more confident in results


• If one measure better than others – use that one
• If not so good or systematically biased – may be able
to correct
– Simple or complex
• If not so good and can’t correct – need to mention in
discussion
Relationship between reliability and
validity

• Can, and often do have reliability without


validity
Remember!

“If all appears to be going well in an epidemiologic study,


you have forgotten something”
– Khan and Sempos
Thank you!

You might also like