You are on page 1of 145

Biostatistics and Research

Methodology

12/21/2019 Biostatistics: Compiled by Sead Z. 1


Course description
• This course is an extension of Introduction to
Statistics which addresses overview of
research (e.g. meaning and steps to be
followed) and application of statistical
knowledge in protocol development, objective
of a given study versus hypothesis testing,
estimation and assessing statistical significance

12/21/2019 Biostatistics: Compiled by Sead Z. 2


Course Objectives:
On successful completion of this course students
will be able to:
• Know and understand the basic concepts of biostatistics
• Be familiar with research design
• Be familiar with the skills of collecting, summarizing,
analyzing, interpreting and presenting data
• Independently be able to choose appropriate statistical test
and conduct the test
• Make a statistical inference
• Be able to determine correct sample size
12/21/2019 Biostatistics: Compiled by Sead Z. 3
Course Content
1. Introduction
2. Study Designs
3. Data analysis and presentation
4. Analysis of continuous data
• t-test (different sample size, Difference Variance)
• ANOVA
• Linear Regression Model
• Multiple Regression

12/21/2019 Biostatistics: Compiled by Sead Z. 4


Course …
5. Analysis of Categorical Data
• X-2 –test
• Fisher’s exact test
• Multiple Logistics Regression
– Analysis of Count Data
6. Non-Parametric Statistical tests
7. Sample Size Determination
 Field or Experimental trial
8. Research Protocol Development
9. Ethics in Research execution and communication
10. Projects
12/21/2019 Biostatistics: Compiled by Sead Z. 5
Further readings
• Thrusfield, M. 2007. Veterinary Epidemiology. 3rd edition, Blackwell Publishing Limited

• Salman, M. D. 2003. Animal Disease Surveillance and Survey Systems: Methods and
Applications. 1st edition, Blackwell Publishing Limited

• A.G.Bluman - Elementary Statistics 2nded.

• EshetuWencheko Introduction to statistics.

• J.E. Freund and G.A. Simon- Modern Elementary Statistics 8th ed

• M.R. Spiegel. Theory and problem of Statistics, Schaums’ Outline series

• Downie, N.M. and R. W. Heath, 1983). Basic Statistical Methods, 5th Edition. New York:
Harper and Row Publ.

• Moore, D.S., and G.P. McCabe, (1989). Introduction to the Practice of Statistics, New
York: W.H. Freeman and Company.

12/21/2019 Biostatistics: Compiled by Sead Z. 6


Statistics

Has TWO MEANINGS


• Specific numbers
• Methods of analysis

12/21/2019 Biostatistics: Compiled by Sead Z. 7


Statistics
• Specific Number
Numerical measurement determined by a set of
data

Example: 65% student use Facebook account

12/21/2019 Biostatistics: Compiled by Sead Z. 8


Statistics
• Methods of Analysis
A collection of methods for planning
experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing,
and drawing conclusions based on the data.
• Biostatistics
– The application of statistics on biological or
life science data

12/21/2019 Biostatistics: Compiled by Sead Z. 9


Definition

• Population: The complete collection of all elements


(scores, people, measurements, and so on) to be
studied. The collection is complete in the sense that is
includes all subjects to be studied.

– Population: Refers to all members of a defined group

12/21/2019 Biostatistics: Compiled by Sead Z. 10


Cont…
• Census: The collection of data from every element in a population

• Sample: A sub-collection of elements drawn from a population

Example: Patients in a hospital would constitute the entire


population for a study of infection control in that hospital.

However, for a study of infected patients in the nation’s hospitals, the


same group of patients would be but a tiny sample. The same
group can be a sample for one question about its characteristics
and a population for another question.
12/21/2019 Biostatistics: Compiled by Sead Z. 11
Cont…
• Parameter: A numerical measurement for describing some
characteristics of a population.
• statistic: A numerical measurement describing some
characteristics of a sample.
• Data: A data is a numerical fact collected for some specific
purpose
• Quantitative data: Numbers representing counts or
measurements
Example: Cholesterol level in the blood
• Qualitative (or categorical or attribute) data: can be separated
in to different categories that are distinguished by some non-
numeric characteristics.
Example: The genders (male/female)
12/21/2019 Biostatistics: Compiled by Sead Z. 12
Cont…
• Discrete data: Data results when the number of possible values is
either a finite number of a ‘countable’ number of possible values
0,1,2,3….
Example; The number of eggs that hens lay; for example, 3 eggs a day
• Continuous data: Numerical data result from infinitely many possible
values that correspond to some continuous scale that covers a range of
values without gaps, interruptions, or jumps.
Example: The amounts of milk that cows produce; for example 2.341
gallons a day.

12/21/2019 Biostatistics: Compiled by Sead Z. 13


Data levels with reference to scales

• Nominal scales: This scales of measurement is


characterized by data that consists of names, labels,
or categories only. The data cannot be arranged in an
ordering scheme (such as low to high).

Example: Survey responses yes, no undecided.

12/21/2019 Biostatistics: Compiled by Sead Z. 14


Data levels with reference to scales

• Ordinal Scales (levels): Involves data that may


be arranged in some order, but differences
between data values either cannot be
determined or are meaningless.
Example: Course grades A, B, C, D, or F
Stages of malignant cancer

12/21/2019 Biostatistics: Compiled by Sead Z. 15


Data levels with reference to scales
• Interval scales (levels): Like the ordinal level, with the
additional property that the difference between any
two data values is meaningful. However, there is no
natural zero starting point (Where none of the quantity
is present).
Example: Years 1000, 2000, 1776 ….
Temperature in degree Celsius
Intelligence quotients' of human
12/21/2019 Biostatistics: Compiled by Sead Z. 16
Data levels with reference to scales

• Ratio Scales (levels) of measurement: The interval level


modified to include the natural zero starting point (Where
zero indicates that none of the quantity is present). For
values at this level, differences and ratios are meaningful.
Example: Prices
Blood pressure
Weight….

12/21/2019 Biostatistics: Compiled by Sead Z. 17


Data levels with reference to scales

In summaries

Nominal – Categories only

Ordinal – Categories with some order

Interval – Differences but no natural starting point

Ratio – Differences and a natural starting point.

12/21/2019 Biostatistics: Compiled by Sead Z. 18


Variables
• A variable is just a term for an observation or
reading giving information on the study
question to be answered.
• Blood pressure is a variable giving
information on hypertension.
• Blood Uric acid level is a variable giving
information on gout.

12/21/2019 Biostatistics: Compiled by Sead Z. 19


Variables
• Independent Variable: Is a variable that, for the
purposes of the study question to be answered, occurs
independently of the effects being studied.
– A variable thought to be the cause of some effect.
– This term is usually used in experimental research to
denote a variable that the experimenter has manipulated.

12/21/2019 Biostatistics: Compiled by Sead Z. 20


Variables

• Dependent Variable: Is a variable that


depends on, or more exactly is influenced by,
the independent variable.
– A variable thought to be affected by
changes in an independent variable. You
can think of this variable as an outcome.

12/21/2019 Biostatistics: Compiled by Sead Z. 21


Variables Cont…
• In a study on gout, suppose we ask if blood uric acid (level) is
a factor in causing pain.
– We record blood uric acid level as a measurable variable that
occurs in the patient.

– Then we record pain as reported by the patient.

– We believe blood uric acid level is predictive of pain.

– In this relationship, the blood uric acid is the independent


variable and pain is the dependent variable.

12/21/2019 Biostatistics: Compiled by Sead Z. 22


Data Summarization

• Approaches

– Tables/graphs

– Numerical summary measures

12/21/2019 Biostatistics: Compiled by Sead Z. 23


Data Summarization
• Numerical summary measures

– Measures of central location

– Measures of dispersion

– Measures of relative standing/detectors of outliers

– Measures of shape

– Measures of association

12/21/2019 Biostatistics: Compiled by Sead Z. 24


Data Summarization
• Measures Of Central Location
– Arithmetic Mean, Median And Mode
• Arithmetic Mean Is Unique, Takes Into Account All Data
Points But Sensitive To Extreme Values

• Median Is Unique, And Not Affected By Extreme Values

• Mode Might Not Exist

12/21/2019 Biostatistics: Compiled by Sead Z. 25


Data Summarization
• Measures of Dispersion
–Measure spread of observations of a
distribution
• Variance, standard deviation and
• coefficient of variation

12/21/2019 Biostatistics: Compiled by Sead Z. 26


Data Summarization
• Measures of relative standing
–Tell the position of a particular
observation relative to others
• Standard score (Z-score)
• Percentiles

12/21/2019 Biostatistics: Compiled by Sead Z. 27


Data Summarization

• Measures of association

– Odds ratio, risk ratio, risk difference

– Correlation coefficient, regression


coefficients

12/21/2019 Biostatistics: Compiled by Sead Z. 28


Measures of Association
• A measure of association quantifies the
strength(magnitude) of the statistical association
between the exposure and the health problem of
interest.
• Estimate size/strength of association between exposure
and outcome
• Measures of association are sometimes called measures
of effect

12/21/2019 Biostatistics: Compiled by Sead Z. 29


Why Measures of Association?
• Cross-tabs and scatter plots are flexible tools for
exploring relationships between variables
• Chi-squared test evaluates statistical significance
• Neither method provides a summary measure of
the relationship
 What is the direction?
How strong is the relationship?
• But, Measures of association seek to provide this
information

12/21/2019 Biostatistics: Compiled by Sead Z. 30


Measures of association……
• In cohort studies, most commonly used is the
relative risk.
• In case-control studies, the odds ratio is the most
commonly used
• In cross-sectional studies, either a prevalence
ratio or a prevalence odds ratio

12/21/2019 Biostatistics: Compiled by Sead Z. 31


Risk Ratio
• It is also called relative risk, compares the risk of a
health event among one group with the risk among
another group.

• Compares incidence among exposed with incidence


among non-exposed

– Compare exposed and non-exposed or diseased and


non-diseased

12/21/2019 Biostatistics: Compiled by Sead Z. 32


Data lay out to show the Disease-exposure
relationship
• Data presentation…

Disease
Exposure Yes No Total
Yes a b a+b
No c d c+d
Total a+c b+d a+b+c+d

12/21/2019 Biostatistics: Compiled by Sead Z. 33


Relative Risk ….
𝑎
Incidence rate in exposed ൗ(𝑎+𝑏)
RR = =
Incidence rate in non−exposed 𝑐ൗ(𝑐+𝑑)

Cut-off points for RR: If


• RR=1, there is no association between the exposure and disease

• RR>1, there is positive association between exposure & disease

• RR<1, there is Negative association between exposure & disease

12/21/2019 Biostatistics: Compiled by Sead Z. 34


Relative Risk…..
• Example : In an outbreak of varicella (chickenpox) in Oregon in 2002,
varicella was diagnosed in 18 of 152 vaccinated children compared with 3
of 7 unvaccinated children. Calculate the risk ratio.
Yes No Total

Vaccinated A=18 B=134 152

Unvaccinated C=3 D=4 7

Total 21 138 159

Solution: Risk of varicella among vaccinated children = 18 / 152 = 0.118


• Risk of varicella among unvaccinated children = 3 / 7 = 0.429
Risk ratio = 0.118 / 0.429 = 0.28
• The risk ratio is less than 1.0, indicating a decreased risk or protective
effect for the exposed (vaccinated) children.

12/21/2019 Biostatistics: Compiled by Sead Z. 35


Odds Ratio
• quantifies the relationship between an exposure with two
categories and health outcome
• To examine the strength of association between Risk factor &
Out come in a case –control study
𝑎Τ 𝑎𝑑
𝑐
• OR= 𝑏ൗ =
𝑑 𝑏𝑐
Cut-off points for OR: If
- OR=1, odds of exposure among cases and controls is the same
- OR>1, odds of exposure among cases is higher than among
controls
- OR<1, odds of exposure among cases is lower than among
controls

12/21/2019 Biostatistics: Compiled by Sead Z. 36


Odds Ratio…
Example:-Exposure and Disease in a Hypothetical Population of
10,000 Persons given in table below
Yes No Total
Vaccinated A=100 B=1,900 2000
Unvaccinated C=80 D=7920 8000
Total 180 9820 10000

Solution : Odds ratio= ad/bc (100 x 7,920) / (1,900 x 80) = 5.2


since OR>1,then the odds of exposure among cases is higher than
among controls, this indicates there is a positive association
between Risk factor & Out come (exposure and disease)

12/21/2019 Biostatistics: Compiled by Sead Z. 37


Bias and confounding
• Bias is any trend in the collection, analysis,
interpretation, publication or review of data that can
lead to conclusions that are systematically different from
the truth.
• Bias can occur during any stage of a study:
during the literature review of the study question
during the selection of the study sample
during the measurement of exposure and outcome
during the analysis of data
during the interpretation of the analysis
12/21/2019during the publication ofby Sead
Biostatistics: Compiled theZ. results 38
Bias…
• Most of them however can be categorised in one of three general
types:
Selection bias
Information bias
Confounding bias
• Selection bias:- Occur during the execution of study when some
subjects are included and not others
Admission bias:- occurs when case control and cross
sectional studies are done exclusively
 Prevalence/incidence bias:- happens when asymptomatic
cases as well as fatal short disease episodes are missed.
Volunteer bias:-occurs when those who volunteer to
participate in a study differ systematically
12/21/2019 Biostatistics: Compiled by Sead Z. 39
Information (Observation) Bias
• Information bias occurs in the data collection stage of
studies
 Interviewer bias:-recording, or interpreting of
information from study subjects
Questionnaire bias:-Difference in accuracy
between compared groups.
Recall bias: when people, having had adverse
health outcomes,

12/21/2019 Biostatistics: Compiled by Sead Z. 40


Confounding Bias
• Confounding bias: occurs when a factor (confounder) associated
with the exposure of interest is also associated with
development of the disease or outcome of interest
independently of exposure.

• A confounder must be predictive of disease occurrence


independent of its association with the exposure of interest.

• The confounding variable can affect the association between


exposure and disease positively or negatively.

12/21/2019 Biostatistics: Compiled by Sead Z. 41


Basics Concepts
Control groups and placebos
A frequent mechanism to pinpoint the effect of a
treatment and to reduce bias is to provide a control
group having all the characteristics of the experimental
group except the treatment under study.
For example: Paracetamol tablet (drug group) and
lactose tablet (Placebo); then compare their effect on
fever reducing property.

12/21/2019 Biostatistics: Compiled by Sead Z. 42


Exploratory Methods

• Exploring data
– Data checking
– Understand distribution of variables
– Understand nature and strength of
relationships between variables

12/21/2019 Biostatistics: Compiled by Sead Z. 43


EDA and Visualization
• Exploratory Data Analysis (EDA) and Visualization are
very important steps in any analysis task.

• get to know your data!


– distributions (symmetric, normal, skewed)
– data quality problems
– outliers
– correlations and inter-relationships
– subsets of interest
– suggest functional relationships

44
Definition of EDA
• It is an approach for data analysis that employs a
variety of techniques (mostly graphical)
Main reasons we use EDA:
• Detection of mistakes
• checking of assumptions
• Preliminary selection of appropriate models
• Determining relationships among the explanatory
variables
• Assessing the direction and rough size of
relationships between explanatory and outcome
variables.
• maximize insight into a data set;
• uncover underlying structure;
• extract important variables;
• detect outliers and anomalies;
• test underlying assumptions;
• determine optimal factor settings.
• find odd values
Typical data format
• Spreadsheet ( store data on row and column)
• Database (e.g. tabular form)
Classification of EDA
Graphical
• Scatter plot
• histogram,
• box plot,
• residual plot,
• Probability plot
Such graphical tools are the shortest path to
gaining insight into a data set in terms of
• testing assumptions
• model selection
• model validation
• estimator selection
• relationship identification
• factor effect determination
• outlier detection
• identify useful raw data & transforms (e.g. log(x))
Non-graphical (summary statistics)
• Averages(mean, median, etc)
• Quantiles
Exploratory data analysis:
One variable
• Graphical displays
– Qualitative/categorical data: bar chart, pie chart, etc.
– Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.

• Summary statistics
– Qualitative/categorical: contingency tables
– Quantitative: mean, median, standard deviation, range etc.

• Probability models
– Qualitative: Binomial distribution(others we won’t cover in this
class)
– Quantitative: Normal curve (others we won’t cover in this class)
Summary of categorical variables

• Graphically
– Bar graphs, pie charts
• Bar graph nearly always preferable to a pie chart. It is
easier to compare bar heights compared to slices of a
pie

• Numerically: tables with total counts or


percents
Summary table
• we summarize categorical data using a table. Note that
percentages are often called Relative Frequencies.

Class Frequency Relative Frequency


Highest Degree Obtained Number of CEOs Proportion
None 1 0.04
Bachelors 7 0.28
Masters 11 0.44
Doctorate / Law 6 0.24
Totals 25 1.00
Bar graph
• The bar graph quickly
compares the degrees of the
four groups
• The heights of the four bars
show the counts for the four
degree categories
Pie chart
• A pie chart helps us see
what part of the whole
group forms
• To make a pie chart, you
must include all the
categories that make up a
whole
Quantitative variables
• Graphical summary
– Histogram
– Stemplots
– Time plots
– more

• Numerical summary
– Mean
– Median
– Quartiles
– Range
– Standard deviation
– more
Single Variable Visualization
• Histogram:
– Shows center, variability, skewness, modality,
– outliers, or strange patterns.
– Bin width and position matter
– Beware of real zeros

56
Issues with Histograms
• For small data sets, histograms can be misleading.
– Small changes in the data, bins, or anchor can deceive

• For large data sets, histograms can be quite effective at


illustrating general properties of the distribution.

• Histograms effectively only work with 1 variable at a


time
– But ‘small multiples’ can be effective

57
Boxplots
• Shows a lot of information about a
variable in one plot
– Median
– IQR
– Outliers
– Range
– Skewness
• Negatives
– Overplotting
– Hard to tell distributional shape
– no standard implementation in
software (many options for
whiskers, outliers)

58
Exploratory data analysis: two variables
• There are three combinations of variables we must consider.
We do so in the following order
– 1 qualitative/categorical, 1 quantitative variables
• Side-by-side box plots, counts, etc.

– 2 quantitative variables
• Scatter plots, correlations, regressions

– 2 qualitative/categorical variables
• Contingency tables (we will cover these later in the
semester)
59
Side-by-side box plots
• Side-by-side box plots are graphical summaries of data when
one variable is categorical and the other quantitative
• These plots can be used to compare the distributions
associated with the the quantitative variable across the levels
of the categorical variable

60
Box plots

• A box plot is a graph of five


numbers (often called the five
number summary)
– minimum

– Maximum

– Median

– 1st quartile

– 3rd quartile

61
Two Continuous Variables
• For two numeric variables, the scatterplot
is the obvious choice

interesting?

interesting?

62
2D Scatterplots
• useful to answer:
• standard tool to display relation – x,y related?
between 2 variables • linear
– e.g. y-axis = response, x-axis = • quadratic
suspected indicator • other
– variance(y) depend on x?
– outliers present?

interesting?

interesting?

63
Scatter Plot: No apparent relationship

64
Scatter Plot: Linear relationship

65
Scatter Plot: Quadratic relationship

66
Scatter plot: Homoscedastic

Why is this important in classical statistical modelling?

67
Scatter plot: Heteroscedastic

variation in Y differs depending on the value of X


e.g., Y = annual tax paid, X = income

68
Two variables - continuous
• Scatterplots
– But can be bad with lots of data

69
Two variables - continuous
• Scatterplots
– But can be bad with lots of data

70
Two variables - continuous
• What to do for large data sets
– Contour plots

71
Describing scatter plots
• Form
– Linear, quadratic, exponential

• Direction
– Positive association
• An increase in one variable is accompanied by an increase in the other

– Negatively associated
• A decrease in one variable is accompanied by an increase in the other

• Strength
– How closely the points follow a clear form

72
Describing scatter plots

• Form:
– Linear

• Direction
– Positive

• Strength
– Strong
73
Two Variables - one categorical
• Side by side boxplots are very effective in showing differences in a
quantitative variable across factor levels
– tips data
• do men or women tip better
– orchard sprays
• measuring potency of various orchard sprays in repelling honeybees

74
Barcharts and Spineplots
stacked barcharts can be
used to compare
continuous values across
two or more categorical
ones.

orange=M blue=F

spineplots show
proportions well, but can
be hard to interpret

75
Study Types

12/21/2019 Biostatistics: Compiled by Sead Z. 76


Study Types

Case Control Study:


A case-control study is s study in which an
experimental group of patient is chosen for being
characterized by some outcome factor, such as
having acquired a disease, and a control group
lacking this factor is matched patient for patient.

12/21/2019 Biostatistics: Compiled by Sead Z. 77


Cohort Study
A cohort study starts by choosing groups that have
already been assigned to study categories, such as
diseases or treatments, and follows these groups
forward in time to assess the outcomes.

12/21/2019 Biostatistics: Compiled by Sead Z. 78


Randomized control Trial

The soundest type of study is the randomized controlled


trial (RCT), often called a clinical trial. An RCT is a
true experiment in which patients are assigned
randomly to a study category, such as clinical treatment,
and are then followed forward in time (making it a
prospective study) and the outcome is assessed.

12/21/2019 Biostatistics: Compiled by Sead Z. 79


Paired and Crossover Designs

Some studies permit a design in which the patients


serve as their own controls, as in a “before – and-
After” study or a comparison of two treatments in
which the patient receives both in sequence.

12/21/2019 Biostatistics: Compiled by Sead Z. 80


STEPS THAT WILL AID IN PLANNING

1. Start with objectives. Specify, clearly,


unequivocally, a question to be answered about an
explicitly defined population

2. Develop the background and relevance

3. Plan your materials. From where will you obtain


your equipments?
12/21/2019 Biostatistics: Compiled by Sead Z. 81
STEPS THAT WILL AID IN PLANNING

4. Plan your methods and data. Identify at least 1


measurable variable capable of answering your
question. Define the specific data that will satisfy
your objectives and verify that your methods will
provide these data. Develop clearly specified null
and alternative hypothesis.
12/21/2019 Biostatistics: Compiled by Sead Z. 82
STEPS THAT WILL AID IN PLANNING
5. Plan data recording. Develop a raw data entry sheet
and a spreadsheet to transfer the raw data to that will
facilitate analysis by computer software
6. Define the subject population, verify that your
sampling procedures will sample representatively.
7. Ensure that your sample size will satisfy your
objectives
12/21/2019 Biostatistics: Compiled by Sead Z. 83
STEPS THAT WILL AID IN PLANNING
8. Anticipate what statistical analysis will yield results
that will satisfy your objectives
9. Plan tests for sampling bias
10. Plan the bridge from results to conclusion
11. Anticipate the form in which your conclusion will be
exercised
12. Now you can draft an abstract.

12/21/2019 Biostatistics: Compiled by Sead Z. 84


12/21/2019 Biostatistics: Compiled by Sead Z. 85
• Assignment 1: Write a clear research title of your area

• Assignment 2: Prepare a questionnaire/tool for


collecting data from the defined population.

12/21/2019 Biostatistics: Compiled by Sead Z. 86


Data Presentation and processing
• Before making analysis, data should be
– edited,
– coded,
– entered into computers, and
– Cleaned
for its consistency.

12/21/2019 Biostatistics: Compiled by Sead Z. 87


• Editing: Data should be edited for its
consistency before analysis.

• Editing involves checking and making corrections


upon all incomplete, erroneous and contradictory
responses recorded in the questionnaires.

• Editing includes dumping the questionnaires which


cannot be improved by applying the above activities.

12/21/2019 Biostatistics: Compiled by Sead Z. 88


• Coding: is the process of transforming the recorded
responses into codes.
• Once the data have been edited and coded, the next step
will be entering the data into computers so that they can
be processed and outputs will be produced.
• Then data cleaning will be carried out based on edit-
specification programs, where this step is the step of
final consistency check to take place before analyzing
the results.

12/21/2019 Biostatistics: Compiled by Sead Z. 89


• Analysis: This step is a step where the presented data
will be investigated using different methods of
statistical techniques.

• Among different methods of analyzing data, we


can mention some of the simple descriptive analysis
such as dealing with measures of central tendencies,
measures of variations and so on.

12/21/2019 Biostatistics: Compiled by Sead Z. 90


• Interpretation of results: Once the final
outputs have been produced, appropriate
interpretation of results will be given by
analyzing the results obtained.
• This stage requires due attention as this is the
final result which can be utilized by decision
makers for otherwise it misleads these decision
makers to a wrong decision.

12/21/2019 Biostatistics: Compiled by Sead Z. 91


Numerical summaries
Measures of centers:
• Mean
• Mode
• Median
Measures of dispersion
• Variance and standard deviation
• Standard score and coefficient of variation

12/21/2019 Biostatistics: Compiled by Sead Z. 92


One sample inference

• Estimation of single population parameter

• Hypothesis testing of single population parameter

12/21/2019 Biostatistics: Compiled by Sead Z. 93


Regression analysis

12/21/2019 Biostatistics: Compiled by Sead Z. 94


Regression Model

• Even if we are interested only in the relationship between a


response and an explanatory variable, we may still have to control
for at least one confounder that can influence the relationship
under investigation.
• In this chapter we will use models as the basis of such analyses.

• The goal is to find the best fitting and most parsimonious, yet
biologically reasonable model to describe the relationship
between an outcome (dependent or response variable) and a set
of independent (predictor or explanatory) variables.

12/21/2019 Biostatistics: Compiled by Sead Z. 95


Regression Model

• A good-fitting model has several benefits.


 inferences for model parameters help us evaluate which
explanatory variables affect the response, while controlling
effects of possible confounding variables.
 estimation of parameters is more informative than mere
significance testing (sizes of estimated model parameters
determine the strength and importance of the effects).
 model based predicted values can be obtained.
 models can handle more complicated situations than those in
previous chapters (e.g. analyzing simultaneously the effects of
several explanatory variables).
12/21/2019 Biostatistics: Compiled by Sead Z. 96
Type of Regression Models

• Depending on the type of the response variable we classify


regression models as.
 normal: linear regression,
 binary: probit and logit (logistic) regression
 counts: Poisson regression
 categorical data: log-linear modelling
 time-to-event: survival regression

12/21/2019 Biostatistics: Compiled by Sead Z. 97


linear Regression Model

• The purpose of linear regression is to analyze the relationship


between metric or dichotomous independent variables and a
metric dependent variable.

• It is used to answer questions such as:


– Do changes in age result in changes in SBP, and if so, do the
results depend on other characteristics such as sex,
cholesterol level and so on ?

12/21/2019 Biostatistics: Compiled by Sead Z. 98


linear Regression Model

• The goal of regression analysis is to understand how the values of


Y (out come variable) change as X (the predictor variable) is
varied over its range of possible values.

• An essential first step in regression analysis is to draw appropriate


graphs of the data.

• A fundamental graphical tool for looking at regression data by


using scatter plot

12/21/2019 Biostatistics: Compiled by Sead Z. 99


Scatter plots of y versus x
y
y No relationship
between x and
y. Spread is
x even in all x
directions.

Linear relationship:
y A line indicates the main direction
of the spread of points.

Non-linear relationship
x between x and y. A curve best
describes the relationship.
12/21/2019 Biostatistics: Compiled by Sead Z. 100
The Model
• Suppose we have n subjects and measure the following on each
subject
– 𝑌 = (𝑦1 , 𝑦2 ,…, 𝑦𝑛 ) be the response
– 𝑋1 =(𝑋11 , 𝑋12 ,…, 𝑋1𝑛 ) be independent variable 1
– 𝑋2 =(𝑋21 , 𝑋22 ,…, 𝑋2𝑛 ) be independent variable 4
– 𝑋3 =(𝑋31 , 𝑋32 ,…, 𝑋3𝑛 ) be independent variable 3

• Aim: To study the relation ship between 𝑌 and 𝑋

12/21/2019 Biostatistics: Compiled by Sead Z. 101


The Model

• Model-1: Simple linear regression:

𝑌 = β0 + β1 𝑋1 + ε

• Model-2: Multiple linear regression:

𝑌 = β0 + β1 𝑋1 + β2 𝑋2 + β3 𝑋3 + ε

Where ε is the residual and it is the part that cannot be accounted for
by the model, that is normally distributed with mean 0 and variance
2.

12/21/2019 Biostatistics: Compiled by Sead Z. 102


Assumptions of Linear Regression Model

1. Linear relationship between the outcome variable (y) and


explanatory variable (X)
2. The outcome variable (y) should be Normally distributed for
each value of explanatory variable (x)
3. Standard deviation of y should be approximately the same for
each value of x
4. All the observations should be Independent

12/21/2019 Biostatistics: Compiled by Sead Z. 103


Assumptions of linear regression

*
*
*
Assumption 1 * *
*
*
*
Linear relationship **
**
*
*
*
Assumption 2 ** *
*

Y normally distributed **
**
*

at each value of x
Assumption 3
Same variance at each value of x

12/21/2019 Biostatistics: Compiled by Sead Z. 104


Testing Assumptions:
Assumption 1: linear relationship
Plot y against x to check for linearity

12/21/2019 Biostatistics: Compiled by Sead Z. 105


Testing Assumptions:
Assumption 2: Normality
Histogram of residuals
Dependent variable BMI
Normal P-P Plot of Standardized Residual

1.0

0.8

Expected Cum Prob


0.6

0.4

0.2

0.0
0.0 0.2 0.4 0.6 0.8 1.0

Observed Cum Prob


12/21/2019 Biostatistics: Compiled by Sead Z. 106
Assumption 3 can be stated as follows:
• The variance of y is the same for any x that is, the spread of
values for y at each level of x remains approximately constant

 2
y| x1  2
y| x2  2
y| x3  2

y Spread of y|x

x
12/21/2019 Biostatistics: Compiled by Sead Z. 107
Testing Assumptions:
Assumption 3: Spread of y values constant over range of x values
(plot of residuals against X)

12/21/2019 Biostatistics: Compiled by Sead Z. 108


Types linear regression Model
• There are three types of multiple regression, each of which is
designed to answer a different question:

– Standard multiple regression is used to evaluate the


relationships between a set of independent variables and a
dependent variable.
– Hierarchical, or sequential, regression is used to examine the
relationships between a set of independent variables and a
dependent variable, after controlling for the effects of some
other independent variables on the dependent variable.
– Stepwise, or statistical, regression is used to identify the subset
of independent variables that has the strongest relationship to a
dependent variable.

12/21/2019 Biostatistics: Compiled by Sead Z. 109


Types linear regression Model
Standard multiple regression

• In standard multiple regression, all of the independent variables are


entered into the regression equation at the same time

• Multiple R and R² measure the strength of the relationship between


the set of independent variables and the dependent variable.

• An F test is used to determine if the relationship can be generalized


to the population represented by the sample.

• A t-test is used to evaluate the individual relationship between each


independent variable and the dependent variable.
12/21/2019 Biostatistics: Compiled by Sead Z. 110
Types linear regression Model
hierarchical multiple regression

• In hierarchical multiple regression, the independent variables are


entered in two stages.
• In the first stage, the independent variables that we want to control
for are entered into the regression.
• In the second stage, the independent variables whose relationship
we want to examine after the controls are entered.
• A statistical test of the change in R² from the first stage is used to
evaluate the importance of the variables entered in the second
stage.

12/21/2019 Biostatistics: Compiled by Sead Z. 111


Types linear regression Model
Stepwise regression
• Stepwise regression is designed to find the most parsimonious set
of predictors that are most effective in predicting the dependent
variable.
• Variables are added to the regression equation one at a time, using
the statistical criterion of maximizing the R² of the included
variables.
• When none of the possible addition can make a statistically
significant improvement in R², the analysis stops.

12/21/2019 Biostatistics: Compiled by Sead Z. 112


Problem 1 - standard multiple regression
Is the following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data, and
violation of assumptions, or outliers. Use a level of significance of
0.05.

• “Body weight in pound” and “age of the person in years” have a


strong relationship to the variable “systolic blood pressure”

• Survey respondents who had less Body weight had a lower systolic
blood pressure. Survey respondents who were younger had a lower
systolic blood pressure.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
12/21/2019 Biostatistics: Compiled by Sead Z. 113
Problem 1 - standard multiple regression
When a problem states that there is a
relationship between some independent
variables and a dependent variable, we
The variables listed first in the do standard multiple regression.
problem statement are the
independent variables (ivs): “Body
weight in pound " and "age of the
person in yea"

• “Body weight in pound” and “age of the person in years” have a


strong relationship to the variable “systolic blood pressure”

• Survey respondents who had less Body weight had a lower systolic
The variable that is
blood pressure. Survey respondents who were younger
related to is the had a lower
systolic blood pressure. dependent variable
(dv): " systolic blood
1. True pressure”

2. True with caution


3. False
4. Inappropriate application of a statistic
12/21/2019 Biostatistics: Compiled by Sead Z. 114
Problem 1 - standard multiple regression

In order for a problem to be true, we


will have find:
•a statistically significant relationship
between the ivs and the dv
•a relationship of the correct strength

• “Body weight in pound” and “age of the person in years” have a


strong relationship to the variable “systolic blood pressure”

• Survey respondents who had less Body weight had a lower systolic
blood pressure. Survey respondents who were younger had a lower
systolic blood pressure.
1. True The relationship of each of
the independent variables
2. True with caution to the dependent variable
must be statistically
3. False significant and interpreted
correctly.
4. Inappropriate application of a statistic
12/21/2019 Biostatistics: Compiled by Sead Z. 115
Problem 1 - standard multiple regression

The probability of the F statistic (27.92) for the


overall regression relationship is <0.001, less than or
equal to the level of significance of 0.05. We reject
the null hypothesis that there is no relationship
between the set of independent variables and the
dependent variable (R² = 0). We support the
research hypothesis that there is a statistically
significant relationship between the set of
independent variables and the dependent variable.

ANOV Ab

Sum of
Model Squares df Mean Square F Sig.
1 Regres sion 10287.769 2 5143.884 27.924 .000a
Residual 14921.219 81 184.213
Total 25208.988 83
a. Predic tors : (Const ant), Body W eight in pound, Age of t he person in years
b. Dependent Variable: S ystolic B lood Pressure in mmHg

12/21/2019 Biostatistics: Compiled by Sead Z. 116


Problem 1 - standard multiple regression

The Multiple R for the relationship between the set of


independent variables and the dependent variable is 0.639,
which would be characterized as strong using the rule of
thumb than a correlation less than or equal to 0.20 is
characterized as very weak; greater than 0.20 and less than
or equal to 0.40 is weak; greater than 0.40 and less than or
equal to 0.60 is moderate; greater than 0.60 and less than or
equal to 0.80 is strong; and greater than 0.80 is very strong.

Model Summary

Adjusted Std. Error of


Model R R Square R Square the Estimate
1 .639a .408 .393 13.572
a. Predictors: (Constant), Body Weight in pound, Age of
the person in years

12/21/2019 Biostatistics: Compiled by Sead Z. 117


Problem 1 - standard multiple regression

For the independent variable Age, the probability of the


t statistic (7.114) for the b coefficient is <0.001 which
is less than or equal to the level of significance of 0.05.
We reject the null hypothesis that the slope associated
with age is equal to zero (b = 0) and conclude that
there is a statistically significant relationship between
Age of the respondent and systolic Blood pressure

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in years .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Pressure in mmHg

12/21/2019 Biostatistics: Compiled by Sead Z. 118


Problem 1 - standard multiple regression

Coefficientsa

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in years .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Pressure in mmHg

The b coefficient associated with Age (0.553) is positive,


indicating a direct relationship in which higher numeric
values for age are associated with higher numeric values
for systolic blood pressure.

12/21/2019 Biostatistics: Compiled by Sead Z. 119


Problem 1 - standard multiple regression

For the independent variable body weight, the


probability of the t statistic (2.239) for the b coefficient
is <0.028 which is less than or equal to the level of
significance of 0.05. We reject the null hypothesis that
the slope associated with body weight is equal to zero
(b = 0) and conclude that there is a statistically
significant relationship between body weight and
systolic blood pressure. a
Coefficients

Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in years .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Pressure in mmHg

12/21/2019 Biostatistics: Compiled by Sead Z. 120


Problem 1 - standard multiple regression
Coefficientsa

Unstandardized Standardiz ed
Coeffic ients Coeffic ients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in y ears .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Press ure in mmHg

The b coefficient associated with body weight


(0.085) is positive, indicating a direct relationship in
which higher numeric values for body weight are
associated with higher value of systolic blood pressure.

12/21/2019 Biostatistics: Compiled by Sead Z. 121


Problem 2 – Hierarchal multiple regression
is the following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data, violation of
assumptions, or outliers. Use a level of significance of 0.05.

After controlling for the effects of the variables “Place of residence" and
“sex", the addition of the variables “birth weight" reduces the error in
predicting “BMI" by 17.2%.
After controlling for Place of residence and Sex, the variables birth weight
make an individual contribution to reducing the error in predicting BMI.
Infants who live in rural area had less BMI. Male infants had a higher BMI.
Infants who had less than 2500g birth weight had lower BMI .

1. True
2. True with caution
3. False
4. Inappropriate application of a statistic

12/21/2019 Biostatistics: Compiled by Sead Z. 122


Problem 2 – Hierarchal multiple regression
is the following statement true, false, or an incorrect application of a statistic?
Assume that there is no Theproblem
variableswith
listedmissing
first in thedata, violation of assumptions, or
outliers. Use a level of significance of 0.05.
problem statement are the
independent variables (ivs)
whose effect we want to control
before we test for the
relationship: “place of residence”
and "sex" [sex],

After controlling for the effects of the variables “Place of residence" and “sex", the
addition of the variables “birth weight" reduces the error in predicting “BMI" by
17.2%.
After controlling for Place of residence and Sex, the variables birth weight make an
individual contribution to reducing the error in predicting BMI. Infants who live in
rural area had less BMI.
The variables thatMale
we addinfants had a higher BMI. Infants who had less than
in after the
control variables are the independent
2500g birthvariables
weightthathad lower BMI .
we think will have a
The variable that to be
predicted or related to is
statistical relationship to the the dependent variable
dependent variable: (dv): “BMI”
1. True “birth weight"
2. True with caution
3. False
4. Inappropriate application of a statistic

12/21/2019 Biostatistics: Compiled by Sead Z. 123


Problem 2 – Hierarchal multiple regression
is the following statement true, false, or an incorrect application of a statistic?
Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05.
In order for a problem to be true, the
relationship between the added variables
and the dependent variable must be
statistically significant, and the strength of
the relationship after including the control
variables must be correctly stated.

After controlling for the effects of the variables “Place of residence" and “sex", the
addition of the variables “birth weight" reduces the error in predicting “BMI" by
17.2%.
After controlling for Place of residence and Sex, the variables birth weight make an
individual contribution to reducing the error in predicting BMI. Infants who live in
rural area had less BMI. Male infants had a higher BMI. Infants who had less than
2500g birth weight had lower BMI .

1. True The relationship between


We are generally not interested
2. True with incaution each of the independent
whether or not the control
variables entered after the
variables have a statistically
3. False significant relationship to the
control variables and the
dependent variable must
4. Inappropriate application
dependent variables. of a statistic
be statistically significant
and interpreted correctly.
12/21/2019 Biostatistics: Compiled by Sead Z. 124
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES

ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 761.620 2 380.810 114.895 .000b
1 Residual 26084.494 7870 3.314
Total 26846.114 7872
Regression 5374.117 3 1791.372 656.497 .000c
2 Residual 21471.998 7869 2.729
Total 26846.114 7872
a. Dependent Variable: BMI
b. Predictors: (Constant), Sex, place of residence
c. Predictors: (Constant), Sex, place of residence, Birth weight
The probability of the F statistic (656.49) for the overall
regression relationship for all indpendent variables is
<0.001, less than or equal to the level of significance of
0.05. We reject the null hypothesis that there is no
relationship between the set of all independent variables
and the dependent variable (R² = 0). We support the
research hypothesis that there is a statistically significant
relationship between the set of all independent variables
and the dependent variable.
12/21/2019 Biostatistics: Compiled by Sead Z. 125
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES

Model Summary
Model R R Adjusted R Std. Error Change Statistics
Square Square of the R Square F df1 df2 Sig. F
Estimate Change Change Change
1 .168a .028 .028 1.82056 .028 114.895 2 7870 .000
1690.37
2 .447b .200 .200 1.65187 .172 1 7869 .000
5
a. Predictors: (Constant), Sex, place of residence
b. Predictors: (Constant), Sex, place of residence, Birth weight

The R Square Change statistic for the increase in R²


associated with the added variables (birth weight)
is 0.172. Using a proportional reduction in error
interpretation for R², information provided by the
added variables reduces our error in predicting BMI
by 17.2%.

12/21/2019 Biostatistics: Compiled by Sead Z. 126


OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES

Model Summary
Model R R Adjusted R Std. Error Change Statistics
Square Square of the R Square F df1 df2 Sig. F
Estimate Change Change Change

1 .168a .028 .028 1.82056 .028 114.895 2 7870 .000


1690.37
2 .447b .200 .200 1.65187 .172 1 7869 .000
5
a. Predictors: (Constant), Sex, place of residence
b. Predictors: (Constant),The
Sex,probability
place of residence, Birth weight
of the F statistic (1690.37) for the change in R²
associated with the addition of the predictor variables to the
regression analysis containing the control variables is <0.001, less
than or equal to the level of significance of 0.05. We reject the
null hypothesis that there is no improvement in the relationship
between the set of independent variables and the dependent
variable when the predictors are added (R² Change = 0).

We support the research hypothesis that there is a statistically


significant improvement in the relationship between the set of
independent variables and the dependent variable.
12/21/2019 Biostatistics: Compiled by Sead Z. 127
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES
Coefficientsa
Model Unstandardized Standardized t Sig.
Coefficients Coefficients
B Std. Error Beta
(Constant) 13.061 .037 351.460 .000
1 place of
-.583 .041 -.157 -14.140 .000
residence
Sex .211 .041 .057 5.148 .000
(Constant) 13.202 .034 389.523 .000
place of
-.323 .038 -.087 -8.512 .000
2 residence
Sex .120 .037 .032 3.214 .001
If there is a relationship between each added individual independent
Birth weight -2.795 .068 -.421 -41.114 .000
variable and the dependent variable, the probability of the statistical
a. Dependent Variable:
test ofBMI
the b coefficient (slope of the regression line) will be less than
or equal to the level of significance. The null hypothesis for this test
states that b is equal to zero, indicating a flat regression line and no
relationship.

If we reject the null hypothesis and find that there is a relationship


between the variables, the sign of the b coefficient indicates the
direction of the relationship for the data values. If b is greater than
or equal to zero, the relationship is positive or direct. If b is less than
zero, the relationship is negative or inverse. If the variable is
dichotomous or ordinal, the direction of the coding must be taken
into account to make a correct interpretation.
12/21/2019 Biostatistics: Compiled by Sead Z. 128
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES
Coefficientsa
Model Unstandardized Standardized t Sig.
Coefficients Coefficients
B Std. Error Beta
(Constant) 13.061 .037 351.460 .000
1 place of
-.583 .041 -.157 -14.140 .000
residence
Sex .211 .041 .057 5.148 .000
(Constant) 13.202 .034 389.523 .000
place of
-.323 .038 -.087 -8.512 .000
2 residence
Sex .120 .037 .032 3.214 .001
Birth weight -2.795 .068 -.421 -41.114 .000
a. Dependent Variable: BMI

12/21/2019 Biostatistics: Compiled by Sead Z. 129


Problem 3 – Stepwise Regression
Reading assignment

12/21/2019 Biostatistics: Compiled by Sead Z. 130


Logistic Regression Model
Review of simple and multiple linear regression
• Simple LR: Model the mean of a numeric response Y as a function
of a single predictor X, i.e. The key is that E(Y|X) is a linear in
E(Y|X) = bo + b1X the parameters b and b but not
necessarily in X.
o 1

b0 = Estimated Intercept


• The value of y at x=0
b1 = Estimated Slope
• change in 𝑦ො for every unit
• interpretable only if x=0
increase in x.
is a value of particular
• Estimated change in the
interest
mean of Y for a unit
change in X
• Always interpretable

12/21/2019 Biostatistics: Compiled by Sead Z. 131


Logistic Regression Model
Review of simple and multiple linear regression
• Simple LR: Model the mean of a numeric response Y as a
function of k predictors X_1,X_2,…X_k, i.e.

𝐸 𝑌 𝑋 = β0 + β1 𝑋1 + β2 𝑋2 + β3 𝑋3 + ⋯ + β𝑘 𝑋𝑘

• the regression coefficients (bi) represent the



estimated change in the mean of the response Y
associated with a unit change in Xi while the other
predictors are held constant.
• They measure the association between Y and Xi
adjusted for the other predictors in the model.

12/21/2019 Biostatistics: Compiled by Sead Z. 132


Logistic Regression Model
• Model the relation ship between mean set of predictors
X_,X_2,…X_k, e.g.,
– dichotomous (yes/no, smoker/nonsmoker,…)
– categorical (social class, race, ... )
– continuous (age, weight, gestational age, ...)

𝑎𝑛𝑑
• A dichotomous response variable Y, e.g.,
– ŷ
Success/Failure
– Remission/No Remission
– Survived/Died
– CHD/No CHD
– Low Birth Weight/Normal Birth Weight, etc…
12/21/2019 Biostatistics: Compiled by Sead Z. 133
Logistic Regression Model
Example: Coronary Heart Disease (CD) and Age
• In this study sampled individuals were examined for signs of CD
(present = 1 / absent = 0) and the potential relationship between
this outcome and their age (yrs.) was considered.

• Portion of the data set

12/21/2019 Biostatistics: Compiled by Sead Z. 134


Logistic Regression Model
Example: Coronary Heart Disease (CD) and Age
How we can analyze this data?

• The mean age of the individuals with some signs of coronary


heart disease is 51.28 years vs. 39.18 years for individuals without
signs (t = 5.95, p < .0001).
12/21/2019 Biostatistics: Compiled by Sead Z. 135
Logistic Regression Model
Example: Coronary Heart Disease (CD) and Age
Simple linear regression?

E(CD | Age)  .54  .02  Age


e.g. For an individual 50 years of age
E(CD | Age  50)  .54  .02  50  .46??

12/21/2019 Biostatistics: Compiled by Sead Z. 136


Logistic Regression Model
Example: Coronary Heart Disease (CD) and Age
Logistic regression
• We can group individuals into age classes and look at the
percentage/proportion showing signs of coronary heart disease.
Diseased

Age group # in group # Proportion

1) 20 - 29 10 1 .100

2) 30 - 34 15 2 .133

3) 35 - 39 ŷ 12 3 .250

4) 40 - 44 15 5 .333

5) 45 - 49 13 6 .462

6) 50 - 54 8 5 .625

7) 55 - 59 17 13 .765
Notice the “S-shape” to the
estimated proportions vs.
8) 60 – 64 10 8 .800
12/21/2019 Biostatistics: Compiled by Sead Z.age. 137
Logistic Regression Model
Logistic function

eβ o β1X
P(" Success"| X) 
1
1  eβ o β1X
P(“Success”|X)

0.8

0.6

ŷ 0.4

0.2

X
12/21/2019 Biostatistics: Compiled by Sead Z. 138
Logistic Regression Model
Logit transformation
• The logistic regression model is given by

eβ o β1X
P(Y | X)  β o β1X
1 e
• Which is equivalent to

ŷ ln P(Y | X)   β  β X
 1  P(Y | X)  o 1
 

This is called the


12/21/2019
Logit Transformation
Biostatistics: Compiled by Sead Z. 139
Logistic Regression Model
Logit transformation
• Consider a dichotomous predictor (X) which represents the
presence of risk (1 = present)
Risk Factor (X)
Disease (Y) Present Absent
(X = 1) (X = 0)

Yes (Y = 1) P (Y  1 X  1) P (Y  1 X  0)

No (Y = 0) 1  P (Y  1 X  1) 1  P (Y  1 X  0)

P(Y  1 | X  1)
Odds for Disease with Risk Present   eβ o β1
P 1 - P(Y  1 | X  1)
 eβ o β1X ŷ
1 P P(Y  1 | X  0)
Odds for Disease with Risk Absent   eβ o
1 - P(Y  1 | X  0)

Odds for Disease with Risk Present e bo  b1


Therefore the
  bo  e b1
odds ratio (OR) Odds for Disease with Risk Absent e
12/21/2019 Biostatistics: Compiled by Sead Z. 140
Logistic Regression Model: Example
Malaria incidence and entomological indices data among non-resettled and
resettled communities in Jimma town

• Aim: Evaluating the impact of resettlement on malaria


incidence and entomological indices

• Data:
– Four study villages, (2 from at risk, 2 from control)
– Two type of study was conducted: Entomological and
Parasitological

12/21/2019 Biostatistics: Compiled by Sead Z. 141


Logistic Regression Model: Example
Malaria incidence and entomological indices data among non-resettled and
resettled communities in Jimma town

Parasitological study
 A cohort of 604 (302 from resettled and 302 from non-
resettled villages) individuals residing in 202 households
was followed from September 1 to November 30, 2013.
 During monthly house-to-house visit, blood sample was
collected from the study participants
 Outcome: P. falciparum malaria infection status
 Covariates: month, resettlement status

12/21/2019 Biostatistics: Compiled by Sead Z. 142


Logistic Regression Model: Example
Malaria incidence and entomological indices data among non-resettled and
resettled communities in Jimma town

Result
Month of parasitological survey * PFIND Crosstabulation
PFIND Total
0 1
Count 583 17 600
September
% 97.2% 2.8% 100.0%
Month of
Count 586 14 600
parasitological October
% 97.7% 2.3% 100.0%
survey
Count 564 14 578
November
% 97.6% 2.4% 100.0%
Count 1733 45 1778
Total
% 97.5% 2.5% 100.0%

Settlement status of the household members * PFIND Crosstabulation


PFIND Total
.00 1.00
Count 855 29 884
Settlement status of Resettled
% 96.7% 3.3% 100.0%
the household
Count 878 16 894
members Indeginous
% 98.2% 1.8% 100.0%
Count 1733 45 1778
Total
% 97.5% 2.5% 100.0%
12/21/2019 Biostatistics: Compiled by Sead Z. 143
Logistic Regression Model: Example
Malaria incidence and entomological indices data among non-resettled and
resettled communities in Jimma town

Result Dependent Variable Encoding


Original Value Internal Value
.00 0
1.00 1

Categorical Variables Codings


Frequency Parameter coding
(1) (2)
September 600 1.000 .000
Month of parasitological
survey October 600 .000 1.000
November 578 .000 .000
Settlement status of the Resettled 884 1.000
household members Indeginous 894 .000

12/21/2019 Biostatistics: Compiled by Sead Z. 144


Logistic Regression Model: Example
Malaria incidence and entomological indices data among non-resettled and
resettled communities in Jimma town

Result

Variables in the Equation


B S.E. Wald df Sig. Exp(B)
Month .346 2 .841
Month(1) .162 .366 .197 1 .657 1.176
Step 1a Month(2) -.037 .383 .010 1 .922 .963
Settlement(1) .621 .315 3.889 1 .049 1.862
Constant -4.051 .338 143.540 1 .000 .017
a. Variable(s) entered on step 1: Month, Settlement.

12/21/2019 Biostatistics: Compiled by Sead Z. 145

You might also like