Chapter 1 Lecture 1

Biostatistics and Research
Methodology
12/21/2019 Biostatistics: Compiled by Sead Z. 1

Course description
• This course is an extension of Introduction to
Statistics which addresses overview of
research (e.g. meaning and steps to be
followed) and application of statistical
knowledge in protocol development, objective
of a given study versus hypothesis testing,
estimation and assessing statistical significance

Course Objectives:
On successful completion of this course students
will be able to:
• Know and understand the basic concepts of biostatistics
• Be familiar with research design
• Be familiar with the skills of collecting, summarizing,
analyzing, interpreting and presenting data
• Independently be able to choose appropriate statistical test
and conduct the test
• Make a statistical inference
• Be able to determine correct sample size
Course Content
1. Introduction
2. Study Designs
3. Data analysis and presentation
4. Analysis of continuous data
• t-test (different sample size, Difference Variance)
• ANOVA
• Linear Regression Model
• Multiple Regression

Course …
5. Analysis of Categorical Data
• X-2 –test
• Fisher’s exact test
• Multiple Logistics Regression
– Analysis of Count Data
6. Non-Parametric Statistical tests
7. Sample Size Determination
 Field or Experimental trial
8. Research Protocol Development
9. Ethics in Research execution and communication
10. Projects
Further readings
• Thrusfield, M. 2007. Veterinary Epidemiology. 3rd edition, Blackwell Publishing Limited
• Salman, M. D. 2003. Animal Disease Surveillance and Survey Systems: Methods and
Applications. 1st edition, Blackwell Publishing Limited
• A.G.Bluman - Elementary Statistics 2nded.
• EshetuWencheko Introduction to statistics.
• J.E. Freund and G.A. Simon- Modern Elementary Statistics 8th ed
• M.R. Spiegel. Theory and problem of Statistics, Schaums’ Outline series
• Downie, N.M. and R. W. Heath, 1983). Basic Statistical Methods, 5th Edition. New York:
Harper and Row Publ.
• Moore, D.S., and G.P. McCabe, (1989). Introduction to the Practice of Statistics, New
York: W.H. Freeman and Company.

Statistics
Has TWO MEANINGS

• Specific numbers
• Methods of analysis

Statistics
• Specific Number
Numerical measurement determined by a set of
data
Example: 65% student use Facebook account

Statistics
• Methods of Analysis
A collection of methods for planning
experiments, obtaining data, and then
organizing, summarizing, presenting, analyzing,
and drawing conclusions based on the data.
• Biostatistics
– The application of statistics on biological or
life science data

Definition
• Population: The complete collection of all elements

(scores, people, measurements, and so on) to be
studied. The collection is complete in the sense that is
includes all subjects to be studied.
– Population: Refers to all members of a defined group

Cont…
• Census: The collection of data from every element in a population
• Sample: A sub-collection of elements drawn from a population
Example: Patients in a hospital would constitute the entire

population for a study of infection control in that hospital.
However, for a study of infected patients in the nation’s hospitals, the

same group of patients would be but a tiny sample. The same
group can be a sample for one question about its characteristics
and a population for another question.
Cont…
• Parameter: A numerical measurement for describing some
characteristics of a population.
• statistic: A numerical measurement describing some
characteristics of a sample.
• Data: A data is a numerical fact collected for some specific
purpose
• Quantitative data: Numbers representing counts or
measurements
Example: Cholesterol level in the blood
• Qualitative (or categorical or attribute) data: can be separated
in to different categories that are distinguished by some non-
numeric characteristics.
Example: The genders (male/female)
Cont…
• Discrete data: Data results when the number of possible values is
either a finite number of a ‘countable’ number of possible values
0,1,2,3….
Example; The number of eggs that hens lay; for example, 3 eggs a day
• Continuous data: Numerical data result from infinitely many possible
values that correspond to some continuous scale that covers a range of
values without gaps, interruptions, or jumps.
Example: The amounts of milk that cows produce; for example 2.341
gallons a day.

Data levels with reference to scales
• Nominal scales: This scales of measurement is

characterized by data that consists of names, labels,
or categories only. The data cannot be arranged in an
ordering scheme (such as low to high).
Example: Survey responses yes, no undecided.

• Ordinal Scales (levels): Involves data that may

be arranged in some order, but differences
between data values either cannot be
determined or are meaningless.
Example: Course grades A, B, C, D, or F
Stages of malignant cancer

• Interval scales (levels): Like the ordinal level, with the
additional property that the difference between any
two data values is meaningful. However, there is no
natural zero starting point (Where none of the quantity
is present).
Example: Years 1000, 2000, 1776 ….
Temperature in degree Celsius
Intelligence quotients' of human
• Ratio Scales (levels) of measurement: The interval level

modified to include the natural zero starting point (Where
zero indicates that none of the quantity is present). For
values at this level, differences and ratios are meaningful.
Example: Prices
Blood pressure
Weight….

In summaries
Nominal – Categories only
Ordinal – Categories with some order
Interval – Differences but no natural starting point
Ratio – Differences and a natural starting point.

Variables
• A variable is just a term for an observation or
reading giving information on the study
question to be answered.
• Blood pressure is a variable giving
information on hypertension.
• Blood Uric acid level is a variable giving
information on gout.

Variables
• Independent Variable: Is a variable that, for the
purposes of the study question to be answered, occurs
independently of the effects being studied.
– A variable thought to be the cause of some effect.
– This term is usually used in experimental research to
denote a variable that the experimenter has manipulated.

Variables
• Dependent Variable: Is a variable that

depends on, or more exactly is influenced by,
the independent variable.
– A variable thought to be affected by
changes in an independent variable. You
can think of this variable as an outcome.

Variables Cont…
• In a study on gout, suppose we ask if blood uric acid (level) is
a factor in causing pain.
– We record blood uric acid level as a measurable variable that
occurs in the patient.
– Then we record pain as reported by the patient.
– We believe blood uric acid level is predictive of pain.
– In this relationship, the blood uric acid is the independent

variable and pain is the dependent variable.

Data Summarization
• Approaches
– Tables/graphs
– Numerical summary measures

Data Summarization
• Numerical summary measures
– Measures of central location
– Measures of dispersion
– Measures of relative standing/detectors of outliers
– Measures of shape
– Measures of association

Data Summarization
• Measures Of Central Location
– Arithmetic Mean, Median And Mode
• Arithmetic Mean Is Unique, Takes Into Account All Data
Points But Sensitive To Extreme Values
• Median Is Unique, And Not Affected By Extreme Values
• Mode Might Not Exist

Data Summarization
• Measures of Dispersion
–Measure spread of observations of a
distribution
• Variance, standard deviation and
• coefficient of variation

Data Summarization
• Measures of relative standing
–Tell the position of a particular
observation relative to others
• Standard score (Z-score)
• Percentiles

Data Summarization
• Measures of association
– Odds ratio, risk ratio, risk difference
– Correlation coefficient, regression

coefficients

Measures of Association
• A measure of association quantifies the
strength(magnitude) of the statistical association
between the exposure and the health problem of
interest.
• Estimate size/strength of association between exposure
and outcome
• Measures of association are sometimes called measures
of effect

Why Measures of Association?
• Cross-tabs and scatter plots are flexible tools for
exploring relationships between variables
• Chi-squared test evaluates statistical significance
• Neither method provides a summary measure of
the relationship
 What is the direction?
How strong is the relationship?
• But, Measures of association seek to provide this
information

Measures of association……
• In cohort studies, most commonly used is the
relative risk.
• In case-control studies, the odds ratio is the most
commonly used
• In cross-sectional studies, either a prevalence
ratio or a prevalence odds ratio

Risk Ratio
• It is also called relative risk, compares the risk of a
health event among one group with the risk among
another group.
• Compares incidence among exposed with incidence

among non-exposed
– Compare exposed and non-exposed or diseased and

non-diseased

Data lay out to show the Disease-exposure
relationship
• Data presentation…
Disease
Exposure Yes No Total
Yes a b a+b
No c d c+d
Total a+c b+d a+b+c+d

Relative Risk ….
𝑎
Incidence rate in exposed ൗ(𝑎+𝑏)
RR = =
Incidence rate in non−exposed 𝑐ൗ(𝑐+𝑑)
Cut-off points for RR: If

• RR=1, there is no association between the exposure and disease
• RR>1, there is positive association between exposure & disease
• RR<1, there is Negative association between exposure & disease

Relative Risk…..
• Example : In an outbreak of varicella (chickenpox) in Oregon in 2002,
varicella was diagnosed in 18 of 152 vaccinated children compared with 3
of 7 unvaccinated children. Calculate the risk ratio.
Yes No Total
Vaccinated A=18 B=134 152
Unvaccinated C=3 D=4 7
Total 21 138 159
Solution: Risk of varicella among vaccinated children = 18 / 152 = 0.118

• Risk of varicella among unvaccinated children = 3 / 7 = 0.429
Risk ratio = 0.118 / 0.429 = 0.28
• The risk ratio is less than 1.0, indicating a decreased risk or protective
effect for the exposed (vaccinated) children.

Odds Ratio
• quantifies the relationship between an exposure with two
categories and health outcome
• To examine the strength of association between Risk factor &
Out come in a case –control study
𝑎Τ 𝑎𝑑
𝑐
• OR= 𝑏ൗ =
𝑑 𝑏𝑐
Cut-off points for OR: If
- OR=1, odds of exposure among cases and controls is the same
- OR>1, odds of exposure among cases is higher than among
controls
- OR<1, odds of exposure among cases is lower than among
controls

Odds Ratio…
Example:-Exposure and Disease in a Hypothetical Population of
10,000 Persons given in table below
Yes No Total
Vaccinated A=100 B=1,900 2000
Unvaccinated C=80 D=7920 8000
Total 180 9820 10000
Solution : Odds ratio= ad/bc (100 x 7,920) / (1,900 x 80) = 5.2

since OR>1,then the odds of exposure among cases is higher than
among controls, this indicates there is a positive association
between Risk factor & Out come (exposure and disease)

Bias and confounding
• Bias is any trend in the collection, analysis,
interpretation, publication or review of data that can
lead to conclusions that are systematically different from
the truth.
• Bias can occur during any stage of a study:
during the literature review of the study question
during the selection of the study sample
during the measurement of exposure and outcome
during the analysis of data
during the interpretation of the analysis
12/21/2019during the publication ofby Sead
Biostatistics: Compiled theZ. results 38
Bias…
• Most of them however can be categorised in one of three general
types:
Selection bias
Information bias
Confounding bias
• Selection bias:- Occur during the execution of study when some
subjects are included and not others
Admission bias:- occurs when case control and cross
sectional studies are done exclusively
 Prevalence/incidence bias:- happens when asymptomatic
cases as well as fatal short disease episodes are missed.
Volunteer bias:-occurs when those who volunteer to
participate in a study differ systematically
Information (Observation) Bias
• Information bias occurs in the data collection stage of
studies
 Interviewer bias:-recording, or interpreting of
information from study subjects
Questionnaire bias:-Difference in accuracy
between compared groups.
Recall bias: when people, having had adverse
health outcomes,

Confounding Bias
• Confounding bias: occurs when a factor (confounder) associated
with the exposure of interest is also associated with
development of the disease or outcome of interest
independently of exposure.
• A confounder must be predictive of disease occurrence

independent of its association with the exposure of interest.
• The confounding variable can affect the association between

exposure and disease positively or negatively.

Basics Concepts
Control groups and placebos
A frequent mechanism to pinpoint the effect of a
treatment and to reduce bias is to provide a control
group having all the characteristics of the experimental
group except the treatment under study.
For example: Paracetamol tablet (drug group) and
lactose tablet (Placebo); then compare their effect on
fever reducing property.

Exploratory Methods
• Exploring data
– Data checking
– Understand distribution of variables
– Understand nature and strength of
relationships between variables

EDA and Visualization
• Exploratory Data Analysis (EDA) and Visualization are
very important steps in any analysis task.
• get to know your data!

– distributions (symmetric, normal, skewed)
– data quality problems
– outliers
– correlations and inter-relationships
– subsets of interest
– suggest functional relationships
44
Definition of EDA
• It is an approach for data analysis that employs a
variety of techniques (mostly graphical)
Main reasons we use EDA:
• Detection of mistakes
• checking of assumptions
• Preliminary selection of appropriate models
• Determining relationships among the explanatory
variables
• Assessing the direction and rough size of
relationships between explanatory and outcome
variables.
• maximize insight into a data set;
• uncover underlying structure;
• extract important variables;
• detect outliers and anomalies;
• test underlying assumptions;
• determine optimal factor settings.
• find odd values
Typical data format
• Spreadsheet ( store data on row and column)
• Database (e.g. tabular form)
Classification of EDA
Graphical
• Scatter plot
• histogram,
• box plot,
• residual plot,
• Probability plot
Such graphical tools are the shortest path to
gaining insight into a data set in terms of
• testing assumptions
• model selection
• model validation
• estimator selection
• relationship identification
• factor effect determination
• outlier detection
• identify useful raw data & transforms (e.g. log(x))
Non-graphical (summary statistics)
• Averages(mean, median, etc)
• Quantiles
Exploratory data analysis:
One variable
• Graphical displays
– Qualitative/categorical data: bar chart, pie chart, etc.
– Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.
• Summary statistics
– Qualitative/categorical: contingency tables
– Quantitative: mean, median, standard deviation, range etc.
• Probability models
– Qualitative: Binomial distribution(others we won’t cover in this
class)
– Quantitative: Normal curve (others we won’t cover in this class)
Summary of categorical variables
• Graphically
– Bar graphs, pie charts
• Bar graph nearly always preferable to a pie chart. It is
easier to compare bar heights compared to slices of a
pie
• Numerically: tables with total counts or

percents
Summary table
• we summarize categorical data using a table. Note that
percentages are often called Relative Frequencies.
Class Frequency Relative Frequency

Highest Degree Obtained Number of CEOs Proportion
None 1 0.04
Bachelors 7 0.28
Masters 11 0.44
Doctorate / Law 6 0.24
Totals 25 1.00
Bar graph
• The bar graph quickly
compares the degrees of the
four groups
• The heights of the four bars
show the counts for the four
degree categories
Pie chart
• A pie chart helps us see
what part of the whole
group forms
• To make a pie chart, you
must include all the
categories that make up a
whole
Quantitative variables
• Graphical summary
– Histogram
– Stemplots
– Time plots
– more
• Numerical summary
– Mean
– Median
– Quartiles
– Range
– Standard deviation
– more
Single Variable Visualization
• Histogram:
– Shows center, variability, skewness, modality,
– outliers, or strange patterns.
– Bin width and position matter
– Beware of real zeros
56
Issues with Histograms
• For small data sets, histograms can be misleading.
– Small changes in the data, bins, or anchor can deceive
• For large data sets, histograms can be quite effective at

illustrating general properties of the distribution.
• Histograms effectively only work with 1 variable at a

time
– But ‘small multiples’ can be effective
57
Boxplots
• Shows a lot of information about a
variable in one plot
– Median
– IQR
– Outliers
– Range
– Skewness
• Negatives
– Overplotting
– Hard to tell distributional shape
– no standard implementation in
software (many options for
whiskers, outliers)
58
Exploratory data analysis: two variables
• There are three combinations of variables we must consider.
We do so in the following order
– 1 qualitative/categorical, 1 quantitative variables
• Side-by-side box plots, counts, etc.
– 2 quantitative variables
• Scatter plots, correlations, regressions
– 2 qualitative/categorical variables
• Contingency tables (we will cover these later in the
semester)
59
Side-by-side box plots
• Side-by-side box plots are graphical summaries of data when
one variable is categorical and the other quantitative
• These plots can be used to compare the distributions
associated with the the quantitative variable across the levels
of the categorical variable
60
Box plots
• A box plot is a graph of five

numbers (often called the five
number summary)
– minimum
– Maximum
– Median
– 1st quartile
– 3rd quartile
61
Two Continuous Variables
• For two numeric variables, the scatterplot
is the obvious choice
interesting?
interesting?
62
2D Scatterplots
• useful to answer:
• standard tool to display relation – x,y related?
between 2 variables • linear
– e.g. y-axis = response, x-axis = • quadratic
suspected indicator • other
– variance(y) depend on x?
– outliers present?
interesting?
interesting?
63
Scatter Plot: No apparent relationship
64
Scatter Plot: Linear relationship
65
Scatter Plot: Quadratic relationship
66
Scatter plot: Homoscedastic
Why is this important in classical statistical modelling?
67
Scatter plot: Heteroscedastic
variation in Y differs depending on the value of X

e.g., Y = annual tax paid, X = income
68
Two variables - continuous
• Scatterplots
– But can be bad with lots of data
69
• Scatterplots
– But can be bad with lots of data
70
• What to do for large data sets
– Contour plots
71
Describing scatter plots
• Form
– Linear, quadratic, exponential
• Direction
– Positive association
• An increase in one variable is accompanied by an increase in the other
– Negatively associated
• A decrease in one variable is accompanied by an increase in the other
• Strength
– How closely the points follow a clear form
72
Describing scatter plots
• Form:
– Linear
• Direction
– Positive
• Strength
– Strong
73
Two Variables - one categorical
• Side by side boxplots are very effective in showing differences in a
quantitative variable across factor levels
– tips data
• do men or women tip better
– orchard sprays
• measuring potency of various orchard sprays in repelling honeybees
74
Barcharts and Spineplots
stacked barcharts can be
used to compare
continuous values across
two or more categorical
ones.
orange=M blue=F
spineplots show
proportions well, but can
be hard to interpret
75
Study Types

Study Types
Case Control Study:

A case-control study is s study in which an
experimental group of patient is chosen for being
characterized by some outcome factor, such as
having acquired a disease, and a control group
lacking this factor is matched patient for patient.

Cohort Study
A cohort study starts by choosing groups that have
already been assigned to study categories, such as
diseases or treatments, and follows these groups
forward in time to assess the outcomes.

Randomized control Trial
The soundest type of study is the randomized controlled

trial (RCT), often called a clinical trial. An RCT is a
true experiment in which patients are assigned
randomly to a study category, such as clinical treatment,
and are then followed forward in time (making it a
prospective study) and the outcome is assessed.

Paired and Crossover Designs
Some studies permit a design in which the patients

serve as their own controls, as in a “before – and-
After” study or a comparison of two treatments in
which the patient receives both in sequence.

STEPS THAT WILL AID IN PLANNING
1. Start with objectives. Specify, clearly,

unequivocally, a question to be answered about an
explicitly defined population
2. Develop the background and relevance
3. Plan your materials. From where will you obtain

your equipments?
4. Plan your methods and data. Identify at least 1

measurable variable capable of answering your
question. Define the specific data that will satisfy
your objectives and verify that your methods will
provide these data. Develop clearly specified null
and alternative hypothesis.
5. Plan data recording. Develop a raw data entry sheet
and a spreadsheet to transfer the raw data to that will
facilitate analysis by computer software
6. Define the subject population, verify that your
sampling procedures will sample representatively.
7. Ensure that your sample size will satisfy your
objectives
8. Anticipate what statistical analysis will yield results
that will satisfy your objectives
9. Plan tests for sampling bias
10. Plan the bridge from results to conclusion
11. Anticipate the form in which your conclusion will be
exercised
12. Now you can draft an abstract.

• Assignment 1: Write a clear research title of your area
• Assignment 2: Prepare a questionnaire/tool for

collecting data from the defined population.

Data Presentation and processing
• Before making analysis, data should be
– edited,
– coded,
– entered into computers, and
– Cleaned
for its consistency.

• Editing: Data should be edited for its
consistency before analysis.
• Editing involves checking and making corrections

upon all incomplete, erroneous and contradictory
responses recorded in the questionnaires.
• Editing includes dumping the questionnaires which

cannot be improved by applying the above activities.

• Coding: is the process of transforming the recorded
responses into codes.
• Once the data have been edited and coded, the next step
will be entering the data into computers so that they can
be processed and outputs will be produced.
• Then data cleaning will be carried out based on edit-
specification programs, where this step is the step of
final consistency check to take place before analyzing
the results.

• Analysis: This step is a step where the presented data
will be investigated using different methods of
statistical techniques.
• Among different methods of analyzing data, we

can mention some of the simple descriptive analysis
such as dealing with measures of central tendencies,
measures of variations and so on.

• Interpretation of results: Once the final
outputs have been produced, appropriate
interpretation of results will be given by
analyzing the results obtained.
• This stage requires due attention as this is the
final result which can be utilized by decision
makers for otherwise it misleads these decision
makers to a wrong decision.

Numerical summaries
Measures of centers:
• Mean
• Mode
• Median
Measures of dispersion
• Variance and standard deviation
• Standard score and coefficient of variation

One sample inference
• Estimation of single population parameter
• Hypothesis testing of single population parameter

Regression analysis

Regression Model
• Even if we are interested only in the relationship between a

response and an explanatory variable, we may still have to control
for at least one confounder that can influence the relationship
under investigation.
• In this chapter we will use models as the basis of such analyses.
• The goal is to find the best fitting and most parsimonious, yet
biologically reasonable model to describe the relationship
between an outcome (dependent or response variable) and a set
of independent (predictor or explanatory) variables.

Regression Model
• A good-fitting model has several benefits.

 inferences for model parameters help us evaluate which
explanatory variables affect the response, while controlling
effects of possible confounding variables.
 estimation of parameters is more informative than mere
significance testing (sizes of estimated model parameters
determine the strength and importance of the effects).
 model based predicted values can be obtained.
 models can handle more complicated situations than those in
previous chapters (e.g. analyzing simultaneously the effects of
several explanatory variables).
Type of Regression Models
• Depending on the type of the response variable we classify

regression models as.
 normal: linear regression,
 binary: probit and logit (logistic) regression
 counts: Poisson regression
 categorical data: log-linear modelling
 time-to-event: survival regression

linear Regression Model
• The purpose of linear regression is to analyze the relationship

between metric or dichotomous independent variables and a
metric dependent variable.
• It is used to answer questions such as:

– Do changes in age result in changes in SBP, and if so, do the
results depend on other characteristics such as sex,
cholesterol level and so on ?

linear Regression Model
• The goal of regression analysis is to understand how the values of

Y (out come variable) change as X (the predictor variable) is
varied over its range of possible values.
• An essential first step in regression analysis is to draw appropriate

graphs of the data.
• A fundamental graphical tool for looking at regression data by

using scatter plot

Scatter plots of y versus x
y
y No relationship
between x and
y. Spread is
x even in all x
directions.
Linear relationship:
y A line indicates the main direction
of the spread of points.
Non-linear relationship
x between x and y. A curve best
describes the relationship.
The Model
• Suppose we have n subjects and measure the following on each
subject
– 𝑌 = (𝑦1 , 𝑦2 ,…, 𝑦𝑛 ) be the response
– 𝑋1 =(𝑋11 , 𝑋12 ,…, 𝑋1𝑛 ) be independent variable 1
• Aim: To study the relation ship between 𝑌 and 𝑋

The Model
• Model-1: Simple linear regression:
𝑌 = β0 + β1 𝑋1 + ε
• Model-2: Multiple linear regression:
𝑌 = β0 + β1 𝑋1 + β2 𝑋2 + β3 𝑋3 + ε
Where ε is the residual and it is the part that cannot be accounted for
by the model, that is normally distributed with mean 0 and variance
2.

Assumptions of Linear Regression Model
1. Linear relationship between the outcome variable (y) and

explanatory variable (X)
2. The outcome variable (y) should be Normally distributed for
each value of explanatory variable (x)
3. Standard deviation of y should be approximately the same for
each value of x
4. All the observations should be Independent

Assumptions of linear regression
*
*
*
Assumption 1 * *
*
*
*
Linear relationship **
**
*
*
*
Assumption 2 ** *
*
Y normally distributed **
**
*
at each value of x
Assumption 3
Same variance at each value of x

Testing Assumptions:
Assumption 1: linear relationship
Plot y against x to check for linearity

Assumption 2: Normality
Histogram of residuals
Dependent variable BMI
Normal P-P Plot of Standardized Residual
1.0
0.8
Expected Cum Prob

0.6
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
Observed Cum Prob

Assumption 3 can be stated as follows:
• The variance of y is the same for any x that is, the spread of
values for y at each level of x remains approximately constant
 2
y| x1  2
y| x2  2
y| x3  2
y Spread of y|x
x
Assumption 3: Spread of y values constant over range of x values
(plot of residuals against X)

Types linear regression Model
• There are three types of multiple regression, each of which is
designed to answer a different question:
– Standard multiple regression is used to evaluate the

relationships between a set of independent variables and a
dependent variable.
– Hierarchical, or sequential, regression is used to examine the
relationships between a set of independent variables and a
dependent variable, after controlling for the effects of some
other independent variables on the dependent variable.
– Stepwise, or statistical, regression is used to identify the subset
of independent variables that has the strongest relationship to a
dependent variable.

Standard multiple regression
• In standard multiple regression, all of the independent variables are

entered into the regression equation at the same time
• Multiple R and R² measure the strength of the relationship between

the set of independent variables and the dependent variable.
• An F test is used to determine if the relationship can be generalized

to the population represented by the sample.
• A t-test is used to evaluate the individual relationship between each

independent variable and the dependent variable.
hierarchical multiple regression
• In hierarchical multiple regression, the independent variables are

entered in two stages.
• In the first stage, the independent variables that we want to control
for are entered into the regression.
• In the second stage, the independent variables whose relationship
we want to examine after the controls are entered.
• A statistical test of the change in R² from the first stage is used to
evaluate the importance of the variables entered in the second
stage.

Stepwise regression
• Stepwise regression is designed to find the most parsimonious set
of predictors that are most effective in predicting the dependent
variable.
• Variables are added to the regression equation one at a time, using
the statistical criterion of maximizing the R² of the included
variables.
• When none of the possible addition can make a statistically
significant improvement in R², the analysis stops.

Problem 1 - standard multiple regression
Is the following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data, and
violation of assumptions, or outliers. Use a level of significance of
0.05.
• “Body weight in pound” and “age of the person in years” have a

strong relationship to the variable “systolic blood pressure”
• Survey respondents who had less Body weight had a lower systolic
blood pressure. Survey respondents who were younger had a lower
systolic blood pressure.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
When a problem states that there is a
relationship between some independent
variables and a dependent variable, we
The variables listed first in the do standard multiple regression.
problem statement are the
independent variables (ivs): “Body
weight in pound " and "age of the
person in yea"

The variable that is
blood pressure. Survey respondents who were younger
related to is the had a lower
systolic blood pressure. dependent variable
(dv): " systolic blood
1. True pressure”

3. False
In order for a problem to be true, we

will have find:
•a statistically significant relationship
between the ivs and the dv
•a relationship of the correct strength

blood pressure. Survey respondents who were younger had a lower
systolic blood pressure.
1. True The relationship of each of
the independent variables
2. True with caution to the dependent variable
must be statistically
3. False significant and interpreted
correctly.
The probability of the F statistic (27.92) for the

overall regression relationship is <0.001, less than or
equal to the level of significance of 0.05. We reject
the null hypothesis that there is no relationship
between the set of independent variables and the
dependent variable (R² = 0). We support the
research hypothesis that there is a statistically
significant relationship between the set of
independent variables and the dependent variable.
ANOV Ab
Sum of
Model Squares df Mean Square F Sig.
1 Regres sion 10287.769 2 5143.884 27.924 .000a
Residual 14921.219 81 184.213
Total 25208.988 83
a. Predic tors : (Const ant), Body W eight in pound, Age of t he person in years
b. Dependent Variable: S ystolic B lood Pressure in mmHg

The Multiple R for the relationship between the set of

independent variables and the dependent variable is 0.639,
which would be characterized as strong using the rule of
thumb than a correlation less than or equal to 0.20 is
characterized as very weak; greater than 0.20 and less than
or equal to 0.40 is weak; greater than 0.40 and less than or
equal to 0.60 is moderate; greater than 0.60 and less than or
equal to 0.80 is strong; and greater than 0.80 is very strong.
Model Summary
Adjusted Std. Error of

Model R R Square R Square the Estimate
1 .639a .408 .393 13.572
a. Predictors: (Constant), Body Weight in pound, Age of
the person in years

For the independent variable Age, the probability of the

t statistic (7.114) for the b coefficient is <0.001 which
is less than or equal to the level of significance of 0.05.
We reject the null hypothesis that the slope associated
with age is equal to zero (b = 0) and conclude that
there is a statistically significant relationship between
Age of the respondent and systolic Blood pressure
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in years .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Pressure in mmHg

Coefficientsa
1 (Constant) 83.019 7.393 11.229 .000
The b coefficient associated with Age (0.553) is positive,

indicating a direct relationship in which higher numeric
values for age are associated with higher numeric values
for systolic blood pressure.

For the independent variable body weight, the

probability of the t statistic (2.239) for the b coefficient
is <0.028 which is less than or equal to the level of
significance of 0.05. We reject the null hypothesis that
the slope associated with body weight is equal to zero
(b = 0) and conclude that there is a statistically
significant relationship between body weight and
systolic blood pressure. a
Coefficients
1 (Constant) 83.019 7.393 11.229 .000

Coefficientsa
Unstandardized Standardiz ed
Coeffic ients Coeffic ients
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in y ears .553 .078 .608 7.114 .000
a. Dependent Variable: Systolic Blood Press ure in mmHg
The b coefficient associated with body weight

(0.085) is positive, indicating a direct relationship in
which higher numeric values for body weight are
associated with higher value of systolic blood pressure.

Problem 2 – Hierarchal multiple regression
is the following statement true, false, or an incorrect application of a
statistic? Assume that there is no problem with missing data, violation of
assumptions, or outliers. Use a level of significance of 0.05.
After controlling for the effects of the variables “Place of residence" and
“sex", the addition of the variables “birth weight" reduces the error in
predicting “BMI" by 17.2%.
After controlling for Place of residence and Sex, the variables birth weight
make an individual contribution to reducing the error in predicting BMI.
Infants who live in rural area had less BMI. Male infants had a higher BMI.
Infants who had less than 2500g birth weight had lower BMI .
1. True
3. False

is the following statement true, false, or an incorrect application of a statistic?
Assume that there is no Theproblem
variableswith
listedmissing
first in thedata, violation of assumptions, or
outliers. Use a level of significance of 0.05.
problem statement are the
independent variables (ivs)
whose effect we want to control
before we test for the
relationship: “place of residence”
and "sex" [sex],
After controlling for the effects of the variables “Place of residence" and “sex", the
addition of the variables “birth weight" reduces the error in predicting “BMI" by
17.2%.
After controlling for Place of residence and Sex, the variables birth weight make an
individual contribution to reducing the error in predicting BMI. Infants who live in
rural area had less BMI.
The variables thatMale
we addinfants had a higher BMI. Infants who had less than
in after the
control variables are the independent
2500g birthvariables
weightthathad lower BMI .
we think will have a
The variable that to be
predicted or related to is
statistical relationship to the the dependent variable
dependent variable: (dv): “BMI”
1. True “birth weight"
3. False

is the following statement true, false, or an incorrect application of a statistic?
Assume that there is no problem with missing data, violation of assumptions, or
outliers. Use a level of significance of 0.05.
In order for a problem to be true, the
relationship between the added variables
and the dependent variable must be
statistically significant, and the strength of
the relationship after including the control
variables must be correctly stated.
After controlling for the effects of the variables “Place of residence" and “sex", the
addition of the variables “birth weight" reduces the error in predicting “BMI" by
17.2%.
After controlling for Place of residence and Sex, the variables birth weight make an
individual contribution to reducing the error in predicting BMI. Infants who live in
rural area had less BMI. Male infants had a higher BMI. Infants who had less than
2500g birth weight had lower BMI .
1. True The relationship between

We are generally not interested
2. True with incaution each of the independent
whether or not the control
variables entered after the
variables have a statistically
3. False significant relationship to the
control variables and the
dependent variable must
4. Inappropriate application
dependent variables. of a statistic
be statistically significant
and interpreted correctly.
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES
ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 761.620 2 380.810 114.895 .000b
1 Residual 26084.494 7870 3.314
Total 26846.114 7872
Regression 5374.117 3 1791.372 656.497 .000c
2 Residual 21471.998 7869 2.729
Total 26846.114 7872
a. Dependent Variable: BMI
b. Predictors: (Constant), Sex, place of residence
c. Predictors: (Constant), Sex, place of residence, Birth weight
The probability of the F statistic (656.49) for the overall
regression relationship for all indpendent variables is
<0.001, less than or equal to the level of significance of
0.05. We reject the null hypothesis that there is no
relationship between the set of all independent variables
and the dependent variable (R² = 0). We support the
research hypothesis that there is a statistically significant
relationship between the set of all independent variables
and the dependent variable.
VARIABLES
Model Summary
Model R R Adjusted R Std. Error Change Statistics
Square Square of the R Square F df1 df2 Sig. F
Estimate Change Change Change
1 .168a .028 .028 1.82056 .028 114.895 2 7870 .000
1690.37
2 .447b .200 .200 1.65187 .172 1 7869 .000
5
a. Predictors: (Constant), Sex, place of residence
b. Predictors: (Constant), Sex, place of residence, Birth weight
The R Square Change statistic for the increase in R²

associated with the added variables (birth weight)
is 0.172. Using a proportional reduction in error
interpretation for R², information provided by the
added variables reduces our error in predicting BMI
by 17.2%.

VARIABLES
Model Summary
Model R R Adjusted R Std. Error Change Statistics
Square Square of the R Square F df1 df2 Sig. F
Estimate Change Change Change
1 .168a .028 .028 1.82056 .028 114.895 2 7870 .000

1690.37
2 .447b .200 .200 1.65187 .172 1 7869 .000
5
a. Predictors: (Constant), Sex, place of residence
b. Predictors: (Constant),The
Sex,probability
place of residence, Birth weight
of the F statistic (1690.37) for the change in R²
associated with the addition of the predictor variables to the
regression analysis containing the control variables is <0.001, less
than or equal to the level of significance of 0.05. We reject the
null hypothesis that there is no improvement in the relationship
between the set of independent variables and the dependent
variable when the predictors are added (R² Change = 0).
We support the research hypothesis that there is a statistically

significant improvement in the relationship between the set of
independent variables and the dependent variable.
VARIABLES
Coefficientsa
Model Unstandardized Standardized t Sig.
B Std. Error Beta
(Constant) 13.061 .037 351.460 .000
1 place of
-.583 .041 -.157 -14.140 .000
residence
Sex .211 .041 .057 5.148 .000
(Constant) 13.202 .034 389.523 .000
place of
-.323 .038 -.087 -8.512 .000
2 residence
Sex .120 .037 .032 3.214 .001
If there is a relationship between each added individual independent
Birth weight -2.795 .068 -.421 -41.114 .000
variable and the dependent variable, the probability of the statistical
a. Dependent Variable:
test ofBMI
the b coefficient (slope of the regression line) will be less than
or equal to the level of significance. The null hypothesis for this test
states that b is equal to zero, indicating a flat regression line and no
relationship.
If we reject the null hypothesis and find that there is a relationship

between the variables, the sign of the b coefficient indicates the
direction of the relationship for the data values. If b is greater than
or equal to zero, the relationship is positive or direct. If b is less than
zero, the relationship is negative or inverse. If the variable is
dichotomous or ordinal, the direction of the coding must be taken
into account to make a correct interpretation.
VARIABLES
Coefficientsa
Model Unstandardized Standardized t Sig.
B Std. Error Beta
(Constant) 13.061 .037 351.460 .000
1 place of
-.583 .041 -.157 -14.140 .000
residence
Sex .211 .041 .057 5.148 .000
(Constant) 13.202 .034 389.523 .000
place of
-.323 .038 -.087 -8.512 .000
2 residence
Sex .120 .037 .032 3.214 .001
Birth weight -2.795 .068 -.421 -41.114 .000
a. Dependent Variable: BMI

Problem 3 – Stepwise Regression
Reading assignment

Logistic Regression Model
Review of simple and multiple linear regression
• Simple LR: Model the mean of a numeric response Y as a function
of a single predictor X, i.e. The key is that E(Y|X) is a linear in
E(Y|X) = bo + b1X the parameters b and b but not
necessarily in X.
o 1
b0 = Estimated Intercept
ŷ
• The value of y at x=0
b1 = Estimated Slope
• change in 𝑦ො for every unit
• interpretable only if x=0
increase in x.
is a value of particular
• Estimated change in the
interest
mean of Y for a unit
change in X
• Always interpretable

Review of simple and multiple linear regression
• Simple LR: Model the mean of a numeric response Y as a
function of k predictors X_1,X_2,…X_k, i.e.
𝐸 𝑌 𝑋 = β0 + β1 𝑋1 + β2 𝑋2 + β3 𝑋3 + ⋯ + β𝑘 𝑋𝑘
• the regression coefficients (bi) represent the

ŷ
estimated change in the mean of the response Y
associated with a unit change in Xi while the other
predictors are held constant.
• They measure the association between Y and Xi
adjusted for the other predictors in the model.

• Model the relation ship between mean set of predictors
X_,X_2,…X_k, e.g.,
– dichotomous (yes/no, smoker/nonsmoker,…)
– categorical (social class, race, ... )
– continuous (age, weight, gestational age, ...)
𝑎𝑛𝑑
• A dichotomous response variable Y, e.g.,
– ŷ
Success/Failure
– Remission/No Remission
– Survived/Died
– CHD/No CHD
– Low Birth Weight/Normal Birth Weight, etc…
Example: Coronary Heart Disease (CD) and Age
• In this study sampled individuals were examined for signs of CD
(present = 1 / absent = 0) and the potential relationship between
this outcome and their age (yrs.) was considered.
• Portion of the data set
ŷ

How we can analyze this data?
ŷ
• The mean age of the individuals with some signs of coronary

heart disease is 51.28 years vs. 39.18 years for individuals without
signs (t = 5.95, p < .0001).
Simple linear regression?
E(CD | Age)  .54  .02  Age

e.g. For an individual 50 years of age
E(CD | Age  50)  .54  .02  50  .46??
ŷ

Logistic regression
• We can group individuals into age classes and look at the
percentage/proportion showing signs of coronary heart disease.
Diseased
Age group # in group # Proportion
1) 20 - 29 10 1 .100
2) 30 - 34 15 2 .133
3) 35 - 39 ŷ 12 3 .250
4) 40 - 44 15 5 .333
5) 45 - 49 13 6 .462
6) 50 - 54 8 5 .625
7) 55 - 59 17 13 .765
Notice the “S-shape” to the
estimated proportions vs.
8) 60 – 64 10 8 .800
12/21/2019 Biostatistics: Compiled by Sead Z.age. 137
Logistic function
eβ o β1X
P(" Success"| X) 
1
1  eβ o β1X
P(“Success”|X)
0.8
0.6
ŷ 0.4
0.2
X
Logit transformation
• The logistic regression model is given by
eβ o β1X
P(Y | X)  β o β1X
1 e
• Which is equivalent to
ŷ ln P(Y | X)   β  β X
 1  P(Y | X)  o 1
 
This is called the

12/21/2019
Logit Transformation
Biostatistics: Compiled by Sead Z. 139
Logit transformation
• Consider a dichotomous predictor (X) which represents the
presence of risk (1 = present)
Risk Factor (X)
Disease (Y) Present Absent
(X = 1) (X = 0)
Yes (Y = 1) P (Y  1 X  1) P (Y  1 X  0)
No (Y = 0) 1  P (Y  1 X  1) 1  P (Y  1 X  0)
P(Y  1 | X  1)
Odds for Disease with Risk Present   eβ o β1
P 1 - P(Y  1 | X  1)
 eβ o β1X ŷ
1 P P(Y  1 | X  0)
Odds for Disease with Risk Absent   eβ o
1 - P(Y  1 | X  0)
Odds for Disease with Risk Present e bo  b1

Therefore the
  bo  e b1
odds ratio (OR) Odds for Disease with Risk Absent e
Logistic Regression Model: Example
Malaria incidence and entomological indices data among non-resettled and
resettled communities in Jimma town
• Aim: Evaluating the impact of resettlement on malaria

incidence and entomological indices
• Data:
– Four study villages, (2 from at risk, 2 from control)
– Two type of study was conducted: Entomological and
Parasitological

Parasitological study
 A cohort of 604 (302 from resettled and 302 from non-
resettled villages) individuals residing in 202 households
was followed from September 1 to November 30, 2013.
 During monthly house-to-house visit, blood sample was
collected from the study participants
 Outcome: P. falciparum malaria infection status
 Covariates: month, resettlement status

Result
Month of parasitological survey * PFIND Crosstabulation
PFIND Total
0 1
Count 583 17 600
September
% 97.2% 2.8% 100.0%
Month of
Count 586 14 600
parasitological October
% 97.7% 2.3% 100.0%
survey
Count 564 14 578
November
% 97.6% 2.4% 100.0%
Count 1733 45 1778
Total
% 97.5% 2.5% 100.0%
Settlement status of the household members * PFIND Crosstabulation

PFIND Total
.00 1.00
Count 855 29 884
Settlement status of Resettled
% 96.7% 3.3% 100.0%
the household
Count 878 16 894
members Indeginous
% 98.2% 1.8% 100.0%
Count 1733 45 1778
Total
% 97.5% 2.5% 100.0%
Result Dependent Variable Encoding

Original Value Internal Value
.00 0
1.00 1
Categorical Variables Codings

Frequency Parameter coding
(1) (2)
September 600 1.000 .000
Month of parasitological
survey October 600 .000 1.000
November 578 .000 .000
Settlement status of the Resettled 884 1.000
household members Indeginous 894 .000

Result
Variables in the Equation

B S.E. Wald df Sig. Exp(B)
Month .346 2 .841
Month(1) .162 .366 .197 1 .657 1.176
Step 1a Month(2) -.037 .383 .010 1 .922 .963
Settlement(1) .621 .315 3.889 1 .049 1.862
Constant -4.051 .338 143.540 1 .000 .017
a. Variable(s) entered on step 1: Month, Settlement.

Chapter 1 Lecture 1

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1 Lecture 1

Uploaded by

Copyright:

Available Formats

Biostatistics and Research

12/21/2019 Biostatistics: Compiled by Sead Z. 1

12/21/2019 Biostatistics: Compiled by Sead Z. 2

12/21/2019 Biostatistics: Compiled by Sead Z. 4

• A.G.Bluman - Elementary Statistics 2nded.

• EshetuWencheko Introduction to statistics.

• J.E. Freund and G.A. Simon- Modern Elementary Statistics 8th ed

• M.R. Spiegel. Theory and problem of Statistics, Schaums’ Outline series

12/21/2019 Biostatistics: Compiled by Sead Z. 6

Has TWO MEANINGS

12/21/2019 Biostatistics: Compiled by Sead Z. 7

Example: 65% student use Facebook account

12/21/2019 Biostatistics: Compiled by Sead Z. 8

12/21/2019 Biostatistics: Compiled by Sead Z. 9

• Population: The complete collection of all elements

– Population: Refers to all members of a defined group

12/21/2019 Biostatistics: Compiled by Sead Z. 10

• Sample: A sub-collection of elements drawn from a population

Example: Patients in a hospital would constitute the entire

However, for a study of infected patients in the nation’s hospitals, the

12/21/2019 Biostatistics: Compiled by Sead Z. 13

• Nominal scales: This scales of measurement is

Example: Survey responses yes, no undecided.

12/21/2019 Biostatistics: Compiled by Sead Z. 14

• Ordinal Scales (levels): Involves data that may

12/21/2019 Biostatistics: Compiled by Sead Z. 15

• Ratio Scales (levels) of measurement: The interval level

12/21/2019 Biostatistics: Compiled by Sead Z. 17

Nominal – Categories only

Ordinal – Categories with some order

Interval – Differences but no natural starting point

Ratio – Differences and a natural starting point.

12/21/2019 Biostatistics: Compiled by Sead Z. 18

12/21/2019 Biostatistics: Compiled by Sead Z. 19

12/21/2019 Biostatistics: Compiled by Sead Z. 20

• Dependent Variable: Is a variable that

12/21/2019 Biostatistics: Compiled by Sead Z. 21

– Then we record pain as reported by the patient.

– We believe blood uric acid level is predictive of pain.

– In this relationship, the blood uric acid is the independent

12/21/2019 Biostatistics: Compiled by Sead Z. 22

– Numerical summary measures

12/21/2019 Biostatistics: Compiled by Sead Z. 23

– Measures of central location

– Measures of relative standing/detectors of outliers

12/21/2019 Biostatistics: Compiled by Sead Z. 24

• Median Is Unique, And Not Affected By Extreme Values

• Mode Might Not Exist

12/21/2019 Biostatistics: Compiled by Sead Z. 25

12/21/2019 Biostatistics: Compiled by Sead Z. 26

12/21/2019 Biostatistics: Compiled by Sead Z. 27

– Odds ratio, risk ratio, risk difference

– Correlation coefficient, regression

12/21/2019 Biostatistics: Compiled by Sead Z. 28

12/21/2019 Biostatistics: Compiled by Sead Z. 29

12/21/2019 Biostatistics: Compiled by Sead Z. 30

12/21/2019 Biostatistics: Compiled by Sead Z. 31

• Compares incidence among exposed with incidence

– Compare exposed and non-exposed or diseased and

12/21/2019 Biostatistics: Compiled by Sead Z. 32

12/21/2019 Biostatistics: Compiled by Sead Z. 33

Cut-off points for RR: If

• RR>1, there is positive association between exposure & disease

• RR<1, there is Negative association between exposure & disease