Professional Documents
Culture Documents
Methodology
• Salman, M. D. 2003. Animal Disease Surveillance and Survey Systems: Methods and
Applications. 1st edition, Blackwell Publishing Limited
• Downie, N.M. and R. W. Heath, 1983). Basic Statistical Methods, 5th Edition. New York:
Harper and Row Publ.
• Moore, D.S., and G.P. McCabe, (1989). Introduction to the Practice of Statistics, New
York: W.H. Freeman and Company.
In summaries
• Approaches
– Tables/graphs
– Measures of dispersion
– Measures of shape
– Measures of association
• Measures of association
Disease
Exposure Yes No Total
Yes a b a+b
No c d c+d
Total a+c b+d a+b+c+d
• Exploring data
– Data checking
– Understand distribution of variables
– Understand nature and strength of
relationships between variables
44
Definition of EDA
• It is an approach for data analysis that employs a
variety of techniques (mostly graphical)
Main reasons we use EDA:
• Detection of mistakes
• checking of assumptions
• Preliminary selection of appropriate models
• Determining relationships among the explanatory
variables
• Assessing the direction and rough size of
relationships between explanatory and outcome
variables.
• maximize insight into a data set;
• uncover underlying structure;
• extract important variables;
• detect outliers and anomalies;
• test underlying assumptions;
• determine optimal factor settings.
• find odd values
Typical data format
• Spreadsheet ( store data on row and column)
• Database (e.g. tabular form)
Classification of EDA
Graphical
• Scatter plot
• histogram,
• box plot,
• residual plot,
• Probability plot
Such graphical tools are the shortest path to
gaining insight into a data set in terms of
• testing assumptions
• model selection
• model validation
• estimator selection
• relationship identification
• factor effect determination
• outlier detection
• identify useful raw data & transforms (e.g. log(x))
Non-graphical (summary statistics)
• Averages(mean, median, etc)
• Quantiles
Exploratory data analysis:
One variable
• Graphical displays
– Qualitative/categorical data: bar chart, pie chart, etc.
– Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.
• Summary statistics
– Qualitative/categorical: contingency tables
– Quantitative: mean, median, standard deviation, range etc.
• Probability models
– Qualitative: Binomial distribution(others we won’t cover in this
class)
– Quantitative: Normal curve (others we won’t cover in this class)
Summary of categorical variables
• Graphically
– Bar graphs, pie charts
• Bar graph nearly always preferable to a pie chart. It is
easier to compare bar heights compared to slices of a
pie
• Numerical summary
– Mean
– Median
– Quartiles
– Range
– Standard deviation
– more
Single Variable Visualization
• Histogram:
– Shows center, variability, skewness, modality,
– outliers, or strange patterns.
– Bin width and position matter
– Beware of real zeros
56
Issues with Histograms
• For small data sets, histograms can be misleading.
– Small changes in the data, bins, or anchor can deceive
57
Boxplots
• Shows a lot of information about a
variable in one plot
– Median
– IQR
– Outliers
– Range
– Skewness
• Negatives
– Overplotting
– Hard to tell distributional shape
– no standard implementation in
software (many options for
whiskers, outliers)
58
Exploratory data analysis: two variables
• There are three combinations of variables we must consider.
We do so in the following order
– 1 qualitative/categorical, 1 quantitative variables
• Side-by-side box plots, counts, etc.
– 2 quantitative variables
• Scatter plots, correlations, regressions
– 2 qualitative/categorical variables
• Contingency tables (we will cover these later in the
semester)
59
Side-by-side box plots
• Side-by-side box plots are graphical summaries of data when
one variable is categorical and the other quantitative
• These plots can be used to compare the distributions
associated with the the quantitative variable across the levels
of the categorical variable
60
Box plots
– Maximum
– Median
– 1st quartile
– 3rd quartile
61
Two Continuous Variables
• For two numeric variables, the scatterplot
is the obvious choice
interesting?
interesting?
62
2D Scatterplots
• useful to answer:
• standard tool to display relation – x,y related?
between 2 variables • linear
– e.g. y-axis = response, x-axis = • quadratic
suspected indicator • other
– variance(y) depend on x?
– outliers present?
interesting?
interesting?
63
Scatter Plot: No apparent relationship
64
Scatter Plot: Linear relationship
65
Scatter Plot: Quadratic relationship
66
Scatter plot: Homoscedastic
67
Scatter plot: Heteroscedastic
68
Two variables - continuous
• Scatterplots
– But can be bad with lots of data
69
Two variables - continuous
• Scatterplots
– But can be bad with lots of data
70
Two variables - continuous
• What to do for large data sets
– Contour plots
71
Describing scatter plots
• Form
– Linear, quadratic, exponential
• Direction
– Positive association
• An increase in one variable is accompanied by an increase in the other
– Negatively associated
• A decrease in one variable is accompanied by an increase in the other
• Strength
– How closely the points follow a clear form
72
Describing scatter plots
• Form:
– Linear
• Direction
– Positive
• Strength
– Strong
73
Two Variables - one categorical
• Side by side boxplots are very effective in showing differences in a
quantitative variable across factor levels
– tips data
• do men or women tip better
– orchard sprays
• measuring potency of various orchard sprays in repelling honeybees
74
Barcharts and Spineplots
stacked barcharts can be
used to compare
continuous values across
two or more categorical
ones.
orange=M blue=F
spineplots show
proportions well, but can
be hard to interpret
75
Study Types
• The goal is to find the best fitting and most parsimonious, yet
biologically reasonable model to describe the relationship
between an outcome (dependent or response variable) and a set
of independent (predictor or explanatory) variables.
Linear relationship:
y A line indicates the main direction
of the spread of points.
Non-linear relationship
x between x and y. A curve best
describes the relationship.
12/21/2019 Biostatistics: Compiled by Sead Z. 100
The Model
• Suppose we have n subjects and measure the following on each
subject
– 𝑌 = (𝑦1 , 𝑦2 ,…, 𝑦𝑛 ) be the response
– 𝑋1 =(𝑋11 , 𝑋12 ,…, 𝑋1𝑛 ) be independent variable 1
– 𝑋2 =(𝑋21 , 𝑋22 ,…, 𝑋2𝑛 ) be independent variable 4
– 𝑋3 =(𝑋31 , 𝑋32 ,…, 𝑋3𝑛 ) be independent variable 3
𝑌 = β0 + β1 𝑋1 + ε
𝑌 = β0 + β1 𝑋1 + β2 𝑋2 + β3 𝑋3 + ε
Where ε is the residual and it is the part that cannot be accounted for
by the model, that is normally distributed with mean 0 and variance
2.
*
*
*
Assumption 1 * *
*
*
*
Linear relationship **
**
*
*
*
Assumption 2 ** *
*
Y normally distributed **
**
*
at each value of x
Assumption 3
Same variance at each value of x
1.0
0.8
0.4
0.2
0.0
0.0 0.2 0.4 0.6 0.8 1.0
2
y| x1 2
y| x2 2
y| x3 2
y Spread of y|x
x
12/21/2019 Biostatistics: Compiled by Sead Z. 107
Testing Assumptions:
Assumption 3: Spread of y values constant over range of x values
(plot of residuals against X)
• Survey respondents who had less Body weight had a lower systolic
blood pressure. Survey respondents who were younger had a lower
systolic blood pressure.
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
12/21/2019 Biostatistics: Compiled by Sead Z. 113
Problem 1 - standard multiple regression
When a problem states that there is a
relationship between some independent
variables and a dependent variable, we
The variables listed first in the do standard multiple regression.
problem statement are the
independent variables (ivs): “Body
weight in pound " and "age of the
person in yea"
• Survey respondents who had less Body weight had a lower systolic
The variable that is
blood pressure. Survey respondents who were younger
related to is the had a lower
systolic blood pressure. dependent variable
(dv): " systolic blood
1. True pressure”
• Survey respondents who had less Body weight had a lower systolic
blood pressure. Survey respondents who were younger had a lower
systolic blood pressure.
1. True The relationship of each of
the independent variables
2. True with caution to the dependent variable
must be statistically
3. False significant and interpreted
correctly.
4. Inappropriate application of a statistic
12/21/2019 Biostatistics: Compiled by Sead Z. 115
Problem 1 - standard multiple regression
ANOV Ab
Sum of
Model Squares df Mean Square F Sig.
1 Regres sion 10287.769 2 5143.884 27.924 .000a
Residual 14921.219 81 184.213
Total 25208.988 83
a. Predic tors : (Const ant), Body W eight in pound, Age of t he person in years
b. Dependent Variable: S ystolic B lood Pressure in mmHg
Model Summary
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in years .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Pressure in mmHg
Coefficientsa
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in years .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Pressure in mmHg
Unstandardized Standardized
Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in years .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Pressure in mmHg
Unstandardized Standardiz ed
Coeffic ients Coeffic ients
Model B Std. Error Beta t Sig.
1 (Constant) 83.019 7.393 11.229 .000
Age of the person in y ears .553 .078 .608 7.114 .000
Body W eight in pound .085 .038 .191 2.239 .028
a. Dependent Variable: Systolic Blood Press ure in mmHg
After controlling for the effects of the variables “Place of residence" and
“sex", the addition of the variables “birth weight" reduces the error in
predicting “BMI" by 17.2%.
After controlling for Place of residence and Sex, the variables birth weight
make an individual contribution to reducing the error in predicting BMI.
Infants who live in rural area had less BMI. Male infants had a higher BMI.
Infants who had less than 2500g birth weight had lower BMI .
1. True
2. True with caution
3. False
4. Inappropriate application of a statistic
After controlling for the effects of the variables “Place of residence" and “sex", the
addition of the variables “birth weight" reduces the error in predicting “BMI" by
17.2%.
After controlling for Place of residence and Sex, the variables birth weight make an
individual contribution to reducing the error in predicting BMI. Infants who live in
rural area had less BMI.
The variables thatMale
we addinfants had a higher BMI. Infants who had less than
in after the
control variables are the independent
2500g birthvariables
weightthathad lower BMI .
we think will have a
The variable that to be
predicted or related to is
statistical relationship to the the dependent variable
dependent variable: (dv): “BMI”
1. True “birth weight"
2. True with caution
3. False
4. Inappropriate application of a statistic
After controlling for the effects of the variables “Place of residence" and “sex", the
addition of the variables “birth weight" reduces the error in predicting “BMI" by
17.2%.
After controlling for Place of residence and Sex, the variables birth weight make an
individual contribution to reducing the error in predicting BMI. Infants who live in
rural area had less BMI. Male infants had a higher BMI. Infants who had less than
2500g birth weight had lower BMI .
ANOVAa
Model Sum of df Mean Square F Sig.
Squares
Regression 761.620 2 380.810 114.895 .000b
1 Residual 26084.494 7870 3.314
Total 26846.114 7872
Regression 5374.117 3 1791.372 656.497 .000c
2 Residual 21471.998 7869 2.729
Total 26846.114 7872
a. Dependent Variable: BMI
b. Predictors: (Constant), Sex, place of residence
c. Predictors: (Constant), Sex, place of residence, Birth weight
The probability of the F statistic (656.49) for the overall
regression relationship for all indpendent variables is
<0.001, less than or equal to the level of significance of
0.05. We reject the null hypothesis that there is no
relationship between the set of all independent variables
and the dependent variable (R² = 0). We support the
research hypothesis that there is a statistically significant
relationship between the set of all independent variables
and the dependent variable.
12/21/2019 Biostatistics: Compiled by Sead Z. 125
OVERALL RELATIONSHIP BETWEEN INDEPENDENT AND DEPENDENT
VARIABLES
Model Summary
Model R R Adjusted R Std. Error Change Statistics
Square Square of the R Square F df1 df2 Sig. F
Estimate Change Change Change
1 .168a .028 .028 1.82056 .028 114.895 2 7870 .000
1690.37
2 .447b .200 .200 1.65187 .172 1 7869 .000
5
a. Predictors: (Constant), Sex, place of residence
b. Predictors: (Constant), Sex, place of residence, Birth weight
Model Summary
Model R R Adjusted R Std. Error Change Statistics
Square Square of the R Square F df1 df2 Sig. F
Estimate Change Change Change
b0 = Estimated Intercept
ŷ
• The value of y at x=0
b1 = Estimated Slope
• change in 𝑦ො for every unit
• interpretable only if x=0
increase in x.
is a value of particular
• Estimated change in the
interest
mean of Y for a unit
change in X
• Always interpretable
𝐸 𝑌 𝑋 = β0 + β1 𝑋1 + β2 𝑋2 + β3 𝑋3 + ⋯ + β𝑘 𝑋𝑘
𝑎𝑛𝑑
• A dichotomous response variable Y, e.g.,
– ŷ
Success/Failure
– Remission/No Remission
– Survived/Died
– CHD/No CHD
– Low Birth Weight/Normal Birth Weight, etc…
12/21/2019 Biostatistics: Compiled by Sead Z. 133
Logistic Regression Model
Example: Coronary Heart Disease (CD) and Age
• In this study sampled individuals were examined for signs of CD
(present = 1 / absent = 0) and the potential relationship between
this outcome and their age (yrs.) was considered.
ŷ
ŷ
ŷ
1) 20 - 29 10 1 .100
2) 30 - 34 15 2 .133
3) 35 - 39 ŷ 12 3 .250
4) 40 - 44 15 5 .333
5) 45 - 49 13 6 .462
6) 50 - 54 8 5 .625
7) 55 - 59 17 13 .765
Notice the “S-shape” to the
estimated proportions vs.
8) 60 – 64 10 8 .800
12/21/2019 Biostatistics: Compiled by Sead Z.age. 137
Logistic Regression Model
Logistic function
eβ o β1X
P(" Success"| X)
1
1 eβ o β1X
P(“Success”|X)
0.8
0.6
ŷ 0.4
0.2
X
12/21/2019 Biostatistics: Compiled by Sead Z. 138
Logistic Regression Model
Logit transformation
• The logistic regression model is given by
eβ o β1X
P(Y | X) β o β1X
1 e
• Which is equivalent to
ŷ ln P(Y | X) β β X
1 P(Y | X) o 1
Yes (Y = 1) P (Y 1 X 1) P (Y 1 X 0)
No (Y = 0) 1 P (Y 1 X 1) 1 P (Y 1 X 0)
P(Y 1 | X 1)
Odds for Disease with Risk Present eβ o β1
P 1 - P(Y 1 | X 1)
eβ o β1X ŷ
1 P P(Y 1 | X 0)
Odds for Disease with Risk Absent eβ o
1 - P(Y 1 | X 0)
• Data:
– Four study villages, (2 from at risk, 2 from control)
– Two type of study was conducted: Entomological and
Parasitological
Parasitological study
A cohort of 604 (302 from resettled and 302 from non-
resettled villages) individuals residing in 202 households
was followed from September 1 to November 30, 2013.
During monthly house-to-house visit, blood sample was
collected from the study participants
Outcome: P. falciparum malaria infection status
Covariates: month, resettlement status
Result
Month of parasitological survey * PFIND Crosstabulation
PFIND Total
0 1
Count 583 17 600
September
% 97.2% 2.8% 100.0%
Month of
Count 586 14 600
parasitological October
% 97.7% 2.3% 100.0%
survey
Count 564 14 578
November
% 97.6% 2.4% 100.0%
Count 1733 45 1778
Total
% 97.5% 2.5% 100.0%
Result