Professional Documents
Culture Documents
University of Gonder
Collage of Medicine and Health Sciences
Institute of Public Health
November 4, 2021
Concept of Biostatistics
Introduction to STATA
Life Table
Individual Profiles
Average Profile
Correlation Matrix
Model Comparison
Tadesse A. Ayele (Universities of Gondar) November 4, 2021 12 / 292
Chapter 4: Longitudinal Analysis for non-Gaussian data
Model Comparison
Signed-rank test
Kruskal-Wallis test
Forecasting
Evaluation
Participation 5%
Tasks 15%
Assignment 15%
Project 20%
Final Exam: 45%
Chapters Date
General Introduction Day 1
Part I: Basic Models
Models for Survival data Day 2
Part II: Advanced Models
Longitudinal Analysis for Continuous Data Day 3
Longitudinal Analysis for categorical Data Day 4
Non-parametric methods Day 5
Time Series Analysis Day 5
Complete census were first carried out in: Sweden in 1749, USA in
1790, ..., Ethiopia 1984
The human must be able to intelligently interpret the output from the
computer
Used for:
Forming hypotheses
Designing experiments and observational studies
Gathering data
Summarizing data
Drawing inferences from data( eg. testing hypotheses)
Why biostatistics?
Quantitative variables
Qualitative variables
Predictor variables
Can be also called explanatory or independent variable
Affects the outcome variable
Qualitative data
Data are not just numbers, they are numbers with a context
In data analysis, context provides meaning
Context includes;
who collected the data,
how were the data collected,
why the data were collected,
when and where were the data collected
what exactly was collected
Data is;
Power
Wealth
Development
Wisdom
...
Used in
Figure 5: Speed versus car accident.
Machine Learning
Artificial
Intelligence
...
Agricultural
Climate
Mining and Environmental
This will allow the Horn of Africa
to improve its ability to plan for
changing weather patterns
”If you torture the data long enough, it will confess.” Ronald H.
Coase
We use models to ”torture” the data
Statistics includes the process of finding out about patterns in the
real world using data
When solving statistical problems it is often helpful to make models
of real world situations based on;
Observed data
Assumptions about the context, and
Theoretical probability
g (Yi ) = β0 + β1 Xi + ij
Where
- g is the link function - Yi is response of interest - Xi is
exposure/explanatory of interest - β0 is intercept - β1 is slope - ij is
residual term
This model works under the assumption of independent
The standard models can be extended when the basic assumptions are
violated. The extended models can be named as
Mixed Effect
Multilevel
Hierarchichal
...
Can be formulated as
The data were collected to established risk factors affecting infant survival.
Children born in Jimma, Kaffa, and Illubabor were examined for their first
year growth characteristics. There were 8050 infants enrolled in the study.
The variables included were:
ID TotPrg TotBrth Abortion StillB Gravida Event Durtaion parity MamAge
1 5 5 2 0 0 0 360 0 30
2 3 3 2 0 0 0 62 0 23
3 8 8 2 0 0 0 360 0 32
4 5 5 2 0 0 0 356 0 30
5 4 2 1 1 0 0 362 1 23
. . . . . . . . . .
. . . . . . . . . .
. . . . . . . . . .
8050 0 4 4 2 0 0 361 2 35
Stata is widely used in social science research and the most used
statistical software on campus.
Graphics
Data editor
Data viewer
‘do’
Log
It is case sensitive!
First set the memory of Stata: set memory 100m
Many Options:
Manually enter data into the Stata Data Editor
Copy data into the Data Editor from another source (ex.: Excel)
Importing an ASCII (text) file.
Reading in an Excel spreadsheet (tab- or comma- delimited text file).
Open existing Stata Data file with file extension .dta
Use a conversion package (StatTransfer) to read in data from another
package (eg, SPSS, SAS data, . . . ).
This is necessary because Stata can have only one dataset in memory
at a time!
Do file
The commands used to produce the output can be saved in the ”do
file”
It can be also executed by highlighting each command line
Any time we want to run the command, we can open the do file
Log file
Log file is used to save the outputs produced
This can bone by clicking begin under Log menu
File → Log → Begin
If you want to stop saving the output, then suspend using
File → Log → Suspend (or End)
2 Command driven
Consider variables ID, Sex, age, weight, height, marital status (S, M,
D, W), BG, family size, and income.
create an excel file and generate data for 100 sample size
Import the data to stata
Compute BMI from weight and height
Recode age in to 10 years age group
Save all the codes in to do file
Save all the outputs in to log file
Generate nutrition status from the variable BMI (l 18.5, between 18.5
to 24.9, between 25 to 29.9 and ≥ 30)
Label the variable sex in to 1-male and 2-Female
We will therefore analyze the time from birth to death (in days).
For infants still alive when these data were collected, time is the time
from birth to the time of data collection.
The variable event is an indicator for whether time refers to death (1)
or end of study (0).
Possible explanatory variables for time-to-death could be place of
residence and sex of infants, among others.
These data can be described as survival data.
Duration or survival data can generally not be analyzed by
conventional methods such as linear regression.
The main reason for this is that some durations are usually
right-censored.
That is, the endpoint of interest has not occurred during the period
of observation.
Another reason is that survival times tend to have positively skewed
distributions.
Recall
Y dj
S(t)
d = (1 − ).
nj
j|t(j) ≤t
Types of Analysis
------------------------------------------------------------------
8050 total obs.
57 obs. end on or before enter()
------------------------------------------------------------------
7993 obs. remaining, representing
698 failures in single record/single failure data
2591422 total analysis time at risk, at risk from t = 0
earliest observed entry t = 0
last observed exit t = 464
stsum
The following output results:
failure _d: event
analysis time _t: duration
Note: There are 7993 subjects, and if the incidence rate (i.e., the hazard function) could be
assumed to be constant, it would be estimated as 0.00027 per day which corresponds to 0.098
per year.
| Events Events
sexChild | observed expected
---------+-------------------------
female | 318 345.01
male | 380 352.99
---------+-------------------------
Total | 698 698.00
chi2(1) = 4.18
Pr>chi2 = 0.0408
| Events Events
CatBwt | observed expected
------------+-------------------------
normal | 537 611.52
underweight | 126 51.48
------------+-------------------------
Total | 663 663.00
chi2(1) = 117.03
Pr>chi2 = 0.0000
| Events Events
CatBwt | observed expected(*)
------------+-------------------------
normal | 537 612.58
underweight | 126 50.42
------------+-------------------------
Total | 663 663.00
chi2(1) = 123.26
Pr>chi2 = 0.0000
The baseline hazard is the hazard when all covariates are zero, and is
left unspecified.
Outcome: Duration.
------------------------------------------------------------------------------
_t | Haz. Ratio Std. Err. z P>|z| [95% Conf. Interval]
-------------+----------------------------------------------------------------
CatBwt | 2.78779 .2759798 10.36 0.000 2.29612 3.384742
------------------------------------------------------------------------------
Underweight infants are nearly 3 times more likely to die at any given time (given that
they remained in the study until that time) as compared to that of normal.
stcox CatBwt
Two factors...
Outcome: Duration.
Male infants are 17% more likely to die at any given time (given that they remained in the
study until that time) than that of females.
We can use the estat function in stata after the final model is fitted
Time: Time
----------------------------------------------------------------
chi2 df Prob>chi2
------------+---------------------------------------------------
global test 9.34 5 0.0961
----------------------------------------------------------------
The data structure for longitudinal data in its long form is;
8
7
6
log−CD4 Count
5
4
3
2
observed mean
Loess curve
1
0 10 20 30 40 50 60 70
Time (Month)
15
haemoglobin
Relaps
Not relaps
relaps
10
10
9
log(WBC)
Relaps
Not relaps
relaps
Ignoring the correlation among the observation will badly affect the
standard error
This further will affect the estimation of the confidence interval and
p-value
Which will later affect the decision of the test
Thus, we have to account the correlation in our analysis
For mixed effects models, the random effects in the models represent
the influence of each individual (cluster) on the repeated observations
that is not captured by the observed covariates
Fixed effects: β
Random effects: bi
Variance components: elements in D and Σi
εi
1 ρ1 ρ2 ρ3
ρ1 1 ρ1 ρ2
R(α) =
ρ2 ρ1 1 ρ3
ρ3 ρ2 ρ1 1
individual profiles
average evolution
correlation structure
6000 10000
8000
wt
4000
2000
0 2 4 6 8 10 12
AGE
STATA CODE:
gen wt=weight if ind<=30
xtline wt, overlay t(age) tlabel(#6) i(ind) legend(off) /*
*/ title("Individual Profiles") scheme(s2mono)
10000
10000
8000
8000
6000
6000
wt
wt
4000
4000
2000
2000
0 2 4 6 8 10 12 0 2 4 6 8 10 12
AGE AGE
STATA CODE:
xtline wt if sex==0, overlay t(age) tlabel(#6) i(ind) legend(off)/*
/* title("Individual Profiles for Females") scheme(s2mono)
xtline wt if sex==1, overlay t(age) tlabel(#6) i(ind) legend(off)/*
*/ title("Individual Profiles for Males") scheme(s2mono)
10000
8000
Mean weight
6000 4000
2000
0 2 4 6 8 10 12
AGE
10000
8000
Mean weight
6000 4000
2000
0 2 4 6 8 10 12
AGE
Females Males
Subjects high (low) at baseline seem to remain high (low) over time
This data set is taken from Potthoff and Roy, Biometrika (1964)
A set of measurements of the distance from the pituitary gland to the
pterygo-maxillary fissure taken every two years from 8 years of age
until 14 years of age on a sample of 27 children, 16 males and 11
females
The data, were collected by orthodontists from x-rays of the
children’s skulls
The individual profile plot and the mean profile for males and females
separately are given below
Research question: Is dental growth related to gender?
Variables in the data include: (1) obs: observation number (2) group:
group according to height of the mother (1: small; 2: medium; 3:
tall) (3) child: subject identification number (4) age: age at which the
observation is taken (years) (5) height: the response measured (cm)
34
30
30
26
26
measure
measure
22
22
18
18
14
14
6 8 10 12 14 16 6 8 10 12 14 16
age age
STATA CODE:
xtline measure if sex==2, overlay t(age) tlabel(6(2)16) i(ind) legend(off)/*
/* title("Individual Profiles for Girls") scheme(s2mono) ylabel(14(4)34)
Yi = Xi β + Zi bi + εi
bi ∼ N(0, D), εi ∼ N(0, Σi ),
b1 , . . . , bN , ε1 , . . . , εN independent
Fixed effects: β
Random effects: bi
Variance components: elements in D and Σi
individual profiles
average evolution
correlation structure
Female Male
10000
8000
Mean weight
6000
4000
2000
0 2 4 6 8 10 12 0 2 4 6 8 10 12
AGE
Graphs by SEX
Female Male
10000
8000
Mean weight
6000
4000
2000
0 2 4 6 8 10 12 0 2 4 6 8 10 12
AGE
95% CI mean weight
Graphs by SEX
From individual profiles, the variability seems almost the same among
the two groups.
30
28
Mean Distance
24 22
20 26
8 10 12 14
age
Boys Girls
Boys Girls
30
28
Mean Distance
26
24
22
20
8 10 12 14 8 10 12 14
age
95% CI Mean Distance
Graphs by sex
20 22 24 26 28 20 22 24 26 28 30
d1
26
24
22
20
18
28
d2
26
24
22
20
20 22 24 26 28 30
d3
20 22 24 26 28 30
d4
18 20 22 24 26 20 22 24 26 28 30
Marginally:
Yi ∼ N(Xi β, Zi DZi0 + Σi )
Formally, one can consider likelihood ratio test for model comparison
gen wt=weight/1000
xtmixed wt i.sex age c.age#c.age i.sex#c.age ||ind: age, cov(u
Hence a linear mean, with random intercept and slope is a good idea...
----------------------------------------------------------------------
measure | Coef. Std. Err. z P>|z| [95% Conf. Interval]
----------------------------------------------------------------------
2.sex .9352545 1.682763 0.56 0.578 -2.362901 4.233409
age | .7884227 .0880488 8.95 0.000 .6158502 .9609952
|
sex#c.age |
2 |-.2987766 .138049 -2.16 0.030 -.5693477 -.0282054
|
_cons | 16.27586 1.0727 15.17 0.000 14.17341 18.37831
----------------------------------------------------------------------
The stata command to compute AIC and BIC after you fit the model
is
STATA CODE:
estat ic
0 12 24 36
Time or visit
Control At risk
Coefficients relating Y to X
Gi is a gender indicator.
Aij is age of the i th infant at time j (also the time variable).
Inference valid in large samples even if distribution of Y and or
variance of Y are incorrectly specified
Valid inference if data are Missing Completely At Random (MCAR)
even if variance model is wrong
That is, the naive standard errors based on the assumed correlation
structure are updated using the information the sample data provide
about the dependence
The result is robust standard errors that are usually more appropriate
than ones based solely on the assumed correlation structure
GEE
Coefficients relating Y to X
Inference valid in large samples even if distribution of Y and or
variance of Y are incorrectly specified
Valid inference if data are Missing Completely At Random (MCAR)
even if variance model is wrong
GLMM
Coefficients relating Y to X conditional on b
Valid inference generally requires correct specification of distribution
of Y and of variance of Y
Valid inference if data are Missing At Random (MAR)
Gi is a gender indicator.
Aij is age of the i th infant at time j (also the time variable).
sex#age |
1 |-.0180339 .0173981 -1.04 0.300 -.0521336 .0160658
|
_cons |-1.872127 .1011225 -18.51 0.000 -2.070323 -1.67393
STATA CODE:
c1 c2 c3 c4 c5 c6 c7
r1 1.0000
r2 0.1458 1.0000
r3 0.1458 0.1458 1.0000
r4 0.1458 0.1458 0.1458 1.0000
r5 0.1458 0.1458 0.1458 0.1458 1.0000
r6 0.1458 0.1458 0.1458 0.1458 0.1458 1.0000
r7 0.1458 0.1458 0.1458 0.1458 0.1458 0.1458 1.0000
Compare the standard errors of the ‘model based’ and the ‘robust’
versions.
Interpretation of parameter estimates?
--------------------------------------------------------------------
Random-effects Parameters | Estimate Std. Err. [95% Conf. Interval]
--------------------------------------------------------------------
ind: Identity |
sd(_cons)| 1.20833 .075294 1.06941 1.365295
--------------------------------------------------------------------
LR test vs. logistic regression: chibar2(01) = 222.16
Prob>=chibar2 = 0.0000
STATA CODE:
xtgee Gamb i.Village i.Season i.Village#c.Time, i( ID ) t( Time ) corr(exc) link(log) family(poisson) eform
STATA CODE:
σ02
corr (Yij , Yij 0 ) =
σ02 + σ 2
If the correlation is zero that means the observation within cluster are
no more similar than the observations from different cluster
mcartest depvars
Imputation
Mean imputation
Multiple imputation
Parametric Non-parametric
Unpaired t-test Wilcoxon rank sum test
Mann-Whitney U test
Paired t-test Sign test
Wilcoxon signed rank test
ANOVA Kruskal-Wallis test
Repeated measures ANOVA Freedman test
Control 2 1 4
(25) (10) (35)
Treatment 5 3 6
(36) (26) (40)
1
PH0 (S1 = s1 , ..., Sn = sn ) = N
n
WS ≥ c
H0 : θ = 0
Zi = Yi − Xi , i = 1, 2, ..., n
are mutually independent;
come from a continuous population.
Hypotheses:
H0 : θ ≤ 0
H1 : θ > 0
At alpha level of significance, reject H0 if T − < tα or p < α.
Where tα is to be obtained from table.
Hypotheses:
H0 : θ ≥ 0
H1 : θ < 0
At alpha level of significance, reject H0 if T + < tα or p < α.
Where tα is to be obtained from table.
Hypotheses:
H0 : θ = 0
H1 : θ 6= 0
At alpha level of significance, reject H0 if the smallest of
T + orT − < tα/2 or p < α.
k
X Rj 2
12
H= − 3(n + 1)
n(n + 1) nj
j=1
+-----------------------+
| BMI3 | Obs | Rank Sum |
|------+-----+----------|
| 1 | 14 | 229.00 |
| 2 | 10 | 257.00 |
| 3 | 17 | 375.00 |
+-----------------------+
where
Ti = Trend value at year i
Si = Seasonal value at time i
Ci = Cyclical value at year i
Ii = Irregular (random) value at year i
Univariate time series data typically arise from the collection of many
data points over time from a single source, such as from a person,
country, financial instrument, etc
Understand the past: What happened over the last years, months?
Prediction: use the model to predict future values of the time series
Intervention analysis: how does a single event change the time series?
Thus, data may be skewed, with mean and variation not constant,
non-normally distributed, and not randomly sampled or independent
There are two main approaches used to analyze time series (1) in the
time domain or (2) in the frequency domain
ARIMA (2,0,1)
Kinds of processes:
Random (stochastic) process
Deterministic process
Mixed
20000
15000
malaria cases
10000
5000
0
0 20 40 60 80 100 120
Month
1000
800
600
Rain Fall
400
200
0
0 20 40 60 80 100 120
Month
Tadesse A. Ayele (Universities of Gondar) Figure 18: Plot of time series data. November 4, 2021 289 / 292
Forecasting
Exponential smoothing
Trend modeling