Professional Documents
Culture Documents
For the purpose of the study, we are looking for a measure that capture inter-group vaccine hesi-
tancy/uptake within a district. Index of disparity is a suitable choice to achieve this purpose.
Definition 1.1. The index of disparity, ID, for n number of groups for a given district is given as
follows
P
n−1 |ri −rrp |
i=1 n−1
ID = [1]
rrp
where ri is the rate of vaccination for the ith group, rrp is the rate of vaccination for the reference
The drawback of the measure above is that it does not take into account the relative size of
the group with respect to the whole population. To remedy this problem, I am proposing a new
Definition 1.2. The weighted index of disparity IDw for a given population of group i from the
P
n−1 |ri −rrp | pi
i=1 n−1 p1 +···+pn
IDw =
rrp
Still, both weighted and non-weighted index of disparity is sensitive to inter-group disparity: but
1
such as ethnicity, gender, income brackets, and others, the index of disparity is a suitable measure
Definition 2.1. For an observation i, if we assume that the response variable yi and the explanatory
yi = β0 + β1 xi1 + · · · + βp xip + ei
Or more concisely,
Y = Xβ + e
Given that X is the design matrix, assuming that E(ei ) = 0, V ar(ei ) = σ 2 for all i, and
Theorem 2.1. The least square estimate for the parameter β in a linear model is
β̂ = (X T X)−1 X T Y
given that (X T X)−1 is invertible (it usually is unless you use two exactly dependent variables).
The reason for the popularity of linear models is due to the following theorem also known as the
Gauss-Markov theorem.
Theorem 2.2. For any linear, unbiased estimator, the least square estimate has the smallest
2
Although one could get a non-linear or biased estimator which can fit a model better, we usually
This is the most crucial part of the OLS, which is essentially a collection of techniques to pick the
There are two distinct techniques which we use to do variable selection; the first is stepwise
Here are the range of the different variable selection techniques commonly used:
3. Adjusted R2 (criteria)
6. Mallow’s Cp (criteria)
7. Cross-validation (criteria)
All the techniques above will generally yield different models. Cross-validation is the most
computationally expensive technique among these but a favorite among many practitioners. We
can select the best models chosen by 1-6 and compare them using cross-validation.
We need to ensure that our model is robust by checking all the assumptions in definition 2.1 and
removing outliers. Usually the OLS library in R can correct most issues.
Last but not least, we need to conduct hypothesis testing to ensure that our observations are
3
grounds for our hypothesis – the first step towards showing causality. The more popular tests
include the F-test and t-tests which are equivalent forms of hypothesis testing for model significance
but requires the assumption that our observations are normally distributed. Another interesting
test is the permutation test, which is also widely used because they involve no assumptions for
normality in the OLS. We usually test our model against an empty model (we essentially argue
that our hypothesized parameters fit the observation better than no parameters at all).
References
[1] Jeffrey N Pearcy and Kenneth G Keppel. A summary measure of health disparity. Public health
reports, 2016.