Professional Documents
Culture Documents
Khawater Elgosbi
The data set is obtained from the blackboard of ETR 790 class. It is a huge data set of
children of immigrants study and it has missing data for different variables.
https://webcourses.niu.edu/bbcswebdav/pid-5888112-dt-content-rid-57346272_2/xid-
57346272_2.
For this paper I chose six variables from this data set. First four variables are continuous
variables including student’s grade V5, which has four response options: seventh grade, eighth
grade, ninth grade, and tenth grade. And parent satisfaction with child education P131 that has
three options very satisfied, satisfied, and not satisfied. After that, I used parental neighborhood
satisfaction P108 and it has the same satisfaction options as the last variable. The last continuous
variable is mother’s age V40 it considers as a scale variable. The two categorical variables are
respondent’s sex SexFac and respondent US citizen UScitezenFac. SexFac has two options male
and female with values 1 and 2. On the other hand, UScitezenFac comes with three options yes,
no, and don’t know which has been considered as missing data.
In the data set named (Children of Immigrants), we can consider Child’s Wellness as a
latent variable. But there is not a single measurement on the data set called “child’s wellness”
that can be measured, and so it is rather a hidden variable. Instead, we measure demographic and
behavioral data from the participants, such as Student’s grade, Respondent’s sex, Respondent’s
us citizen, Parent satisfaction with child Education, Parent neighborhood satisfaction, and
Mother’s age. These separate measurements can be used to judge “child’s wellness”, based on
the demographics and experiences of the respondents and the values of the data from a variety of
parents.
CHILDREN OF IMMIGRANTS: MISSING DATA 3
The measurement model of a latent variable with effect variables is the set of
relationships (modeled as equations) in which the latent variable is set as the predictor of the
indicator variables. Then, using structural Equation Modeling SEM techniques, we can identify
the amount of variance between the observed variable in the data set (i. e., Student’s Grade,
Respondent’s sex, Respondent’s us citizen, Parent satisfaction with child Education, Parent
neighborhood satisfaction, and Mother’s age), and then estimation of the latent variable (child’s
wellness) is done by analyzing the variance and covariance of these observed variables.
Therefore, the values of the latent variable could be written as a set of regression models.
Although the relationships between the latent variable and the observed variables are not given
by the data set, we still can find them in the variance and covariance of the variables. They can
In order to deal with the missing data, I run a descriptive statistic that showed for SexFac
n was 5262 with no missing values the frequency for males and females was 2575 to 2687 with
proportion of 0.489 to 0.511. Student’s grade n was 5262, also without any missing data and the
mean was: 8.462, and the frequencies were 1 in seventh grade, 2832 in eighth grade, 2428 in
ninth grade, and 1 in tenth grade. For the UScitezenFac n was 4073 with 1189 missing values
with frequencies of 2582 for yes and 1491 for no and the proportion was 0.634 to 0.366 for yes
and no values. In addition, parent neighborhood satisfaction showed n 2431 with 2831 missing
values and the mean was 1.666 and the proportion for values 1, 2, 3, 4, 5 were 0.527, 0.339,
0.081, 0.045, and 0.007 consequently. While mother’s age showed n of 4069 with 1193 missing
values and the mean was 40.3. Finally, parent satisfaction with child education had n 2424 with
CHILDREN OF IMMIGRANTS: MISSING DATA 4
2838 missing values and the mean was 1.561 with frequencies of 1262, 965, 197 for 1,2, and 3
values and the proportion for these values were 0.521, 0.398, and 0.081.
According to the graph, 1458 cases have no missing values in any variables, 10 cases
have only one missing value in parent satisfaction with child education variable, and only three
cases have one missing value in parent neighborhood satisfaction variable. Additionally, 1750
cases have two missing values in parent neighborhood satisfaction and parent satisfaction with
child education. 458 have one missing value in mother’s age, and one case has two missing
values in mother’s age and parent neighborhood satisfaction. 393 cases have three missing values
in parent neighborhood satisfaction, parent satisfaction with child education, and mother’s age.
CHILDREN OF IMMIGRANTS: MISSING DATA 5
323 cases have one missing value in respondent US citizen variable, and two cases have two
missing values in parent satisfaction with child education and respondent US citizen. Three cases
have missing values in respondent US citizen and parent neighborhood satisfaction. 520 cases
have missing values in three variables, which are parent satisfaction with child education, parent
neighborhood satisfaction, and respondent US citizen. Finally, 178 cases have missing values in
respondent US citizen and mother’s age, two cases have missing values in parental satisfaction
with child education, mother’s age, and respondent US citizen, and 161 cases have missing
Missing Data
The following data frame visualization showing columns of missing data where respondent’s sex
and student’s grade have no missing data while parent neighborhood and parent satisfaction with
I also applied the little’s MCAR test that showed P = 0 and the chi-square statistic is
statistically significant, 𝜒 " (14) = 51 with using alpha = 0.5 so we reject the null hypothesis that
says missing data are missing completely at random. The missing data don’t appear to be
Tests of MAR
The logistic regression on variables with missing cases showed that mother’s age is
significantly predicted by SexFac p< 0.01. while parental satisfaction with child education is
significantly predicted by student’s grade and p = 0.040. On the other hand, respondent US
citizen is not significantly predicted. Since at least one variable has p less than 0.05 so I assumed
MAR which means data is missing at random and the missing is depending on observed data. I
used multiple imputation to decrease the bias. After multiple imputation I found respondent US
citizen is not statistically significant predicted and p = 0.65 while mother’s age is statistically
significant for SexFac and parental satisfaction with child education is statistically significant for
student’s grade so maintaining sample size will be by replace the missing data with one that
similar to it.
Since my missing data is MAR I chose to conduct multiple imputation. After using
multiply imputed data set for regression SexFac is statistically significant predictor for parent
satisfaction with child education P131 P<0.01. While UScitezen is not statistically significant
pridector P=0.421.
CHILDREN OF IMMIGRANTS: MISSING DATA 8
Appendix A
R Codes
children$sexFac<-factor(children$sex,
level=c(1,2),
labels=c("Male", "Female"))
children$UScitezenFac<-factor(children$UScitezen,
labels=c(1,2),
labels=c("Yes","No"))
library(dplyr)
childrenSubset1<- dplyr::select(children,
V40,P131,P108,V5,SexFac,UScitezen)
library(Hmisc)
Hmisc::describe (childrenSubset1)
library(visdat)
visdat::vis_miss(childrenSubset1)
library(mice)
mice::md.pattern(childrenSubset1)
CHILDREN OF IMMIGRANTS: MISSING DATA 9
library(naniar)
naniar::gg_miss_upset(childrenSubset1,
#Plotting the number of variables with missing values for each case
naniar::gg_miss_case(childrenSubset1)
naniar::gg_miss_case(childrenSubset1,
facet=sexFac)
naniar::gg_miss_case(childrenSubset1,
facet=UScitezenFac)
library(BaylorEdPsych)
MCARtest1<-BaylorEdPsych::LittleMCAR(childrenSubset1)
MCARtest1$chi.square
MCARtest1$df
MCARtest1$p.value
MCARtest1$missing.patterns
MCARtest1$amount.missing
#Logistic regression predicting missingness of UScitezen from V40, P108, P131, SexFac, V5
UScitezenMARcheck<-glm(data=childrenSubset1,
Uscitezenmiss
family=binomial)
CHILDREN OF IMMIGRANTS: MISSING DATA 10
summary(UScitezenMARcheck)
#Logistic regression predicting missingness of V40 from, Uscitezen, P108, P131, SexFac, V5
V40 MARcheck<-glm(data=childrenSubset1,
V40miss
family=binomial)
summary(V40MARcheck)
#Logistic regression predicting missingness of P131 from, Uscitezen, P108, V40, SexFac, V5
P131 MARcheck<-glm(data=childrenSubset1,
V40miss
family=binomial)
summary(P131MARcheck).
ChildrenMultimpute <-mice::mice(ChildrenSubset1,m=5,maxit=50,meth='pmm',seed=500)
summary(ChildrenMultimpute)
summary(ChildrenMultimpute1)
CHILDREN OF IMMIGRANTS: MISSING DATA 11
summary(ChildrenMultimpute2)
summary(ChildrenMultimpute3)
summary(ChildrenMultimpute4)
summary(ChildrenMultimpute5)
ChildrenMultimputeRegSummary<-summary(mice::pool(ChildrenMultimputeReg))
round(ChildrenMultimputeRegSummary,digits=3)
CHILDREN OF IMMIGRANTS: MISSING DATA 12
References
ETR 790 Class Blackboard (Fall 2019) Children of Immigrants [data file]. Retrieved from
https://webcourses.niu.edu/bbcswebdav/pid-5888112-dt-content-rid-57346272_2/xid-
57346272_2