Children - Of-Immigrants-Research-Project-Attempt2

Running head: CHILDREN OF IMMIGRANTS: MISSING DATA
Children of Immigrants Study
Khawater Elgosbi
Northern Illinois University

Fall 2019
CHILDREN OF IMMIGRANTS: MISSING DATA 2
The data set is obtained from the blackboard of ETR 790 class. It is a huge data set of
children of immigrants study and it has missing data for different variables.
https://webcourses.niu.edu/bbcswebdav/pid-5888112-dt-content-rid-57346272_2/xid-
57346272_2.
For this paper I chose six variables from this data set. First four variables are continuous
variables including student’s grade V5, which has four response options: seventh grade, eighth
grade, ninth grade, and tenth grade. And parent satisfaction with child education P131 that has
three options very satisfied, satisfied, and not satisfied. After that, I used parental neighborhood
satisfaction P108 and it has the same satisfaction options as the last variable. The last continuous
variable is mother’s age V40 it considers as a scale variable. The two categorical variables are
respondent’s sex SexFac and respondent US citizen UScitezenFac. SexFac has two options male
and female with values 1 and 2. On the other hand, UScitezenFac comes with three options yes,
no, and don’t know which has been considered as missing data.
The Latent Variable
In the data set named (Children of Immigrants), we can consider Child’s Wellness as a
latent variable. But there is not a single measurement on the data set called “child’s wellness”
that can be measured, and so it is rather a hidden variable. Instead, we measure demographic and
behavioral data from the participants, such as Student’s grade, Respondent’s sex, Respondent’s
us citizen, Parent satisfaction with child Education, Parent neighborhood satisfaction, and
Mother’s age. These separate measurements can be used to judge “child’s wellness”, based on
the demographics and experiences of the respondents and the values of the data from a variety of
parents.
The measurement model of a latent variable with effect variables is the set of
relationships (modeled as equations) in which the latent variable is set as the predictor of the
indicator variables. Then, using structural Equation Modeling SEM techniques, we can identify
the amount of variance between the observed variable in the data set (i. e., Student’s Grade,
Respondent’s sex, Respondent’s us citizen, Parent satisfaction with child Education, Parent
neighborhood satisfaction, and Mother’s age), and then estimation of the latent variable (child’s
wellness) is done by analyzing the variance and covariance of these observed variables.
Therefore, the values of the latent variable could be written as a set of regression models.
Although the relationships between the latent variable and the observed variables are not given
by the data set, we still can find them in the variance and covariance of the variables. They can
be modeled based on the values of the observed variables.
Descriptive Statistics for Missing Data
In order to deal with the missing data, I run a descriptive statistic that showed for SexFac
n was 5262 with no missing values the frequency for males and females was 2575 to 2687 with
proportion of 0.489 to 0.511. Student’s grade n was 5262, also without any missing data and the
mean was: 8.462, and the frequencies were 1 in seventh grade, 2832 in eighth grade, 2428 in
ninth grade, and 1 in tenth grade. For the UScitezenFac n was 4073 with 1189 missing values
with frequencies of 2582 for yes and 1491 for no and the proportion was 0.634 to 0.366 for yes
and no values. In addition, parent neighborhood satisfaction showed n 2431 with 2831 missing
values and the mean was 1.666 and the proportion for values 1, 2, 3, 4, 5 were 0.527, 0.339,
0.081, 0.045, and 0.007 consequently. While mother’s age showed n of 4069 with 1193 missing
values and the mean was 40.3. Finally, parent satisfaction with child education had n 2424 with
2838 missing values and the mean was 1.561 with frequencies of 1262, 965, 197 for 1,2, and 3
values and the proportion for these values were 0.521, 0.398, and 0.081.
Graphical Representation of Missing Data
According to the graph, 1458 cases have no missing values in any variables, 10 cases
have only one missing value in parent satisfaction with child education variable, and only three
cases have one missing value in parent neighborhood satisfaction variable. Additionally, 1750
cases have two missing values in parent neighborhood satisfaction and parent satisfaction with
child education. 458 have one missing value in mother’s age, and one case has two missing
values in mother’s age and parent neighborhood satisfaction. 393 cases have three missing values
in parent neighborhood satisfaction, parent satisfaction with child education, and mother’s age.
323 cases have one missing value in respondent US citizen variable, and two cases have two
missing values in parent satisfaction with child education and respondent US citizen. Three cases
have missing values in respondent US citizen and parent neighborhood satisfaction. 520 cases
have missing values in three variables, which are parent satisfaction with child education, parent
neighborhood satisfaction, and respondent US citizen. Finally, 178 cases have missing values in
respondent US citizen and mother’s age, two cases have missing values in parental satisfaction
with child education, mother’s age, and respondent US citizen, and 161 cases have missing
values in the four variables.
Missing Data
The following data frame visualization showing columns of missing data where respondent’s sex
and student’s grade have no missing data while parent neighborhood and parent satisfaction with
child education are with most missing data.

Columns of Missing Data
Bar Graph by Missing Cases

Little’s MCAR test
I also applied the little’s MCAR test that showed P = 0 and the chi-square statistic is
statistically significant, 𝜒 " (14) = 51 with using alpha = 0.5 so we reject the null hypothesis that
says missing data are missing completely at random. The missing data don’t appear to be
missing completely at random, so I have to run tests for MAR.
Tests of MAR
The logistic regression on variables with missing cases showed that mother’s age is
significantly predicted by SexFac p< 0.01. while parental satisfaction with child education is
significantly predicted by student’s grade and p = 0.040. On the other hand, respondent US
citizen is not significantly predicted. Since at least one variable has p less than 0.05 so I assumed
MAR which means data is missing at random and the missing is depending on observed data. I
used multiple imputation to decrease the bias. After multiple imputation I found respondent US
citizen is not statistically significant predicted and p = 0.65 while mother’s age is statistically
significant for SexFac and parental satisfaction with child education is statistically significant for
student’s grade so maintaining sample size will be by replace the missing data with one that
similar to it.
Using the multiply-imputed data set for regression
Since my missing data is MAR I chose to conduct multiple imputation. After using
multiply imputed data set for regression SexFac is statistically significant predictor for parent
satisfaction with child education P131 P<0.01. While UScitezen is not statistically significant
pridector P=0.421.
Appendix A
R Codes
#Convert sex to factor
children$sexFac<-factor(children$sex,
level=c(1,2),
labels=c("Male", "Female"))
#Convert respondentUScitezen to factor
children$UScitezenFac<-factor(children$UScitezen,
labels=c(1,2),
labels=c("Yes","No"))
#Create a subset of the children data
library(dplyr)
childrenSubset1<- dplyr::select(children,
V40,P131,P108,V5,SexFac,UScitezen)
#Descriptive statistics using Hmisc::describe
library(Hmisc)
Hmisc::describe (childrenSubset1)
#Missing data plot using visdat package
library(visdat)
visdat::vis_miss(childrenSubset1)
#checking missing data patterns
library(mice)
mice::md.pattern(childrenSubset1)
#Missing value plots with naniar package
library(naniar)
naniar::gg_miss_upset(childrenSubset1,
nsets=10) #Note:nsets = number of vars
#Plotting the number of variables with missing values for each case
naniar::gg_miss_case(childrenSubset1)
naniar::gg_miss_case(childrenSubset1,
facet=sexFac)
naniar::gg_miss_case(childrenSubset1,
facet=UScitezenFac)
#Testing for MCAR
library(BaylorEdPsych)
MCARtest1<-BaylorEdPsych::LittleMCAR(childrenSubset1)
MCARtest1$chi.square
MCARtest1$df
MCARtest1$p.value
MCARtest1$missing.patterns
MCARtest1$amount.missing
#Logistic regression predicting missingness of UScitezen from V40, P108, P131, SexFac, V5
UScitezenMARcheck<-glm(data=childrenSubset1,
Uscitezenmiss
, V40, P108, P131, SexFac, V5,
family=binomial)
summary(UScitezenMARcheck)
#Logistic regression predicting missingness of V40 from, Uscitezen, P108, P131, SexFac, V5
V40 MARcheck<-glm(data=childrenSubset1,
V40miss
, UScitezen, P108, P131, SexFac, V5,
family=binomial)
summary(V40MARcheck)
#Logistic regression predicting missingness of P131 from, Uscitezen, P108, V40, SexFac, V5
P131 MARcheck<-glm(data=childrenSubset1,
V40miss
, UScitezen, P108, V40, SexFac, V5,
family=binomial)
summary(P131MARcheck).
#Creating 5 multiply-imputed data sets with predictive means matching
ChildrenMultimpute <-mice::mice(ChildrenSubset1,m=5,maxit=50,meth='pmm',seed=500)
summary(ChildrenMultimpute)
#Examining descriptive statistics for the multiply-imputed data set
ChildrenMultimpute1 <- mice::complete(ChildrenMultimpute, action=1)
ChildrenMultimpute2 <- mice::complete(ChildrenMultimpute, action =2)
summary(ChildrenMultimpute1)
#Multiple regression and pooling results
ChildrenMultimputeReg <- with(ChildrenMultimpute,
lm(QoL ~ SexFac + UScitezinFac))
ChildrenMultimputeRegSummary<-summary(mice::pool(ChildrenMultimputeReg))
round(ChildrenMultimputeRegSummary,digits=3)
References
ETR 790 Class Blackboard (Fall 2019) Children of Immigrants [data file]. Retrieved from
https://webcourses.niu.edu/bbcswebdav/pid-5888112-dt-content-rid-57346272_2/xid-
57346272_2

Children - Of-Immigrants-Research-Project-Attempt2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Children - Of-Immigrants-Research-Project-Attempt2

Uploaded by

Copyright:

Available Formats

Running head: CHILDREN OF IMMIGRANTS: MISSING DATA

Children of Immigrants Study

Northern Illinois University

The Latent Variable

be modeled based on the values of the observed variables.

Descriptive Statistics for Missing Data

Graphical Representation of Missing Data

values in the four variables.

child education are with most missing data.

Columns of Missing Data

Bar Graph by Missing Cases

Little’s MCAR test

missing completely at random, so I have to run tests for MAR.

Using the multiply-imputed data set for regression

#Convert sex to factor

#Convert respondentUScitezen to factor

#Create a subset of the children data

#Descriptive statistics using Hmisc::describe

#Missing data plot using visdat package

#checking missing data patterns

#Missing value plots with naniar package

nsets=10) #Note:nsets = number of vars

#Testing for MCAR

, V40, P108, P131, SexFac, V5,

, UScitezen, P108, P131, SexFac, V5,

, UScitezen, P108, V40, SexFac, V5,

#Creating 5 multiply-imputed data sets with predictive means matching

#Examining descriptive statistics for the multiply-imputed data set

ChildrenMultimpute1 <- mice::complete(ChildrenMultimpute, action=1)

ChildrenMultimpute2 <- mice::complete(ChildrenMultimpute, action =2)

ChildrenMultimpute3 <- mice::complete(ChildrenMultimpute, action =3)

ChildrenMultimpute4 <- mice::complete(ChildrenMultimpute, action =4)

ChildrenMultimpute5 <- mice::complete(ChildrenMultimpute, action =5)

#Multiple regression and pooling results

ChildrenMultimputeReg <- with(ChildrenMultimpute,

lm(QoL ~ SexFac + UScitezinFac))

You might also like