You are on page 1of 12

Running head: CHILDREN OF IMMIGRANTS: MISSING DATA

Children of Immigrants Study

Khawater Elgosbi

Northern Illinois University


Fall 2019
CHILDREN OF IMMIGRANTS: MISSING DATA 2

The data set is obtained from the blackboard of ETR 790 class. It is a huge data set of

children of immigrants study and it has missing data for different variables.

https://webcourses.niu.edu/bbcswebdav/pid-5888112-dt-content-rid-57346272_2/xid-

57346272_2.

For this paper I chose six variables from this data set. First four variables are continuous

variables including student’s grade V5, which has four response options: seventh grade, eighth

grade, ninth grade, and tenth grade. And parent satisfaction with child education P131 that has

three options very satisfied, satisfied, and not satisfied. After that, I used parental neighborhood

satisfaction P108 and it has the same satisfaction options as the last variable. The last continuous

variable is mother’s age V40 it considers as a scale variable. The two categorical variables are

respondent’s sex SexFac and respondent US citizen UScitezenFac. SexFac has two options male

and female with values 1 and 2. On the other hand, UScitezenFac comes with three options yes,

no, and don’t know which has been considered as missing data.

The Latent Variable

In the data set named (Children of Immigrants), we can consider Child’s Wellness as a

latent variable. But there is not a single measurement on the data set called “child’s wellness”

that can be measured, and so it is rather a hidden variable. Instead, we measure demographic and

behavioral data from the participants, such as Student’s grade, Respondent’s sex, Respondent’s

us citizen, Parent satisfaction with child Education, Parent neighborhood satisfaction, and

Mother’s age. These separate measurements can be used to judge “child’s wellness”, based on

the demographics and experiences of the respondents and the values of the data from a variety of

parents.
CHILDREN OF IMMIGRANTS: MISSING DATA 3

The measurement model of a latent variable with effect variables is the set of

relationships (modeled as equations) in which the latent variable is set as the predictor of the

indicator variables. Then, using structural Equation Modeling SEM techniques, we can identify

the amount of variance between the observed variable in the data set (i. e., Student’s Grade,

Respondent’s sex, Respondent’s us citizen, Parent satisfaction with child Education, Parent

neighborhood satisfaction, and Mother’s age), and then estimation of the latent variable (child’s

wellness) is done by analyzing the variance and covariance of these observed variables.

Therefore, the values of the latent variable could be written as a set of regression models.

Although the relationships between the latent variable and the observed variables are not given

by the data set, we still can find them in the variance and covariance of the variables. They can

be modeled based on the values of the observed variables.

Descriptive Statistics for Missing Data

In order to deal with the missing data, I run a descriptive statistic that showed for SexFac

n was 5262 with no missing values the frequency for males and females was 2575 to 2687 with

proportion of 0.489 to 0.511. Student’s grade n was 5262, also without any missing data and the

mean was: 8.462, and the frequencies were 1 in seventh grade, 2832 in eighth grade, 2428 in

ninth grade, and 1 in tenth grade. For the UScitezenFac n was 4073 with 1189 missing values

with frequencies of 2582 for yes and 1491 for no and the proportion was 0.634 to 0.366 for yes

and no values. In addition, parent neighborhood satisfaction showed n 2431 with 2831 missing

values and the mean was 1.666 and the proportion for values 1, 2, 3, 4, 5 were 0.527, 0.339,

0.081, 0.045, and 0.007 consequently. While mother’s age showed n of 4069 with 1193 missing

values and the mean was 40.3. Finally, parent satisfaction with child education had n 2424 with
CHILDREN OF IMMIGRANTS: MISSING DATA 4

2838 missing values and the mean was 1.561 with frequencies of 1262, 965, 197 for 1,2, and 3

values and the proportion for these values were 0.521, 0.398, and 0.081.

Graphical Representation of Missing Data

According to the graph, 1458 cases have no missing values in any variables, 10 cases

have only one missing value in parent satisfaction with child education variable, and only three

cases have one missing value in parent neighborhood satisfaction variable. Additionally, 1750

cases have two missing values in parent neighborhood satisfaction and parent satisfaction with

child education. 458 have one missing value in mother’s age, and one case has two missing

values in mother’s age and parent neighborhood satisfaction. 393 cases have three missing values

in parent neighborhood satisfaction, parent satisfaction with child education, and mother’s age.
CHILDREN OF IMMIGRANTS: MISSING DATA 5

323 cases have one missing value in respondent US citizen variable, and two cases have two

missing values in parent satisfaction with child education and respondent US citizen. Three cases

have missing values in respondent US citizen and parent neighborhood satisfaction. 520 cases

have missing values in three variables, which are parent satisfaction with child education, parent

neighborhood satisfaction, and respondent US citizen. Finally, 178 cases have missing values in

respondent US citizen and mother’s age, two cases have missing values in parental satisfaction

with child education, mother’s age, and respondent US citizen, and 161 cases have missing

values in the four variables.

Missing Data

The following data frame visualization showing columns of missing data where respondent’s sex

and student’s grade have no missing data while parent neighborhood and parent satisfaction with

child education are with most missing data.


CHILDREN OF IMMIGRANTS: MISSING DATA 6

Columns of Missing Data

Bar Graph by Missing Cases


CHILDREN OF IMMIGRANTS: MISSING DATA 7

Little’s MCAR test

I also applied the little’s MCAR test that showed P = 0 and the chi-square statistic is

statistically significant, 𝜒 " (14) = 51 with using alpha = 0.5 so we reject the null hypothesis that

says missing data are missing completely at random. The missing data don’t appear to be

missing completely at random, so I have to run tests for MAR.

Tests of MAR

The logistic regression on variables with missing cases showed that mother’s age is

significantly predicted by SexFac p< 0.01. while parental satisfaction with child education is

significantly predicted by student’s grade and p = 0.040. On the other hand, respondent US

citizen is not significantly predicted. Since at least one variable has p less than 0.05 so I assumed

MAR which means data is missing at random and the missing is depending on observed data. I

used multiple imputation to decrease the bias. After multiple imputation I found respondent US

citizen is not statistically significant predicted and p = 0.65 while mother’s age is statistically

significant for SexFac and parental satisfaction with child education is statistically significant for

student’s grade so maintaining sample size will be by replace the missing data with one that

similar to it.

Using the multiply-imputed data set for regression

Since my missing data is MAR I chose to conduct multiple imputation. After using

multiply imputed data set for regression SexFac is statistically significant predictor for parent

satisfaction with child education P131 P<0.01. While UScitezen is not statistically significant

pridector P=0.421.
CHILDREN OF IMMIGRANTS: MISSING DATA 8

Appendix A

R Codes

#Convert sex to factor

children$sexFac<-factor(children$sex,

level=c(1,2),

labels=c("Male", "Female"))

#Convert respondentUScitezen to factor

children$UScitezenFac<-factor(children$UScitezen,

labels=c(1,2),

labels=c("Yes","No"))

#Create a subset of the children data

library(dplyr)

childrenSubset1<- dplyr::select(children,

V40,P131,P108,V5,SexFac,UScitezen)

#Descriptive statistics using Hmisc::describe

library(Hmisc)

Hmisc::describe (childrenSubset1)

#Missing data plot using visdat package

library(visdat)

visdat::vis_miss(childrenSubset1)

#checking missing data patterns

library(mice)

mice::md.pattern(childrenSubset1)
CHILDREN OF IMMIGRANTS: MISSING DATA 9

#Missing value plots with naniar package

library(naniar)

naniar::gg_miss_upset(childrenSubset1,

nsets=10) #Note:nsets = number of vars

#Plotting the number of variables with missing values for each case

naniar::gg_miss_case(childrenSubset1)

naniar::gg_miss_case(childrenSubset1,

facet=sexFac)

naniar::gg_miss_case(childrenSubset1,

facet=UScitezenFac)

#Testing for MCAR

library(BaylorEdPsych)

MCARtest1<-BaylorEdPsych::LittleMCAR(childrenSubset1)

MCARtest1$chi.square

MCARtest1$df

MCARtest1$p.value

MCARtest1$missing.patterns

MCARtest1$amount.missing

#Logistic regression predicting missingness of UScitezen from V40, P108, P131, SexFac, V5

UScitezenMARcheck<-glm(data=childrenSubset1,

Uscitezenmiss

, V40, P108, P131, SexFac, V5,

family=binomial)
CHILDREN OF IMMIGRANTS: MISSING DATA 10

summary(UScitezenMARcheck)

#Logistic regression predicting missingness of V40 from, Uscitezen, P108, P131, SexFac, V5

V40 MARcheck<-glm(data=childrenSubset1,

V40miss

, UScitezen, P108, P131, SexFac, V5,

family=binomial)

summary(V40MARcheck)

#Logistic regression predicting missingness of P131 from, Uscitezen, P108, V40, SexFac, V5

P131 MARcheck<-glm(data=childrenSubset1,

V40miss

, UScitezen, P108, V40, SexFac, V5,

family=binomial)

summary(P131MARcheck).

#Creating 5 multiply-imputed data sets with predictive means matching

ChildrenMultimpute <-mice::mice(ChildrenSubset1,m=5,maxit=50,meth='pmm',seed=500)

summary(ChildrenMultimpute)

#Examining descriptive statistics for the multiply-imputed data set

ChildrenMultimpute1 <- mice::complete(ChildrenMultimpute, action=1)

ChildrenMultimpute2 <- mice::complete(ChildrenMultimpute, action =2)

ChildrenMultimpute3 <- mice::complete(ChildrenMultimpute, action =3)

ChildrenMultimpute4 <- mice::complete(ChildrenMultimpute, action =4)

ChildrenMultimpute5 <- mice::complete(ChildrenMultimpute, action =5)

summary(ChildrenMultimpute1)
CHILDREN OF IMMIGRANTS: MISSING DATA 11

summary(ChildrenMultimpute2)

summary(ChildrenMultimpute3)

summary(ChildrenMultimpute4)

summary(ChildrenMultimpute5)

#Multiple regression and pooling results

ChildrenMultimputeReg <- with(ChildrenMultimpute,

lm(QoL ~ SexFac + UScitezinFac))

ChildrenMultimputeRegSummary<-summary(mice::pool(ChildrenMultimputeReg))

round(ChildrenMultimputeRegSummary,digits=3)
CHILDREN OF IMMIGRANTS: MISSING DATA 12

References

ETR 790 Class Blackboard (Fall 2019) Children of Immigrants [data file]. Retrieved from

https://webcourses.niu.edu/bbcswebdav/pid-5888112-dt-content-rid-57346272_2/xid-

57346272_2

You might also like