You are on page 1of 30

1

Predicting Literacy Proficiency Level from Age, Sex, Previous Year

Preschool Enrollment, and Race

Lidia Lentz

Department of Education Research and Evaluation, Northern Illinois University

ETR 560: Computer Data Analysis

Dr. Thomas Smith

12/07/2020
PREDITING LITERACY PROFICIENCY 2

Predicting Literacy Proficiency Level from Age, Sex, Previous Year Preschool Enrollment, and

Race

Kindergarten standards for literacy and mathematics have increased over the years,

demanding higher expectations from children. The percentage of students enrolled in preschool

and pre-kindergarten had only slightly raised from the previous census to 66% at four-year-old’s

and 43%of three-year-old’s (Yoshikaw & Brooks-Gunn, 2016). Low enrollment is shocking,

considering the importance that research has shown over the years of early education. This trend

starts to raise the question, why is enrollment rates low? What are the barriers and implications

of lack of early education?

Literature Review

Early education in this study is defined as a preschool, structured setting based on

education before four years old, and pre-kindergarten, structured setting based on education the

school year before kindergarten. Early education has made a shift from play time to a structured

learning environment focused all aspects of being a learner. Bassok and Rorem (2016) found

kindergarten expectations from 1998 to 2010 increased in rigor over five domains: literacy,

language, mathematics, social behavior, and functional behavior. Since the severity is set higher

for children entering kindergarten, research to determine the effects of early education and its

importance on later achievement has been a topic of many studies.

Literacy is a focus of this research that can be very complex due to the numerous skills

sets in this category. Literacy is a combination of different reading and language skills,

particularly oral language, phonological/phonemic awareness, alphabetic knowledge, print

knowledge, and invented spelling (Slutzky & Debruin-Parecki, 2019). Kindergarten students are

expected to identify sight words, read pattern texts, understand story elements and the

differences between genres. Since the expectations for entering kindergarten students are
PREDITING LITERACY PROFICIENCY 3

significantly higher in literacy than the years past, building necessary reading skills is vital to

obtain before kindergarten.

Children across America come from different racial groups and family dynamics. Family

dynamics can play a part in the ability and type of preschool that a child will attend. Past studies

showed a correlation of race as a predictor of struggling students. Bowdon et al. (2019) found

that Hispanic and black students were 28 and 24 days behind white students, which is significant.

Race was a common factor across studies as being a predictor of achievement. Gender, on the

other hand, was not a predictor of academic achievement (St.Clair-Christman et al, 2011).

Finding predictors and correlations across multiple predictors will help identify areas of

focus for early elementary stakeholders. Previous research shows some compelling results about

possible predictors such as race, gender, and preschool enrollment (St.Clair-Christman et al,

2011; Morrow, 2005). The study does not focus on the years before pre-kindergarten, age 0-3,

and the impact of childcare during that time on later academic achievement. This study takes into

consideration all of these variables and sees if they are significantly related.

Purpose statement

The purpose of this data analysis is to investigate possible predictors for Spring

proficiency level in the area of literacy to age, sex, race, and last year's previous preschool

enrollment for students entering kindergarten.

Research Question

1. Does age significantly predict Spring literacy proficiency level for children entering

kindergarten?

2. After controlling for age, does sex and last year preschool enrollment significantly

predict Spring literacy proficiency level for children entering kindergarten?


PREDITING LITERACY PROFICIENCY 4

3. After controlling for age, sex, and previous year preschool enrollment, does race

significantly predict Spring literacy proficiency level for children entering kindergarten?

Methodology

Dataset

The dataset found in this study is a combination of two study's datasets. The combined

dataset was from the Multi-State Study of Pre-Kindergarten and the State-Wide Early Education

Programs (SWEEP), which collected information across 11 states focusing on early childhood

education. The participants were children, 4 to 6 years old, and early education teachers. In total,

721 classrooms and 2,982 pre-kindergarten children were the participants (Early, et al., 2013).

The Multi-State Study of Pre-Kindergarten was conducted in the 2001-2002 school year

in a total of six states that had a single-minded focus and initiatives for early education. The

sampling was a stratified random sample of 40 centers or schools from a selected list. Data

collection was collected from participating teachers and families. Students included entering

kindergarten the next year, did not qualify for an IEP, and understood English or Spanish

directions (Early, et al., 2013).

The State-Wide Early Education Programs (SWEEP) was conducted in the 2003-2004

school year in five states. These states were different from the Mult-State Study of Pre-

Kindergarten to represent the population of states who use other initiatives and funding models.

State-funded pre-kindergarten sites were selected by random. 465 sites participated from the

states' given list, only two discontinued in the spring. Like the previous study, teachers and

families collected data. Eligible participants were selected the same as the previous study (Early,

et al., 2013). Data that was collected for both studies included demographic information.

Variables of Interest
PREDITING LITERACY PROFICIENCY 5

The full dataset 2, renamed MSS_SWEEP, was imported into R statistical software

package for analysis. Dataset 1 was not used because this contained the teachers' information and

classroom observations. These variables were not used. Missing values are appropriately

specified for all variables used in data analysis. A data frame was created omitting missing

cases, MSregdata1cc (Early, et al., 2013).

Outcome Variable

The dependent variable (outcome) is a composite score was a mean of the five items

related to literacy proficiency evaluated by teachers on a rating scale. The respondents were

asked to rate the participants on a scale from 1 to 5, where 1=not yet, 2=beginning, 3=In

progress, 4=intermediate, and 5=proficient. Modifications to these five variables

(CSLANG2 through CSLANG6) included computing a composite score (Proflitskill) for each

participant. "Proflitskill" is a quantitative ratio variable. The composite scale score items rated

participants' comprehension, letter identification, phonological skills, prediction skills, and early

reading skills (Early, et al., 2013).

Predictor Variables

Six variables from MSS_SWEEP, a combination of both studies, were selected as

predictors in this data analysis. The predictors used are age (ASMTAGEPS), age in years;

quantitative, ratio-level variable, gender (CHGENP); categorical variable, race (CHRACEP);

nominal variable, last year preschool enrollment (CHATNDPRKP); nominal-binary level

variable (Early, et al., 2013).

Age was classified to a numeric and named (ageR). Sex, last year's preschool enrollment,

and race were classified as a factor and assigned labels. The sex variable has two levels, 1="

Male" and 2=" Female" and renamed sexR. Last year preschool enrollment has two levels, 1="

No" 2=" Yes" was named prekR. prekR is releveled, so the outcome variable reference category
PREDITING LITERACY PROFICIENCY 6

is 'Yes', prekR1. Releveling is appropriate since the variable is measuring if they went to school

the previous year. Race variable has six levels, 1 =Latino, 2 = African American, 3 =Native

American, 4 =Asian, 5 =White, 6 =Multiracial and was renamed raceR. raceR is releveled, so

the outcome variable reference category is 'White', raceRR.

Analytic Methods Used

Descriptive statistical and graphical representations were used to collect descriptive

statistics and assess the distributions using the data frame, MSregdata1cc. Cronbach's Alpha was

calculated for the composite score, Proflitskill, to check for reliability between the items. Q-Q

plot was used to check for normality of the distribution. Multiple linear regression was used for

this analysis. Additionally, the Shapiro-Wilk normality test for homogeneity of variance were

computed.

Results

Descriptive Statistics

Descriptive statistics were computed for each variable. Figure 1 shows the visual

representation of each predictor variable (Early, et al., 2013). Skewness for ageR is 0, which

indicates normal distribution. Skewness for raceR, -0.22, and sexR, -0.02, is slightly negatively

skewed to the left. Skewness for prekR, -1.04, is negatively skewed to the left. Negative kurtosis

for all variables indicate a playkurtic distribution; ageR, -0.85, raceR, -1.70, sexR, -2.00,

and prekR,-0.92 .

Figure 01
Descriptive statistics for each variable in the subset.
PREDITING LITERACY PROFICIENCY 7

Cronbach's Alpha was computed to determine reliability between the items in the

composite scale score of Proflitskill, named Proflitskillitems. The value for raw alpha = 0.88

(based on covariances) and standardized alpha = 0.88 (based on correlations). The value of alpha

indicates adequate reliability. Removal of an item would not significantly increase alpha, so all

items remained in the composite score. Descriptive statistics were computed as well as a

construct of a histogram seen in Figure 02. The computed results show that a negative skewness

statistic indicates a "left-skewed" distribution, and a slightly positive kurtosis statistic indicates a

somewhat "piked" distribution. The normal distribution can be assumed.

A Q-Q plot was computed for Proflitskill, and the results showed linearity in this plot,

which indicated normal distribution. Descriptive data were calculated, with plots shown in

Figure 02. The Q-Q plot shows positive skewness statistic indicates a "right-skewed"

distribution, and the negative kurtosis statistic indicates a somewhat "flattened" distribution.

Values of skew.2SE and kurt.2SE were more extreme than ±1.0, which is evidence of

statistically significant (p < .05) skewness and kurtosis.


PREDITING LITERACY PROFICIENCY 8

The Shapiro-Wilk test for normality was conducted. The results showed 95% confidence

interval for the mean, 2.438 ± 0.039 and coefficient of variation = SD/mean = 0.997/2.438 =

0.409. The null hypothesis was that data come from a normal distribution. The null hypotheses is

accepted due to the W = 0.952, p < .001. There is a statistically significant departure from

normality for the composite "Literacy Proficient skills" scores. Descriptive HMISC showed

because the information statistic is close to 1, 0.997; this suggests a high degree of continuity in

this variable.

Figure 2

Q-Q Plot of Composite Scale Score

Inferential Statistics

Simple linear regression equation was computed to predicting “Literacy Proficiency

Level” from age . The equation used Proflitskil= b0 + b1(ageR). This was computed in Rstudio;

Proflitski= -0.63458 + 0.61172 (age), R2 = 0.03881. A Test was conducted to test of null

hypothesis; F(1, 2091) = 84.44, p = 2.2e-16, because p < .05, we reject the null hypothesis. It is a

small effect (R2 = 0.03881).

The ggplot, in Figure 3, is not excessively curved, so a linear relationship is suggested. A

formal test from the car was computed to test the null hypothesis that the residuals have constant

variance. The test results showed; χ2(1) = 7.432077, p = 0.0064071. Since p < .05, we reject the
PREDITING LITERACY PROFICIENCY 9

null hypothesis of constant variance. Further Residual plots and histogram, in Figure 3, show

homoscedasticity assumption has not been met. Shapiro-Wilk test for normality of residuals;

rejected the null hypothesis; χ2(1) = 0.9652, p = 2.2e-16, because p < .05.

Figure 3

Scatterplot and residual plosts of simple regression model proflitskill ~ age

𝑦̂𝑖 = −0.63458 + 0.61172𝑥𝑖

Multiple linear regression predicting "Literacy Proficiency Level" from age, last year

preschool enrollment, and sex was computed, R2 = 0.08108. 8.1% of the variation in "Literacy

Proficiency Level" is explained by the full set of predictors. This is an increase of R2 by .04227.

A test was computed to test the null hypotheses. Results from that test are; F(1, 2098) =

61.44, p = 2.2e-16, because p < .05, we reject the null hypothesis. This combination of three

predictors significantly predicts perceived Literacy Proficiency Level. Shapiro-Wilk test for

normality of residuals; rejected the null hypothesis; χ2(1) = 0.9652, p = 0.96681, because p <

.05. The data is positively skewed and negatively kurtosis. This descriptive statistic can be seen

in the histogram in Figure 4. The non-constant Variance Score Test, Chisquare = 6.880831, Df =
PREDITING LITERACY PROFICIENCY 10

1, p = 0.0087125. All variables were 1.0, which means no variance inflation due to

multicollinearity. Plots can be seen in Figure 4. The models have compared models by fitting the

two models., F(2, 2089) = 48.041 , p = 2.2e-16, because p < .05, we reject the null hypothesis.

Figure 4

Scatterplot and residual plosts of multple regression model proflitskill ~ age + preschool + sex

A third linear regression model was computed with an additional predicator, race, a

nominal variable requiring dummy coding. The code dummy code is 'White'. The model

compared White people to people from each of the other race categories. The model showed that

Native American, Asian, and Multiracial do not differ significantly in perceived Literacy

Proficiency Level due to the p-value being more than .05. Latino people had a p-value of 2.64e-

12, and black people had a p-value of 6.67e-05, which is significant and is a predictor of lower

Literacy Proficiency Levels. However, age (b = 0.55119, p =2e-16) , prek enrollment (b=0.39725

, p=2e-16), and female sex (b = 0.20888, p =2.63e-07) are statistically significant predictors of

Literacy Proficiency Level.

The R2 = 0.1354 for this model. The combined set of predictors explains 13.54% of the

variability in perceived literacy proficiency level. This is an increase of R2 by .09659. To check


PREDITING LITERACY PROFICIENCY 11

for multicollinearity, the test of the null hypothesis, H0: R2 = 0 in the population, found F(8,

2084) = 40.8, p = 2.2e-16, and because p < .05, we reject the null hypothesis. The non-constant

Variance Score Test was computed, and all VIF statistics equaled one, which means there are no

concerns about multicollinearity among predictors. Models were compared; race is a statistically

significant predictor of perceived proficiency literacy skills, see Figure 5.

Figure 5

Scatterplot and residual plosts of multiple regression model proflitskill ~ age + preschool + sex
+ race

Discussion of the Findings and Recommendations

RQ1: Does age significantly predict Spring literacy proficiency level for children

entering kindergarten? There is a positive linear relationship between age and perceived Literacy

Proficiency Level. 3.88% of the variation in "Literacy Proficiency Level" is explained by age.

RQ2: After controlling for age, does sex and last year preschool enrollment significantly

predict Spring literacy proficiency level for children entering kindergarten? Considered
PREDITING LITERACY PROFICIENCY 12

individually, age (b1 = 0.59494, p = 2e-16), preschool enrollment ‘Yes’ (b2 = 0.38652, p =

2.76e-16), and female sex (b3 = 0.21363, p = 3.04e-07) each are statistically significant, positive

predictors of increased Literacy Proficiency Level. As age increases, the Literacy Proficiency

Level also increases. Females have a higher perceived Literacy Proficiency Level than

males. Children who went to preschool the previous year have higher perceived Literacy

Proficiency Levels than children attending other childcare types. After controlling for age, the

combined set of predictor variables (sex and preschool enrollment) are statistically significant

predictors of literacy proficiency level.

RQ3: After controlling for age, sex, and previous year preschool enrollment, does race

significantly predict Spring literacy proficiency level for children entering kindergarten? After

controlling for age, preschool enrollment, and sex, race is a statistically significant literacy

proficiency level.

The findings show that preschool enrollment for two years before kindergarten has an

impact on literacy achievement. Further funding and initiatives in early education can improve

academic achievement for young children. This research also supports the findings that early

entrance to kindergarten is not ideal because older age is a predictor of academic achievement.

Hispanic and black children are at risk for lower academic achievement and should be a focus

group to close the gap in these marginalized groups.

Limitations

One limitation of this data set is the possibility of error in data input. There was little

information about how both sets of data were combined and processed used. Additionally, there

is a big chance of error due to the data collection since there was multiple teachers collected data

across 11 states in different studies. The guidelines for the data collection were not exact and that

could cause a discrepancy. The levels of literacy proficiency were very vague and dependent on
PREDITING LITERACY PROFICIENCY 13

peer performance. Since the rating scale for proficiency was based merely on a teacher's

perspective, these levels are objective by their perceived expectations and the standards set in

their district or state. These studies were performed in different states that might have different

standards or initiatives. The funding in the state's early education and marginalized groups may

differ.

Future Research

The questionnaire for this research asked if a child attending preschool the previous year.

Since this was a predictor from data, this is an area that further research is essential. Future

research should be conducted on the consistency of preschool from 3 years old to entering

kindergarten and the quality and enrollment to academic achievement. Additionally, preschool

enrollment with race, gender, and age should be considered compared to growth versus

achievement. Achievement does not assess the child's ability from the start, where growth will

show preschool's impact.


PREDITING LITERACY PROFICIENCY 14

References

Bassok, D., Latham, S., & Rorem, A. (2016). Is kindergarten the new first grade? AERA Open,

2(1), 233285841561635. doi:10.1177/2332858415616358

Bowdon, J., Dahlke, K., Yang, R., Pan, J., Marcus, J., & Lemieux, C. (2019). Children's

knowledge and skills at kindergarten entry in Illinois: Results from the first statewide

administration of the Kindergarten Individual Development Survey (Rep. No. REL

2020012). Retrieved https://ies.ed.gov/ncee/edlabs/projects/project.asp?projectID=4573

(ERIC Document Reproduction Service No. ED599357)

Early, D., Burchinal, M., Barbarin, O., Bryant, D., Chang, F., Clifford, R., . . . Barnett, W. S.

(2013). Pre-Kindergarten in eleven states: NCEDL's multi-state study of pre-kindergarten

and study of state-wide early education programs (SWEEP). ICPSR Data Holdings.

doi:10.3886/icpsr34877.v1

Morrow, L. M. (2005). Language and literacy in preschools: Current issues and concern.

Literacy Teaching and Learning, 9(1), 7-19. Retrieved https://eric.ed.gov/?id=EJ966159.

Slutzky, C., & Debruin‐Parecki, A. (2019, December). State‐level perspectives on kindergarten

readiness. ETS Research Report Series, 2019(1), 1-40. doi:10.1002/ets2.12242

St.Clair-Christman, J., Buell, M., & Gamel-McCormick, M. (2011). Money matters for early

education: The relationships among childcare quality, teacher characteristics, and subsidy

status. Early Childhood Research & Practice, 13(2).

Yoshikaw, H., Weiland, C., & Brooks-Gunn, J. (2016). When does preschool matter? The Future

of Children, 26(2), 21-35. doi:10.1353/foc.2016.0010


PREDITING LITERACY PROFICIENCY 15

Appendix A

Excerpt of Survey Items

CHGENP [ENTER RESPONDENTS GENDER:]

1 Male
2 Female

ASMTAGEPS ASK ALL


What is this child’s date of birth?

ASK ALL
Q.20 Rate the student’s achievement in comparison to other students of the same grade
level. The examples do not exhaust all the ways that a child may demonstrate what he/she
knows or can do. This child (INSERT ITEM) is not yet, beginning, in progress,
intermediate, proficienct, not applicable.

b. Understands and interprets a story or other text read to him/her – for


example, retelling a story just read to the group, or telling about why a story
ended
as it did, or connecting part of the story to his/her own life.
c. Easily and quickly names all upper– and lower-case letters of the alphabet.
d. Produces rhyming words – for example, says a word that rhymes with "chip,"
"shop," "drink," – or "light."
e. Predicts what will happen next in stories by using the pictures and storyline for
clues.
f. Reads simple books independently – for example, reads books with a repetitive
language pattern.

RESPONSE CATEFORIES:
1 Not yet
2 Beginning
3 In progress
4 Intermediate
5 Proficient

CHRACEP and CHATNDPRKP were survey items in the family questionnaire – The family

questionnaire were unavailable


PREDITING LITERACY PROFICIENCY 16

Appendix B

Data Values for Each Variable

ASMTAGEPS: ASSMT PK S: AGE AT TIME OF ASSMT (YEARS)


Based upon 2,757 valid cases out of 2,982 total cases.
• Mean: 5.05
• Median: 5.06
• Mode: 5.34
• Minimum: 4
• Maximum: 6
• Standard Deviation: 0.32
Location: 2129-2136 (width: 8; decimal: 2)
Variable Type: numeric
(Range of) Missing Values: -99.00

CHGENP: PRESCHOOL: CHILD`S GENDER


Value Label Frequency Unweighted %
1 Male 1459 48.9 %
2 Female 1507 50.5 %
Missing Data
-99 System Missing 16 0.5 %
Total 2,982 100%

CHRACEP: FAMQ: PK CHILD`S RACE (MutExCat)


Value Label Frequency Unweighted %
1 Latino 764 25.6 %
2 African American 533 17.9 %
3 Native American 21 0.7 %
4 Asian 83 2.8 %
5 White 1200 40.2 %
6 Multiracial 297 10.0 %
Missing Data
-99 System Missing 84 2.8 %
Total 2,982 100%

CHATNDPRKP: TQSC PK F: DID CHILD ATTEND PREK LAST YEAR?


Value Label Unweighted Frequency %
1 Yes 669 22.4 %
2 No 1803 60.5 %
Missing Data
-99 System Missing 510 17.1 %
Total 2,982 100%
Based upon 2,472 valid cases out of 2,982 total cases.
Location: 857-864 (width: 8; decimal: 0)
Variable Type: numeric
(Range of) Missing Values: -99

CSLANGPF2: ECLSK Acad skills PK FL&L ITEM2:STORY


PREDITING LITERACY PROFICIENCY 17

Value Label UnweightedFrequency %


1 Not Yet 294 9.9 %
2 Beginning 682 22.9 %
3 In Progress 677 22.7 %
4 Intermediate 512 17.2 %
5 Proficient 356 11.9 %
Missing Data
-99 System Missing 461 15.5 %
Total 2,982 100%
Based upon 2,521 valid cases out of 2,982 total cases.
Location: 793-800 (width: 8; decimal: 0)
Variable Type: numeric
(Range of) Missing Values: -99

CSLANGPF3: ECLSK Acad skills PK FL&L ITEM3:ALPHABET


Value Label Unweighted Frequency %
1 Not Yet 821 27.5 %
2 Beginning 698 23.4 %
3 In Progress 478 16.0 %
4 Intermediate 208 7.0 %
5 Proficient 155 5.2 %
Missing Data
-99 System Missing 622 20.9 %
Total 2,982 100%
Based upon 2,360 valid cases out of 2,982 total cases.
Location: 801-808 (width: 8; decimal: 0)
Variable Type: numeric
(Range of) Missing Values: -99

CSLANGPF4: ECLSK Acad skills PK FL&L ITEM4:RHYME


Value Label Unweighted Frequency %
1 Not Yet 940 31.5 %
2 Beginning 648 21.7 %
3 In Progress 408 13.7 %
4 Intermediate 170 5.7 %
5 Proficient 120 4.0 %
Missing Data
-99 System Missing 696 23.3 %
Total 2,982 100%
Based upon 2,286 valid cases out of 2,982 total cases.
Location: 809-816 (width: 8; decimal: 0)
Variable Type: numeric
(Range of) Missing Values: -99

CSLANGPF5: ECLSK Acad skills PK FL&L ITEM5:PREDICTS


Value Label Unweighted Frequency %
1 Not Yet 323 10.8 %
2 Beginning 795 26.7 %
PREDITING LITERACY PROFICIENCY 18

3 In Progress 703 23.6 %


4 Intermediate 444 14.9 %
5 Proficient 254 8.5 %
Missing Data
-99 System Missing 463 15.5 %
Total 2,982 100%
Based upon 2,519 valid cases out of 2,982 total cases.
Location: 817-824 (width: 8; decimal: 0)
Variable Type: numeric
(Range of) Missing Values: -99

CSLANGPF6: ECLSK Acad skills PK FL&L ITEM6:READS


Value Label Unweighted Frequency %
1 Not Yet 1059 35.5 %
2 Beginning 574 19.2 %
3 In Progress 297 10.0 %
4 Intermediate 168 5.6 %
5 Proficient 92 3.1 %
Missing Data
-99 System Missing 792 26.6 %
Tota l 2,982 100%
Based upon 2,190 valid cases out of 2,982 total cases.
Location: 825-832 (width: 8; decimal: 0)
Variable Type: numeric
(Range of) Missing Values: -99
PREDITING LITERACY PROFICIENCY 19

Appendix C

R Syntax

####Making a backup copy of your dataframe####


MSS_SWEEPup<-MSS_SWEEP

####load packages####

library(dplyr)

library(psych)

library(car)

library(mice)

library(MissMech)

library(imputeR)

library(naniar)

library(Hmisc)
library(ggplot2)
library(pastecs)
library(psych)

nrow(MSS_SWEEP)

####Converting variables to numeric####


MSS_SWEEP$CSLANGPF2R<-as.numeric(MSS_SWEEP$CSLANGPF2)
MSS_SWEEP$CSLANGPF3R<-as.numeric(MSS_SWEEP$CSLANGPF3)
MSS_SWEEP$CSLANGPF4R<-as.numeric(MSS_SWEEP$CSLANGPF4)
MSS_SWEEP$CSLANGPF5R<-as.numeric(MSS_SWEEP$CSLANGPF5)
MSS_SWEEP$CSLANGPF6R<-as.numeric(MSS_SWEEP$CSLANGPF6)

####Computing a composite score as the mean of item scores####


MSS_SWEEP$Proflitskill<-rowMeans(cbind(MSS_SWEEP$CSLANGPF2R,
MSS_SWEEP$CSLANGPF3R,
MSS_SWEEP$CSLANGPF4R,
MSS_SWEEP$CSLANGPF5R,
MSS_SWEEP$CSLANGPF6R),
na.rm=TRUE)

####Classify age as numeric####


MSS_SWEEP$ageR<-as.numeric(MSS_SWEEP$ASMTAGEPS)
PREDITING LITERACY PROFICIENCY 20

####Classifying sex as a factor and assigning labels####


MSS_SWEEP$sexR<-factor(MSS_SWEEP$CHGENP,
levels=c(1,2),
labels=c("Male", "Female"))

####Converting variable to a factor and assigning labels####


MSS_SWEEP$raceR<-factor(MSS_SWEEP$CHRACEP,
levels=c(1,2,3,4,5,6),
labels=c("Latino",
"AfricanAmerican",
"Native American",
"Asian",
"White",
"Multiracial"))

####Classifying Last Year Preschool Enrollment as a factor and assigning labels####


MSS_SWEEP$prekR<-factor(MSS_SWEEP$CHATNDPRKP,
levels=c(1,2),
labels=c("Yes", "No"))

####Create subset dataframe MSregdata1cc####


MSregdata1<-dplyr::select(MSS_SWEEP,
Proflitskill,
ageR,
sexR,
raceR,
prekR)

MSregdata1cc<-na.omit(MSregdata1)

####Check for MCAR Create a temporary data set of six variables####


MSS_SWEEP$sexRR<-factor(MSS_SWEEP$CHGENP,
levels=c(1,2))
MSS_SWEEP$raceRR<-factor(MSS_SWEEP$CHRACEP,
levels=c(1,2,3,4,5,6))
MSS_SWEEP$prekRR<-factor(MSS_SWEEP$CHATNDPRKP,
levels=c(1,2))

####Descriptive statistics for age####


summary(MSregdata1cc$ageR)
psych::describe(MSregdata1cc$ageR)
hist(MSregdata1cc$ageR, col = "red")

####Descriptive statistics for sex####


summary(MSregdata1cc$sexR)

####construct a frequency distribution table for sex####


PREDITING LITERACY PROFICIENCY 21

table(MSregdata1cc$sexR, exclude="NULL")

####compute descriptive stats for age by sex####


psych::describeBy(MSregdata1cc$ageR,MSregdata1cc$sexR)

####Assign frequency table to an object called sex_table####


sex_table <- table(MSregdata1cc$sexR)

####construct a barplot of sex####


barplot(sex_table,
col="purple",
main="Barplot of Sex")

####Bar plot of mean values, including 95% bootstrapped CI####


ggplot2::ggplot(MSregdata1cc,
aes(sexR, ageR)) +
stat_summary(fun=mean,
geom="bar",
fill=c("lightgreen","lightblue"),
color="black") +
labs(title="Barplot of Mean Age by Sex",
x="Sex",
y="Mean Age") +
stat_summary(fun.data=mean_cl_boot,
geom="pointrange")

####Bar plot of median values, including 95% CI####


ggplot2::ggplot(MSregdata1cc,
aes(sexR, ageR)) +
stat_summary(fun=median,
geom="bar",
fill=c("lightgreen","lightblue"),
color="black") +
labs(title="Barplot of Median Age by Sex",
x="Sex",
y="Median Age") +
stat_summary(fun.data=median_hilow,
geom="pointrange")

####Constructing a freq table####


table(MSregdata1cc$raceR)

####Constructing a freq table that also shows missing values####


table(MSregdata1cc$raceR, exclude="NULL")

####construct a frequency distribution table for race####


race_table<-table(MSregdata1cc$raceR)
PREDITING LITERACY PROFICIENCY 22

####construct barplot for Race####


barplot(race_table,
col="yellow",
main="Barplot for Race",
xlab="Race of Student",
ylab="Frequency")

####compute descriptive stats for age by race####


psych::describeBy(MSregdata1cc$ageR,MSregdata1cc$raceR)

####Descriptive statistics for “PROF” by Sex####


psych::describeBy(MSregdata1cc$Proflitskill, MSregdata1cc$sexR)

####Side-by-side boxplots####
boxplot(data=MSregdata1cc,
Proflitskill~raceR,
main="Boxplot of Literacy Proficiency Level by Race",
ylab="Composite Literacy Proficiency Level",
col="yellow",
notch=TRUE)

boxplot(data=MSregdata1cc,
Proflitskill~raceR,
main="Boxplot of Literacy Proficiency Level by Race",
ylab="Composite Literacy Proficiency Level",
col="yellow",
notch=FALSE)

####construct a frequency distribution table for Preschool Setting####


table(MSregdata1cc$prekR, exclude="NULL")

####Assign frequency table to an object called prek_table####


prek_table <- table(MSregdata1cc$prekR)

####construct a barplot of sex####


barplot(prek_table,
col="purple",
main="Barplot of Last Year Preschool Enrollment",
xlab="Enrolled in Preschool Last Year",
ylab = "Frequency")

####Descriptive statistics for “PROF” by Enrollment####


psych::describeBy(MSregdata1cc$Proflitskill, MSregdata1cc$prekR)

####Side-by-side boxplots####
boxplot(data=MSregdata1cc,
Proflitskill~prekR,
main="Boxplot of Literacy Proficiency Level by Preschool Enrollment",
PREDITING LITERACY PROFICIENCY 23

ylab="Composite Literacy Proficiency Level",


xlab="Last year Preschool Enrollment",
col="purple",
notch=TRUE)

#Computing descriptive statistics and histogram for composite variable


#na.rm=TRUE indicates to exclude missing values from computations
summary(MSregdata1cc$Proflitskill,
na.rm=TRUE)
hist(MSregdata1cc$Proflitskill, col = "green")

####Advanced Descriptive Statistics Histogram for composite variable####


#Descriptive stats from psych package
psych::describe(MSregdata1cc$Proflitskill)

#Histogram using additonal options


hist(MSregdata1cc$Proflitskill,
main="Histogram for Proficiency level of \n
Literacy Skills",
xlab="Proficiency Scores",
col="lightblue")

Cronbach Alpha
####Creating a dataframe that includes items only####
Proflitskillitems<- subset(MSS_SWEEP,
select=(c("CSLANGPF2R",
"CSLANGPF3R",
"CSLANGPF4R",
"CSLANGPF5R",
"CSLANGPF6R")))

names(Proflitskillitems)

####Computing Cronbach's alpha####


psych::alpha(Proflitskillitems)

####Constructing a Q-Q plot####


ggplot2::qplot(sample=MSregdata1cc$Proflitskill,
main = "Q-Q plot of Composite Scale Score (Proflitskill)")

####Computing descriptive statistics####


pastecs::stat.desc(MSregdata1cc$Proflitskill,
norm=TRUE)

####Computing descriptive statistics and round values to 3 digits####


ProflitskillStats1<-pastecs::stat.desc(MSregdata1cc$Proflitskill,
norm=TRUE)
round(ProflitskillStats1, digits=3)
PREDITING LITERACY PROFICIENCY 24

####Descriptive statistics####
Hmisc::describe(MSregdata1cc$Proflitskill)

####boxplot####
boxplot(MSregdata1cc$Proflitskill,
main="Boxplot of 'Literacy Proficiency Level'",
ylab="Composite Score",
col="darkgreen",
notch=TRUE)

####Side-by-side boxplots####
boxplot(data=MSregdata1cc,
Proflitskill~sexR,
main="Boxplot of 'Literacy Proficiency Level'",
ylab="Composite Score",
col="darkgreen",
notch=TRUE)

#Scatterplot of scores on age with linear regression line and confidence interval#
ggplot2::ggplot(MSregdata1cc,
aes(x=ageR,
y=Proflitskill))+
geom_point(na.rm=TRUE, alpha=0.2)+
labs(title="Scatterplot of Literacy Proficiency Level",
x="Age",
y="Literacy Proficiency Level") +
theme_bw(base_size=8) +
theme(plot.title=element_text(hjust = 0.5)) +
geom_smooth(method=lm,
se=TRUE,
color="red",
fill="darkgreen",
alpha=0.2)

ggplot2::ggplot(MSregdata1cc,
aes(x=raceR,
y=Proflitskill))+
geom_point(na.rm=TRUE, alpha=0.2)+
labs(title="Scatterplot of Literacy Proficiency Level",
x="Race",
y="Literacy Proficiency Level") +
theme_bw(base_size=8) +
theme(plot.title=element_text(hjust = 0.5)) +
geom_smooth(method=lm,
se=TRUE,
PREDITING LITERACY PROFICIENCY 25

color="red",
fill="darkgreen",
alpha=0.2)

ggplot2::ggplot(MSregdata1cc,
aes(x=sexR,
y=Proflitskill))+
geom_point(na.rm=TRUE, alpha=0.2)+
labs(title="Scatterplot of Literacy Proficiency Level",
x="Sex",
y="Literacy Proficiency Level") +
theme_bw(base_size=8) +
theme(plot.title=element_text(hjust = 0.5)) +
geom_smooth(method=lm,
se=TRUE,
color="red",
fill="darkgreen",
alpha=0.2)

ggplot2::ggplot(MSregdata1cc,
aes(x=prekR,
y=Proflitskill))+
geom_point(na.rm=TRUE, alpha=0.2)+
labs(title="Scatterplot of Literacy Proficiency Level",
x="Last Year Preschool Enrollment",
y="Literacy Proficiency Level") +
theme_bw(base_size=8) +
theme(plot.title=element_text(hjust = 0.5)) +
geom_smooth(method=lm,
se=TRUE,
color="red",
fill="darkgreen",
alpha=0.2)

####Linear regression of Proficiency Literacy Lvel on age####


MSreg1<-lm(data=MSregdata1cc,
Proflitskill~ageR)
summary(MSreg1)

####Test for non-constant variance####


car::ncvTest(MSreg1)

####Regression diagnostic plots####


plot(MSreg1)

#Shapiro-Wilk test for normality of residuals


shapiro.test(MSreg1$residuals)
PREDITING LITERACY PROFICIENCY 26

####Releveling the outcome variable so reference category is ‘Yes’ ####


MSregdata1cc$prekR2 <- relevel(MSregdata1cc$prekR, 2)

#Multiple linear regression model


MSreg2 <- lm(data=MSregdata1cc,
Proflitskill ~
ageR +
prekR2 +
sexR)
summary(MSreg2)

#histogram of residuals
hist(MSreg2$residuals, col="red")

#Shapiro-Wilk test for normality of residuals


shapiro.test(MSreg2$residuals)

#Descriptive statistics for residuals


psych::describe(MSreg2$residuals)

#Test for non-constant variance


car::ncvTest(MSreg2)

#Residual plots
plot(MSreg2)

#Assessing multicollinearity among predictors


car::vif(MSreg2)

#Comparing ‘reduced’ and ‘full’ models#


anova(MSreg1,MSreg2)

####Releveling the outcome variable so reference category is ‘White’ ####


MSregdata1cc$raceR2 <- relevel(MSregdata1cc$raceR, 5)

#Releveling a categorical predictor (equivalent code)


MSregdata1cc$raceR2<-relevel(MSregdata1cc$raceR, "White")

MSreg3 <- lm(data=MSregdata1cc,


Proflitskill ~
ageR +
prekR2 +
sexR +
raceR2)
summary(MSreg3)

#histogram of residuals
PREDITING LITERACY PROFICIENCY 27

hist(MSreg3$residuals, col="red")

#Shapiro-Wilk test for normality of residuals


shapiro.test(MSreg3$residuals)

#Descriptive statistics for residuals


psych::describe(MSreg3$residuals)

#Test for non-constant variance


car::ncvTest(MSreg3)

#Residual plots
plot(MSreg3)

car::vif(MSreg3)

#Plot of actual outcome on predicted outcome


plot(predict(MSreg3),MSregdata1cc$Proflitskill,
xlab="predicted Literacy Proficiency Level",ylab="actual Literacy Proficiency Level")

#Comparing ‘reduced’ and ‘full’ models#


anova(MSreg2,MSreg3)

library(imputeR)
####Create new dataframe with mean-imputed missing values####
MSregmean<-imputeR::guess(MSregdata1cc, type="mean")
Hmisc::describe(MSregmean)

####Create new dataframe with randomly-imputed missing values ####


MSregrand<-imputeR::guess(MSregdata1cc, type="random")
MSregrand<-as.data.frame(MSregrand)
Hmisc::describe(MSregrand)

#Classifying sex as a factor and assigning labels#


MSregrandIMP$sexR<-factor(MSregrand$sexR,
levels=c(1,2),
labels=c("Male", "Female"))

#Multiple linear regression model w/imputed data


MSregrandIMP <- lm(data=MSregrand,
Proflitskill ~
ageR +
prekR2 +
sexR +
raceR2)
summary(MSregrandIMP)
PREDITING LITERACY PROFICIENCY 28

#Create five multiply-imputed data sets


library(mice)
MSregMULTIMP <- mice::mice(MSregdata1cc,
m=5,
maxit=50,
seed=500)

#Pooling regression results from multiply-imputed data


MSreg2MI <- with(MSregMULTIMP,
lm(Proflitskill ~
ageR +
prekR2 +
sexR +
raceR2))
summary(pool(MSreg2MI))

MSreg2MIsum<-summary(pool(MSreg2MI))
round(MSreg2MIsum,digits=3)
PREDITING LITERACY PROFICIENCY 29

Appendix D

Regression Outputs
> summary(MSreg1)
Call:
lm(formula = Proflitskill ~ ageR, data = MSregdata1cc)

Residuals:
Min 1Q Median 3Q Max
-1.8663 -0.7696 -0.1712 0.6485 2.9194

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.59696 0.33005 -1.809 0.0706 .
ageR 0.60441 0.06518 9.274 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9825 on 2223 degrees of freedom


Multiple R-squared: 0.03724, Adjusted R-squared: 0.03681
F-statistic: 86 on 1 and 2223 DF, p-value: < 2.2e-16

> summary(MSreg2)

Call:
lm(formula = Proflitskill ~ ageR + prekR2 + sexR, data = MSregdata1cc)

Residuals:
Min 1Q Median 3Q Max
-2.2377 -0.7493 -0.1409 0.6425 2.9133

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.72665 0.32384 -2.244 0.0249 *
ageR 0.58858 0.06385 9.218 < 2e-16 ***
prekR2Yes 0.37893 0.04599 8.239 2.93e-16 ***
sexRFemale 0.21286 0.04081 5.216 1.99e-07 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.9621 on 2221 degrees of freedom


Multiple R-squared: 0.07762, Adjusted R-squared: 0.07637
F-statistic: 62.3 on 3 and 2221 DF, p-value: < 2.2e-16
PREDITING LITERACY PROFICIENCY 30

> summary(MSreg3)

Call:
lm(formula = Proflitskill ~ ageR + prekR2 + sexR + raceR2, data = MSregdata1cc)

Residuals:
Min 1Q Median 3Q Max
-2.4218 -0.6927 -0.1498 0.5909 2.9971

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.31699 0.31702 -1.000 0.317
ageR 0.54815 0.06207 8.831 < 2e-16 ***
prekR2Yes 0.39170 0.04491 8.722 < 2e-16 ***
sexRFemale 0.20619 0.03960 5.207 2.09e-07 ***
raceR2Latino -0.59414 0.04937 -12.035 < 2e-16 ***
raceR2AfricanAmerican -0.22920 0.05597 -4.095 4.38e-05 ***
raceR2Native American -0.20650 0.26049 -0.793 0.428
raceR2Asian -0.03671 0.11869 -0.309 0.757
raceR2Multiracial -0.09545 0.06998 -1.364 0.173
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.932 on 2216 degrees of freedom


Multiple R-squared: 0.1364, Adjusted R-squared: 0.1332
F-statistic: 43.73 on 8 and 2216 DF, p-value: < 2.2e-16

Analysis of Variance Table

Model 1: Proflitskill ~ ageR + prekR2 + sexR


Model 2: Proflitskill ~ ageR + prekR2 + sexR + raceR2
Res.Df RSS Df Sum of Sq F Pr(>F)
1 2221 2055.8
2 2216 1924.9 5 130.91 30.141 < 2.2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

You might also like