You are on page 1of 20

Child Development, May/June 2001, Volume 72, Number 3, Pages 887906

A Meta-Analysis of Measures of Self-Esteem for Young Children:

A Framework for Future Measures
Pamela E. Davis-Kean and Howard M. Sandler

The objective of this study was to synthesize information from literature on measures of the self in young chil-
dren to create an empirical framework for developing future methods for measuring this construct. For this
meta-analysis, all available preschool and early elementary school self-esteem studies were reviewed. Reli-
ability was used as the criterion variable and the predictor variables represented different aspects of method-
ology that are used in testing an instrument: study characteristics, method characteristics, subject characteris-
tics, measure characteristics, and measure design characteristics. Using information from two analyses, the
results indicate that the reliability of self-esteem measures for young children can be predicted by the setting
of the study, number of items in the scale, the age of the children being studied, the method of data collection
(questionnaires or pictures), and the socioeconomic status of the children. Age and number of items were
found to be critical features in the development of reliable measures for young children. Future studies need
to focus on the issues of age and developmental limitations on the complicated problem of how young chil-
dren actually think about the self and what methods and techniques can aid in gathering this information
more accurately.

INTRODUCTION ity, convergent validity), it may be possible to under-

stand what issues are important to examine in con-
Researchers in the areas of psychology, education,
structing a measure of self-concept in young children.
and other social sciences view the self-concept as the
Researchers view the enhancement of self-concept
cornerstone of both social and emotional develop-
as a vital component to achievement in academics as
ment (Kagen, Moore, & Bredekamp, 1995, p. 18). Be-
well as in social and emotional experiences (Byrne et
cause of the importance of this construct, many early
al., 1992; Harter, 1983; Marsh et al., 1991; Rambo,
childhood programs (e.g., Head Start) and educational
1982; Stipek, Recchia, & McClintic, 1992). Stipek and
programs have emphasized this construct in their curric-
colleagues (1992), for example, note that research in
ula (Kagen et al., 1995). Currently, however, these pro-
areas such as attributions, perceived control, and self-
grams have no reliable means of knowing whether
efcacy demonstrates that children who evaluate
their curricula have had an impact on childrens self-
themselves or their behaviors as low in competence
concept (Fantuzzo, McDermott, Manz, Hampton, &
are more likely to have difculties performing tasks.
Burdick, 1996). The instruments that exist for young
Stipek et al. (1992) also report that positive academic
children have used different methodologies (e.g., pic-
achievements or failures can affect how children de-
tures, Q-sort, questionnaires, puppets) to measure the
velop a sense of self-worth or self-esteem. Similarly,
self, but have achieved limited success (Byrne, Shav-
the clinical psychology literature views a healthy and
elson, & Marsh, 1992; Wylie, 1989). Thus, researchers
positive self-concept as a means for individuals of all
have continued to create instrumentation in an at-
ages to better deal with life stresses and achieve more
tempt to adequately measure the self in young chil-
in their lives (Coopersmith, 1967). It is clear, then, that
dren (Chapman & Tunmer, 1995; Marsh, Craven, &
the development of the self has strong support from
Debus, 1991). Unfortunately, there has been little at-
diverse research paradigms. The question, then, is
tempt to try to integrate the information from these
why is it so difcult to develop a good measure of
differing methodologies; hence, it is unclear if any of
self-concept for young children? The answer to this
the methodologies have been successful at measuring
question is complicated and involves the general de-
this construct.
bate about the denition of self-concept as well as the
The objective of this study was to synthesize infor-
issues related to the developmental/cognitive ability
mation from the literature on measures of the self in
of young children.
young children to create an empirical framework for
developing future methods for measuring this con-
struct. By studying different methodologies and quality- 2001 by the Society for Research in Child Development, Inc.
of-measure indicators (reliability, discriminant valid- All rights reserved. 0009-3920/2001/7203-0019
888 Child Development

Denitional Issues: Self-Concept or Self-Esteem uative statements when presented with statements
such as Tell us what you like and dislike about your-
One of the problems in measuring self-concept, in
self, but not when presented with statements that
general, is creating an operational denition for the
contained no evaluative element, such as Tell us
construct. There is a long history of debate about
about yourself (Brinthaupt & Erwin, 1992). Thus,
what constitutes the self and how it evolves. When
these researchers concluded that the question format
studying young children, for example, developmen-
or methodology is the most important factor in distin-
tal researchers are generally interested in how the
guishing between self-concept and self-esteem. They
knowledge of self develops over time (Brinthaupt &
acknowledged, however, that the majority of the in-
Erwin, 1992). Young children may understand that
struments being used to measure the self are reactive
they are boys or girls but not incorporate any mean-
measures that tap only the self-evaluative aspect
ing to this categorization (Lewis & Brooks-Gunn,
(Brinthaupt & Erwin, 1992). Hence, the majority of
1979). Over time, information from social interactions
the literature pertaining to the self involves the self-
may change the way they view themselves as boys
evaluative aspect and not the self-descriptive.
and girls (Hattie, 1992). Thus, the concept of self
changes across the life span. Other researchers, how-
ever, are interested not in how the self changes but Developmental and Cognitive Issues
rather in how children evaluate themselves across Denitional problems are not the only issues im-
time and in comparison with others. The distinction pacting the development of self-concept measures
between these two research areas is seen by many re- for young children. Distinctive to the assessment of
searchers as the difference between self-concept and young children are their limitations in language and
self-esteem (Brinthaupt & Erwin, 1992). cognitive development. In general, assessments re-
Self-esteem has been dened as the positive or quire a certain amount of verbal ability to either read
negative attitude about the self, the degree of liking of or understand the spoken language. If terms are not
or satisfaction with the self, and ones feeling of per- familiar to young children or they cannot read the
ceived worth as compared with others (Brinthaupt & terms, then they may have difculty answering the
Erwin, 1992; Cook, 1987). What is common to all of questions. Similarly, some researchers believe that
these slightly different denitions is the evaluative young children (younger than 8 years old) have not
component. Self-concept, on the other hand, has been reached the cognitive/developmental level needed to
described in broader terms that include the sum of all grasp the abstract ideas used in self-concept assess-
experiences across the life span that affect our opin- ments, for example, Who are you? (Harter, 1983).
ions, behaviors, and social interactions and the evalu- These two problems alone would make the assess-
ations of these experiences (Cook, 1987). Many ment of self-concept in young children difcult, if not
researchers argue that discriminating between self- impossible, to achieve. Research has shown, however,
statements that are purely descriptive and those that that young children do have both the language and
contain some evaluative elements is difcult (Brint- the cognitive ability to discuss the self by the time
haupt & Erwin, 1992). When asked Who are you? they are preschoolers (Bates, 1990; Damon & Hart,
for example, it is possible that individuals will re- 1988; Lewis & Brooks-Gunn, 1979).
spond by describing those things that they consider In studies by Lewis and Brooks-Gunn (1979), chil-
their best qualities. Thus, even though it is easy to la- dren between the ages of 18 and 30 months were able
bel a statement as evaluative (e.g., I am smart), it is to use self-referent terms and to distinguish them-
not as easy to label a statement as simply descriptive. selves by feature recognition (the categorical self).
Thus, it is difcult to create a self-concept measure Children in this age group, for example, can correctly
that taps into only the descriptive and not the evalua- identify their size in comparison with other children
tive aspect of the self. Brinthaupt and Erwin (1992) (e.g., I am bigger) and use terms like I, mine,
reviewed the problems associated with creating self- and me (Lewis & Brooks-Gunn, 1979). Similarly, re-
concept measures and concluded that the methodol- searchers have demonstrated that children are able to
ogy used to elicit information about the self dictates discuss abstract ideas about emotions and inner
what aspect of the self is measured. They found that states if the questions are explicit (Damon & Hart,
reactive measures (forced choice) are more likely to be 1988; Eder, 1989; Marsh et al., 1991). For example,
self-evaluative, whereas spontaneous measures (open- children may have problems with a vague question
ended) tap into the more descriptive part of the self like, Do you like who you are? but have an easier
(Brinthaupt & Erwin, 1992). In their own study they time answering, Do you like being a girl/boy?
found that participants would give spontaneous eval- (Damon & Hart, 1988; Eder, 1989).
Davis-Kean and Sandler 889

In research on young childrens verbal ability, has reliable psychometric properties (Byrne et al.,
Bates (1990) argues that children integrate the con- 1992; Wylie, 1989). As discussed earlier, the literature
cept of self and the required grammatical and lexi- is also inconclusive about what methodology (e.g.,
cal forms for describing the self between the ages of pictures, questions, or puppets) to use when eliciting
3 and 4. Bates (1990) cautions, however, that having information from young children; thus, new instru-
the knowledge and the tools to discuss the self does ments continue to be introduced into the literature. In
not mean that young children understand all the nu- an attempt to evaluate these measures, some narra-
ances of the self-structure. Hence, measuring the tive reviews of self-concept in young children exist
self-concept remains problematic if one wishes to (Walker et al., 1973; Wylie, 1989). These reviews, how-
obtain more than categorical information (e.g., I am ever, usually examine only a few measures or present
a boy). the information about the measures with no com-
Many researchers have tried to address the issue of mentary on their usefulness. Researchers who are
developmental limitations when designing self- creating new instruments need an empirical synthe-
concept instruments for young children (Eder, 1989; sis of the literature that integrates and summarizes
Harter & Pike, 1984). Eder (1990), for example, be- what has been done in the past. In this way, new in-
lieves that childrens lack of language skills makes it struments can be developed that do not continue to
difcult for them to demonstrate all that they compre- use techniques that are ineffective at eliciting reliable
hend. Thus, even if children have a sense of self at a information. This type of empirical integration of
young age they will be unable to express it so that it the literature is termed a meta-analysis (Cooper &
can be recorded by researchers. Eder (1989), therefore, Hedges, 1994; Peter & Churchill, 1986; Rosenthal,
promotes the use of nonverbal methods (puppets) for 1995).
assessing young childrens self-concept. Other re- One of the advantages of the meta-analytic tech-
searchers, however, nd that asking children simple nique is that it is exible enough to accommodate dif-
and direct questions is the best way to obtain reliable ferent types of synthesis. When performing ordinary
information (Damon & Hart, 1988; Davis-Kean & meta-analyses, methodological elements such as
Sandler, 1995; Marsh et al., 1991; Mischel, Zeiss, & number of participants, socioeconomic status (SES),
Zeiss, 1974). and method of data collection are often tested for ef-
Another related issue is whether childrens un- fects on the primary hypothesis or effect size. For a
derstanding of the types of scaling (e.g., Likert scale) synthesis of methodology, these elements would be
used in questionnaires is affected by the cognitive or used as independent variables with an appropriate
language ability of the young children. Research by methodological dependent variable. In research by
Fantuzzo et al. (1996), for example, suggests that Peter and Churchill (1986), this type of synthesis was
low-income children have trouble comprehending used effectively to test the importance of method-
the Likert-type quantities that are required for reli- ological elements on the reliability and validity of rat-
ably answering questions in self-report measures. ing scales in the marketing literature.
They found that only 1 of the 153 low-income chil- Peter and Churchills (1986) primary goal in their
dren in their study had both the ability to compre- research was to test the relations between research de-
hend the concepts of quantity and the picture recog- sign variables and quality of measure indicators to
nition ability that are required to answer questions determine what impacted construct validity. In their
on the Harter and Pike (1984) Pictorial Scale of Per- study they categorized research design variables into
ceived Competence and Social Acceptance (Fan- sampling characteristics (sample size, type of partici-
tuzzo et al., 1996). Thus, when assessing the useful- pants, method of data collection), measure character-
ness of self-related measures for young children, istics (number of items, number of dimensions, type
addressing their cognitive/developmental limita- of scale), and measure design processes (procedures
tions is important. used to generate items). The quality of measure indi-
cators included estimates of reliability, convergent
validity, and discriminant validity. They found that
Measuring the Self in Young Children
research design variables, most notably measure
Throughout the course of the last 3 decades, instru- characteristics, did have differential impact on the
mentation on self-concept in young children has at- quality indicators. Peter and Churchill concluded that
tempted to account for some of the cognitive/devel- when designing a quality measure, it is important to
opmental limitations of this age group (Walker, Bane, consider all aspects of the measurement process and
& Bryk, 1973; Wylie, 1989). In general, these attempts to be specic about what criteria will be used in the
have not succeeded in producing an instrument that construction of the instrument.
890 Child Development

Study Objectives Thus, it is expected that participant characteristics

such as SES will differentially predict the reliability of
This study concentrated on those measures that ex-
the measure. Measures that assess children from
amined the evaluative aspect of the self (see earlier
high- and middle-class families should have higher
discussion on denitional issues). These measures in-
reliabilities than those that measure low-income
clude self-esteem, self-concept, perceived compe-
tence, and other self-related constructs that use eval-
The nal question in this study involved the meth-
uative statements or questions. For simplicity, these
ods of data collection used in assessing young
measures were referred to as self-esteem measures.
childrens self-esteem. As discussed earlier, there is
This study had three major questions. The rst
disagreement among researchers about which tech-
question was whether a relation existed between reli-
niques or devices should be used to aid in the mea-
ability and study characteristics, subject characteris-
surement of self-esteem in young children. As a result
tics, sample characteristics, measure characteristics,
of perceived cognitive/developmental limitations,
and measure design characteristics. Convergent and
some researchers promote the use of devices (e.g.,
discriminant validity were also reviewed, but be-
puppets, cartoons, Polaroids) to elicit information
cause of the small sample size (eight studies reporting
from children about their self-concepts. Other re-
convergent validity and one reporting discriminant
validity) they were excluded from the analysis. The searchers use no devices and simply ask the children
lack of reporting of these two critical psychometric in- questions. In this study, the different techniques or
dicators is consistent with research by both Peter and devices used were compared to determine if one tech-
Churchill (1986) and Wylie (1989), who found that the nique produced signicantly higher reliabilities.
majority of researchers who create new instruments Thus, the use of pictures as a nonverbal methodology
do not perform adequate psychometric testing. In the was tested against the use of questions with no pic-
meta-analysis by Peter and Churchill (1986), a posi- tures as well as other methods that have been used
tive relation was found to exist between measure with young children. It is our hope that the results
characteristics, such as length of instruments and will help inform the development of future measures
type of scale, and the reliability of rating scales. Thus, and the type of methodology that is used to elicit re-
it was expected that similar relations would exist in sponses from young children.
this study.
In addition to nding out about specic relations,
it was important to determine whether some combi- METHOD
nation of the characteristics listed earlier could pre- Literature Searches
dict higher reliability estimates. It was important, for
example, to test whether the tenet in measurement The literature review for this study was divided
theory that adding more items of parallel content and into three phases. The review process began with a
difculty increases the internal reliability of a mea- broad search of the literature and then became more
sure is true, even when testing young children focused through each subsequent phase.
(Crocker & Algina, 1986). It is also possible that nei- Phase I. Because of the lack of differentiation be-
ther the number nor the content of the items is instru- tween self-esteem and self-concept in the literature,
mental in achieving good reliability. It may, instead, the rst phase of this study consisted of a broad re-
be the characteristics of the participants (e.g., age or view of articles compiled through the use of key-
SES) or an interaction of these characteristics that af- words that relate to preschool and early elementary
fect the reliability of an instrument. Hence, it was im- school self-construct (e.g., self-concept, self-esteem,
portant in this study to know which features or com- self-understanding, constructs of the self, and per-
bination of features in current measures of young ceived competence). This review created a population
children have the most impact on reliability. This of self-related measures for young children that were
information may be useful in designing future then reviewed, in Phase II, for their adherence to the
measures. inclusion criteria. The primary sources of literature
The second question to be answered in this study for the broad review were found in computer
was related to the impact of cognitive issues on mea- searches from the Educational Resource Information
surement. Do the items related to participant charac- Center (ERIC), Psychological Abstracts (PsycINFO,
teristics (e.g., SES) have an impact on reliability?As PsycLIT), Social Science Citation Index (Social
demonstrated by Fantuzzo et al. (1996), low-income Scisearch or SSCI), Dissertation Abstracts, and the
children have a more difcult time understanding the Wilson Indexes to Journal Articles. To decrease the
quantities that are symbolized in the Likert scale. bias of reviewing only journal articles, other sources
Davis-Kean and Sandler 891

of data were also examined (Glass, McGaw, & Smith, information on their internal reliabilities was found
1981; Rosenthal, 1995; White, 1994). These sources in- in the literature. Another 10 studies were excluded
cluded bibliographic information from the articles because they did not t the denition of self-esteem
and books obtained in the computer search, commu- that was used. Seven of the studies could not be lo-
nication with others who have studied preschool and cated in the literature or through personal correspon-
early elementary school self-esteem, manual search of dence. Only studies in which children were the pri-
reference books, internet searches by keywords, and mary respondents were used for analysis, thus
formal requests to agencies or researchers who may eliminating three more studies. Finally, two studies
have had access to measures. It is possible that other were excluded because they did not t the age crite-
measures exist that were either not published or not ria. Twenty-two studies met all of the requirements of
found using the search criteria described. the inclusion criteria and were retained in the sample
Phase II. After the initial literature review, 59 mea- for the nal phase of the literature review (see Appen-
sures were obtained that met the inclusion criteria for dix A for a listing of all measures reviewed for this
Phase I. These measures were reviewed a second study).
time to narrow the focus of the literature review to in- Phase III. After ascertaining which studies met the
clude only those measures that focused on the self- inclusion criteria, a secondary literature review was
evaluative aspect of the self and also reported some undertaken. The purpose of this secondary review
psychometric information. was to gather information on the psychometric prop-
To select the appropriate types of measures, the fol- erties of the measures. It is possible, for example, that
lowing selection criteria were used. adequate psychometric information was not pre-
sented in the studies or manuals that rst introduced
1. The measures had to assess or have some aspect
these measures into the literature. Over time, how-
that assessed the self-esteem or self-evaluative
ever, these measures may have been used in other
aspect of the self. As summarized by Brinthaupt
studies that did report good psychometric informa-
and Erwin (1992), this includes measures that
tion. It was important to obtain the best psychometric
evaluate the self (e.g., I like myself), the de-
information available on these measures. In this way,
gree of liking of or satisfaction with some aspect
the questions for this study could be answered more
of the self (e.g., I am good at running), or the
discrepancy between actual and ideal self (e.g.,
For those measures that met the inclusion criteria
Im not as good at math as I want to be). In
in Phase II, a secondary search using the computer-
general, these measures had to contain state-
ized education and psychological databases previ-
ments or questions that involved evaluation of
ously listed was conducted. Both the title and the au-
the self as good or bad. If it was unclear
thor were entered as keywords to locate all possible
whether a question or statement was self-
references to the instrument. If no other reference was
evaluative, then the decision to include a mea-
made in the literature, then the psychometric infor-
sure was based on how the author dened
mation from the original study was used.
the measure (i.e., self-esteem or other self-
Other studies that reported psychometric informa-
tion were examined to see if they provided more in-
2. The measures had to have been tested on a
formation than the original study. If no additional in-
sample of children who were older than 4
formation was provided, then the original study was
but younger than 7 (preschool and some rst-
used. If only one study reported more adequate psy-
grade ages).
chometric information than the original study, then
3. Internal reliability information had to be re-
that study was chosen for inclusion. To guarantee
ported for the measure (Cronbachs , KR-20,
independence of the studies, only one study using
21, Hoyts method, or split-half).
each instrument and the criteria noted earlier was
4. The methodology that was used to collect data
for the measure had to be reported (e.g., number
of participants).
5. The form of data collection that was used in as- Coding of the Studies
sessment of self-construct (i.e., pictures, ques-
Each of the 22 studies retained after the literature
tionnaires, puppets) had to be reported.
review phases was coded using a 36-item coding
This ltering process eliminated 37 of the 59 origi- scheme (see Appendix B). Intercoder reliabilities were
nal studies (58%). The majority of the studies (15) obtained by randomly selecting seven studies coded
were removed from this study because no published by a trained coder and then comparing them with the
892 Child Development

initial coding completed by an experienced coder. were coded, but only the internal consistency coef-
Using techniques described by Orwin (1994) and cients were used for analysis because they were re-
Yeaton and Wortman (1993), reliabilities were calcu- ported most often in the studies reviewed.
lated for each subcategory of the coding scheme (e.g., Internal consistency refers to the extent that items
participant characteristics, measure design character- in a measure intercorrelate (Baltes et al., 1977). Thus,
istics, etc.). The interrater agreement of the subcate- researchers can administer a test at one time only and
gory variables ranged 83% for the measure design determine how reliable or internally consistent the
variables to 100% for the quality of measure esti- items are at measuring a construct. There are two
mates, with 94% overall agreement for the whole cod- ways of obtaining this intercorrelation: by correlating
ing measure. Differences in the coding were identi- two halves of the same measure or by calculating
ed and resolved by the initial coder. each items covariance. The rst method is called the
Because of the type of analyses that were to be per- split-half reliability, and the second, the  coefcient
formed, the variables were coded as either continu- (Baltes et al., 1977; Crocker & Algina, 1986). Because
ous or dichotomous. Thus, some of the categorical the present study focused on the internal consistency
variables were transformed to accommodate the cod- of a measure, both the split-half and the  coefcients
ing restrictions of the statistics to be used (see Table 2 were accepted as meeting the internal consistency re-
for coding information). Descriptive information for quirement for inclusion in the study. Two analyses
these variables can be found in Tables 1 and 2. A brief were conducted to see if the split-half coefcients var-
description of the categories of this coding scheme ied signicantly from  estimates. The rst examined
and reasons for their inclusion follows. the relationship between instruments that reported
both an  coefcient and a split-half reliability coef-
cient (n  3). The correlation between these coef-
cients was .86. The second analysis was a t test that
Quality of Measure Estimate
compared the mean of the  coefcients and the split-
Because of the lack of reported information on con- half coefcients. The analysis showed that the  coef-
vergent and discriminant validity, only the reliability cients (M  .75) were not signicantly different from
coefcient was used as a quality of measure estimate the split-half coefcients (M  .82), t(23)  .70, p 
(see Study Objectives section). Reliability refers to the .05. Hence, it was decided that the split-half reliabili-
consistency of instruments in measuring a phenome- ties did not differ enough from the  coefcients to be
non (Baltes, Reese, & Nesselroade, 1977; Crocker & excluded from the study. Thus, both the split-half re-
Algina, 1986). There are various ways of looking at liabilities and the  coefcients were used to repre-
measurement consistency including the replicability sent the internal consistency coefcients in this study.
of scores over time, or stability; alternate forms of a In the three cases where both coefcients were re-
test, or equivalence; and internal consistency, or ho- ported, the  coefcient was used because it is consid-
mogeneity (Baltes et al., 1977; Crocker & Algina, ered a better estimate of internal consistency (Crocker
1986). For this study, all three types of reliabilities & Algina, 1986).

Table 1 Means, Standard Deviations, Kurtosis, Skewness, and Ranges for Continuous Variables in the Initial and Alternative

Variables N M SD Kurtosis Skewness Range

Initial analysis
Reliabilitya 22 .75 .13 2.31 1.22 .36.94
Age 22 5.57 .66 .65 .27 4.56.5
Measure design characteristic 22 1.32 1.32 .19 .98 0 4
No. of items 22 26.24 15.6 .16 .76 464
Publication year 22 1980 8.1 1.03 .28 19681995
Alternative analysis
Reliabilitya 30 .76 .14 1.31 1.09 .36.94
Age 30 5.38 .78 .81 .18 3.86.5
Measure design characteristic 30 1.47 1.38 .45 .74 04
No. of items 30 26.90 15.5 .41 .97 464
Publication year 30 1981 8.2 1.15 .16 19681995

a Criterion variable.
Davis-Kean and Sandler 893

Table 2 Frequencies, Percentage, Kurtosis, and Skewness for Dichotomous Variables in the
Initial and Alternative Analyses

Variables Frequency Percentage Kurtosis Skewness

Initial analysis
Nonschool (0) 2 19 8.09 3.06
School (1) 20 91
Middle class/mixed class (0) 17 77 .57 1.4
Low-income (1) 5 23
Scale type
Likert/rating (0) 7 32 1.44 .84
Dichotomous (1) 15 68
Pictures/cartoons (0) 18 82 1.25 1.77
Questionnaires (1) 4 18
Alternative analysis
Nonschool (0) 3 10 6.31 2.81
School (1) 27 90
Middle class/mixed class (0) 24 80 .53 1.58
Low-income (1) 6 20
Scale Type
Likert/rating (0) 11 37 1.78 .58
Dichotomous 19 63
Pictures/cartoons (0) 24 80 .53 1.58
Questionnaires (1) 6 20

Note: Numbers in parentheses indicate coding for analysis.

There are three methods for obtaining the  coef- the study (Math, Reading, Music, Sports) to get a
cient: Cronbachs , KR-20/21, and Hoyts method. mean reliability for the whole measure. Because this
As discussed by Crocker and Algina (1986), these averaging did not reect the actual increase in reli-
methods produce the same internal consistency re- ability that would be obtained by including all the
sults. The only exception is that the KR-21 assumes items, the number of items recorded for this scale was
that all items are of equal difculty. When items are of reduced to reect the number in the subscale (9.25
equal difculty, the KR-20 and KR-21 estimates are items). In this way, the mean reliability represented
the same (Crocker & Algina, 1986). When the dif- the mean of the subscales and the number of items
culty varies across items, then the KR-21 estimate will that were reected by the subscales and thus did not
be systematically lower than the KR-20 estimate bias these two studies.
(Crocker & Algina, 1986). It is best to report both esti-
mates for a test using this reliability estimate (Crocker
Study Characteristics
& Algina, 1986). In general, the overall internal con-
sistency coefcient was recorded for each of the mea- To get a general overview of each instrument, data
sures. Two measuresthe Eder (1990) and the Eccles, about the instruments were collected. This informa-
Wigeld, Harold, and Blumenfeld (1993) scales (see tion included year of publication and study setting.
Appendix A)did not report an overall  coefcient, These variables gave some indication of the context of
but instead reported individual alphas for each of the the study. The year that an article was published, for
subscales. The total number of items was adjusted for example, was coded to see if instruments varied in
these measures to account for the averaging of the re- quality across time. The study setting was coded to
liabilities to create an overall reliability coefcient. represent the most common settings in which instru-
Thus, for the Eccles et al. (1993) scale, for example, the ments were administered to children (school, home,
scores were averaged across the four domains in professional ofce, laboratory).
894 Child Development

Method Characteristics represent the characteristics of the participants to

whom the instruments were administered. These
The method characteristics are generally listed in
characteristics include identication of the respon-
the methodology section of research reports and in-
dent for the measure (child or parent/teacher), and
clude the overall sample size of the study and the age
SES, race, and gender of the respondent. The respon-
of the children. Sample size was selected to determine
dent variable is important for distinguishing between
whether the number of participants to whom an in-
a self-report measure and an inferred measure. Self-
strument was administered impacted quality indica-
report measures ask the children questions regarding
tors. The age of the children who used the instru-
their self-esteem, and inferred measures ask either
ments was an important variable for this study. Thus,
parents or teachers about their perceptions of chil-
some data transformations were used to get an accu-
drens self-esteem. Only studies in which the child
rate account of the ages of the children to whom these
was the primary respondent were used for the analy-
instruments were administered.
When coding age, it became clear that it would not
As discussed earlier, the SES of children can have
be easy to obtain directly the mean age of the chil-
an impact on their ability to understand the quantity
dren. Six studies reported the ages of the children to
differences in a Likert scale (Fantuzzo et al., 1996).
whom the instruments were administered, but the
Thus, it was of interest to test the impact that instru-
other 16 reported only age categories (e.g., kindergar-
ments used for low-income samples might have on
ten, rst grade, etc.). Thus, to use the data from all of
quality indicators. Because of a lack of information
the studies and to obtain a continuous variable for use
regarding income ranges or the measures used to de-
in the analyses, a new age variable was created. Three
termine SES in the studies reviewed, only those
categories of ages were used: preschool (4- to 5-year-
studies that reported that the whole sample was
olds), kindergarten (5- to 6-year-olds), and rst grade
from a low-income sample were categorized as pov-
(6- to 7-year-olds). These age categories were estab-
erty samples. The other studies were coded as mixed
lished by examining the studies and determining
SES. Thus, a dichotomous code was used to repre-
how the ages were grouped. The data were consistent
sent SES: A 1 indicated poverty and a 0 indicated
across the studies. The actual labeling of the catego-
mixed SES.
ries (i.e., preschool, kindergarten, and rst grade) was
Information on the reliability of different ethnic
based on the American school system terms for these
groups was limited and thus was not examined in this
ages. If the description of an instrument stated that
article. Similarly, because only three studies reported
subjects fell into one of the three groups, then the
reliability by gender, the data were too limited to con-
midpoint (i.e., 4.5, 5.5, and 6.5, respectively) of that
duct analysis. The reliabilities reported by these three
age range was coded for that measure. Thus, all mea-
studies, however, showed almost no difference be-
sures received codes representing the age groups in
tween males and females, mean reliability: females,
their studies, and missing data for age were avoided.
M  .74; males, M  .73.
For consistency, if authors did report the actual ages
of the children in their studies, that information was
not used in this variable. The actual age information, Measure Characteristics
however, provided an opportunity to test whether or
Characteristics such as number of items, type of
not this continuous age variable adequately repre-
scale, and method of data collection were used to
sented the age groups. Using a correlation analysis,
examine the nature of the measures (Peter &
the continuous age variable was found to be signi-
Churchill, 1986). In previous research, these types of
cantly related, r(20)  .93, p  .01, to actual age re-
characteristics were shown to have the strongest rela-
ported by measures that reported exact age (N  6).
tion with the reliability of a measure (Peter &
Also, the greatest difference between actual age and
Churchill, 1986).
the continuous age variable was .50. Hence, the con-
Two types of scales were primarily used by the au-
tinuous age variable was found to be adequate for
thors of the instruments: Likert and dichotomous.
representing the ages of children being assessed by
Even though the main research question involved the
the instrument.
comparison of studies using dichotomous scales to
those using Likert scales, additional analyses were
Participant Characteristics
conducted to determine if the amount of categories
Participant characteristics are also found in the (27) had an effect on the reliability of a scale. The
methodology section of research reports, usually two questions answered were: (1) Is there a signi-
under the heading of Subjects. These variables cant relationship between number of responses 27
Davis-Kean and Sandler 895

with internal reliability? (2) Is there a signicant a conrmatory factor analysis to empirically investi-
difference between response categories 3 or 3? In gate the dimensions of the instrument. Using a di-
the rst case, there was a .14 correlation between chotomous scale (present/not present), the values are
number of responses and the weighted reliability co- summed across these variables to create one continu-
efcient. In the second case, there was a nonsigni- ous index that represents the rigor of the measure de-
cant difference in the means between response cate- velopment process, with the highest value being a 4
gories below 3 and those above, M  3 .72; M  3  (most rigorous), and the lowest being a 0 (least rigor-
.80. This study suggested that the number of response ous), M  1.32, SD  1.28. This one index was used to
categories did not have an impact, but the n was represent the impact of the measure design variables
small. It would be interesting to look at this issue in in the analyses.
more depth in a direct study of method and response
In general, the dichotomous scales represented the RESULTS
answers like me and not like me. The Likert Descriptives
scales varied in what each of their points represented,
but usually just added the distinction like me some- The descriptive summary of the continuous vari-
times or always and not like me sometimes or al- ables (Table 1) demonstrates that the reliability vari-
ways. The numbers representing the points ranged able (the criterion variable) was negatively skewed
from 3 to 7, M  4.4. (which indicates that the scores cluster on the high
The method of data collection was dened as any end of the scale).
device that was used in a study to aid in either help- The descriptive summary for the dichotomous
ing the child to understand the questions being variables also shows that some of the predictor vari-
asked by the instrument or to help in answering the ables were skewed. Skewness is assessed for dichoto-
questions. These methods were assigned to one of mous variables by examining how many scores fall
two categories: those that used some type of device into each category (Tabachnick & Fidell, 1983, 1996).
to assist in asking the question (e.g., cartoons, photo- If between 80% and 90% are in one category, then the
graphs, smiley faces, puppets) and those that used variables are considered to be skewed (Tabachnick &
only questionnaires. Fidell, 1983, 1996). Table 2 indicates that two of the di-
chotomous variables had skewed distributions: set-
ting and method of data collection.
Measure Design Characteristics
Measure design characteristics were coded to gather
Effect Size
empirical data on how authors of the instrument cre-
ated their instruments (Peter & Churchill, 1986). The primary objective of this study was to examine
These procedures included not only the generation of the relation between measurement characteristics
questions, but also the denition of the domain of the and quality of measurement indicators. As discussed
construct, an a priori specication of dimensions, earlier, the quality indicator being used for analysis
and the investigation of dimensionality (Peter & was the internal reliability coefcient. Before con-
Churchill, 1986). Peter and Churchill (1986) believe ducting the correlation analysis, the reliability coef-
that these procedures give strong evidence for the cient was transformed to create an effect size statistic
quality of the measure. They caution, however, that that represented the reliability information and was
this evidence is subjective, not empirical, and thus statistically comparable across samples (M. W. Lip-
is difcult to measure. Peter and Churchill (1986) sey, personal communication, November 14, 1997).
also point out that these measures may be highly Following procedures outlined by Shadish and Had-
related to each other. Researchers, for example, dock (1994) and M. W. Lipsey (personal communica-
often investigate the dimensionality of their instru- tion, November 14, 1997), the reliability coefcients
ment (i.e., factor analysis) when there are a priori were transformed using Fisher Z transformation, and
theories or dimensions to be examined. To deal with the correlation analysis was weighted using the in-
the potential multicollinearity that might exist be- verse of the conditional variance (i.e., w  n  3). This
tween these variables, a measure design index was transformation weights the effect sizes by the sample
created. size of the combined studies, which makes the corre-
A measure is generally considered stronger if it is lations more stable because of the increased sample
derived from theory, has dimensions specied a pri- size (De Wolff & van IJzendoorn, 1997). After the ef-
ori, empirically investigates dimensionality, and uses fect sizes were transformed, the mean weighted effect
896 Child Development

Table 3 Meta-Analytic Summary Statistics for Transformed Correlations

and Weighted Effect Size in Initial and Alternative Analyses
The correlations were used as descriptors of the re-
Initial Alternative lation between variables in the study. Thus, this dis-
Type of Summary Analysis Analysis
cussion focuses on the magnitude of the relation as an
No. of studies 22 30.00 indication of the strength of the bivariate relation. The
Weighted mean effect size 1.13 1.07 correlations indicate that there were moderate to
Weighted SD .27 .27 large, positive relations between the weighted reli-
95% CI 1.101.15 1.051.09 ability coefcient and four of the predictor variables:
Min.Max. .381.74 .381.74
age, SES, number of items, and method of data collec-
Homogeneity Q 499.27, p  .05 637.38, p  .001
tion (see Table 4 for correlation matrix). The largest
magnitude of relation existed between age and inter-
nal reliability; that is, the reliability of a measure in-
creased with the age of the children being assessed.
sizes, condence intervals, and homogeneity test Similarly, the internal reliability of a measure in-
were calculated for the combined effect sizes (see creased as the number of items in an instrument in-
Table 3). creased. The correlations also indicated that, on aver-
The test for homogeneity of effect sizes was signif- age, measures using questionnaire formats were
icant and indicates that the correlations are most more reliable than those using pictures or other de-
likely not from the same population. Generally, when vices. Finally, instruments that were used with low-
this occurs in meta-analysis, moderator variables income children produced lower reliabilities than
(e.g., year of publication, setting of study, age) are ex- those used with middle- to higher-income children.
amined to determine if they differently impact the ef- The correlation matrix (see Table 4) also allowed
fect size estimate (Durlack & Lipsey, 1991). This examination of the relations among various predictor
study, however, assumed differences in the character- variables. The measure design characteristic variable
istics of the studies being examined and, as with tra- (which was created by summing across items that
ditional meta-analysis, modeled the impact of these represented good measurement theory) was highly
characteristics on the effect size. Thus, it is important correlated with the year that an article was published.
that heterogeneity in effect sizes was found in this Thus, recent articles more often developed and tested
sample of studies. measures using measurement theory than did older

Table 4 Fisher Z Transformed and Weighted Correlation Coefcients for Variables in the Initial and Alternative Analysis

Variables 1 2 3 4 5 6 7 8 9

Initial analysis
1. Reliabilitya .06 .16 .20 .65 .37 .53 .28 .29
2. Measure design characteristics .81 .10 .02 .11 .06 .57 .86
3. Publication year .06 .19 .49 .02 .52 .74
4. Setting .16 .03 .17 .04 .04
5. Age .47 .12 .13 .31
6. SES .27 .30 .23
7. No. of items .40 .28
8. Scale type .61
Alternative Analysis
1. Reliabilitya .08 .17 .23 .54 .47 .57 .30 .25
2. Measure design characteristics .74 .07 .04 .23 .23 .58 .81
3. Publication year .15 .29 .55 .06 .54 .73
4. Setting .14 .04 .14 .02 .04
5. Age .67 .24 .25 .27
6. SES .39 .37 .27
7. No. of items .43 .31
8. Scale type .61
9. Method

Note: Initial analysis: weighted N  7,626, 22 (df  20) cases for all variables. Alternative analysis: weighted N  9,545, 30 (df  28) cases
for all variables.
a Transformed, weighted criterion variable.
Davis-Kean and Sandler 897

articles. The measure design variable was also highly between the criterion and predictor variables, but
related to type of method and to type of scale used; it were not useful for showing the predictive ability of
indicated that questionnaires using Likert scales were multiple predictor variables. Thus, a multiple regres-
more likely to use good measurement techniques. sion analysis was used to predict reliability using
The type of income group with which young chil- multiple variables.
dren are associated was also related to some of the In meta-analysis, an analogue to the basic multiple
variables. The correlation matrix (Table 4) showed that regression procedure is used to model the relation be-
recent articles were more likely to examine middle- tween categorical and continuous predictor variables
class rather than low-income groups. These low-income and a continuous criterion variable (Hedges, 1994).
groups were also more likely to be represented by the This procedure uses a weighted, generalized least
younger ages, and instruments using a dichotomous squares algorithm for estimation (Hedges, 1994).
scale were more likely to be used with this group. Standard statistical analysis packages can be used to
Items related to the methodology of the instru- compute the weighted regression analysis, but they
ments were also related to each other (Table 4). Instru- do not give the correct coefcients for the standard er-
ments that have been created recently, for example, ror or signicance test because they are based on dif-
were more likely to be questionnaires instead of pic- ferent models than those used for xed-effects meta-
ture instruments. Questionnaires were also more analysis (Hedges, 1994). Thus, corrections must be
likely to be based on a Likert scale than on a dichoto- made to the standard error coefcients to accurately
mous scale, and to be used with older children. report the estimates of the regression coefcients
(Hedges, 1994). An SPSS macro program obtained
from M. W. Lipsey (personal communication, No-
Initial Regression Analysis
vember 25, 1997), which makes the necessary correc-
The second question to be addressed involved tions to the standard error coefcients, was used to
whether reliability could be predicted by measure run the simultaneous regression analysis. The results
and participant characteristics. The correlation coef- from this analysis can be found in Table 5.
cients were adequate at showing the bivariate relation The regression analysis showed that testing chil-

Table 5 Weighted Generated Least Squares Regression Results for the Prediction of Reliability in the Initial and Alternative Analyses

Predictor Variables B Q 95% CI Lower 95% CI Higher

Initial analysis
Measure design characteristics .04 .18 1.36 .03 .10
Setting .31 .13 6.73** .08 .55
Method 4.88 6.30 13.98*** 2.32 7.43
SES 1.44 2.41 7.47** .41 2.48
Publication year .00 .14 1.00 .01 .01
No. of items .06 3.02 8.87** .09 .02
Scale type .01 .02 .13 .06 .09
Age .16 .30 3.69* .00 .33
Age .25 2.43 6.54** .45 .06
Age .83 6.66 17.19*** 1.22 .44
No. of Items
Age .01 3.56 11.00** .00 .02
Alternative analysis
Measure design characteristics .02 .13 1.79 .06 .01
Setting .19 .09 4.11* .01 .37
Method .40 .47 13.45*** .18 .60
SES .10 .19 3.97* .19 .00
Publication year .00 .01 .00 .01 .01
No. of items .01 .32 20.97*** .00 .01
Scale type .01 .02 .10 .09 .06
Age .08 .21 8.43** .02 .13
Age .07 .24 16.49*** .11 .04
Age .16 .49 51.38*** .21 .12
No. of Items
Age .01 .03 .26 .03 .06

Note: Initial analysis: adjusted R2  .77, df  1; alternative analysis: adjusted R2  .62, df  1.

* p  .05; ** p  .01, *** p  .001.
898 Child Development

dren in a school setting was predictive of increased re- year-olds). The interaction showed that middle-class
liability. Method of data collection (questionnaire ver- group average reliability scores increased steadily as
sus pictures), SES, age of the child, and number of children got older and the low-income group rose
items in a scale were also signicant main effect pre- slightly but remained low (see Figure 2).
dictors of reliability. These variables were also exam- The same problem occurred when looking at the
ined to determine if the age of the child had any im- differences between methods of data collection. In
pact on these three important study variables. this case, no studies used the questionnaire-only
As Table 5 shows, there was a signicant interac- format for younger children. Figure 3 shows that at
tion between number of items and age. As the num- the older ages, the reliabilities for picture and device
ber of items increased, the reliability also increased format were as high as or higher than those for the
across all age groups. This increase, however, was questionnaires. The main effect for the methods of
more rapid for the younger group than for the older data collection showed a positive, signicant contri-
(see Figure 1). The reliability of the older group re- bution of questionnaires to the weighted reliability
mained the same despite the number of items added, that was more consistent with the correlation. The in-
but the reliability of the younger group increased teraction, however, made clear that the positive con-
with the addition of more items. tribution was coming from the older children because
Method of data collection and SES were also exam- no questionnaires were administered to the younger
ined for interactions with age. Both these variables children.
had signicant interactions with age, but because of
the limitation of this data, these interactions were dif-
Alternative Regression Analysis
cult to interpret. Children who were in the lower
SES group were younger than those in the middle- Because of the strong impact of age on these anal-
class group. Thus, there were no scores represented yses, an additional analysis was completed to exam-
for the low-income group at the older ages (6.0- to 6.5- ine the full range of ages represented by these studies.

Figure 1 The effect of interaction between age and number of Figure 2 The effect of interaction between age and SES on in-
scale items on internal reliability. ternal reliability.
Davis-Kean and Sandler 899

ing reliabilities for more than one target age. Instead

of combining the ages to create an age composite for
each study, studies with more than one age group
were entered as a separate study for each age group.
Thus, the number of studies for the analysis increased
from 22 to 30. Information on the descriptive data and
correlations showed almost no change between the
two analysis (see Tables 1 to 4). The regression analy-
sis results, in general, were also similar to the previ-
ous regression, with one important exception: the in-
teraction between number of items and age, Q(1) 
.15, p  .05, was no longer signicant (see Table 5).
This analysis suggests that the number of items in a
measure, and not the age of the child, predict an in-
crease in the reliability coefcient. Figure 1 shows the
graphical depiction of the interaction between age
and number of items. This gure shows that the more
items in a scale, the higher the reliability. This is con-
sistent across all age groups even though the young-
est children do have the lowest reliabilities.
The magnitude of the difference between middle-
and low-income groups also changed in the alterna-
tive analysis (Figure 2). There was still a signicant
interaction between age and SES, but both groups
now appeared very similar, with average reliability
scores increasing as the children got older. The nal
Figure 3 The effect of interaction between age and data col- graphs, on the interaction between data collection
lection method on internal reliability.
methodology and age, shows very little change be-
tween the initial and alternative analyses (Figure 3).
One of the objectives of this study was to deter-
In the initial analysis, information regarding all of the mine whether a signicant difference exists between
ages in a study was combined to create a composite age the different types of data collection methodologies
score, thus losing valuable information regarding the that can be used to collect data on self-esteem in
range of ages in each study. The alternative analysis young children. Because of the preliterate skills of this
included all the ages of the children in a study and the age group, many researchers have relied on nonver-
individual reliabilities obtained for those ages when bal inventories that use pictures to convey the situa-
reported (see Appendix A for studies that reported tion they are assessing. Other researchers use individ-
more than one age). Unfortunately, this analysis vio- ually administered questionnaires to collect data. It
lates the assumption of independence of effect size by would be useful, then, to know if there is a methodol-
using multiple effect sizes from the same study and ogy that is more consistent at gaining information on
using data collected by the same research team; thus, self-esteem in young children. Unfortunately, the re-
the results need to be interpreted cautiously on the sults show that information across the ages was insuf-
basis of these redundancies (Matt & Cook, 1994). To cient to make a determination about the impact of
deal with the problem of multiple data points repre- the different methods used in the collection of self-
senting the same study, Matt & Cook (1994) suggest esteem data in young children (Figure 3).
using the covariance matrix obtained from each of the
studies. This was not possible for this meta-analysis
because of the lack of reporting for this statistic. An-
other more conservative method is to create a weighted This study has made some interesting contributions
composite that combines the sample information; this to the study of the development of self-esteem mea-
is the technique used in the initial analysis. sures for young children. The results showed that the
The alternative analysis used the same methodol- reliability of self-esteem measures for young children
ogy as the initial analysis with the only exception be- can be predicted by the setting of the study, number
ing the addition of information from studies report- of items in the scale, age of the children being studied,
900 Child Development

method of data collection (questionnaires or pictures), with younger children (4- to 5-year-olds). This result
and SES of the children. Also, age interacted signi- replicates ndings from other researchers who have
cantly on SES, number of items (in the initial analy- studied perceived competence (Harter & Pike, 1984),
sis), and the impact of method of data collection in the self-perceptions (Eccles et al. 1993), self-descriptions
study. These ndings, though not conclusive, provide (Marsh et al., 1991), and self-understanding (Damon &
important directions for what future studies should Hart, 1988). All found that, in general, the reliability of
focus on to further the area of self-concept and self- their measures increased as the age of the children in-
esteem test development. creased. Neither this study nor other studies have been
The setting in which the instrument was adminis- able to determine which aspects of the age of a child
tered is a predictor of increased reliability. All the in- inuence reliability. In general, there are two options:
struments except for two were administered in a developmental limitations or language limitations.
school setting. This makes it difcult to make any As discussed earlier, it is unclear whether young
strong statements about the school being a preferable children have difculty answering questions related
setting in which to administer self-esteem scales. It is to the self because of developmental reasons (i.e.,
possible that completing self-esteem instruments in they do not have a developed sense of self) or because
the school setting helps children to answer questions of the limitation of their language ability (they dont
concerning math and reading abilities because those understand the terms). A direct test of this question
abilities are more salient to children in that setting was not possible in this study. Instead, a proxy vari-
and therefore easier to retrieve from memory (Davis- able, SES, indicated that there was a relation between
Kean & Sandler, 1995; Ericsson & Simon, 1993). SES and the reliability of a measure, with low-income
The results supported the ndings of Peter and children showing lower reliabilities. The existence of
Churchill (1986) that measure characteristics (i.e., an interaction between age and SES was difcult to
number of items in a scale) are positively related to re- interpret because of insufcient information on the
liability. Both the initial and alternative analyses indi- older children; additional analysis suggests that this
cated that reliability increased across all ages as the variable may have little impact. This clouded the in-
number of items increased. Overall, these ndings are terpretation that could be made about the impact of
consistent with test theory and contrary to earlier rea- SES on reliability. The use of SES as a proxy variable,
soning (see Introduction) that the number of items in however, did not give a sufciently adequate indica-
preschool self-esteem scales might be affected nega- tion of either developmental level or of language abil-
tively by the age of young children. The initial analy- ity; thus, this critical issue remains unresolved. Fu-
sis indicates that there might be an interaction with ture research needs to focus on unraveling the
age, but this nding was not replicated in the alterna- possible confound between limitations due to age
tive analysis. The reliability coefcient is strongly in- and to other factors such as SES.
uenced by the number of similar items that are in a Another contribution made by this study was the
scale (Crocker & Algina, 1986). Therefore, regardless use of meta-analysis as a technique to integrate infor-
of the age of a respondent, if researchers add enough mation from the literature and to create parameters that
items of parallel content to an instrument, the instru- future researchers can use to develop measures. The
ment will be highly reliable. Recent research by meta-analysis uses cumulative knowledge across the
Marsh et al. (1998) also supports this nding. They literature to answer questions that would be difcult to
found that fatigue, due to short attention span, did answer in a single primary study (Schmidt, 1992). For
not impact reliability in young children. Marsh et al. this study, the meta-analysis has added important em-
(1998) believe that the length of the instrument may pirical information about questions that plague research
assist in teaching children how to respond accu- in preschool and young school childrens self-esteem.
rately to a scale. Hence, the number of items in a scale In summary, the primary focus of this study was to
may not only increase the reliability of an instrument test the relation between measurement variables and
by measurement principle, but may also give children quality indicators of instruments (i.e., reliability and
time to understand and respond more appropriately convergent and discriminant validity). It was clear
to the questions or stimuli presented to them. before conducting any analysis, however, that con-
Even though there appears to be little impact of vergent and discriminant validity could not be exam-
age on the number of items in a scale, both the corre- ined by using the group of studies selected for inclu-
lation and regression analyses found that the age of sion in this study. Few of the studies reported validity,
the children had a strong impact on the reliability of and those that did focused on convergent rather than
the measure. Instruments used with older children (6- discriminant validity. Thus, this study used only the
to 6.5-year-olds) were more reliable than those used reliability coefcient as an indication of the quality of
Davis-Kean and Sandler 901

a measure. Unfortunately, the reliability coefcient is quality of a measure. It is not clear why this is the
not the strongest indicator of quality. Convergent case. Future studies need to address whether young
and, more importantly, discriminant validity are gen- children are able to understand both the concepts that
erally more valuable at assessing construct validity are being used in these measures (e.g., I like myself)
(Crocker & Algina, 1986; Messick, 1995; Peter & and whether that understanding can be measured
Churchill, 1986). Hence, the reliance on the reliability with current techniques. It is possible that young chil-
coefcient is a limitation, but one that reects the dren are not capable of answering questions about
available literature. whether they like qualities about themselves because
Another limitation was the reliance on secondary they do not think of themselves in those terms or do
data for looking at the impact of study characteristics not have a full understanding of what they are think-
on self-esteem in young children. Because of this lim- ing (Flavell, Green, & Flavell, 1995). In fact, research
itation, it was not possible to directly compare the pic- has shown that school helps dene for children what
ture and questionnaire methodology, because not all they are and are not good at (Eccles et al., 1993). Thus,
ages were represented. This type of question would it may be important in the preschool and early school
be better answered with an experimental study that years to use inferred methods from teachers and par-
looks at direct comparisons between methods across ents to get a general idea of childrens self-esteem and
the preschool and early school ages. then to use self-report methods when they get older
This study has summarized and reiterated some of and can grasp the concepts better. There is also a
the fundamental problems in constructing instru- greater emphasis in the literature on the multidimen-
ments to measure self-concept and self-esteem in sionality of self-esteem (Marsh et al., 1991; Marsh,
young children. One of the most important and easi- Craven, & Debus, 1998). Thus, future measures
est to reconcile is the psychometric limitations of the should examine not just the school environment but
studies represented in this meta-analysis. The lack of also the home and peer relationships. No matter
information on convergent and discriminant validity which avenue is taken, it is clear that more research
demonstrates poor test development procedures needs to be done concerning childrens understand-
used by researchers of preschool and young school ing of the self before any further instruments are de-
childrens self-esteem. Wylie (1989) has reviewed self- veloped. Understanding how young children think
esteem measures for all ages and has found that re- about themselves will help us create better measures
searchers consistently ignore good psychometric that are more appropriate for them, thus increasing
practice when testing their models or theories. In- our knowledge about social cognition.
deed, almost everyone who reviews the self-concept
and self-esteem literature criticizes the area for not
being more stringent in the development of new mea-
sures (Davis-Kean & Sandler, 1995; Hughes, 1984; The authors wish to thank David Cordray, Larry
Wylie, 1989). This study has shown that it is impor- Hedges, Mark Lipsey, William Shadish, Kai Schnabel,
tant to look at the developmental implications of and Gregory Davis-Kean for their statistical advice on
studying self-esteem in young children and to make analyzing the data.
sure that the instruments used are valid for these age
Studies on the development of psychometrically
sound instruments, however, are secondary to the Corresponding author: Pamela E. Davis-Kean, Gen-
need in the eld to study the issue of young childrens der and Achievement Research Program, 204 S. State
understanding of self-esteem constructs and their St., University of Michigan, Ann Arbor, MI 48109-
ability to communicate this understanding. It is clear 1290; e-mail: Howard M. San-
from this study that age has a strong impact on the dler is at Vanderbilt University, Nashville, TN.
902 Child Development


Author(s), Study Sample Information Data Collection
Instrument Name Publication Year Location Size Source Method Scale Type

1. Bickley Assessment of Figa, 1979 United States 107 Child Pictures, device, Self-esteem,
Self-EsteemSchool cartoons dichotomous
Esteem Subtesta
2. Pictorial Self-Concept Scale Bolea, Felker, & United States 549 Child Pictures, device, Self-concept,
(PSCS)a Barnes, 1971 cartoons Likert
3. Browns IDS Self-Concept Walker, Bane, & United States 1,154 Child Pictures, device, Self-concept,
Referents Testa Bryk, 1973 cartoons dichotomous
4. Reading Self-Concept Scale Chapman & New Zealand 290 Child Questionnaires Self-concept,
(RSCS)a Tunmer, 1995 Likert
5. Purdue Self-Concept Cicirelli, 1973 United States 198 Child Pictures, device, Self-concept,
Measurea cartoons dichotomous
6. North York Primary Crawford, 1977 Canada 1,585 Child Pictures, device, Self-esteem,
Self-Concept Test cartoons dichotomous
7. Competence Perceptions Eccles, Wigfield, United States 284 Child Questionnaires Self-esteem,
and Subjective Task Value Harold & Likert
Beliefs Blumenfeld, 1993
8. Self-View Interviewa Eder, 1990 United States 60 Child Pictures, device, Self-concept,
cartoons dichotomous
9. Pictorial Scale of Perceived Harter & Pike, 1984 United States 146 Child Pictures, device, Perceived
Competence and Social cartoons competence,
Acceptance for Young Likert
10. Childrens Self-Concept Helms, Holthouse, United States 633 Child Pictures, device, Self-concept,
Index (CSCI) Granger, cartoons dichotomous
Cicarelli, &
Cooper, 1968
11. The Self-Social Constructs Long, Ziller, & United States 39 Child Pictures, device, Self-concept,
TestSelf-Esteem Scale Henderson, 1969 cartoons Likert
12. Maryland Preschool Self- Hughes & United States 78 Child Pictures, device, Self-concept,
Concept ScaleRevised Leatherman, cartoons dichotomous
(MPSS-R) 1982
13. Piers Preschool Pictorial Jensen, 1985 United States 92 Child Pictures, device, Self-concept,
Self-Concept Scale cartoons dichotomous
14. Joseph Preschool and Joseph, 1979 United States 1,245 Child Pictures, device, Self-concept,
Primary Self-Concept cartoons dichotomous
Screening Test
15. Self-Description Marsh, Craven, & Australia 332 Child Questionnaires Self-concept,
QuestionnaireIa Debus, 1991 Likert
16. MartinekZaichkowsky Martinek & Zaich- United States 18 Child Pictures, device, Self-concept,
Self-Concept Scale kowsky, 1975 cartoons dichotomous
17. McDanielPiers Young McDaniel & United States 103 Child Questionnaires Self-concept,
Childrens Self-Concept Leddick, 1978 dichotomous
18. Self-Concept and Davis & Johnston, United States 167 Child Pictures, device, Self-concept,
Motivation Inventory 1987 cartoons Likert
19. Primary Self-Concept Scale McDowell & Lind- United States 20 Child Pictures, device, Self-concept,
holm, 1986 cartoons dichotomus
20. U-Scale Self-Concept Test Ozehosky & Clark, United States 306 Child Pictures, device, Self-concept,
1971 cartoons dichotomous
21. Perez Self-Concept Perez, 1982 United States 252 Child Pictures, device, Self-concept,
Inventory cartoons dichotomous
22. Thomas Self-Concept Michael, 1972 United States 34 Child Pictures, device, Self-concept,
Values Test (TSCVT) cartoons dichotomous

Note: N.R.  not reported.

a Studiesused in alternative analysis.
b The same data for the study were used for each age represented in the study.
Davis-Kean and Sandler 903

Test Age for Age(s) for

No. Internal Retest Parallel Converg. Discrimin. Initial Alternative
Items Reliability Reliability Reliability Validity Validity Analysis Analysisb Setting SES

14 .73 N.R. N.R. N.R. N.R. 6.0 5.0, 6.0 School Middle-class, 1.

50 .85 N.R. N.R. .42 N.R. 6.0 6.0 School Middle-class, 2.

16 .72 N.R. N.R. N.R. N.R. 5.5 4.5, 5.0 School Poverty 3.

30 .85 N.R. N.R. N.R. N.R. 6.0 5.46, 6.45 School Middle-class, 4.
40 .86 .70 N.R. .44 N.R. 5.5 4.0, 5.0 School Middle-class, 5.
24 .89 N.R. N.R. N.R. N.R. 6.5 6.5 School Middle-class, 6.
9.25 .74 N.R. N.R. N.R. N.R. 6.5 6.5 School Middle-class, 7.

5 .52 .46 N.R. N.R. N.R. 5.5 3.8, 5.7 Not Middle-class, 8.
school mixed
24 .88 N.R. N.R. N.R. N.R. 5.0 4.4, 5.5, 6.3 School Middle-class, 9.

26 .80 .66 N.R. N.R. N.R. 5.5 5.5 School Poverty 10.

4 .68 N.R. N.R. N.R. N.R. 4..5 4.5 Not Poverty 11.
20 .67 N.R. N.R. .14 N.R. 5.5 5.5 School Middle-class, 12.

30 .65 N.R. N.R. .52 N.R. 4.5 4.5 School Poverty 13.

15 .73 .87 N.R. .51 N.R. 5.5 5.5 School Middle-class, 14.

64 .94 N.R. N.R. .38 .52 6.0 5.0, 6.0 School Middle-class, 15.
25 .88 N.R. N.R. .49 N.R. 6.5 6.5 School Middle-class, 16.
40 .83 N.R. N.R. .32 N.R. 6.5 6.5 School Middle-class, 17.

12 .69 .50 N.R. N.R. N.R. 5.5 5.5 School Middle-class, 18.
20 .36 N.R. N.R. N.R. N.R. 4.5 4.5 School Middle-class, 19.
50 .67 N.R. N.R. N.R. N.R. 5.5 5.5 School Middle-class, 20.
40 .80 .77 N.R. N.R. N.R. 5.5 5.5 School Middle-class, 21.
19 .73 N.R. N.R. N.R. N.R. 4.5 4.5 School Poverty 22.
904 Child Development

APPENDIX B 24. If Likert or rating scale, number of points on scale?

25. Treatment of respondent uncertainty or ignorance:
META-ANALYSIS CODING SHEET 1. Forced choice, 2. Neutral point, 3. Unknown, 4. Other
Study Characteristics
Measure Design Characteristics
1. Name of instrument
2. Authors of instrument 26. Does measure have theoretical orientation?
3. Publishers or journal Yes No
Source type: 1. Journal, 2. Conference paper, 3. Manual, 27. If yes, what is the theoretical orientation?
4. Dissertation, 5. Book, 6. Government publication, 28. Procedures used to generate sample of items:
7. Unpublished, 8. Other 1. Literature only, 2. Pretesting of large sample of items,
5. Year of publication 3. Expert opinion, 4. Combination of techniques,
6. Location of study (city and state or country) 5. Unknown, 6. Other measure or questionnaire
7. Setting of study: 29. Are dimensions specied a priori?
1. School, 2. Home, 3. Professional ofce Yes No
(e.g., psychologists ofce, counselors ofce), 4. Other 30. Were dimensions empirically investigated?
Yes No
31. Did study use conrmatory or exploratory factor
Method Characteristics
8. Number of samples:
9. Sample size and ages of groups in inclusion criteria: Quality of Measure Estimates
Sample Size M Age SD Age
Pre-kindergarten 32. Type of reliability and estimate:
(4- to 5-year-olds) _____ _____ ______ 1. Cronbachs alpha, 2. KR 20/21, 3. Hoyts method,
Kindergarten 4. Test-retest, 5. Split-half, 6. Parallel tests
(5- to 6-year-olds) _____ _____ ______ 33. Reliability of scales (scale name, reliability)
First grade 34. Convergent validity correlation, names of measure
(6- to 7-year-olds) _____ _____ ______ used, type of respondent
10. Overall sample size 35. Discriminant validity correlation, names of measure
11. Type of administration: 1. Individual, 2. Group, 3. Other used, type of respondent
36. Multitrait/multimethod (MTMM), what methods?
What traits? Correlation matrix
Subject Characteristics
12. Number of male/female participants:
Male, Female, Not reported REFERENCES
13. Source of information:
1. Child, 2. Parent, 3. Teacher, 4. Other References marked with an asterisk indicate studies in-
cluded in the meta-analysis.
14. Socioeconomic status of family/child:
1. Middle-class, 2. Poverty, 3. Unknown, 4. Other Baltes, P., Reese, H., & Nesselroade, J. (1977). Measurement.
15. If mixed SES, what number was in each group? In Life-span developmental psychology: Introduction to re-
1. Middle-class, 2. Poverty, 3. Unknown, 4. Other search methods (pp. 5874). Monterey, CA: Brookes/Cole.
16. If academic or cognitive test were given, what were the Bates, E. (1990). Language about me and you: Pronominal
correlations between self-construct and test (list by SES reference and the emerging concept of self. In D. Cic-
group if applicable): Name of Test, Correlation with chetti & M. Beeghly (Eds.), The self in transition (pp. 165
self-construct, Correlation matrix (for subscales). 182). Chicago: University of Chicago Press.
*Bolea, A. S., Felker, D. W., & Barnes, M. D. (1971). A picto-
rial self-concept scale for children in K4. Journal of Edu-
Measure Characteristics
cational Measurement, 8(3), 223224.
17. Method of data collection: Brinthaupt, T. M., & Erwin, L. J. (1992). Reporting about the
1. Pictures/cartoons, 2. Pictures/lm, 3. Pictures/other, self: Issues and implications. In T. M. Brinthaupt & R. P.
4. Questionnaires, 5. Puppets, 6. Interview, 7. Other Lipka (Eds.), The Self: Denitional and methodological is-
18. Type of measure: sues (pp. 137171). New York: State University of New
1. Self-concept, 2. Self-esteem, 3. Other York Press.
19. Number of items in nal scale Byrne, B. M., Shavelson, R. J., & Marsh, H. W. (1992). Multi-
20. Number of scales or dimensions group comparisons in self-concept research: Reexamin-
21. How many questions in each scale? ing the assumption of equivalent structure and measure-
22. Are some items negatively worded? If yes, what is the ment. In T. M. Brinthaupt & R. P. Lipka (Eds.), The self:
percentage of negatively worded items? Denitional and methodological issues (pp. 172203). New
23. Type of scale: 1. Likert scale, 2. Rating scale, York: State University of New York Press.
3. Dichotomous (yes/no), 4. Other *Chapman, J. W., & Tunmer, W. E. (1995). Development of
Davis-Kean and Sandler 905

young childrens reading self-concepts: An examination Flavell, J. H., Green, F. L., & Flavell, E. R. (1995). Young chil-
of emerging subcomponents and their relationship with drens knowledge about thinking. Monographs of the So-
reading achievement. Journal of Educational Psychology, ciety for Research in Child Development, 60(1, Serial No.
87(1) 154167. 243).
*Cicirelli, V. G. (1973). The Purdue Self-Concept Scale for Pre- Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis
school Children manual. West Lafayette, IN: Purdue Uni- in social research. Beverly Hills, CA: Sage.
versity. Harter, S. (1983). Developmental perspectives on the self-
Cook, P. J. (1987). A meta-analysis of studies on self-concept be- system. In E. M. Hetherington (Ed.), P. H. Mussen (Series
tween the years of 1976 and 1986. Unpublished doctoral Ed.), Handbook of child psychology: Vol. 4. Socialization, per-
dissertation, North Texas State University, Denton. sonality, and social development (4th ed., pp. 275385).
Cooper, H., & Hedges, L. V. (1994). Research synthesis as a New York: Wiley.
scientic enterprise. In H. Cooper & L.V. Hedges (Eds.), *Harter, S., & Pike, R. (1984). The pictorial scale of perceived
Handbook of research synthesis (pp. 313). New York: Rus- competence and social acceptance for young children.
sell Sage Foundation. Child Development, 55, 19691982.
Coopersmith, S. (1967). The antecedents of self-esteem. San Hattie, J. (1992). Self-concept. Hillsdale, NJ: Erlbaum.
Francisco: W. H. Freeman and Company. Hedges, L. V. (1994). Fixed effects models. In H. Cooper &
*Crawford, P. (1977). Norms for the North York self-concept in- L. V. Hedges (Eds.), The handbook of research synthesis (pp.
ventory: Intermediate and primary levels. North York, Can- 285299). New York: Russell Sage Foundation.
ada: North York Board of Education. (ERIC Document *Helms, D., Holthouse, N., Granger, R. L., Cicarelli, V. G., &
Reproduction Service No. ED 226 023) Cooper, W. H. (1968). The childrens self-concept index
Crocker, L., & Algina, J. (1986). Introduction to classical and (CSCI), New York: Westinghouse Learning Corporation.
modern test theory. New York: Holt, Rinehart, & Winston. *Hughes, H. M., & Leatherman, M. K. (1982). Renement of
Damon, W., & Hart, D. (1988). Self-understanding in childhood the Maryland preschool self-concept scale. Fayetteville: Uni-
and adolescence. New York: Cambridge University Press. versity of Arkansas. (ERIC Document Reproduction Ser-
*Davis, T. M., & Johnston, J. M. (1987). On the stability and vice No. ED 222 568)
internal consistency of the self-concept and motivation Hughes, H. M. (1984). Measures of self-concept and self-
inventory: Preschool/kindergarten form. Psychological esteem for children ages 312 years: A review and rec-
Reports, 61, 871874. ommendations. Clinical Psychology Review, 4, 657692.
Davis-Kean, P. E., & Sandler, H. M. (1995). Issues related to *Jensen, M. A. (1985). Development of a preschool self-
the study of self-concept in preschool children. Unpublished concept scale. Early Child Development and Care, 22(23),
manuscript, Vanderbilt University, Nashville, TN. 89107.
De Wolff, M. S., & van IJzendoorn, M. H. (1997). Sensitivity *Joseph, J. (1979). Joseph preschool and primary self-concept
and attachment: A meta-analysis on parental antecedents screening test. Chicago: Stoelting.
of infant attachment. Child Development, 68(4), 571591. Kagen, S. L., Moore, E., & Bredekamp, S. (1995). Considering
Durlak, J. A., & Lipsey, M. W. (1991). A practitioners guide childrens early development and learning: Toward common
to meta-analysis. American Journal of Community Psychol- views and vocabulary (Report No. 95-03). Washington,
ogy, 19(3), 291332. DC: National Education Goals Panel.
*Eccles, J., Wigeld, A., Harold, R. D., & Blumenfeld, P. Lewis, M., & Brooks-Gunn, J. (1979). Social cognition and the
(1993). Age and gender difference in childrens self- and acquisition of self. New York: Plenum.
task perceptions during elementary school. Child Devel- *Long, B. H., Ziller, R. C., & Henderson, E. H. (1969). The
opment, 64, 830847. self-social constructs test. Towson, MD: Goucher College.
Eder, R. A. (1989). The emergent personologist: The struc- (ERIC Document Reproduction Service No. ED 033 744)
ture and content of 3, 5, and 7-year olds concepts *Marsh, H. W., Craven, R. G., & Debus, R. (1991). Self-
of themselves and other persons. Child Development, 60, concepts of young children 5 to 8 years of age: Measure-
12181228. ment and multidimensional structure. Journal of Educa-
*Eder, R. A. (1990). Uncovering young childrens psycho- tional Psychology, 83(3), 377392.
logical selves: Individual and developmental differ- Marsh, H. W., Craven, R. G., & Debus, R. (1998). Structure,
ences. Child Development, 61, 849863. stability, and development of young childrens self-
Ericsson, K. A., & Simon, H. A. (1993). Protocol analysis: Ver- concepts: A multicohort-multioccasion study. Child De-
bal reports as data. (Rev. ed.). Cambridge, MA: MIT Press. velopment, 69, 10301053.
Fantuzzo, J. W., McDermott, P. A., Manz, P. H., Hampton, V. *Martinek, T. J., & Zaichkowsky, L. D. (1975). The develop-
R., & Burdick, N. A. (1996). The Pictorial Scale of Per- ment and validation of the Martinek-Zaichkowsky self-
ceived Competence and Social Acceptance: Does it work concept scale for children. Boston: Department of Move-
with low-income urban children? Child Development, 67, ment, Health, and Leisure, Boston University. (ERIC
10711084. Document Reproduction Service No. ED 020 005)
*Figa, L. E. (1979). The ability of the student-teacher factor to Matt, G. E., & Cook, T. D. (1994). Threats to the validity of
discriminate between subjects on the Bickley Assessment of research synthesis. In H. Cooper & L. V. Hedges (Eds.),
Self-Esteem. Florence, SC: Francis Marion College. (ERIC The handbook of research synthesis (pp. 285299). New
Document Reproduction Service No. ED 313 447) York: Russell Sage Foundation.
906 Child Development

*McDaniel, E. D., & Leddick, G. H. (1978). Elementary chil- mother, and teacher. Unpublished doctoral dissertation,
drens self-concepts, factor structures and teacher ratings. Vanderbilt University, Nashville, TN.
West Lafayette, IN: Purdue University. (ERIC Document Rosenthal, R. (1995). Writing meta-analytic reviews. Psycho-
Reproduction Service No. ED 175 919) logical Bulletin, 118(2), 183192.
*McDowell, A. L., & Lindholm, B. W. (1986). Measures of Schmidt, F. L. (1992). What do data really mean? Research
self-esteem by preschool children. Psychological Reports, ndings, meta-analysis, and cumulative knowledge in
59(2), 615621. psychology. American Psychologist, 47(10), 11731181.
Messick, S. (1995). Validity of psychological assessment: Shadish, W. R., & Haddock, C. K. (1994). Combining esti-
Validation of inferences from persons responses and mates of effect size. In H. Cooper & L. V. Hedges (Eds.),
performances as scientic inquiry into score meaning. Handbook of research synthesis (pp. 261281). New York:
American Psychologist, 50(9), 741749. Russell Sage Foundation.
*Michael, J. J. (1972). The Thomas Self-Concept Values Test. Stipek, D. J., Gralinski, J. H., & Kopp, C. B. (1990). Self-
In O. K. Buros (Ed.), Seventh mental measurement yearbook concept development in the toddler years. Developmental
(Vol. 1, pp. 371374). Highland Park, NJ: Gryphon Press. Psychology. 26(6), 972977.
Mischel, W., Zeiss, R., & Zeiss, A. (1974). Internal-external Stipek, D., Recchia, S., & McClintic, S. (1992). Self-evalua-
control and persistence: Validation and implications of tion in young children. Monographs of the Society for Re-
the Stanford Preschool Internal-External Scale. Journal of search in Child Development, 57(1, Serial No. 226).
Personality and Social Psychology, 29(2), 265278. Tabachnick, B. G., & Fidell, L. S. (1983). Using multivariate
Orwin, R. G. (1994). Evaluating coding decisions. In H. statistics. New York: Harper & Row
Cooper & L. V. Hedges (Eds.), The handbook of research Tabachnick, B. G., & Fidell, L. S. (1996). Using multivariate
synthesis (pp. 285299). New York: Russell Sage Founda- statistics (3rd ed.). New York: HarperCollins.
tion. *Walker, D. K., Bane, M. J., & Bryk, A. (1973). The quality of
*Ozehosky, R. J., & Clark, E. T. (1971). Verbal and non-verbal the Head Start planned variation data. Cambridge, MA:
measures of self-concept among kindergarten boys and Huron Institute. (ERIC Document Reproduction Service
girls. Psychological Reports, 28, 195199. No. ED 082 856)
*Perez, J. R. (1982). Perez self-concept inventory. Test manual. White, H. D. (1994). Scientic communication and literature
Dallas Independent School District. (ERIC Document retrieval. In H. Cooper & L.V. Hedges (Eds.), Handbook of
Reproduction Service No. ED 228 877) research synthesis (pp. 4155). New York: Russell Sage
Peter, J. P., & Churchill, G. A., Jr. (1986). Relationships Foundation.
among research design choices and psychometric prop- Wylie, R. C. (1989). Measures of self-concept. Lincoln: Univer-
erties of rating scales: A meta-analysis. Journal of Market- sity of Nebraska Press.
ing Research, 23(1), 110. Yeaton, W. H., & Wortman, P. M. (1993). On the reliability of
Rambo, B. C. (1982). A comparative study of childrens self- meta-analytic reviews: The role of intercoder agreement.
concept in two preschool programs as measured by child, Evaluation Review, 17(3), 292309.