You are on page 1of 15

This article was downloaded by: [Middle Tennessee State University]

On: 16 August 2013, At: 13:53


Publisher: Routledge
Informa Ltd Registered in England and Wales Registered Number: 1072954
Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH,
UK

Measurement in Physical
Education and Exercise Science
Publication details, including instructions for
authors and subscription information:
http://www.tandfonline.com/loi/hmpe20

Applying a Score Confidence


Interval to Aiken's Item
Content-Relevance Index
Randall D. Penfield & Peter R. Giacobbi, Jr.
Published online: 18 Nov 2009.

To cite this article: Randall D. Penfield & Peter R. Giacobbi, Jr. (2004) Applying a
Score Confidence Interval to Aiken's Item Content-Relevance Index, Measurement
in Physical Education and Exercise Science, 8:4, 213-225, DOI: 10.1207/
s15327841mpee0804_3

To link to this article: http://dx.doi.org/10.1207/s15327841mpee0804_3

PLEASE SCROLL DOWN FOR ARTICLE

Taylor & Francis makes every effort to ensure the accuracy of all the
information (the “Content”) contained in the publications on our platform.
However, Taylor & Francis, our agents, and our licensors make no
representations or warranties whatsoever as to the accuracy, completeness,
or suitability for any purpose of the Content. Any opinions and views
expressed in this publication are the opinions and views of the authors, and
are not the views of or endorsed by Taylor & Francis. The accuracy of the
Content should not be relied upon and should be independently verified with
primary sources of information. Taylor and Francis shall not be liable for any
losses, actions, claims, proceedings, demands, costs, expenses, damages,
and other liabilities whatsoever or howsoever caused arising directly or
indirectly in connection with, in relation to or arising out of the use of the
Content.

This article may be used for research, teaching, and private study purposes.
Any substantial or systematic reproduction, redistribution, reselling, loan,
sub-licensing, systematic supply, or distribution in any form to anyone is
expressly forbidden. Terms & Conditions of access and use can be found at
http://www.tandfonline.com/page/terms-and-conditions
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013
MEASUREMENT IN PHYSICAL EDUCATION AND EXERCISE SCIENCE, 8(4), 213–225
Copyright © 2004, Lawrence Erlbaum Associates, Inc.

Applying a Score Confidence Interval


to Aiken’s Item Content-Relevance Index
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

Randall D. Penfield
Department of Educational and Psychological Studies
University of Miami

Peter R. Giacobbi, Jr.


Department of Applied Physiology and Kinesiology
University of Florida

Item content-relevance is an important consideration for researchers when develop-


ing scales used to measure psychological constructs. Aiken (1980) proposed a statis-
tic, V, that can be used to summarize item content-relevance ratings obtained from a
panel of expert judges. This article proposes the application of the Score confidence
interval to Aiken’s V statistic to improve the inference of the unknown population
value of V. The application of the Score confidence interval to V is described, a nu-
merical example is provided, and a demonstration of the Score confidence interval is
presented for ratings obtained in the development of a scale measuring life skills.

Key words: content validity, item content-relevance, Score confidence interval

A primary concern for researchers measuring psychological constructs using


scales and inventories is the extent to which the content of each item of the scale
matches the content domain intended to be measured by the item. This property of
the item is commonly referred to as item content-validity, or item content-rele-
vance (Crocker, Miller, & Franks, 1989; Dunn, Bouffard, & Rogers, 1999; Haynes,
Richard, & Kubany, 1995; Sireci, 1998; Yalow & Popham, 1983). The assessment
of item content-relevance is an important step in ensuring that inferences drawn
from educational and psychological measurements are meaningful (Crocker et al.,
1989; Haynes et al., 1995; Sireci, 1998), and is highly recommended for sport psy-

Requests for reprints should be sent to Randall D. Penfield, School of Education, P.O. Box 248065,
University of Miami, Coral Gables, FL 33124-2040, E-mail: penfield@miami.edu
214 PENFIELD AND GIACOBBI

chology researchers creating scales for use in applied research settings (Dunn et
al., 1999).
Item content-relevance is commonly assessed by obtaining ratings from a panel
of expert judges on the extent to which the item in question matches the intended
content domain. Although the precise form of the rating method may vary across
applications, typically the item content-relevance ratings are obtained using either
a 5- or 7-point Likert-type rating scale, where the lowest possible rating corre-
sponds to very poor content-relevance and the highest possible rating corresponds
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

to very good content-relevance. The obtained ratings for each item are then sum-
marized using an appropriate descriptive statistic, such as the mean or some trans-
formation of the mean (Crocker, Llabre, & Miller, 1988; Crocker et al., 1989;
Sireci & Geisinger, 1995).
Dunn et al. (1999) reviewed published articles over a two-decade period in The
Sport Psychologist, the Journal of Sport and Exercise Psychology, the Journal of
Applied Sport Psychology, and Research Quarterly for Exercise and Sport. Their
intent was to assess item content-relevance procedures reported by authors of stud-
ies whose main focus included the development of new psychological inventories.
Of the articles reviewed, several trends were noted. First, the number, characteris-
tics, and qualifications of expert judges used to assess item content-relevance var-
ied considerably from study to study. In many of the studies reviewed, Dunn et al.
(1999) noted that little to no information was presented regarding the judges’ char-
acteristics or why specific judges were chosen to serve as expert raters. Dunn et al.
(1999) recommended that “authors provide some information regarding experts’
familiarity not only with the construct domains under investigation, but also with
the population for whom the test is intended” (p. 18).
A second trend noted by Dunn et al. (1999) was that little emphasis was placed
on using statistical procedures to appropriately summarize the obtained judges’
ratings. To provide guidance concerning available procedures that can be used to
summarize the obtained ratings, Dunn et al. (1999) recommended the use of
Aiken’s V statistic (Aiken, 1980, 1985) because it can not only be used to summa-
rize the magnitude of the obtained expert ratings, but also to test specific hypothe-
ses concerning the values of the ratings for the population. The V statistic is com-
puted using the formula

X -l
V= (1)
k

where X represents the sample mean of the judges’ ratings, l represents the lowest
possible rating, and k represents the range of possible values of the rating scale
used (e.g., a scale having possible values extending from 1 to 5 has l = 1 and k = 5 –
1 = 4). The statistic V provides an index of rater endorsement that ranges from 0 to
CONTENT-RELEVANCE CONFIDENCE LEVEL 215

1. A value of V = 0 is obtained when all judges select the lowest possible rating, and
a value of V = 1 is obtained when all judges select the highest possible rating. Hy-
pothesis tests concerning the unknown population value of V, denoted Vp, can also
be conducted. For example, a scale developer may wish to test the null hypothesis
that Vp = 0.50 against the directional alternative hypothesis that Vp > 0.5; any item
for which the null hypothesis is rejected may be deemed to have a sufficient level
of item content-relevance. The hypothesis test is based on an exact binomial test
(see Aiken, 1985 for details), and Aiken (1985) provides a table containing the crit-
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

ical values of V required to reject the specific null hypothesis Vp = 0.5 in favor of
the alternative hypothesis Vp > 0.5. Applications of this approach are presented by
Aiken (1985) and Dunn et al. (1999).
Although Aiken’s V provides a useful framework for making descriptive state-
ments about the level of content-relevance of an item, the inferential procedure for
testing hypotheses concerning Vp has several drawbacks. First, the critical values
of V listed in the table provided by Aiken (1985) are only applicable for the null hy-
pothesis that Vp = 0.5, a somewhat arbitrary null hypothesis. Because hypothesis
tests of Vp greater than 0.5 (e.g., 0.6, 0.7, or 0.8) may be of great interest to re-
searchers wishing to place a more conservative criteria on the value of Vp for inclu-
sion of the item on the scale, the table of critical values of V provided by Aiken
(1985) may be of limited use to some researchers assessing item content-rele-
vance. Second, the computation of the binomial probabilities required for the hy-
pothesis test can be intensive, and thus unless the rating specifications being used
are within the criteria of Aiken’s (1985) table of critical values, the researcher must
be able to compute the binominal tail probabilities either by hand or using statisti-
cal software. Third, the discrete nature of the data inherent in the exact binomial test
leads to difficulties in making inferential statements, particularly when the number
of raters is small, because the critical values of V do not correspond precisely to the
intended Type I error rate. As a result, the specific critical values of V listed in the ta-
ble can be somewhat misleading for a researcher intending to assume a Type I error
rate of 0.05 or 0.01, a commonly encountered problem in conducting exact hypothe-
sis tests of discrete variables (see Agresti, 1990). Fourth, the outcome of a hypothesis
test alone provides little information about the actual value of Vp. That is, the hypoth-
esis test leads only to a decision of whether or not Vp equals a particular value, but
does not provide information concerning what the value of Vp might actually be.
Fifth, the hypothesis test alone provides no information concerning the expected er-
ror of V as an estimate of Vp, and thus provides no information concerning how close
the sample value of V is expected to be to the unknown value of Vp.
The five drawbacks of the binomial-based hypothesis test of Vp discussed earlier
can be overcome through the use of a confidence interval for Vp. The advantages of
a confidence interval for Vp include (a) the existence of rich information concern-
ing the actual value of Vp, in contrast to the reject–accept nature of a hypothesis
test; (b) the existence of information concerning the amount of error expected in
216 PENFIELD AND GIACOBBI

using V as an estimate of Vp; (c) the availability of directional and nondirectional


hypothesis tests concerning Vp using the lower and upper limits of the confidence
interval; (d) a way to test hypotheses concerning Vp regardless of the number of
judges providing ratings, the number of response categories of the rating scale
items, and the desired Type I error rate; (e) a computationally simple way to test
hypotheses about any value of Vp; and (f) the existence of information concerning
whether the sample size is adequate to obtain a desired level of precision in V as an
estimator of Vp. In addition, the use of a confidence interval for Vp is consistent
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

with the growing emphasis placed on the use of confidence intervals in reporting
all quantitative psychological research (Fidler, 2002).
The difficulty in constructing a confidence interval for Vp is the bounded nature
of V, making confidence intervals based on asymptotic normal distribution as-
sumptions inappropriate. That is, because V is not normally distributed, traditional
confidence intervals for a population mean, such as the Wald interval (Wald, 1943)
given by most introductory statistics texts as

æ s ö
X ± tdf çç ÷÷÷,
çè n ø

will lead to inaccurate results. However, other methods of constructing a confi-


dence interval that are not based on the assumption that V is distributed as normal
can be applied. In particular, because V can assume values between 0 and 1, the V
statistic can be conceptualized as a sample proportion in which the number of suc-
cesses is given by n(X – 1) and the number of trials is given by nk. Treating V as a
sample proportion and Vp as the corresponding population proportion, the Score
confidence interval for a population proportion (Wilson, 1927) can be effectively
applied to Vp. The Score confidence interval has the desirable properties of being
asymmetric about the sample proportion (or V in this case), not being dependent on
a normal distribution of the sample proportion (or V in this case), and has been
shown to be highly effective and accurate, even when the sample size is small and
the population proportion is extreme (Newcombe, 1998; Wilson, 1927).
The use of the Score confidence interval as a method of constructing a confi-
dence interval for Vp may provide a valuable improvement in the inferential proce-
dures available to the interpretation of item content-relevance ratings. The purpose
of this article is to introduce the application of the Score confidence interval to
Aiken’s V, and display the rich information provided by the confidence interval. To
this end, the remainder of this article is divided into four sections. The first section
introduces the Score confidence interval and its application to Aiken’s V. The sec-
ond section provides a numeric example of the computation of the Score confi-
dence interval. The third section applies the Score confidence interval to a real set
of item content-relevance ratings obtained in the development of an instrument
measuring life skills. The final section provides concluding remarks on the appli-
cation of the Score confidence interval to Aiken’s V.
CONTENT-RELEVANCE CONFIDENCE LEVEL 217

THE SCORE CONFIDENCE INTERVAL

Consider the case of a group of n judges rating an item using ratings that have a
possible range of k. Note that k can be computed as the highest possible rating mi-
nus the lowest possible rating, or as the number of points on the rating scale minus
one. Based on the ratings of the n judges, suppose that the statistic V is computed
using Equation 1. Then, the lower (L) and upper (U) limits to a C% Score confi-
dence interval for Vp can be obtained using the following form originally devel-
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

oped by Wilson (1927):

2nkV + z2 - z 4nkV (1- V ) + z2


L= (2)
2(nk + z2 )

2nkV + z2 + z 4nkV (1- V ) + z2


U= (3)
2(nk + z2 )

In Equations 2 and 3, z corresponds to the value of a standard normal distribution


such that C% of the area of the distribution lies between –z and z (e.g., for a 95%
confidence interval z = 1.96). The Score confidence interval as it pertains to a pop-
ulation proportion is described in greater detail by Agresti (1996), Newcombe
(1998), and Penfield (2003). The derivation of Equations 2 and 3 is presented in the
Appendix. Simplified forms of the lower and upper limits of the Score confidence
interval presented in Equations 2 and 3 are given by
A- B
L= (4)
C

A+ B
U= (5)
C

where
A = 2nkV + z2 (6)

B = z 4nkV (1- V ) + z2 (7)

C = 2(nk + z2 ) (8)

The Score confidence interval has the desirable property of being asymmetric
about V. If V is greater than 0.5, then the Score confidence interval will extend fur-
ther below V than above V, and if V is less than 0.5, then the Score confidence inter-
val will extend further above V than below V. In addition, the bounds of the Score
confidence interval cannot extend below 0 or above 1.0, thus overcoming a prob-
lem of impossible confidence interval limits commonly encountered in the ap-
218 PENFIELD AND GIACOBBI

plication of the traditional Wald interval to bounded variables. The results of em-
pirical investigations of the Score confidence interval indicate that the Score
confidence interval is typically substantially shorter in length, and has a higher
probability of containing the population parameter of interest than the traditional
Wald confidence interval (Ghosh, 1979; Newcombe, 1998).

A NUMERIC EXAMPLE
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

Consider the situation of a sample of seven expert judges providing content-rele-


vance ratings for a particular item of a scale. The ratings are assigned the values 1
to 5, where 1 corresponds to a very poor fit of the item to the specified content do-
main, and 5 corresponds to a very good fit of the item to the specified content do-
main. The ratings for this sample of seven judges are: 3, 3, 4, 5, 5, 5, and 5. Note
that in this case X = 4.29, l = 1, n = 7, and k = 4. Based on these ratings, the value of
V is computed by

4.29 - 1.00
V= = 0.82.
4.00

The obtained value of V tells us that the sampled raters tended to provide relatively
high ratings for this item. The value of V may deviate substantially, however, from
the population value it estimates (Vp), and thus it is useful to construct a confidence
interval for Vp. Let us construct a 95% confidence interval for Vp using the Score
confidence interval. Note that a 95% confidence interval uses z = 1.96. Using this
information, the terms A, B, and C of Equations 6, 7, and 8 are given by

A = 2 ´ 7 ´ 4 ´ 0.82 + 1.962 = 49.76

B = 1.96 4 ´ 7 ´ 4 ´ 0.82 ´ (1 - 0.82) + 1.962 = 8.85

C = 2(7 ´ 4 + 1.962 ) = 63.68.

Substituting the values of A, B, and C into Equations 4 and 5 yield the lower and
upper limits of
49.76 - 8.85
L= = 0.64
63.68

49.76 + 8.85
U= = 0.92.
63.68
CONTENT-RELEVANCE CONFIDENCE LEVEL 219

Thus, we can be 95% confident that the value of Vp lies between 0.64 and 0.92.
Note that the lower bound of 0.64 lies 0.18 units below V, and the upper bound of
0.92 lies 0.10 units above V. The Score confidence interval provides more room for
error below V than above V because the value of V was closer to 1.0 than to 0.

APPLYING THE SCORE INTERVAL TO REAL DATA


Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

As an example of applying the Score confidence interval to item content-relevance


ratings, we will use content-relevance ratings obtained from seven expert judges
for the items of a scale designed to measure an exploratory construct, life skills.
The expert judges were selected on the criteria of having gained expertise in the
area of teaching children life skills though participation in one or more national
training seminars. The initial scale consisted of 60 items intended to measure three
subconstructs: setting goals, communicating with others, and emotional control.
Each item consisted of a stem, followed by five response options ranging from 1
(strongly disagree) to 5 (strongly agree). Examples of items on the life skills scale
are (a) I introduce myself to people I meet, (b) I set goals to improve myself, and
(c) I cannot control my anger. The life skills scale was developed with the ultimate
goal of comparing the levels of life skills of adolescents who do and do not partici-
pate in a particular athletic program.
The seven expert judges were instructed to rate the consistency of each item
with the subconstruct for which he or she viewed the item as being most consistent.
The judges provided ratings on a scale ranging from 1 (complete lack of consis-
tency) to 5 (very strong consistency). The use of seven judges is typical of con-
tent-relevance studies (Haynes et al., 1995; Lynn, 1986). Although the initial life
skills scale contained 60 items, only the first 20 will be described here to keep the
discussion concise. The ratings of the seven judges for each of the first 20 items are
displayed on the left side of Table 1. In addition, for each item, Table 1 displays
Aiken’s V, the lower and upper bounds of the 90% Score confidence interval for Vp,
and the lower and upper bounds of the 95% Score confidence interval for Vp. For
example, Item 1 was associated with a value of V equal to 0.89, a 90% Score confi-
dence interval extending from 0.76 to 0.96, and the 95% Score confidence interval
extending from 0.73 to 0.96.
The 90% and 95% confidence intervals provide valuable information concern-
ing the expected precision of V as an estimator of Vp, and inform decisions con-
cerning sample size. For example, the typical length of the 95% Score confidence
interval for the data presented in Table 1 is approximately 0.30, although this value
varies across the items depending on the specific value of V for the item. Using the
typical length of the interval as a measure of precision of V as an estimator of Vp, a
researcher may make statements concerning the adequacy of the precision of V. For
example, a researcher may set a criterion level of typical length of a 95% confi-
220 PENFIELD AND GIACOBBI

TABLE 1
Outcomes of Ratings, Values of Aiken’s V, and 90% and 95% Score
Confidence Interval for 20 Items of the Life Skills Questionnaire

Rating Frequency 90% CI 95% CI

Lower Upper Lower Upper


Item 1 2 3 4 5 V Limit Limit Limit Limit

1 0 1 0 0 6 .89* .76 .96 .73 .96


Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

2 0 2 3 2 0 .50 .35 .65 .33 .67


3 0 1 1 3 2 .71 .56 .83 .53 .85
4 0 2 1 1 3 .68 .52 .80 .49 .82
5 0 0 2 1 4 .82* .68 .91 .64 .92
6 0 0 1 2 4 .86* .72 .93 .69 .94
7 0 0 2 4 1 .71 .56 .83 .53 .85
8 1 1 2 1 2 .57 .42 .71 .39 .74
9 1 1 3 1 1 .50 .35 .65 .33 .67
10 1 0 1 5 0 .61 .45 .74 .42 .76
11 0 0 0 5 2 .82* .68 .91 .64 .92
12 0 0 0 2 5 .93* .81 .98 .77 .98
13 0 1 1 3 2 .71 .56 .83 .53 .85
14 1 1 1 4 0 .54 .39 .68 .36 .71
15 0 0 0 1 6 .96* .86 .99 .82 .99
16 0 0 0 2 5 .93* .81 .98 .77 .98
17 0 0 1 4 2 .79* .64 .89 .61 .90
18 0 0 3 4 0 .64 .49 .77 .46 .79
19 0 0 1 2 4 .86* .72 .93 .69 .94
20 0 0 1 3 3 .82* .68 .91 .64 .92

Note. The critical value of V for testing the null hypothesis that Vp = 0.5 according to Aiken’s
(1985) table of critical values is 0.75 under a Type I error rate of 0.05. The items for which the null hy-
pothesis is rejected according to Aiken’s critical value are noted with *. CI = confidence interval.

dence interval equal to 0.20 to ensure adequate precision of V as an estimate of Vp.


If the typical length of the Score confidence interval exceeds this (as is the case
with the example provided in Table 1), then the researcher may opt to examine the
content of the items for potential lack of content-relevance, or increase the number
of expert judges providing ratings for the items of the scale. Increasing the number
of expert judges will act to increase the precision of V, and thus decrease the length
of the confidence interval. If the researcher sets a criterion level of typical length of
a 95% confidence interval equal to 0.30, then the data presented in Table 1 are
likely sufficient to meet this criterion. Note that, although we have used interval
length criteria of 0.20 and 0.30 in the aforementioned discussion, these values
were heuristically chosen to serve the purpose of the example. Score confidence
interval lengths of 0.20 and 0.30 correspond to distances of 0.80 and 1.20, respec-
tively, on a 5-point rating scale (a 5-point rating scale spans a total distance of four
units, and thus the lengths of 0.20 and 0.30 on a range of 0 to 1.0 correspond to 0.80
CONTENT-RELEVANCE CONFIDENCE LEVEL 221

and 1.20 on a range of 1 to 5). As a result, we viewed the cited criteria to be mean-
ingful for applied settings, but acknowledge that the criteria adopted by a particu-
lar researcher may vary depending on the content area, and intended use of the ob-
tained scale scores. We are not aware of any research providing guidelines
concerning criteria for acceptable lengths of confidence intervals for item con-
tent-relevance studies.
The Score confidence interval can also be used to assess hypotheses concerning
the value of Vp. For a directional test of the null hypothesis that Vp equals some
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

value, V0, using a Type I error rate of α, the acceptance of the null hypothesis is as-
sociated with a (1 – 2α) × 100% confidence interval about V. The null hypothesis is
accepted if the confidence interval for Vp contains the null value of V0, and the null
hypothesis is rejected in favor of the directional alternative hypothesis if the lower
limit of the confidence interval exceeds the null value, V0. As an example, consider
a researcher interested in testing the null hypothesis that Vp = 0.5 against the alter-
native hypothesis that Vp > 0.5 (note that the value of Vp = 0.5 corresponds to mean
rating equaling the middle point on the rating scale). Items for which the null hy-
pothesis is rejected are retained, and items for which the null hypothesis is ac-
cepted are flagged for revision or removal.
Applying the hypothesis test to the items presented in Table 1, we see that Items
2, 8, 9, 10, 14, and 18 have a 90% confidence interval that contains 0.5, and thus for
each of these items we accept the null hypothesis that Vp = 0.5. These items should
be examined for their content, and either revised or removed from the scale. The
remaining items are retained because there is sufficient evidence to support the hy-
pothesis that Vp exceeds 0.5, and thus that the mean rating in the population of rat-
ers reflects a positive endorsement of the item (e.g., the mean rating in the popula-
tion exceeds 3.0 on a 5-point scale). Note that using the significance test proposed
by Aiken (1980, 1985), the critical value of V for testing the null hypothesis that Vp
= 0.5 at α = 0.05 is equal to 0.75 (Aiken, 1985, p. 134). The items for which the null
hypothesis is rejected using this critical value are denoted by an asterisk next to the
value of V in Table 1. Using Aiken’s critical value, the null hypothesis is accepted
for Items 2, 3, 4, 7, 8, 9, 10, 13, 14, and 18. Based on these results, Aiken’s critical
value appears to be more conservative than that Score confidence interval. The
conservative nature of Aiken’s hypothesis test, relative to that of the results pro-
vided by the Score confidence interval, is most likely due to the fact that the critical
values provided by Aiken’s (1985) table do not correspond precisely to the in-
tended Type I error rate because of the highly discrete nature of the variable under
investigation. This problem, as noted earlier, is commonly encountered with exact
tests of discrete variables (Agresti, 1996).
Unlike the hypothesis test proposed by Aiken (1980, 1985), the use of confi-
dence intervals permits us to assess any arbitrary null hypothesis. For example, we
may wish to make the criteria of item revision more stringent, through testing the
null hypothesis that Vp = 0.75. Note that a value of 0.75 is associated with an aver-
222 PENFIELD AND GIACOBBI

age rating of 4 of a 5-point scale with response options ranging from 1 to 5, or good
fit to the intended construct. In this case, determining the items for which the null
hypothesis of Vp = 0.75 is accepted using a Type I error rate of 0.05 can be con-
ducted by determining the items for which the 90% confidence interval contains
0.75 (all items but 1, 12, 15, and 16). Although the criterion value of 0.75 may be
too stringent in practical applications, we present it here strictly for didactic pur-
poses to illustrate the flexibility of the Score confidence interval over the hypothe-
sis test proposed by Aiken (1980, 1985). Researchers in the beginning stages of
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

scale development may choose to select a more liberal criterion value (e.g., V0 =
0.4) or use a higher Type I error rate (e.g., α = 0.10), particularly if the number of
expert raters used is small.
As a final note on the application of the Score confidence interval to Aiken’s V,
an often useful method for assessing item content-relevance is to ask each expert
judge to rate the content-relevance of each item regarding each subconstruct in-
tended to be measured by the scale. Computing Aiken’s V for each combination of
item and subconstruct permits the scale developer to obtain an index of con-
tent-relevance for each item in relation to each subconstruct (see Dunn et al.,
1999). Because an item should yield higher values of V for subconstructs intended
to be measured by the item than subconstructs not intended to be measured by the
item, this approach can yield useful convergent and divergent validity information,
and as a result lead to more accurate conclusions concerning item content-rele-
vance. The Score confidence interval can be applied to this situation in a similar
fashion as described earlier. In this case, a Score confidence interval would be con-
structed for each subconstruct in relation to each item. Although the content-rele-
vance ratings collected for the life skills scale described earlier do not accommo-
date this particular analysis (because each item was not rated in relation to each
subconstruct), we viewed it important to bring this potentially useful application of
the Score confidence internal to the reader’s attention.

DISCUSSION

The application of a Score confidence interval to Aiken’s V statistic was proposed


for item content-relevance ratings. The Score confidence interval provides rich in-
formation concerning the likely value of the population value of V, Vp, and can be
easily used to test hypotheses concerning the value of Vp. A numeric example and
the application of the Score confidence interval to a real data set of content-rele-
vance ratings were used to illustrate the calculations required for the Score confi-
dence interval and the flexibility of the confidence interval to answer many ques-
tions about the content-relevance of scale items.
A possible obstacle to the successful application of the Score confidence inter-
val to content-relevance ratings is the computation of the interval; however, using
CONTENT-RELEVANCE CONFIDENCE LEVEL 223

Equations 4, 5, 6, 7, and 8, the computation of the Score confidence interval using


any data management program (e.g., Excel) is a simple matter, requiring only six
columns of elementary computations: (a) a column computing V, (b) a column
computing A as displayed in Equation 6, (c) a column computing B as displayed in
Equation 7, (d) a column computing C as displayed in Equation 8, (e) a column
computing the lower limit to the Score confidence interval as displayed in Equa-
tion 4, and (f) a column computing the upper limit to the Score confidence interval
as displayed in Equation 5. Using this framework, the 90% and 95% Score confi-
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

dence intervals displayed in Table 1 for the 20 items of the life skills scale were
computed in just a few minutes.
In conclusion, the application of the Score confidence interval to Aiken’s V can
enhance the analysis of item content-relevance by providing valuable information
concerning the expected precision of Aiken’s V as an estimator of the unknown
population value, Vp. The primary obstacle to the implementation of the Score con-
fidence interval is its computational complexity; however, as described earlier, the
Score confidence interval can be computed with little difficulty using any data
management software. One unresolved issue of the application of the Score confi-
dence interval to item-content relevance studies concerns criteria of acceptable
length of the interval. Because meaningful guidelines concerning acceptable inter-
val lengths have not yet been established, this is an important topic for future re-
search in the area of scale validation.

REFERENCES

Agresti, A. (1990). Categorical data analysis. New York: Wiley.


Agresti, A. (1996). An introduction to categorical data analysis. New York: Wiley.
Aiken, L. R. (1980). Content validity and reliability of single items or questionnaires. Educational and
Psychological Measurement, 40, 955–959.
Aiken, L. R. (1985). Three coefficients for analyzing the reliability and validity of ratings. Educational
and Psychological Measurement, 45, 131–142.
Crocker, L., Llabre, M., & Miller, M. D. (1988). The generalizability of content validity ratings. Jour-
nal of Educational Measurement, 25, 287–299.
Crocker, L., Miller, M. D., & Franks, E. A. (1989). Quantitative methods for assessing the fit between
test and curriculum. Applied Measurement in Education, 2, 179–194.
Dunn, J. G. H., Bouffard, M., & Rogers, W. T. (1999). Assessing content-relevance in sport psychology
scale-construction research: Issues and recommendations. Measurement in Physical Education and
Exercise Science, 3, 15–36.
Fidler, F. (2002). The fifth edition of the APA Publication Manual: Why its statistics recommendations
are so controversial. Educational and Psychological Measurement, 6, 749–770.
Ghosh, B. K. (1979). A comparison of some approximate confidence intervals for the binomial parame-
ter. Journal of the American Statistical Association, 74, 894–900.
Haynes, S. N., Richard, D. C., & Kubany, E. S. (1995). Content validity in psychological
assessment: A functional approach to concepts and methods. Psychological Assessment, 7, 238–247.
224 PENFIELD AND GIACOBBI

Lynn, M. R. (1986). Determination and quantification of content validity. Nursing Research, 35,
382–385.
Newcombe, R. G. (1998). Two-sided confidence intervals for the single proportion: Comparison of
seven methods. Statistics in Medicine, 17, 857–872.
Penfield, R. D. (2003). A score method of constructing asymmetric confidence intervals for the mean of
a rating scale item. Psychological Methods, 8, 149–163.
Sireci, S. G. (1998). The construct of content validity. Social Indicators Research, 45, 83–117.
Sireci, S. G., & Geisinger, K. F. (1995). Using subject-matter experts to assess content representation:
An MDS analysis. Applied Psychological Measurement, 19, 241–255.
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

Wald, A. (1943). Tests of statistical hypotheses concerning several parameters when the number of ob-
servations is large. Transactions of the American Mathematical Society, 54, 426–482.
Wilson, E. B. (1927). Probable inference, the law of succession, and statistical inference. Journal of the
American Statistical Association, 22, 209–212.
Yalow, E. S., & Popham, W. J. (1983). Content validity at the crossroads. Educational Researcher,
12(8), 10–14, 21.

APPENDIX

Consider the sample proportion, p, obtained by dividing the number of successes


of a dichotomous variable from a sample of x trials. Wilson’s (1927) Score confi-
dence interval begins by considering the following equation to conduct any asymp-
totic test of the null hypothesis that the population proportion, π, equals some
value π0

p - π0
z= . (A1)
π0 (1- π0 )
x

When both sides of Equation A1 are squared, the terms can be rearranged to give

z2 (π0 - π20 ) = x( p - π0 )2 . (A2)

Expanding Equation A2 into a typical quadratic form yields

π20 ( x + z2 ) + π0 (-z2 - 2 xp) + xp2 = 0. (A3)

Next, the solution to the quadratic form of Equation A3 with respect to π0 can be
solved using

2 px + z2 ± z 4 px(1- p) + z2
. (A4)
2( x + z2 )
CONTENT-RELEVANCE CONFIDENCE LEVEL 225

In the context of Aiken’s V, the value of V is conceptualized as a sample proportion


obtained by dividing the number of successes, n(X – l), by the number of trials, nk,
where n equals the number of judges, k represents the range of possible ratings, and
l represents the lowest possible rating. Substituting V for p and nk for x in Equation
A4 yields

2nkV + z2 ± z 4nkV (1- V ) + z2


(A5)
2(nk + z2 )
Downloaded by [Middle Tennessee State University] at 13:53 16 August 2013

which is the result given in Equations 2 and 3.

You might also like