You are on page 1of 12

Chapter 5 Validity

Refining the Definition of Validity


Customary definition of validity: the extent to which a test measures what it purports to measure.

1. However, it is imprecise to refer to validity as a test. One establishes the validity of a test
score when used for a particular purpose (e.g. don’t ask “Is the Rorschach valid?” ask “Is the
Depression Index from the Rorschach valid for identifying the severity of depression?”
2. Validity is a matter of degree; it’s not an all-or-none affair. Some tests have no validity for a
particular purpose. Validity may be very slight, moderate, or considerable. One’s concern is
determining the extent of validity (for practical purposes, we want to know if the validity is
sufficient to make use of the test worthwhile).
3. One must distinguish between validity and accuracy of norms for a test. It is possible to have
a test with good validity but also with norms that are well of the mark.

Construct Underrepresentation and Construct-Irrelevant Variance


It is useful to consider the overlap between the construct we wish to measure and the test that will
measure that construct. The construct is a trait characteristic (e.g. depression). The relationship
between the construct and the test is represented by overlapping geometric forms. Overlap between
the construct and the test represents validity (what we want to measure). The part not covered is
called construct underrepresentation. The construct of interest is not entirely covered by the test.
Moreover, the test might measure some characteristics other than what we want to measure (called
construct irrelevant variance). The ideal case is complete overlap of the construct and the test. We
do not attain that ideal in practice.

Example we conceptualize depression as having 3 components: cognitive (thoughts about


depression), emotional (feeling depressed) and behavioral (doing/not doing things symptomatic of
depression). The questionnaire might cover the cognitive and emotional components, yet gives no
information about behavioral components. Thus, the complete construct of depression is
underrepresented by the test. It might also be that, to some extent, the questionnaire scores reflect a
tendency to give socially desirable responses. This aspect is construct–irrelevant variance.

The Traditional and Newer Classifications of Types of Validity


Evidence
There is a traditional tripartite system for classifying types of validity evidence (has demonstrated
considerable validity). The 1999 Standards partially abandoned it in favor of a more diversified
representation of types of validity.

A comparison of the terminology content validity is the same. Criterion-related validity in the
traditional system corresponds to the “relationship to other variables”, especially to the “test-
criterion relationship” subcategory in the new system. Convergent and discriminant validity (from the
traditional system) are made much more explicit in the new system. Response processes and internal
structure (from the new system) are represented under the general category of construct validity
(traditional system). Consideration of “consequences” is a newly introduced topic.

The Issue of Face Validity


Test validity is an empirical demonstration that a test measures what it purports to measure
(specifically, the scores on the test can be interpreted meaningfully for a particular purpose). In
contrast, face validity refers to whether a test looks like it measures its target construct.

 Detractors (those against) of face validity are against its often use as a substitute for
empirical demonstration of validity
 Advocates for face validity note that we work with real people in the real world.

Face validity is never a substitute for empirical validity, but it can be a useful addition (if 2 tests have
equal empirically established validity, the test with better face validity is preferred).

Content Validity
Content validity deals with the relationship between the content of a test and some well-defined
domain of knowledge or behavior. For a test to have good content validity, there must be a good
match between the content of the test and the content of the relevant domain. Application of
content validity involves sampling of all possible contents of the domain. The test might also cover all
the possible contents of the domain, but most of the time, the domain is too large to do that.
Content validity has 2 primary applications: educational achievement tests and employment tests.

Application to Achievement Tests


Content validity is generally considered the most important type of validity for achievement tests.
The purpose of such tests is to determine the extent of knowledge of some body of material (e.g.
high school chemistry, history of the civil war) the process of establishing content validity begins with
a careful definition of the content to be covered. The process results in a table of specifications or a
blueprint (for high school chemistry the blueprint might arise form examination of content in 5 most
widely used textbooks in the field). In many cases the content area is represented by a two-way
table of specifications. The first dimension covers the content topics. The second dimension
represents mental processes (e.g. knowledge of facts, ability to apply material).

Example the best known scheme for representing processes is Bloom’s taxonomy. There are 3 in
total (or the cognitive domain, for the psychomotor domain, and for the affective domain). The
cognitive taxonomy is the most widely cited, as well as most relevant to the content validity of
achievement tests. The full taxonomy is rarely used; generally the 6 categories are reduced to 3. Note
that the attempts to validate the distinctions in the taxonomy (show that various categories
represent distinct mental processes) have failed. Nonetheless, Bloom’s taxonomy (or its variation) is
frequently encountered in the literature.

After preparing the table of specifications for a content area, we determine the content validity of a
test by matching the content of the test with the table. It usually is done on an item-by-item basis.
The analysis shows a) areas of content not covered by the test and b) test items that do not fit the
content of specifications. The result is not summarized numerically (the percentage of the domain
covered by test items and the percentage of the test items not reflected in the domain). Rather, a
judgment is done to see if the test does/does not have sufficient content validity.

Instructional Validity
Instructional validity is a special application of the content validity, which asks whether the content
has actually been taught. For a test to have instructional validity, there must be evidence that the
content was adequately covered in an instructional program (sometimes called “the opportunity to
learn”). Can be done by asking students taking the test if they have been exposed to the material
covered in the test. Instructional validity primarily applies to educational achievement tests. It is also
not well established as something distinct from content validity, thus it does not introduce an
entirely novel type of validity.

Application to Employment Tests


A second major application of content validity is to employment tests. In that setting, the content
domain consists of knowledge and skills required by a particular job. It is customary to restrict the list
of the test to knowledge and skills at the entry level. Factors like motivation and personality are not
ordinarily included. The process of developing the list of knowledge and skills is called job analysis.
After completing a job analysis, the test content is matched with the job content.

Two differences when applying content validity to achievement and employment tests:

1. For achievement tests, print documents (books, curriculum guides) usually serve as the basis
for content specifications.
2. Although a percentage-agreement figure is rarely used with achievement tests, such a figure
occurs for the evaluation of employment tests (Lawshe presented a methodology for
expressing the percentage of test content that experts judged essential for job performance,
called content validity ratio).

Content Validity in Other Areas


Content validity’s application in other areas (e.g. intelligence, personality) is limited because few
other areas are susceptible to clear specification of the domains to be covered. However, in few
instances, content validity may have some limited use in these areas (to show that a test designed to
measure a certain disorder covers all the traits specified for the disorder).

Problems with Content Validity


Conceptually, establishing content validity is a basic process: specify the content of a domain, check
how well the test matches this content. In practice, however, complications arise from these sources:

1. Except in a few very simple cases, getting a clear specification of the content domain is often
difficult. For the chemistry test mentioned above, state curriculum guides could differ. One
handles this by specifying the depth of knowledge that is wanted (same process for
employment tests).
2. Judging how well test items cover elements of the content specifications might be difficult.
Items with a common classification can vary widely in the skills they require the person
judging content validity must examine the actual test items and not rely exclusively on a
listing of categories.
3. Content validity does not refer in any way to actual performance on the test. All other
methods of determining validity refer (in some way) to empirical performance. Thus, it
doesn’t take into consideration the examinees’ interaction with the test.

Criterion-Related Validity
An essential feature of criterion-related validity is establishing the relationship between
performance on the test and on some other criterion that is taken as an important indicator of the
construct of the test.

Three common applications of criterion-related validity involve use of:

a. An external, realistic criterion defining the construct of interest


b. Group contrasts
c. Another test.

Fundamentally, these three approaches reduce to the same thing, however, they have some practical
differences.

The two general contexts for criterion-related validity are:

1. Predictive validity, where the test aims to predict status on some criterion that will be
attained in the future (e.g. college entrance test to predict GPA at the end of freshman year
in college)
2. Concurrent validity, where we check on agreement between test performance and current
status on some other variable (e.g. relationship between performance on a standardized
achievement test and a teacher-made test, both administered at approximately the same
time).

External, Realistic Criterion


External criterion provides a realistic definition of the construct of interest (it is the information of
main interest). Why not get the information rather than the test then? For two reasons:

1. It may be that we cannot get information on the criterion until some time in the future, and
we’d like to predict the future on the said criterion
2. Getting information on the criterion is very time-consuming or expensive.

Usually, one expresses the external validity of a test as a correlation coefficient (i.e. Pearson
correlation coefficient). In this case the correlation coefficient is called validity coefficient. Once we
know the correlation between 2 variables, we use it to predict the status on variable Y from standing
on variable X (here, Y is the external criterion, X is the test). Thus, we apply the usual regression
equation: Yʹ = bX + a
When we have means and standard deviations for the X and Y and the correlation between X and Y,
we use (formula 5.2 page163, not mentioned in the slides).

The standard error of estimate is a standard deviation of actual criterion scores around their
predicted scores. (formula 5.3, page 164, not mentioned in the slides).

Contrasted Groups
Second major method for demonstrating criterion-related validity is the contrasted groups method.
Here, the criterion is group membership. We want to demonstrate that the test differentiates one
group from another. The better the differentiation between groups, the more valid the test. When
viewing the results of a contrasted-groups study of test validity, it is important to consider the degree
of separation between groups. Reporting “statistically significant difference” between groups is not
sufficient (if n is large enough, it is not difficult to obtain significance). It is important that the test
differentiates between the groups to an extent that is useful in practice. Effect size would be a useful
application to contrasted groups, but are rarely used for this purpose.

Correlations with Other Tests


Another method for contrasted groups is to show that the test correlates with some other tests that
is presumed to be a valid measure of the relevant construct. In this application, the other tests
becomes the criterion (analogous to the external criterion). Some reasons for establishing validity for
new tests are:

a. New test might be shorter/less expensive than the criterion test (e.g. a 15 minute
intelligence test validated against the Wechsler Intelligence Scale for Children)
b. New test might have better norms, or more efficient scoring procedures

The same methodology as with the external criterion is used (Pearson correlation). Note to be careful
with the jingle fallacy (thinking that using the same words or similar words for two things means that
they are really the same) or the jangle fallacy (thinking that 2 things are really different because they
use different words). Guarding for those fallacies calls for empirical evidence.

The most important index for reporting criterion-related validity is the correlation coefficient. The
degree of correlation can be depicted with bivariate distributions. A special application of this type of
array is the expectancy table, which has a structure similar to that of the bivariate chart. Entries for
each row are percentages of cases in that row. Thus, entries and combinations of entries translate
into probabilities.

Special Consideration for Interpreting Criterion-Related Validity


Condition Affecting the Correlation Coefficient
Several conditions affect the magnitude of the correlation coefficient (r).

 If the relationship is nonlinear, the Pearson correlation will underestimate the true degree of
relationship. Always examine the bivariate distribution (scatter plot) for the two variables.
Nonlinearity is not a common problem when studying test validity.
 Difference in group homogeneity is a common problem. A validity study may be constructed
on a very heterogeneous group, yielding a high validity coefficient, when we want to apply
the result to a more homogenous group.
 Homoscedasticity refers to the assumption that data points are equally scattered around the
prediction line throughout the range. This is not a general problem. Correlations between
scores and other criteria are not often high enough to warrant concern about this. Plus, it’s
easy to check the scatter plot to determine if there’s a problem in this regard.

The Reliability-Validity Relationship


The validity of the test depends to some extend to the reliability of the test (and vice-versa). Thus,
limited reliability of either the test or the criterion will limit the criterion-related validity of the test.
The reliability-validity relationship is treated in the context of criterion-related validity. If a test has
no reliability whatsoever (the test scores are simply random error), then the test can have no validity.
However, a test can be perfectly reliable and still have no validity (the test measures something other
than what we intended it to measure). If the criterion has no reliability (status on the criterion is
simply random error) then the test can have no validity with respect to the criterion. There are
formulas (that we don’t need to know) to express the effect of limited reliability on criterion-related
validity. Attenuation (lessening /reducing) is the technical term for the limit placed on validity by
imperfect reliability. From the obtained validity coefficient, we can calculate (either for the criterion
or for the test) the disattenuated validity coefficient (also called the validity coefficient corrected for
unreliability). In the practical applications for these procedures, we correct for unreliability in the test
(we assume that the criterion is reliable), to bring it to a level of perfect reliability (1.00). Although
theoretically possible, this procedure is unrealistic in practice (thus, we set a reliability of .85 or .90).

Validity of the Criterion


How well does the test predict or correlate with the criterion? The test should be one’s focus,
because one assesses the test’s validity. However, we also should consider the operational definition
of the criterion (how good is freshman GPA as an operational definition of success in college?).

Criterion Contamination
Criterion contamination refers to a situation in which performance on the test influences status on
the criterion. Example: a sample of 50 cases is used to establish the validity of the Cleveland
Depression Scale by showing that it correlates highly with the ratings of 3 clinicians. If the clinicians
have access to the scores, this could lead to an inflation of the correlation between the test and the
criterion. One must be alert to detecting criterion-related validity, as there are no analytical
methods/formulas that estimate its effect.

Convergent and Discriminant Validity


Convergent validity refers to a relatively high correlation between the test and some criterion
thought to measure the same construct as the test (e.g. test measuring level of depression correlates
with another test measuring the same level of depression). In contrast, discriminant validity shows
that the test has relatively low correlation with constructs other than the construct it is intended to
measure (test does not have a high correlation with constructs like anxiety). Convergent and
discriminant validity are widely used in the field of personality measurement, and not widely used, in
practice, for ability and achievement testing.
The Multitrait-Multimethod Matrix
A special application of convergent and discriminant validity is the multitrait-multimethod matrix.
The matrix is just a correlation matrix. Variables in the matrix include tests that purport to measure
several different traits (this is the multitrait part) using several different methods (the multimethod
part). The essential purpose of the multitrait-multimethod analysis is to demonstrate that
correlations within a trait but cutting across different methods are higher than correlations within
methods cutting across different traits (and, of course, higher than correlations cutting both traits
and methods). The method is widely cited in the psychometric literature, but not widely used in
practice.

Combining Information from Different Tests


In some cases, several tests are used to predict status on a criterion. The usual method is multiple
correlation (a technique for expressing the relationship between one variable (the criterion) and the
optimal combination 2 or more other variables (i.e. several tests). Example: predicting freshman GPA
from a combination of an administered test, high school rank, and a test of academic motivation. In
this case, one needs to get the optimum weights for the other variables that maximize the
correlation between criterion and the combination of tests (the weights depend not only on the
tests\ correlations with the criteria, but also on the relationship among the predictor tests).

Two main purposes for the multiple correlations procedure:

1. A very practical purpose of yielding the best possible prediction of a dependent variable as
economically as possible, by not including any more variables than necessary.
2. To understand theoretically which variables contribute effectively to a prediction of a
dependent variable and which variables are redundant.

There are two end products from multiple correlation procedure:

1. A multiple correlation coefficient, represented by R (indicates what is being predicted and


from what). Example: if variable 1 is predicted from variables 2, 3, 4, we show R1.234. R is
interpreted the same as the Pearson r (here called a zero-order correlation coefficient).
2. The weights assigned to the predictor variables. They have 2 forms: bs (applied to raw
scores;), and βs (applied to the scores in “standardized” form; z scores).

The difference in bs and βs is that bs just tell how much to weight each raw score variable
(variables with “big” numbers get small weights, variables with “little” numbers get large
weights. In z-scores form, all variables have M=0 and SD=1 (beta weights are directly comparable
and thus, tell which variables are receiving the most weight).

R-squared ( R2 ¿ is the percent of variance in Y accounted for or overlapping with variance in the
predictors.

Example A tests 2 and 3 show substantial overlap with the criteria. However, they are themselves
highly correlated. After we enter one of them in the multiple regression formula, the other one
adds little new information. If test 2 enters the equation first, it will have the greatest weight (β).
Test 4 will have the second next greatest weight even though test 3 has a higher correlation with
the criterion than does test 4. Test 4 provides more new information after test 2.
Important points in multiple correlation methodology

a. The order in which variables enter the equation


b. The overlap between predictors
c. When new variables do not add any predictive power.

Multiple correlations is a crucial technique for determining incremental validity (refers to how much
new, unique information is a test adds to an existing body of information

Cross-Validation and Validity Shrinkage


If enough variables are plugged into the formula, some of them will turn out “significant” just by
chance. A desirable practice is to use cross-validation (to determine the equation (and R) on one
sample, then apply the equation in a new sample to see what R emerges). The loss in the validity
(reduction of R) from the first to second sample is called validity shrinkage. The problem of validity
shrinkage (hence, the need for cross validation) can be severe when the initial sample is small
(problem diminishes as sample size increases).

Statistical versus Clinical Prediction


With multiple correlation techniques, one determines empirically what information to use, and what
information to discard. As an alternative, we could combine the information based on clinical
intuition and experience.

Example: we can make a statistical prediction based on high school rank and SAT scores (using
multiple correlation methodology) or we can ask admission counselors to make predictions (they
combine all this information in any way they wish and make a clinical judgment about probable
success).

In general, the statistical predictions are at least equal and usually beat the clinical predictions. Thus,
clinicians could be replaced, but not always. Development of the formulas require an adequate
database (if one has it, one relies on it). But data bases are not always available, thus, one must rely
on clinicians to make the best judgment. Moreover, clinicians are needed to develop the original
notions of what should be measured to go into the formulas. Clinicians that are firmly guided by
statistical formulas can be better than the formulas themselves.

Decision Theory: Basic Concepts and Terms


Decision theory is a body of concepts, terms, and procedures for analyzing the quantitative effects of
our decisions. As applied to testing, the decisions involve using tests, especially in the context of
criterion-related validity for diverse purposes (selection, certification and diagnosis). In applying the
theory, one wants to optimize the results of his/her decisions according to certain criteria.

Hits, False Positives, and False Negatives


A hit is a case that has the same status with respect to both the test and the criterion (includes cases
that exceed the cutscores on both criterion and test [a positive hit] and cutscores that fall below the
cutscores on both criterion and test [a negative hit]). A high hit rate indicates good criterion-related
test validity. However, unless the correlation between the test and criterion is perfect, there will be
some errors in prediction. False positives are cases that exceed the test cutscores but fail on the
criterion. False negatives are cases that fall below the test cutscore but are successful on the
criterion. Note false positives and false negatives are often confused. How to avoid it:

a. Always draw the chart so that the test is on the bottom axis and the criterion on the left axis
b. After drawing the cutscore lines, place the “hit” labels
c. “positive”values are to the right, and “negative” values are to the left, respectively, false
positives are to the right, and false negatives are to the left

Two factors affect the percentage of hits, false positives, false negatives

1. The degree of correlation between the test and the criterion. Limiting cases are a perfect
correlation (no false positive and false negatives will exist, just hits) or zero correlation (the
sum of false positives and false negatives will equal the number of hits).
2. Placement of the cutoff score on the test. Changes in cutoff affect the relative percentage of
false positives and false negatives. General rule: when the correlation is less than perfect
(always the case in practice), there is a trade-off between false positive rate and false
negative rate. When setting the custscore, one must decide what result is preferable:
relatively high false positive rate or high false negative rate.

Base Rate
Base rate is the percentage of individuals in the population having some characteristic. When base
rate is extreme (either very high or very low), it is difficult to show that the test has good validity in
identifying individuals in the target group. Example: a characteristic is possessed by 0.5% in the
population. In this case, unless the test for identifying such individuals has exceptionally high validity,
we minimize errors by simply stating that no one has the characteristic (no matter the test score).
Good test validity is easiest to attain if the vase rate is near 50%. Base rate also may change on the
definition of the population (e.g. 1% in general population, but 39% in those who seek help). Test
validity interacts with base rate at a given selection ratio.

Sensitivity and Specificity


Specificity and sensitivity are applied when a test is used to classify individuals into 2 groups (e.g.
suicide attempters and suicide nonattempters). Example if a test’s purpose is to identify suicide
attemtpers, one would use a criterion group (group of people who have attempted suicide) and a
contrast group (suffer from depression but haven’t attempted suicide). The test and the cutoff score
would identify a) the criterion group (suicide attempters) but b) will not identify the contrast group
(nonattempters). The test’s sensitivity is the extent to which it correctly identifies the criterion
group. The test’s specificity is the extent to which it doesn’t identify the contrast group. Both terms
are usually expressed in simple percentages (representing “hits”).

Two factors affect sensitivity and specificity for a test:

1. The degree of separation between the groups (the greater the degree of separation, the
better both sensitivity and specificity).
2. The placement of the cutscore (moving the cutscore will make sensitivity and specificity vary
inversely [as one increases, the other decreases]).

It is important to have meaningful contrasts when considering discrimination between groups (more
useful to contrast suicide attempters with nonattempters who suffer from depression than to
contrast attempters with general population). Thus, one must be alert to the nature of the groups
that are compared. Clinical applications employ concepts of positive predictive power (PPP) and
negative predictive power (NPP).

Construct Validity
Construct validity encompasses the attempts to measure a construct by adding a variety kinds of
evidence to support the proposition that the test measures the contrast.

Internal Structure
A high degree of internal consistency (high KR-20 or alpha coefficient) indicates that the test in
measuring the construct or trait in a consistent manner. Internal consistency provides only weak,
ambiguous evidence regarding validity. Best thought of as a perquisite for validity, rather than
validity evidence itself.

Factor Analysis
Factor analysis is a family of statistical techniques (include principal components analysis, various
rotation procedure, stopping rules) that help to identify the common dimensions underlying
performance on many different measures. Factor analysis begins with raw data (for practical
perspective, it begins with a correlation matrix). Example A and B are considered one underlying
dimension (not fruitful to think of them as 2 variables) if they are highly correlated (e.g. .95).
However, if the correlation with a third variable is fairly low (e.g. .20), we cannot collapse A and B
and C, thus we now have 2 underlying dimensions (AB and C).

Results of a factor analysis are usually presented as a factor matrix, which shows the weight (called
loading) that each original variables has on the newly established factor. Consider loadings in excess
of .30 as noteworthy. Factors are named and interpreted rationally. After “extracting” factors, it is
customary to “rotate” the axes to aid interpretation. Most common technique is called Varimax
rotation.

Response processes
Response process (the study of how to examinees go about responding to a test), which can involve
mechanical and electronic recordings, may provide evidence (not very strong, though) regarding the
validity of the test.

Effect of Experimental Variables


Studying the effects of experimental variables (which can demonstrate the validity of a test) is similar
to the contrasted- groups method (logically, they’re the same). The difference is that contrasted
groups involve naturally occurring groups, while the experimental variable groups are created
specifically to study test validity. Example a test is administered to a group. After, the group is
subjected to an anxiety-producing situation, and then readministered the test. We expect the scores
to increase on the test.

Developmental Changes
In this instance, one contrasts groups at different ages/grades. Thus, increases in test scores and in
performance on individual test items at successfully higher grades are used to argue for the validity
of achievement tests (reading performance in grade 5 is better than in grade 3).
Consequential Validity
Consequential validity references the test to the (intended and unintended) consequences of its
uses and interpretations. Two issues are considered:

1. Explicit claims regarding consequences made by the test authors. If the test has an explicit
consequence (followed from the purpose of the test), the validity process should address
that consequence.
2. Consequences that may occur regardless of explicit claims by the authors. Example test only
claims to predict GPA, but someone claims that the use of the test improves the quality of
instruction at the institution.

Because consequential validity is a new term, there is no means of agreement yet (how does one
identify all the consequences of using a test?).

Test Bias as Part of Validity


Test bias (or the opposite, test fairness) refers to whether a test measures its target construct
equivalently in different groups. A biased test will not measure its target the same in different
groups.

The Practical Concerns


For many tests, multiple validities studies are conducted, with results varying from one study to the
next. There may be legitimate reasons that the validity of a test varies from situation to situation
(depression had different complexion in younger versus older adults).

Integrating the Evidence


In the final analysis, one must weight all the available evidence and make a judgment about the
probable validity of a test for different circumstances. Validity generalization is the process of
weighting all the evidence and judging the relevance of studies to a specific anticipated use. It
requires a) knowledge of the relevant content area, b) familiarity with the research already
conducted with the test and with similar tests and c) understanding of the concepts and procedures,
d) perceptive analysis of local circumstances for an anticipated use of the test.

Meta-analysis is a (currently preferred) technique for summarizing the actual statistical information
contained in many different studies on a single topic. Its result is a statistic (correlation coefficient,
measure of effect size).

In the Final Analysis: A Relative Standard


How high should validity be? The answer about validity is a relative one (is the test more or less valid
than another test). Sometimes, the practical question is whether to use a test or nothing. The validity
of psychological tests compares favorably with the validity of many commonly used medical tests.

You might also like