Professional Documents
Culture Documents
1. However, it is imprecise to refer to validity as a test. One establishes the validity of a test
score when used for a particular purpose (e.g. don’t ask “Is the Rorschach valid?” ask “Is the
Depression Index from the Rorschach valid for identifying the severity of depression?”
2. Validity is a matter of degree; it’s not an all-or-none affair. Some tests have no validity for a
particular purpose. Validity may be very slight, moderate, or considerable. One’s concern is
determining the extent of validity (for practical purposes, we want to know if the validity is
sufficient to make use of the test worthwhile).
3. One must distinguish between validity and accuracy of norms for a test. It is possible to have
a test with good validity but also with norms that are well of the mark.
A comparison of the terminology content validity is the same. Criterion-related validity in the
traditional system corresponds to the “relationship to other variables”, especially to the “test-
criterion relationship” subcategory in the new system. Convergent and discriminant validity (from the
traditional system) are made much more explicit in the new system. Response processes and internal
structure (from the new system) are represented under the general category of construct validity
(traditional system). Consideration of “consequences” is a newly introduced topic.
Detractors (those against) of face validity are against its often use as a substitute for
empirical demonstration of validity
Advocates for face validity note that we work with real people in the real world.
Face validity is never a substitute for empirical validity, but it can be a useful addition (if 2 tests have
equal empirically established validity, the test with better face validity is preferred).
Content Validity
Content validity deals with the relationship between the content of a test and some well-defined
domain of knowledge or behavior. For a test to have good content validity, there must be a good
match between the content of the test and the content of the relevant domain. Application of
content validity involves sampling of all possible contents of the domain. The test might also cover all
the possible contents of the domain, but most of the time, the domain is too large to do that.
Content validity has 2 primary applications: educational achievement tests and employment tests.
Example the best known scheme for representing processes is Bloom’s taxonomy. There are 3 in
total (or the cognitive domain, for the psychomotor domain, and for the affective domain). The
cognitive taxonomy is the most widely cited, as well as most relevant to the content validity of
achievement tests. The full taxonomy is rarely used; generally the 6 categories are reduced to 3. Note
that the attempts to validate the distinctions in the taxonomy (show that various categories
represent distinct mental processes) have failed. Nonetheless, Bloom’s taxonomy (or its variation) is
frequently encountered in the literature.
After preparing the table of specifications for a content area, we determine the content validity of a
test by matching the content of the test with the table. It usually is done on an item-by-item basis.
The analysis shows a) areas of content not covered by the test and b) test items that do not fit the
content of specifications. The result is not summarized numerically (the percentage of the domain
covered by test items and the percentage of the test items not reflected in the domain). Rather, a
judgment is done to see if the test does/does not have sufficient content validity.
Instructional Validity
Instructional validity is a special application of the content validity, which asks whether the content
has actually been taught. For a test to have instructional validity, there must be evidence that the
content was adequately covered in an instructional program (sometimes called “the opportunity to
learn”). Can be done by asking students taking the test if they have been exposed to the material
covered in the test. Instructional validity primarily applies to educational achievement tests. It is also
not well established as something distinct from content validity, thus it does not introduce an
entirely novel type of validity.
Two differences when applying content validity to achievement and employment tests:
1. For achievement tests, print documents (books, curriculum guides) usually serve as the basis
for content specifications.
2. Although a percentage-agreement figure is rarely used with achievement tests, such a figure
occurs for the evaluation of employment tests (Lawshe presented a methodology for
expressing the percentage of test content that experts judged essential for job performance,
called content validity ratio).
1. Except in a few very simple cases, getting a clear specification of the content domain is often
difficult. For the chemistry test mentioned above, state curriculum guides could differ. One
handles this by specifying the depth of knowledge that is wanted (same process for
employment tests).
2. Judging how well test items cover elements of the content specifications might be difficult.
Items with a common classification can vary widely in the skills they require the person
judging content validity must examine the actual test items and not rely exclusively on a
listing of categories.
3. Content validity does not refer in any way to actual performance on the test. All other
methods of determining validity refer (in some way) to empirical performance. Thus, it
doesn’t take into consideration the examinees’ interaction with the test.
Criterion-Related Validity
An essential feature of criterion-related validity is establishing the relationship between
performance on the test and on some other criterion that is taken as an important indicator of the
construct of the test.
Fundamentally, these three approaches reduce to the same thing, however, they have some practical
differences.
1. Predictive validity, where the test aims to predict status on some criterion that will be
attained in the future (e.g. college entrance test to predict GPA at the end of freshman year
in college)
2. Concurrent validity, where we check on agreement between test performance and current
status on some other variable (e.g. relationship between performance on a standardized
achievement test and a teacher-made test, both administered at approximately the same
time).
1. It may be that we cannot get information on the criterion until some time in the future, and
we’d like to predict the future on the said criterion
2. Getting information on the criterion is very time-consuming or expensive.
Usually, one expresses the external validity of a test as a correlation coefficient (i.e. Pearson
correlation coefficient). In this case the correlation coefficient is called validity coefficient. Once we
know the correlation between 2 variables, we use it to predict the status on variable Y from standing
on variable X (here, Y is the external criterion, X is the test). Thus, we apply the usual regression
equation: Yʹ = bX + a
When we have means and standard deviations for the X and Y and the correlation between X and Y,
we use (formula 5.2 page163, not mentioned in the slides).
The standard error of estimate is a standard deviation of actual criterion scores around their
predicted scores. (formula 5.3, page 164, not mentioned in the slides).
Contrasted Groups
Second major method for demonstrating criterion-related validity is the contrasted groups method.
Here, the criterion is group membership. We want to demonstrate that the test differentiates one
group from another. The better the differentiation between groups, the more valid the test. When
viewing the results of a contrasted-groups study of test validity, it is important to consider the degree
of separation between groups. Reporting “statistically significant difference” between groups is not
sufficient (if n is large enough, it is not difficult to obtain significance). It is important that the test
differentiates between the groups to an extent that is useful in practice. Effect size would be a useful
application to contrasted groups, but are rarely used for this purpose.
a. New test might be shorter/less expensive than the criterion test (e.g. a 15 minute
intelligence test validated against the Wechsler Intelligence Scale for Children)
b. New test might have better norms, or more efficient scoring procedures
The same methodology as with the external criterion is used (Pearson correlation). Note to be careful
with the jingle fallacy (thinking that using the same words or similar words for two things means that
they are really the same) or the jangle fallacy (thinking that 2 things are really different because they
use different words). Guarding for those fallacies calls for empirical evidence.
The most important index for reporting criterion-related validity is the correlation coefficient. The
degree of correlation can be depicted with bivariate distributions. A special application of this type of
array is the expectancy table, which has a structure similar to that of the bivariate chart. Entries for
each row are percentages of cases in that row. Thus, entries and combinations of entries translate
into probabilities.
If the relationship is nonlinear, the Pearson correlation will underestimate the true degree of
relationship. Always examine the bivariate distribution (scatter plot) for the two variables.
Nonlinearity is not a common problem when studying test validity.
Difference in group homogeneity is a common problem. A validity study may be constructed
on a very heterogeneous group, yielding a high validity coefficient, when we want to apply
the result to a more homogenous group.
Homoscedasticity refers to the assumption that data points are equally scattered around the
prediction line throughout the range. This is not a general problem. Correlations between
scores and other criteria are not often high enough to warrant concern about this. Plus, it’s
easy to check the scatter plot to determine if there’s a problem in this regard.
Criterion Contamination
Criterion contamination refers to a situation in which performance on the test influences status on
the criterion. Example: a sample of 50 cases is used to establish the validity of the Cleveland
Depression Scale by showing that it correlates highly with the ratings of 3 clinicians. If the clinicians
have access to the scores, this could lead to an inflation of the correlation between the test and the
criterion. One must be alert to detecting criterion-related validity, as there are no analytical
methods/formulas that estimate its effect.
1. A very practical purpose of yielding the best possible prediction of a dependent variable as
economically as possible, by not including any more variables than necessary.
2. To understand theoretically which variables contribute effectively to a prediction of a
dependent variable and which variables are redundant.
The difference in bs and βs is that bs just tell how much to weight each raw score variable
(variables with “big” numbers get small weights, variables with “little” numbers get large
weights. In z-scores form, all variables have M=0 and SD=1 (beta weights are directly comparable
and thus, tell which variables are receiving the most weight).
R-squared ( R2 ¿ is the percent of variance in Y accounted for or overlapping with variance in the
predictors.
Example A tests 2 and 3 show substantial overlap with the criteria. However, they are themselves
highly correlated. After we enter one of them in the multiple regression formula, the other one
adds little new information. If test 2 enters the equation first, it will have the greatest weight (β).
Test 4 will have the second next greatest weight even though test 3 has a higher correlation with
the criterion than does test 4. Test 4 provides more new information after test 2.
Important points in multiple correlation methodology
Multiple correlations is a crucial technique for determining incremental validity (refers to how much
new, unique information is a test adds to an existing body of information
Example: we can make a statistical prediction based on high school rank and SAT scores (using
multiple correlation methodology) or we can ask admission counselors to make predictions (they
combine all this information in any way they wish and make a clinical judgment about probable
success).
In general, the statistical predictions are at least equal and usually beat the clinical predictions. Thus,
clinicians could be replaced, but not always. Development of the formulas require an adequate
database (if one has it, one relies on it). But data bases are not always available, thus, one must rely
on clinicians to make the best judgment. Moreover, clinicians are needed to develop the original
notions of what should be measured to go into the formulas. Clinicians that are firmly guided by
statistical formulas can be better than the formulas themselves.
a. Always draw the chart so that the test is on the bottom axis and the criterion on the left axis
b. After drawing the cutscore lines, place the “hit” labels
c. “positive”values are to the right, and “negative” values are to the left, respectively, false
positives are to the right, and false negatives are to the left
Two factors affect the percentage of hits, false positives, false negatives
1. The degree of correlation between the test and the criterion. Limiting cases are a perfect
correlation (no false positive and false negatives will exist, just hits) or zero correlation (the
sum of false positives and false negatives will equal the number of hits).
2. Placement of the cutoff score on the test. Changes in cutoff affect the relative percentage of
false positives and false negatives. General rule: when the correlation is less than perfect
(always the case in practice), there is a trade-off between false positive rate and false
negative rate. When setting the custscore, one must decide what result is preferable:
relatively high false positive rate or high false negative rate.
Base Rate
Base rate is the percentage of individuals in the population having some characteristic. When base
rate is extreme (either very high or very low), it is difficult to show that the test has good validity in
identifying individuals in the target group. Example: a characteristic is possessed by 0.5% in the
population. In this case, unless the test for identifying such individuals has exceptionally high validity,
we minimize errors by simply stating that no one has the characteristic (no matter the test score).
Good test validity is easiest to attain if the vase rate is near 50%. Base rate also may change on the
definition of the population (e.g. 1% in general population, but 39% in those who seek help). Test
validity interacts with base rate at a given selection ratio.
1. The degree of separation between the groups (the greater the degree of separation, the
better both sensitivity and specificity).
2. The placement of the cutscore (moving the cutscore will make sensitivity and specificity vary
inversely [as one increases, the other decreases]).
It is important to have meaningful contrasts when considering discrimination between groups (more
useful to contrast suicide attempters with nonattempters who suffer from depression than to
contrast attempters with general population). Thus, one must be alert to the nature of the groups
that are compared. Clinical applications employ concepts of positive predictive power (PPP) and
negative predictive power (NPP).
Construct Validity
Construct validity encompasses the attempts to measure a construct by adding a variety kinds of
evidence to support the proposition that the test measures the contrast.
Internal Structure
A high degree of internal consistency (high KR-20 or alpha coefficient) indicates that the test in
measuring the construct or trait in a consistent manner. Internal consistency provides only weak,
ambiguous evidence regarding validity. Best thought of as a perquisite for validity, rather than
validity evidence itself.
Factor Analysis
Factor analysis is a family of statistical techniques (include principal components analysis, various
rotation procedure, stopping rules) that help to identify the common dimensions underlying
performance on many different measures. Factor analysis begins with raw data (for practical
perspective, it begins with a correlation matrix). Example A and B are considered one underlying
dimension (not fruitful to think of them as 2 variables) if they are highly correlated (e.g. .95).
However, if the correlation with a third variable is fairly low (e.g. .20), we cannot collapse A and B
and C, thus we now have 2 underlying dimensions (AB and C).
Results of a factor analysis are usually presented as a factor matrix, which shows the weight (called
loading) that each original variables has on the newly established factor. Consider loadings in excess
of .30 as noteworthy. Factors are named and interpreted rationally. After “extracting” factors, it is
customary to “rotate” the axes to aid interpretation. Most common technique is called Varimax
rotation.
Response processes
Response process (the study of how to examinees go about responding to a test), which can involve
mechanical and electronic recordings, may provide evidence (not very strong, though) regarding the
validity of the test.
Developmental Changes
In this instance, one contrasts groups at different ages/grades. Thus, increases in test scores and in
performance on individual test items at successfully higher grades are used to argue for the validity
of achievement tests (reading performance in grade 5 is better than in grade 3).
Consequential Validity
Consequential validity references the test to the (intended and unintended) consequences of its
uses and interpretations. Two issues are considered:
1. Explicit claims regarding consequences made by the test authors. If the test has an explicit
consequence (followed from the purpose of the test), the validity process should address
that consequence.
2. Consequences that may occur regardless of explicit claims by the authors. Example test only
claims to predict GPA, but someone claims that the use of the test improves the quality of
instruction at the institution.
Because consequential validity is a new term, there is no means of agreement yet (how does one
identify all the consequences of using a test?).
Meta-analysis is a (currently preferred) technique for summarizing the actual statistical information
contained in many different studies on a single topic. Its result is a statistic (correlation coefficient,
measure of effect size).