Fairness

PSYCHOMETRY- FAIRNESS
The presence of test bias represents a very real threat to test validity, since it is an indication that test scores do not have
the same meaning for those from different groups. Furthermore, biases can also result in claims of unfairness.
Impact= A term used for any finding of differences in the average scores of 2 groups on a test / test item. It may be due to:
1. True ability differences between the group  does not imply bias
2. The inclusion of items that are inappropriate or unfair for one of the groups  imply bias
3. Other construct-irrelevant aspects of testing such as test instructions or test delivery modes that affect groups
differentially  imply bias
Once impact is found, it needs to be determinate it it’s due to actual group differences or to inappropriate items or test
features. One way to do this at the item level is to match the two groups on their levels of the construct being measured.
If examinees who are from different groups but have the same level of the construct obtain different scores on an item,
then reason (1) can be ruled out (e.g., groups of men and women with the same level of math ability had different
probabilities of answering a math item correctly, that item would be considered biased).
Differential item functioning (DIF)= Item-level bias (both for lower and higher average performance levels).
Bias= a systematic difference between two quantities that should be equal. Item or test score means from two groups
should be equal if the groups have equal levels of the construct being measured. If, after matching the groups on the
construct level, the test or item means still differ systematically, test or item bias is implied.
 An item, or a whole test, could be considered biased if two groups that should obtain equal scores do not.
DETECTING, TEST AND ITEM BIAS  Based on the “predictive bias model” by Cleary (1968).
Test Bias
When tests are used to predict future outcomes (e.g., success in a job) it is important that the prediction be as accurate as
possible  That the tests predict equally well for all individuals
Biased test= if it does not result in the same predictive accuracy (on average) for every test-taker.
This bias is sometimes called differential prediction, since it is assessed by determining whether the predictive relation
between test scores and criterion differs across groups  Differential prediction can be assessed from the results of a
multiple regression model in which test scores, group membership, and the group by test score interaction are used to
predict a criterion of interest.
 Interaction [Group membership & the group by test score?] are entered into the regression equation to allow for
differences in group regression lines.
 Purpose= to determine if the same regression line can be used to predict the criterion for all groups of interest.
 If not: ≠ regression lines are needed for ≠ groups & differential prediction is present.
E.g.: SAT scores to predict 1rst year GPA (variables: gender/ racial group)  To allow differences in regression lines: into
the equation: 1st SAT scores  2nd group membership (dummy/ coded)  3rd group membership by SAT scores interaction
When testing for differential prediction, one of these 4 things can happen:
1. No bias  The same outcome is predicted for a given test score regardless of group membership.
It would be the ideal situation. Even if the average score is different on the test score used as a
predictor (X) and the criterion being predicted (Y), a higher or lower score in X one group (both
marked with the circle) is associated with higher or lower values on Y. Therefore, the relationship
between X and Y is the same for both groups and the same regression line can be used for both.
Thus, someone with a given test score would obtain the same predicted criterion value whether he
or she was in group A or group B. This can be seen in the graph by how the dashed line links a given
score on X to a predicted score on Y, and this predicted score is the same regardless of the group.
Different types of differential prediction:
2. Differences in intercept only  The test systematically over- or under-predicts criterion values for a particular group. In
this case, Group B has higher average scores, but the two groups have equally high average scores on Y. However, the X-Y
relation is the same for both groups, as indicated by the parallel regression lines. This situation is the
one most often associated with a lack of fairness in testing.. E.g., Group A is a traditionally
underrepresented group and Group B is a majority group. In the figure it can be seen that although GB
has higher test scores, members of the two groups with the same test score, do equally well on
criterion. If the common regression line were used (the dotted line that represents the common
regression line that would be obtained if the same regression equation were used for both groups),
predictions for Group A would be consistently too low and predictions for Group B would be
consistently too high. Members of the underrepresented group would therefore obtain predicted
criterion values that underestimate their actual performance. This is sometimes referred to as intercept
bias
3. Differences in slope only  the groups have the same intercept but different slopes. Slop bias
indicates that the test does not bear the same relation to the criterion in the two groups. E.g., the SAT
scores do not predict grades equally well for the two groups. Thus, if a single regression line (shown
as a dotted line) were used, the performance of both groups would be predicted incorrectly. As can
be seen in the figure, the line that relates test scores to the criterion differs across groups.
4. Difference in both intercept and slope
Evidence has been found for both bias: e.g., in a study including nearly 100,000 students who entered college it was seen
that SAT scores underpredicted the grades of women and overpredicted the grades of black and Hispanic students.
Item bias (or DIF)
DIF= unexpected difference in item difficulty between groups that is due to something other than the construct of interest.
 Methods: require matching members of the groups of interest on levels of the construct to rule out true differences
in construct level as a reason for the DIF.
Therefore, DIF is related to test validity  if an item exhibits DIF, it means that said item is not measuring the same in both
groups. This may happen because the test taps into different dimensions in the two groups.
e.g.: a math test might measure knowledge of English + Math for non-native English speakers vs only Math knowledge
for native English speakers  such unanticipated multidimensionality can result in DIF.
INTERPRETATION OF DIF AND TEST BIAS
Inability to predict which items will result in DIF  DIF can be mathematically defined, but not necessarily detected when
seen  although there has been much theorizing about particular item characteristics that may cause DIF, researchers
have been unable to formulate general rules that can be used to predict DIF. Reasons suggested:
1- DIF is still a relatively new phenomenon
2- Theories of item difficulty are not well developed—that is, we do not really understand what makes an item difficult
3- The focal groups studied are very heterogeneous
4- DIF is probably not related to just one item characteristic.
In the future, differences in scores may be found to be due to individual differences rather than group differences. In other
words, item effects are so idiosyncratic, and racial and cultural groups are so heterogeneous, that individual effects are
likely to be more salient than group effects.
DIF as Construct-Irrelevant Variance

Standards for Educational and Psychological Testing = “The detection of DIF does not always indicate bias in an item: there
needs to be a suitable, substantial explanation for the DIF to justify the conclusion that the item is biased”.
Bias= type of construct-irrelevant variance  Therefore, they can be seen as a validity issue: a situation in which an item
or test has differential validity for 2 or + groups
e.g.: Item “What colour is a ruby?”  Used to be in tests of general knowledge for children. However, the answer could be
more available to children from parents with high income (who are probably more familiar with rubies). Thus, this question
may measure general knowledge for higher-income children but specialized knowledge for lower-income children crating
therefore a bias (relevant for high-income children and not for lower-income children).
In order to determine whether an item is biased, researchers must determine why the item shows DIF and whether the
reason is relevant to the construct.
Sources of Test Bias
If the bias-inducing mechanism pervades the entire set of test items in the same way, the result would be group differences
in total test scores as well as item scores. However, the item-level differences would not be detectable. The reason is that
DIF detection procedures involve matching groups on total test scores, and because the total test scores would be a sum
of the item-level differences, the latter differences would essentially be “covaried out” and item-level DIF would not be
detected. Testing Standards suggest the following sources of test bias (some of which are also causes of DIF):
• Test content: Bias can arise if the content of a test is more familiar or interesting to a particular group than to another.
e.g., it has been found that black and Hispanic examinees have consistently been found to perform better on reading
passages containing content relevant to them. Other studies from over 20 years showed bias favouring women for content
based on the humanities or on human relationships, and bias against women when the content was science related.
Another source of both test bias and DIF is content that may be offensive or disturbing to a particular group (e.g., reading
passages or test items that refer to the Holocaust). The portrait of stereotypic roles also should be avoided. Language may
also be something to care about: language demands should be kept to a minimum unless the construct being measured is
said language reading, writing, or speaking ability.
• Test context: Test bias can result from lack of clarity in the instructions because examinees who are test-wise will be
better able to decipher such directions correctly. A lack of rapport with or trust for the person administering the test may
also result in test bias. This is particularly true for tests that are individually administered, as the match of an examinee’s
race, ethnicity, gender, and cultural background to that of the test administrator has been found to affect responses.
• Test responses: In some cases, test bias or DIF can result from responses that were unanticipated or were arrived at using
an unconventional strategy. If examinees from a particular group are more likely to provide such responses, and these
responses are not accounted for in the scoring rubric, that group may be unfairly penalized for responses that are
unconventional but correct. It is also important to ensure that scoring rubrics focus on the most important features of
responses, rather than irrelevant features such as neatness or handwriting (unless these are the focus of the assessment).
• Opportunities to learn: Learning occurs both within and outside the classroom, and examinees from different groups may
have more or less exposure to such learning. In education contexts, it may happen that students in some schools are simply
not taught the material needed to answer questions on an achievement test. In other cases, such as in the ruby example,
some examinees may lack the experiences or informal learning opportunities to acquire certain types of knowledge.
A situation in which differential performance might not be considered bias  Math test in which word problems were
found to be more difficult for black than for white examinees whereas disparities between the groups on computational
problems were much smaller. It was said that math content could have 2 dimensions (computation + word problems). Thus,
assuming that both were part of math ability, it would not be considered bias since ability to solve word problems was
relevant for the construct (even if this makes test developers reconsider the mix of the 2 dimensions used on the test).
TEST FAIRNESS
Cole  Psychometric decisions relevant to test bias and the policy decisions relevant to the appropriateness of test makeup
are separate issues  A test may be unbiased in a psychometric sense but still be used in an unfair manner.
 It may be appropriate to interpret a test as a measure of intelligence but using the test scores to select employees
is a different matter and might result in unfair selection practices.
Fairness in tests= the way in which test scores are used (how scores are interpreted in the evaluation of a test-taker)
↑ Related to test validity

It’s evaluation requires ↑range of evidence: empirical data, legal, ethical, political, philosophical, and economic reasoning
Testing Standards  “all steps in the testing process, including test design, validation, development, administration, and
scoring procedures, should be designed in such a manner as to minimize construct-irrelevant variance and to promote valid
score interpretations for the intended uses for all examinees in the intended population”
Universal Design (UD)
The principles of UD were developed to create tests that are as accessible as possible to all test-takers in the intended
population  To create a test in which no one is disadvantaged, and sources of construct-irrelevant variance are minimized.
The use of UD principles in developing and administering tests should ↓ test features that may result in bias.
UD principles:
- Are based on sensitivity to all characteristics that might disadvantage test takers, including language, culture, race, and
sex, but also focus on characteristics of test takers with special needs.
- Include avoiding test characteristics that might disadvantage some test takers, such as unnecessary test speediness,
wording or examples that may be unfamiliar to some test takers, or use of language that might be offensive to some groups.
- Focus on providing a range of test-taking options, such as choice of font sizes and response formats (paper and pencil,
computer, or verbal). Such choices allow the test to be accessible to as many test takers as possible.
Accommodations and Modifications

Not all tests can be made accessible to all test takers, even if universal design principles are used, and in these cases test
adaptations may be needed.
Adaptations = changes to the original test that are intended to increase access to the test for certain test takers. 2 types:
Accommodations preserve the comparability of scores across the adapted and original test versions. E.g.: time extension,
text magnification, braille format, dictionary for non-native, etc.
It’s important that the level of difficulty does not change it should eliminate a source of bias for those needing it but would
not improve the scores of those who do not need it (e.g., braille won’t improve the scores of those with sight, but it will
improve those whose sight is impaired).  Nevertheless, there are cases in which this is not as clear, such as with time
extension. For this accommodation to be done, thus, evidence should be gathered proving that the accommodation
performs better when given extra time, whereas those not qualifying for the accommodation do not (= performance),
which could be fairly difficult since most test-takers would benefit from extra time. This is why as the UD says that no time
limit should be added unless it’s part of the measured construct: eliminating time as a biasing factor eliminates the need
to provide this accommodation.
Modifications  adapt the test in such a way that the construct being measured will likely change to some extent.
Are usually made for test takers for whom the test would not otherwise be accessible  e.g., a test based on reading
passages for a test-taker with severe dyslexia: the test could be read to them. If the construct is defined as the ability to
read, such modification changes the nature of the construct and thus, the interpretation of the test score. If the construct
is general comprehension ability, the construct could be considered the same.
In the first case, it would not generally be appropriate to compare scores from the accommodated test to those from the
non-accommodated test since such a comparison presupposes that the same inferences can be made from scores on the
two tests. If the accommodation is such that the scores no longer measure the same construct, it is unlikely that inferences
from the two tests will be the same, but if the construct remains the same, then the scores might be reasonable to compare.
Standardized testing procedures require standardized administration and scoring conditions, and test adaptations are
essentially disruptions of these standardized conditions  These adaptations can compromise the validity of the resulting
scores and thus, there have to be research to show that test scores are equally valid for accommodated and non-
accommodated tests.
Testing Standards  “When a test is changed to remove barriers to the accessibility of the construct being measured, test
developers and/or users are responsible for obtaining and documenting evidence of the validity of score interpretations for
intended uses of the changed test, when sample sizes permit”
 Scores from accommodated tests are comparable to those of the non-accommodated test.
 This should be backed up by evidence in the same manner as any validity claim.
 Modified assessments should be treated as new assessments and should adhere to the same standards of reliability,
validity, and fairness as any other test.
Potential threats to the modification of a test:
Construct underrepresentation  e.g., accommodations sometimes allow examinees to take tests under untimed
conditions. But if speed of processing is part of the intended construct, such an accommodation would result in an
incomplete representation of the construct.
Construct irrelevance  test adaptations may result in construct-irrelevant easiness. They may go beyond its purpose
of allowing access to the test and instead render the test too easy. E.g., use calculators for a math test: for students
with dyscalculia it may help, but for the rest could make the test too easy  removes barriers to test access for those
receiving it, without giving these test takers any unfair advantage.
In some cases, it would not be appropriate to offer testing accommodations in case the test measures essential skills that
those performing test must be able to perform, providing accommodations that do not require those skills would not be
appropriate (e.g., speed is essential in some jobs, such as short-order cooks and firefighters).
Important to standardize procedures if adaptations are provided: who is eligible and how the adaptation should be
administered  Adaptations should only be provided if the test taker has a documented need, such as an individualized
education plan or documentation by a physician, psychologist, or other qualified professional.
Need for More Research on DIF and Test Bias
More research is needed on the causes of DIF and test bias. Said research could use methods like:
o Matching studies  Could be useful for situations in which DIF or test bias is hypothesized to be due to differential
patterns of course taking or to differences in reading or English-language ability. Examinees could be matched on these
variables to determine if bias is reduced or eliminated after matching is done.
o Experimental studies  Could be used to study certain item features (e.g., personally relevant content / inclusion of tables
or figures) that are thought to cause different responding by manipulating them on the study.
o Qualitative methods (as interviews or focus groups)  Think-aloud protocols, in which examinees are asked to verbalize
their thinking as they answer items, can yield valuable information on response strategies. Examinees could also be asked
directly about their strategies or about the effects of various item or test characteristics on their ability to respond. If
items are given in a free response format, error analyses in which features of incorrect answers are categorized might
provide useful information about misconceptions that occur for certain examinees. In addition to these analyses, which
are typically conducted after test or item bias has been detected, testing organizations routinely pre-screen items for
possible bias (known as a sensitivity review)
Sensitivity Reviews
Most testing organizations are quite sensitive to the possibility of item bias and thus, items are routinely screened for
content that is potentially unfair.
It’s common to convene a group of reviewers, including representatives of all groups against whom possible DIF is expected.
 Between 1 day or ½ day participating in review activities. There’re different procedures but a typical session:
1. Orientation in which reviewers are trained in the review process:
a. Specially selected or constructed tests that illustrate common types of biased or unfair items are used to train
reviewers (not necessarily)
b. For each item, reviewers would indicate whether they think there is a problem and how the item might be revised.
c. Since this judgment is by nature subjective, they are thought to come to a consensus about whether an item is
actually problematic, whether and how it should be revised, or whether it should be deleted.
2. Reviewers provide judgments of actual and/or potential test items.
a. Focus: ID items that exhibit stereotyping or biased representations of women or minority group members, offensive
wording, or item content for which examinees may have differential levels of familiarity or instruction  Ideally
there wouldn’t be any since they’d have been a careful writing process.
3. Decisions on whether to eliminate or reword some of the test items based on the results of the sensitivity review
Recommendations for sensitivity review panels:
 Include at least 2 representatives from each group. Writers of the items should not serve as sensitivity reviewers.
 Ask reviewers to state specific reasons for judgments of unfairness or bias.
 Provide reviewers with a rating form that delineates specific categories for types of bias (stereotypical, offensive,
unfamiliar, etc.) and includes space for additional comments.
DIRECTRICES INTERNACIONALES PARA EL USO DE LOS TEST
2.3 Prestar atención a los aspectos relacionados con el sesgo de los tests
Cuando los tests se van a utilizar con personas de diferentes grupos (por ejemplo: género, cultura, educación, etnia, origen,
o edad, entre otros), los usuarios competentes de los tests harán todos los esfuerzos posibles para asegurarse de que:
1. Los tests son imparciales y adecuados para todos los grupos evaluados.
2. Los constructos que se están midiendo son relevantes para cada uno de los grupos evaluados.
3. Existen datos disponibles sobre las diferencias de rendimiento de los grupos en el test.
4. Hay datos disponibles sobre el Funcionamiento Diferencial de los Ítems cuando ello es pertinente.
5. Hay datos sobre la validez que apoyan el uso del test en diferentes grupos.
6. Se minimizan los efectos de las diferencias grupales no relacionadas con el objetivo de la medición.
7. Las directrices sobre la imparcialidad de los tests se interpretan dentro del marco de la legislación al respecto existente
en cada país.
Cuando se utilizan los tests en más de un idioma (idiomas distintos, dialectos, lenguaje de signos, etc.) los usuarios
competentes harán todos los esfuerzos posibles para asegurarse de que:
8. Las versiones de los distintos idiomas o dialectos hayan sido elaboradas utilizando una metodología rigurosa.
9. Los constructores hayan sido sensibles a los aspectos de contenido, culturales e idiomáticos.
10. Quienes aplican los tests sean capaces de comunicarse perfectamente en el idioma en el que se aplica el test.
11. El dominio de la lengua (en la que se aplicará el test) de las personas evaluadas sea comprobado sistemáticamente,
utilizándose la versión más adecuada, o una bilingüe si fuese necesario.
Cuando se utilizan los tests con personas que tienen alguna discapacidad, los usuarios competentes harán todo lo que sea
posible para asegurarse de que:
12. Se ha recabado consejo de los expertos acerca de los efectos de la discapacidad sobre el rendimiento en el test.
13. Se han consultado las personas a evaluar y se ha dado un tratamiento adecuado a sus necesidades y deseos.
14. Se han llevado a cabo los ajustes oportunos cuando se evalúa a personas con discapacidades auditivas, visuales,
motoras, dislexia, u otras.
15. Se ha contemplado la posibilidad de usar procedimientos de evaluación alternativos en vez de modificaciones o
ajustes de los tests.
16. Se ha solicitado consejo a expertos en el caso de que el grado de modificación requerido por el test esté más allá de
la experiencia y conocimientos del usuario.
17. Las modificaciones, cuando sean necesarias, se ajustan a la naturaleza de la discapacidad y se han diseñado para
minimizar el impacto sobre la validez de las puntuaciones.
18. La información relativa a cualquier ajuste o modificación hechos en el test o en su aplicación se comunica a quienes
interpretan o utilizan las puntuaciones del test, para así facilitar una interpretación apropiada de las puntuaciones.
Artículo: "Uso equitativo de tests en ciencias de la salud"

Fairness

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fairness

Uploaded by

Copyright:

Available Formats

PSYCHOMETRY- FAIRNESS

Item bias (or DIF)

INTERPRETATION OF DIF AND TEST BIAS

DIF as Construct-Irrelevant Variance

Sources of Test Bias

↑ Related to test validity

Accommodations and Modifications

Need for More Research on DIF and Test Bias

Recommendations for sensitivity review panels:

Artículo: "Uso equitativo de tests en ciencias de la salud"

You might also like