PA PR - S5 - Moerdyk - Ch8 (1) Week 5 Reading 1

Copyright © 2015. Van Schaik Publishers. All rights reserved.
May not be reproduced in any form without permission from the publisher, except fair uses permitted under
8
OB J ECT I V ES
By the end of this chapter, you should be able
to
o describe the importance of assessment in a

cross-cultural context
o list the factors that affect the outcomes of
assessment in cultures that differ from those
in which the assessment was devised
o describe the various forms of equivalence
Assessing in a
and the factors that jeopardise the equiva-
lence of different tests and items multicultural
context
o describe the factors that influence the cross-
cultural validity and fairness of assessments.
8.1 Introduction purposes of people with limited English ability

and/or experience of English culture, such as
Psychological assessment is being increasingly
with recent immigrant populations, or where
applied to people from different cultural con-
the assessments are used in transnational set-
texts, either in a single country (involving im-
tings. Various economic, political and social
migrants) or in different countries. This is done
developments, both nationally and internation-
for many reasons – for example, academic re-
ally, in the past few decades have resulted in a
searchers may be interested in looking at uni-
great increase in the need for, and interest in,
versals of behaviour across all groups, or they
cross-cultural assessment (Van de Vijver, 2002).
may be interested in national group differences
These trends include a more global economy
or, finally, they may be interested in individual
and increased labour migration, the internation-
differences. The focus of their attention depends
alisation of education, and a massive influx of
on why assessing is being done in the first place.
political refugees into many European and
In many instances, people are assessed to under-
other stable countries, all of which have given
stand their current level of functioning as may
impetus to the understanding of cross-cultural
occur when they are experiencing some form of
interactions. According to a March 2000 report
difficulty, either as a result of personal adjust-
by the International Labour Organization (ILO,
ment problems or as a result of circumstances
2000, p. 1)
imposed on them by natural disasters. Here one
thinks of earthquake survivors or people who [t]he growing pace of economic globaliza-
have been traumatised by volcanic eruptions tion has created more migrant workers than
and cyclones. People may also have been nega- ever before. Unemployment and increasing
tively affected by man-made disasters such as poverty have prompted many workers in de-
U.S. or applicable copyright law.
war and nuclear disasters or large-scale chemical veloping countries to seek work elsewhere,
pollution accidents such as occurred at Bhopal, while developed countries have increased
India. People are also assessed for educational their demand for labour, especially unskilled
placement and for the award of academic bur- labour. As a result, millions of workers and
saries and scholarships. In the organisational their families travel to countries other than
arena, people are usually assessed for selection their own to find work. At present there are
purposes to determine their suitability for specif- approximately 175 million migrants around
ic jobs or positions. the world, roughly half of them workers (of
In the context of this book, most of this as- these, around 15% are estimated to have an
sessment will be for the selection and placement irregular status).
µ BACK TO CONTENTS
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 8/19/2019 6:56 AM via THE SOUTH AFRICAN COLLEGE OF97
APPLIED PSYCHOLOGY
AN: 1243028 ; Moerdyk, A. P..; The Principles and Practice of Psychological Assessment
Account: ns190599
SECTION 2 INTRODUCTION TO PSYCHOMETRIC THEORY
The report also predicts further increases in norms, roles and values ... Cultural differ-
Copyright © 2015. Van Schaik Publishers. All rights reserved. May not be reproduced in any form without permission from the publisher, except fair uses permitted under
international economic migration, particularly ences are best conceptualised as different

if the disparity in wealth between rich and poor patterns of sampling information found in the
countries continues to grow as it did in the last environment (Triandis, 2000, p. 146).
decade of the 20th century. In addition, polit-
ical disturbances and natural disasters in many A similar definition is offered by Robbins
parts of the world have resulted in the displace- (1996, p. 48):
ment and migration of large numbers of people
The primary values and practices that charac-
to more stable and economically attractive
terise a particular country.
countries.
As a result of these and similar factors, there is
an increasing need to assess various educational, Perhaps the most widely cited definition of
mental health and work-related competencies culture is that put forward by Geert Hofstede
and the application of tests and assessments of (1991) who sees culture as “software of the
different kinds across different cultural contexts, mind” and as “the collective programming of
either in a single country (involving migrants) the mind which distinguishes the members of
or in different countries. In addition, research one group or category of people from others”
into various personality and other psychological (p. 5).
constructs necessitates widespread assessment There are various ways of categorising cul-
across a broad range of cultural and social con- ture, the most prominent of which are theor-
texts. Accordingly, when it comes to assessing ies by Kluckhohn and Strodtbeck (1961) and
psychological functioning in such areas as cog- Hofstede (e.g. 1980). However, space does not
nitive ability, personality, mental health status warrant a discussion of these models here. Read-
and legal competence across a range of contexts, ers are referred to Hofstede (e.g. 1991) and to
such as in the educational or mental health Kluckhohn and Strodtbeck (1961).
fields, neuropsychological assessment or for se-
lection or promotion in organisational contexts, 8.1.2 Emic and etic approaches
the effects of culture on psychological and cog-
nitive process and outcomes need to be taken A closely related issue is the widely accepted
distinction made between emic and etic ap-
into account.
proaches to the understanding of psychological
These assessments need to be carried out in
phenomena in a cross-cultural context. These
ways that are fair and unbiased, irrespective of
terms derive from linguistics, where phonemics is
why they are carried out. As shown in Chapter 7,
the study of sound patterns in a particular so-
fairness is a special case of validity generalisation.
cial group and phonetics is the study of univer-
(Are our measures equally valid across different
sal sound patterns. They are discussed in some
groups?) In much the same way, cross-cultural
depth in Chapter 7 (section 7.2.1).
fairness asks whether the measures are equally
In the field of assessment, especially assess-
valid across groups of people with different cul-
ment across sociocultural categories, one is
tural backgrounds and linguistic ability in the lan-
constantly faced with the issue of whether dif-
guage of assessment. At the same time, as Coyne
ferences in assessment score reflect group dif-
(2008) has noted, “[e]qual opportunity laws in
ferences or whether they reflect bias and other
many countries will prohibit the use of tests in
problems in the measuring technique. In other
a manner that discriminates unfairly against pro-
words, are any measured group differences on

tected groups of the population (such as gender,
various psychological dimensions real or are they
racial, age, disability, religion, etc.).”
simply artefacts that arise because the assessment
processes measure things differently in the dif-
8.1.1 Definitions of culture ferent groups? To put it crudely, if we find group
differences in ability or personality structure, are
The term “culture” is widely used in anthropol-
they real – that is, can we assume that the as-
ogy, where a typical definition is as follows:
sessment techniques we use are correct, and that
Culture is a shared meaning system, a shared scores accurately reflect these differences? Or do
pattern of beliefs, attitudes, self definitions, we argue that the differences in the assessment
98EBSCO µ BACK TO CONTENTS

Publishing : eBook Collection (EBSCOhost) - printed on 8/19/2019 6:56 AM via THE SOUTH AFRICAN COLLEGE OF
APPLIED PSYCHOLOGY
Account: ns190599
ASSESSING IN A MULTICULTURAL CONTEXT 8
outcomes reflect weaknesses in the assessment turation is a process of change in the direction of
instruments – that is, they are biased? If people the mainstream culture. Although migrants may
score differently on any assessment technique differ in the speed of the process, it results in
and this is because the assessment techniques adaptation to the culture of destination.
used are invalid, this is a psychometric issue Recently (over the past two or three decades),
– the measuring tool is “at fault”, and ways of the uni-dimensional model has been increas-
dealing with this need to be found. Alternative- ingly replaced with a bi-dimensional model of
ly, if the assessment techniques are equally valid acculturation that has been seen as more appro-
for all people being assessed, any observed dif- priate (Ryder, Alden & Paulhus, 2000). Rather
ferences are the result of real social, historical than pursuing complete adjustment to the new
and educational differences that impact on the culture in an assimilationist way, the trend has
abilities and behaviours of the people being as- been towards developing a bi-cultural identity or
sessed and hence on the assessment process and/ by retaining the original culture without exten-
or outcomes. Addressing these differences is a sively adjusting to the society of settlement. Van
social and a political problem, not a measure- de Vijver and Phalet (2004) attribute this to two
ment issue. Of course, a middle way between factors: first, the sheer magnitude of migration
these two extremes is possible – that while some has allowed the incoming migrant populations
differences in group performance may reflect to develop and sustain their own cultural insti-
real differences in ability and structure, few tutions such as education, health care and reli-
measures are culture-fair, and bias and differ- gion, and second, because the Zeitgeist of the
ential item functioning (DIF) may well suggest assimilationist doctrine among existing cultures
differences when none exist. has been replaced by one that is more accepting
of diversity, in which the retention of various
8.1.3 The issue of acculturation cultural institutions and behaviour patterns by
migrants is more readily accepted.
An important aspect of culture and its impact As a result, a popular current model is one
on psychological structures is acculturation, or proposed by Berry (Berry & Sam, 1997). Ac-
the transition from one culture to another. It is cording to this model, a migrant is required to
commonly known that humans are not static deal with two questions. The first is, does he
organisms but change in reaction to (and often want to establish good relationships with the
lead) changes in their environments. An import- culture of destination or his host culture (adap-
ant source of these changes is moving from one tation dimension)? The second question involves
sociocultural context to another, for whatever cultural maintenance: does he want to maintain
reason. Acculturation is one such process and good relations with the culture of origin or his
involves the psychological adaptation of people native culture? These two dimensions interact to
(such as migrants and minorities) to a new and yield four distinct coping strategies, as shown in
different cultural setting as a result of movement Table 8.1.
from and adjustment (see, for example, Van de
Vijver & Phalet, 2004). The extent of this adap-
Table 8.1 Migrants’ strategies in a bi-
tation depends on a range of exogenous vari-
dimensional model of acculturation
ables, such as length of residence, generation-
al status, education, language mastery, social
Cultural adaptation
disadvantage and cultural distance (Aycan & Do I want to establish good
Berry, 1996; Ward & Searle, 1991). In addition, relations with the culture of
it depends on the extent to which the individuals destination?
wish to adapt and integrate into the new culture.
In this regard, Van de Vijver and Phalet (2004) Cultural Yes No
argue that two basic models of acculturation maintenance
Do I want to Yes Integration Separation/
can be identified in the literature, depending on segregation
maintain good
whether acculturation is seen as a uni-dimen- relationships
sional or a bi-dimensional process. The best- with my culture No Assimilation Marginalisation
known uni-dimensional model is that proposed of origin?
by Gordon (1964), which assumes that accul-
µ BACK TO CONTENTS
EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 8/19/2019 6:56 AM via THE SOUTH AFRICAN COLLEGE OF99
APPLIED PSYCHOLOGY
Account: ns190599
The first strategy put forward by Van de Vij- mension (which is essentially a continuum rath-
ver and Phalet (2004) is integration, where char- er than a dichotomy) determines the suitability
acteristics of both cultures are maintained in a of the assessment technique for the person and
process of biculturalism. They quote a number the applicability of the norms used for inter-
of research studies in Belgium and the Nether- preting the outcomes. Simply assuming that all
lands (e.g. Phalet & Hagendoorn, 1996; Pha- tests are invalid for minority groups, or that they
let, Van Lotringen & Entzinger, 2000; Van de can simply be used with all minority groups, is
Vijver, Helms-Lorenz & Feltzer, 1999), which clearly false: the level of acculturation may be
consistently show a preference for this strategy, an important moderator of test performance in
namely that migrants want to combine their ori- multicultural groups (Cuéllar, 2000). For this
ginal culture with elements of the mainstream reason, Van de Vijver and Phalet (2004) argue
culture. that the various measures of acculturation that
The second strategy identified by Van de Vij- have been developed need to be applied as a pre-
ver and Phalet (2004) is one where migrants re- cursor to assessment in a multicultural context.
tain most elements of their original culture and They argue (p. 218) that
generally ignore most aspects of the host culture.
Van de Vijver and Phalet term this separation (in [i]t is regrettable that assessment of accultur-
sociology and demography it is also labelled seg- ation is not an integral part of assessment in
regation). In South Africa, where this cultural multicultural groups (or ethnic groups in gen-
separation was enforced by white nationalists, it eral) when mainstream instruments are used
was termed apartheid or “separate-ness”. among migrants.
The third strategy is assimilation, which is
the opposite of the separation strategy, in that Using this two-dimensional model, measures
it aims at complete absorption of the migrant of acculturation are typically based on different
into the host culture with the concomitant loss combinations of positive or negative attitudes
of most elements of the original culture. This is towards adaptation and maintenance. These atti-
the notion of the melting pot, which was the dom- tudes are assessed using three distinct question
inant policy for many years in many Anglophone formats, namely one, two or four questions
countries (the UK, the US, Australia and Can- (Van de Vijver, 2001). The Culture Integra-
ada, to name a few). In recent years, this melt- tion-Separation index (CIS: Ward & Kennedy,
ing-pot view has given way to multiculturalism 1992) is an example of a one-question format
of various kinds. measure. These measures typically ask forced-
The fourth (and in the view of Van de Vij- choice questions, with a choice between either
ver and Phalet (2004), the least often observed) valuing the ethnic culture or host culture, or
strategy is marginalisation, which involves the both, or neither, for example: “Do you pre-
loss of the original culture without establishing fer (A) your own [e.g. Turkish] way of doing
ties with the new culture. In some countries things; (B) the Dutch way of doing things; (C)
youth, often second or third generation, show equally like both Turkish and Dutch ways of
marginalisation of this kind; they do not feel any doing things; and (D) neither – I dislike both
attachment to the parental culture nor do they ways of doing things”. An advantage of this
want to establish strong ties with the host cul- one-question format is that the questions tend
ture (often they are prevented from identifying to be efficient and short, but they cannot dis-
with the host culture because of societal distinguish complex attitudes of bicultural individ-
crimination or other forms of exclusion). As De- uals.

noso (2010) argues, in real life marginalisation is The two-question format asks for separate
seen as a negative outcome of the acculturation importance ratings for maintaining the ethnic
process, rather than as a conscious choice by the culture and for adapting to the host culture,
people concerned. which assess the individuals’ attitudes to cul-
When it comes to assessment, Van de Vijver tural maintenance and adaptation to the host
and Phalet (2004, p. 218) argue that the culture culture separately. An example of this is Phalet
maintenance dimension is usually less relevant & Swyngedouw’s (2003) Acculturation in Con-
than the adjustment dimension. This is so be- text Measure (ACM). In this case, the ACM
cause the position of a person on the latter di- asks these two questions: “Do you think that
100
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
[“Culture of Origin Groups”, e.g. Turks] in the to assessment in general has poor psychometric
[Country of Destination, e.g. the Netherlands] properties, in that the questions are ipsative
should maintain the [Turkish] culture (4) com- (i.e. they are positively correlated and thus not
pletely; (3) mostly; (2) only in part; or (1) not independent of one another). (See section 3.6.6).
at all?” and “Do you think that [Turks] in the Another development in the assessment of ac-
[Netherlands] should adapt to the [Dutch] cul- culturation is the view that individuals do not
ture (4) completely; (3) mostly; (2) only in part; adopt a single approach in this area. Rather, the
or (1) not at all?” approach adopted is contingent on the situation
The four-question format measures such as in which it is shown. In this regard, according to
the Acculturation Attitudes Scale (AAS) de- Arends-Tóth and Van de Vijver (2003), accul-
veloped by Berry, Kim, Power, Young and turation strategies adopted depend on whether
Buyaki (1989) use agreement ratings with four this occurs in the public and private domain.
statements that independently assesses each of Similarly, Phalet and Swyngedouw (2003)
Berry’s four strategies by indicating whether found that willingness to engage in maintenance
participants Agree Strongly (A), Agree (a), Dis- or adaptation was context dependent. In particu-
agree (d) or Disagree Strongly (D) with each of lar, they showed that most migrants tend to fa-
the following statements: vour cultural maintenance in the private domain,
such as family relationships, and adaptation to
1. I think that [Turks] in [the Netherlands] should the host culture in the public domain, such as
maintain the [Turkish] culture and not adopt school, work, etc. (Arends-Tóth & Van de Vij-
any Dutch ways of doing things [Separation]. ver, 2003; Phalet & Andriessen, 2003; Phalet &
2. I think that Turks in the Netherlands should try Swyngedouw, 2003). Moreover, in these stud-
to fully adopt Dutch ways and forget about their ies this acculturation profile was considered as
Turkish ways of doing things [Assimilation]. the most adaptive pattern. The Acculturation in
Context Measure (ACM) developed by Phalet
3. I think that Turks in the Netherlands should try and Swyngedouw (2003) is a two-question for-
to keep their Turkish customs and culture, while mat measure that repeats the same questions in
at the same time trying to fit into Dutch culture multiple contexts (e.g. home, family, school and
as far as possible [Integration]. work situations).
4. I think it is stupid of people to have any form of In closing this section on acculturation,
culture – I reject both my Turkish culture and Arends-Tóth and Van de Vijver (2006b) provide
that of the Netherlands [Marginalisation]. five guidelines for the assessment of accultura-
tion. These are as follows:
(Note: These are not the actual questions used,
but merely illustrate the approach.) 1. Acculturation conditions, orientations and
outcomes usually cannot be combined in
a single measure. Combining them makes
Denoso (2010, p. 38) argues that the two- and
it difficult to determine how acculturation
four-question format measures successfully
could explain other variables (e.g. cognitive
discrimination between the integration strat-
developmental outcomes) if all aspects of ac-
egy, which is generally considered to be more
culturation are used as predictors.
adaptive, and the other less adaptive strategies
(Arends-Tóth & Van de Vijver, 2003). On the 2. A measure of acculturation can only be com-
other hand, Rudmin and Ahmadzadeh (2001, prehensive if it contains aspects of both the
mainstream and heritage cultures.
cited by Denoso) have argued that the mar-

ginalisation strategy was misconceived and 3. Proxy measures (e.g. generation, number of
incorrectly operationalised during the test con- years living in the country) can provide valu-
struction process. They argue that the four-fold able complementary information to other
paradigm commits the Fundamental Attri- measures of acculturation, but are usually
bution Error* by presuming that accultura- poor stand-alone measures of acculturation.
tion outcomes are caused by the preferences of Simply taking stock of a set of background
the acculturating individuals rather than by the conditions and ignoring psychological as-
acculturation situations. They further argue the pects results in an indirect, limited appraisal
general point that the four-question approach of acculturation.
µ BACK TO CONTENTS 101

EBSCO Publishing : eBook Collection (EBSCOhost) - printed on 8/19/2019 6:56 AM via THE SOUTH AFRICAN COLLEGE OF
APPLIED PSYCHOLOGY
Account: ns190599
4. The use of single-index measures should be He argues that if any instrument is “borrowed”
avoided. The content validity of these types from another cultural group, it must be shown
of measures is typically low and inadequate to have been validly adapted: the test items must
to capture the multifaceted complexities of have conceptual and linguistic equivalence, the
acculturation. Moreover, there is no support test and test items must be free of bias (Fouad,
in the literature for any single-index measure 1993; Geisinger, 1994) and appropriate norms
of acculturation. must be developed. These properties have to be
5. The psychometric properties of instruments empirically determined.
(validity and reliability) should be reported.
8.2.2 Translate/adapt
8.2 Approaches to cross-cultural Secondly, existing tests and measures can be
assessment adapted and translated into the language of the
target group. However, this goes beyond a literal
In addressing the issues associated with using and even idiomatic translation in order to ensure
psychometric instruments in societies for which the proper conceptual translation of the test ma-
they have not been developed, Van de Vijver terial. For example, the Minnesota Multiphasic
and Hambleton (1996) identify three approach- Personality Inventory (MMPI – a clinically ori-
es which they term Apply, Adapt and Assemble. ented personality scale) contains various implicit
In this book, the third approach (namely As- references to the American culture of the test
semble) has been split into two to yield Develop designers, and extensive adaptations to many
Culture-Friendly Tests and Develop Cul- items are required before the scale can be used
ture-Specific Tests. The four different ap- in other languages and cultures.
proaches discussed in this text are:
8.2.3 Develop culture-friendly tests

8.2.1 Apply
The third approach to assessing cross-culturally
Firstly, instruments that have been developed is to develop instruments that are designed to
in one particular social context (essentially measure the targeted construct in ways that are
Western/Eurocentric) can simply be applied “user friendly” in specific cross-cultural con-
to all groups across different sociocultural set- texts. This was the idea behind the so-called
tings without checking the meaningfulness “culture-free” (Cattell, 1940), “culture-fair”
and psychometric properties such as reliability (Cattell & Cattell, 1963), and “culture-reduced”
and validity of the instruments. This approach tests (Jensen, 1980). The claim that there are
adopts an assumption of universality, the view that psychological assessment processes that are not
these instruments retain their original properties affected by cultural factors was criticised more
in the new setting. Personality questionnaires than 40 years ago (e.g. Frijda & Jahoda, 1966).
developed by Eysenck are examples of instru- Nevertheless, the idea that some assessment for-
ments that have been translated and validated mats are more suited for use in cross-cultural
in various cultural groups on the assumption contexts than others because of particular fea-
that personality structures and the items as- tures such as their format, mode of administra-
sessing each aspect are the same in all cultures tion or item contents still underlies much test
and contexts as has occurred with various per- design and data analysis in cross-cultural psych-
sonality scales (e.g. Barrett, Petrides, Eysenck ology (Van de Vijver, 2002).
& Eysenck, 1998; Eysenck, Barrett & Eysenck,
1985). In Chapter 7 (section 7.2.1), this ap-
proach is described as an “etic” approach, and 8.2.4 Develop culture-specific tests
as Van de Vijver (2002, p. 545) notes, this is The fourth approach to assessing cross-
a form of “blind” application of an instrument culturally is to develop culture-specific instru-
in a culture for which it has not been designed, ments from scratch to assess constructs that
and is simply bad practice where there is no con- may be very different in the specific cultural
cern for the applicability of the instrument nor setting (e.g. Cheung, Leung, Fan, Song, Zhang
its psychometric properties in the new context. & Chang, 1996). This is especially important
102
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
when existing instruments have been shown not haviours. This is, of course, in line with the find-
only to be invalid and unreliable, but more espe- ings of Hofstede (e.g. 1991, 1994, 1996) who
cially that they do not adequately assess the par- distinguishes between masculinity and feminin-
ticular construct in the “other” cultural group. ity as one of five cultural dimensions. As such,
This is termed an “emic” approach. (The origin test items assessing self-presenting or self-en-
and meaning of the terms “emic” and “etic” are hancing traits that would be viewed as import-
discussed in some depth in Chapter 7, section ant in communalistic cultures would be seen
7.2.1.) as socially undesirable, rated lower and seen as
Irrespective of whether a psychometric test is incongruent with possible leadership emergence
taken as is and applied to a new group of people, and effectiveness in individualistic Western cul-
whether the test has been adapted or whether tures.
it has been developed from scratch, it needs to Another example comes from research in per-
be calibrated or “normed” for the population sonality on the five-factor model. On the basis of
for which it is to be used. Perhaps more import- widespread research, McCrae and Costa (1997)
antly, the behaviour of the test across cultural found considerable evidence for the universal-
boundaries needs to be investigated in order to ity of the structure in US English, German,
determine whether the tests are measuring the Portuguese, Hebrew, Chinese, Korean and
same phenomenon in the same way – do the re- Japanese samples. On the other hand, however,
sults mean the same thing for different groups? Cheung et al. (1996) found that the five-factor
Put differently, are the tests and their results model leaves out aspects of psychological func-
equivalent across the different groups? In order tioning that are considered important by Chi-
to examine this further, we need to understand nese people. For example, interpersonal factors
the various factors or sources of bias that con- such as “harmony” and “losing face” are often
taminate and detract from the cross-cultural observed when descriptions of personality are
validity of our measure. given by Chinese informants, but are not repre-
sented in the five-factor model.
A third example can be found in Ho’s (1996)
8.3 Forms of bias work on filial piety (psychological characteristics
In addressing the issues of cross-cultural equiva- associated with being a good son or daughter).
lence, a useful starting point is to identify the The Western conceptualisation is more restrict-
various sources of bias so that steps can be taken ed than the Chinese, according to which chil-
to prevent them from contaminating the assess- dren are supposed to assume the role of care-
ment scores. Van de Vijver and others (Van de takers of their parents when the latter grow old.
Vijver & Leung, 1997a, 1997b; Van de Vijver & Finally, Dyal (1984, cited in Van de Vijver &
Poortinga, 1997) identify three distinct sources Phalet, 2004) shows that measures of locus of
of bias and unfairness, assuming that blatant control often show different factor structures
forms of discrimination on the basis of sex, race, across cultures, strongly suggesting that either
caste, etc. are excluded. These are construct the Western concept of control is inappropriate
bias, item bias and method bias. in cross-cultural settings or that the behaviours
associated with the concept differ across cul-
tures.
8.3.1 Construct bias Construct equivalence thus implies that the
Construct bias* is the most important reason same construct is being measured across cul-
for construct inequivalence, and occurs when tures, and inequivalence occurs when the instru-
the constructs are associated with different be- ment measures a construct differently in two cul-
haviours or characteristics across cultural groups tural groups, when the concepts of the construct
(“cultural specifics”). Schumacher (2010), for overlap only partially across cultures or when
example, argues that in individualistic Western the measure identifies somewhat different con-
cultures, leadership is usually associated with structs (resulting in “apples and oranges being
traits such as dominance and assertiveness, compared”). This absence of structural equiva-
whereas in more communalistic cultures leader- lence indicates bias at the construct level, and
ship is more likely to be associated with self- unless construct equivalence is demonstrated,
effacing, community-supporting traits and be- erroneous or misleading conclusions about the

APPLIED PSYCHOLOGY
Account: ns190599
nature and significance of the construct in the another. Van de Vijver (2002) gives the hypo-
particular context are likely to result. This sug- thetical example of the item: “Are you afraid
gests the need for an “emic” approach involving when you walk alone on the street in the middle
the development of an appropriate assessment of the night?”, pointing out that this item may
process that is tailored to the unique constella- be responded to very differently by persons
tion of dimensions in the particular context. depending on the safety of their neighbour-
hood, even though they fully comprehend the
question. An item is deemed equivalent across
8.3.2 Item bias cultural groups when it behaves in the same
Item bias, also known as differential item func- way in both cultures – that is, when this form
tioning or DIF, refers to systematic error in how of item bias is absent. Ways of demonstrating
a test item measures a construct for the members and measuring the extent of equivalence across
of a particular group (Camilli & Shepard, 1994). cultural groups are discussed in some depth in
When a test item unfairly favours one group of section 8.5, but in general these can be seen to
examinees over another, the item is biased. Even take the form of chi-square expectancies*,
if the construct itself does not vary across cultural item-whole correlations*, factor-loadings*
divides, many of the items in the assessment may and item curve characteristics* (ICC) of
behave quite differently in different contexts. these items, which need to be shown to be (ac-
When anomalies at the item level exist, item bias ceptably) similar to each other.
is detected (Fontaine, 2005), which points to-
wards differences in the psychological meaning
of the items across cultures or the inapplicability
8.3.3 Method bias
of item content in a specific culture. An item of, The third source of bias in cross-cultural assess-
say, an assertiveness scale is said to be biased if ment refers to the presence of nuisance variables
people from different sociocultural contexts with due to method-related factors. Three types of
a given level of assertiveness are not equally like- method bias can be envisaged. First, incompar-
ly to endorse the item. A good example, given ability of samples on aspects other than the
by Hambleton (1994, p. 235 and cited by Van target variable can lead to method bias (sample
de Vijver, 2002, p. 549), is the test item “Where bias). For instance, cultural groups often differ in
is a bird with webbed feet most likely to live?” The educational background and, when dealing with
English phrase “the bird with webbed feet” is mental tests, these differences can confound real
translated into Swedish as “the bird with swim- population differences on a target variable.
ming feet”, with the result that the English and Secondly, method bias also refers to problems
Swedish items are no longer equivalent as the that relate to the assessment materials used (in-
Swedish version provides a much stronger clue strument bias). A well-known example illustrat-
to the answer than the original English item. (In ing this is the study by Deregowski and Serpell
South Africa, many school-leaving examination (1971), who asked Scottish and Zambian chil-
papers in technical subjects such as science or dren in one condition to sort miniature models
biology were in the past presented simultaneous- of animals and motor vehicles, and in another
ly in English and Afrikaans on the same question condition to sort photographs of these models.
paper. Many English students, when stumped Although no cross-cultural differences were
about the meaning of a technical term, would found for the physical models, the Scottish chil-
turn to the Afrikaans version of the question for dren obtained higher scores than the Zambian
a clue. For example, in biology, the English term children when photographs were sorted. In the
stamen is translated into Afrikaans as meeldraad, latter case, the Zambian children were relatively
which literally translates as pollen wire.) unfamiliar with photographic material.
This type of bias is a major issue in deter- The third form of method bias arises from the
mining the cross-cultural equivalence of a manner in which the assessment is administered
measure and has been extensively studied by (administration bias). Communication problems
psychometricians (see e.g. Berk, 1982; Holland between testers and testees (or interviewers and
& Wainer, 1993). At the same time, it must be interviewees) can easily occur, especially when
realised that item bias does not reside only in they have different first languages and cultural
the translation of items from one language to backgrounds (see Gass & Varonis, 1991). Inter-
104
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
viewees’ insufficient knowledge of the testing tain the same or similar scores on the different
language and inappropriate modes of address or language versions of the items or measure. If
cultural norm violations on the part of the inter- not, the items are said to be biased and the
viewer can seriously endanger the collection of two versions of the measure are non-equiva-
appropriate data, even in structured interviews. lent (p. 79).
One can see how computerised administration
of a test would affect computer-literate people Individual test items and the test as a whole
and those with very little computer experience should not vary in the levels of difficulty or
quite differently. intensity when the groups are known to be
The distinction between measurement unit similar. Equivalence is thus achieved when
equivalence (e.g. degrees Celsius and degrees the assessment behaves in a similar way across
Kelvin) and scalar equivalence (where the cultures as shown by a pattern of high correla-
meaning of the values obtained on the measure tions with related measures (convergent valid-
are identical across groups) is important because ity) and low correlations with measures of other
only the latter assumes that the measurement is constructs (discriminant validity) as would be
completely free of bias (Van de Vijver & Tan- expected from an instrument measuring a sim-
zer, 2004). As indicated, construct bias indicates ilar construct. If there are major differences in
conceptual inequivalence, and instruments that the way in which the groups behave, or if there
do not adequately cover the target construct in are marked differences in the way in which
one of the cultural groups cannot be used for the attributes occur, then specifically designed
cross-cultural score comparisons. Construct bias measures need to be developed and tailored to
precludes the cross-cultural measurement of a meet the demands of the cultural context. This
construct with the same measuring instrument means that at least some items will be differ-
or scale (Van de Vijver & Tanzer, 2004). If no ent in the two countries. This approach is con-
direct score comparisons are to be made across sistent with the “emic” approach.
cultures, then neither method nor item bias will Three kinds of equivalence have been identi-
affect cross-cultural equivalence. However, both fied and are linked in a hierarchy of increasing
method and item bias can have major effects on importance (Van de Vijver & Poortinga, 1997;
scalar equivalence as items that systematically Van de Vijver & Leung, 1997a, 1997b). These
favour a particular cultural group may conceal levels are construct equivalence, measurement
real differences in scores on the construct being unit equivalence and scalar equivalence.
assessed.
8.4.1 Construct equivalence*
8.4 Forms of equivalence This form of equivalence, also termed struc-
Equivalence is essentially the absence of bias – tural equivalence* and functional equiva-
that is, the systematic but irrelevant compon- lence*, indicates that the same construct is
ent of the observed scores – and is the extent to measured across all cultural groups studied, even
which any measure yields the same results across if the measurement of the construct is not based
different groups and is able to correctly identify on identical instruments across all cultures. In
individuals or groups possessing equal amounts cross-cultural assessment, the test constructor/
of the attribute concerned (assuming that they user cannot assume that the construct being as-
have the same amount of the attribute being sessed has the same meaning and psychological
import across cultural divides – this needs to be

assessed), and to correctly distinguish between
empirically demonstrated, and a measure shows
people and groups with different amounts of the
construct bias if there is an incomplete iden-
attribute. As Kanjee and Foxcroft (2009) show
tity of a construct across groups or incomplete
(using a South African example),
overlap of behaviours associated with the con-
[f]or measures to be equivalent, individuals struct (Van de Vijver & Phalet, 2004). Construct
with the same or similar standing on a con- equivalence thus implies that the construct is
struct, such as learners with high mathematical universal (i.e. culture independent) and is meas-
ability, but belong to different groups, such as ured with a given instrument with equal validity
Xhosa- and Afrikaans-speaking, should ob- in the different sociocultural groups. It assumes

APPLIED PSYCHOLOGY
Account: ns190599
the similarity of the underlying psychological 8.4.3 Scalar equivalence

construct in the various groups, a view that is

The third and highest level of equivalence is
associated with an “etic” position.
scalar equivalence*, or full-scale equivalence.
This level of equivalence is obtained when two
8.4.2 Measurement unit equivalence measures have the same measurement unit
This second level of equivalence identified) is and the same origin. These values need to be
called measurement unit equivalence* and demonstrated empirically. Thus a score of 10
is obtained when two metric measures have the on a scale of job satisfaction would have the
same measurement unit but have different ori- same psychological meaning in all sociocultur-
gins (i.e. the point at which they cross the y-axis al groups assessed only if scalar equivalence has
(point “c”) as shown in the formula Y = aX + c been demonstrated – that is, only if the regres-
(Van de Vijver & Leung, 1997a, 1997b). In other sion line obtained for the two groups was equiva-
words, the scales can be equated by adding (or lent in both slope and point of intersection with
subtracting) a constant equal to the difference the y-axis. Full-scale equivalence uses the same
between the c-values obtained on the two meas- scale across cultures, thereby maintaining the
ures. An example of this can be found in the same unit of measure. Naturally, such equiva-
measurement of temperature using Kelvin and lence can only be achieved if scales are univer-
Celsius scales as shown in Figure 1.1 when the sally used and accepted to hold the same univer-
different levels of data are discussed. The two sal meaning (e.g. Fahrenheit or Celsius scale).
scales have the same unit of measurement, but The highest level of equivalence is thus scalar
their origins differ by 273,4 degrees. Converting or full-scale equivalence, which requires equiva-
between degrees Celsius and degrees Kelvin is lence at both construct and measurement lev-
achieved by adding a constant 273,4 to the for- els (Van de Vijver & Tanzer, 2004). In other
mer (so that 100 °C is equivalent to 373,4 K). words, full-test or scalar equivalence is achieved
In some cases, the scores are obtained using when construct, measurement and scalar char-
two measures that are scaled differently and can acteristics obtained within cross-cultural test-
therefore not be directly compared. The scores ing contexts are all similar to those achieved in
can be compared if the relationship between the mono-cultural conditions (Van de Vijver & Tan-
two scales is known, as occurs with Celsius and zer, 2004).
Fahrenheit temperature measures. To illustrate A general model for assessing construct
this, conversion between the Celsius scale and equivalence has been developed by Douglas and
the Fahrenheit scale involves multiplying the °C Nijssen (2002). (See Figure 8.1.)
by 9/5 and adding an offset constant of 32 (so
that 60 °C equals 60 × 9/5 + 32 or 108 + 32, 8.5 Detecting item bias
which equals 140°F).
In the case of cross-cultural studies, both Various methods of demonstrating and meas-
measurement unit equivalence and scale rela- uring the extent of equivalence across cultur-
tionships need to be known if scores are to be al groups take the form of differences in item
compared. In the case where the measurement means and standard deviations, various non-
unit is equivalent, direct score comparisons can- parametric techniques based on chi-square
not be made across groups unless the size of the expectancies*, item-whole correlations*,
offset is known, which is seldom the case. At factor loadings* and item curve charac-
the same time, differences within each group can teristics* (ICC). Differential item functioning
still be compared across groups. For example, (DIF) is perhaps the most important indicator of
change scores in pretest–post-test designs can non-equivalence of assessment items/tests and
be compared across cultures for instruments of bias. At the same time, an item that exhibits
with measurement unit equivalence. Similarly, DIF may not necessarily be biased for or against
gender differences found in one culture can be any group (Kanjee, 2007), but may reflect per-
compared with gender differences in another formance differences that the test is designed
culture for scores showing measurement unit to measure (Camilli & Shepard, 1994) or real
equivalence, even though across-group compari- differences in the phenomenon being assessed.
sons of each gender are not meaningful. This is illustrated in Chapter 7 (section 7.3.1),
106
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
Country A Compare contextual similarity Country B
Assess salience in Country B based on:

Construct/
• Literature review of similar/related constructs/concepts
scale
• Discussion with local researchers (local experts)
developed in
• Conduct evaluative research, i.e. focus groups,
Country A
in-depth interviews where necessary
Compare Develop modified

Literal version of scale/
translation construct
of scale • broaden domain
Examine internal
structure via specification
principal components • add items
and/or confirmatory
factor analysis
Examine nomological validity of

literal and modified versions
Assess criterion and

predictive validity
Figure 8.1 A general model for assessing equivalence

Source: Douglas & Nijssen (2002)
where differences between males and females, of matched reference and focal groups. It also
and between athletes with a European and an refers to the differing probabilities of success on
African heritage perform quite differently in an item of people of the same ability but belong-
various sporting events such as long-distance ing to different groups – that is, when people
running and sprinting. with equivalent overall test performance but
In order to detect the presence and extent from different groups have a different probability
of inequivalence, we need to move away from or likelihood of answering an item correctly.
classical test theory to what is known as differ- DIF thus refers to the differing probabil-
ential item functioning (DIF), which is perhaps ities of success on an item of people of the
the most important indicator of non-equivalence same ability but belonging to different groups –
of assessment items/tests and bias. According to that is, when people with equivalent overall test
Hambleton, Swaminathan and Rogers (1991), performance but from different groups have a
the accepted definition of DIF is as follows: different probability or likelihood of answering

an item correctly. DIF analysis aims at identi-
An item shows DIF if individuals having the fying unexpected differences in performance
same ability, but from different groups, do not across matched groups of testees by comparing
have the same probability of getting the item the performance of matched reference and focal
right (p. 110). groups of equal ability. Of course, the equiva-
lence of the ability needs to be shown independ-
DIF analysis is a means of identifying unexpect- ently of the assessment process – and this cre-
ed differences in performance across matched ates a great area for heated debate and political
groups of testees by comparing the performance grandstanding.

APPLIED PSYCHOLOGY
Account: ns190599
There are several ways in which item bias can refined in the light of this experience. This pro-
be demonstrated. Some are based on expert cess can be repeated several times. The simpli-
judgements which are based on inspection and city of the option has led to its widespread use.
back translation, while others are based on vari-
ous forms of statistical analysis. The statistical
techniques are divided into two main categories: 8.5.2 Non-parametric statistical approaches
non-parametric methods developed for dichot- Whereas the judgemental approaches involve
omously scored items using contingency tables, judgements by expert statistical approaches,
and parametric methods for test scores with non-parametric approaches look for differences
interval-scale properties based on the analysis of in the frequency with which test scores are
variance (ANOVA). given, using a contingency approach and the
chi-squared statistic. These patterns are based
8.5.1 Judgemental techniques on various factors such as age, gender and cul-
tural-group membership – when differences in
Judgemental approaches for determining the predicted scoring patterns occur on the basis
equivalence of measures rely on the degree to of group membership, bias is identified. Three
which two or more experts in the area agree non-parametric approaches can be identified,
that the measures are similar. The most com- namely the Mantel-Haenszel (MH) approach,
mon judgemental approaches to identifying the Simultaneous Item Bias Test (SIBTEST)
inequivalence involve experts in test construc- and Distracter Response Analysis (DRA).
tion, very familiar with both the culture of ori-
gin and the target culture, who inspect the items
8.5.2.1 The Mantel-Haenszel (MH) approach
for cultural and linguistic equivalence. These
techniques include forward translation and back The first non-parametric method for identify-
translation of test items. Forward translation is ing DIF was developed by Mantel and Haenszel
done when the measure is translated from the (1959) (see also Holland & Thayer, 1988). The
source language (SL) into the target language Mantel-Haenszel (MH) approach uses contin-
(TL) by a person (or group of people) who is/ gency tables and is based on the assumption that
are experts in both languages. In forward trans- an item does not show DIF if the odds (or chan-
lation, the original test in the source language ces) of getting an item correct are the same at
is translated into the target language and TL all ability levels for two matched groups of test-
speakers are then asked to complete the meas- takers who differ only in terms of their mem-
ure. They are questioned by the experts about bership (call these two Group A and Group B).
their responses and their understanding of the The pass/fail results of Group A and Group B
various items. are tabulated in a two-by-two table for each item
In back translation, the test is translated into and compared. This is repeated for each item
the target language and then it is re-translated in the measure. Suppose there are 100 people
by an independent expert back into the source in both groups, and that 58 As and 23 Bs get
language. A panel of bilingual scholars then re- the item right and 42 As and 77 Bs get the item
views the translated version, which is translated wrong. This is shown in Table 8.2.
back into the first language to monitor retention
of the original meaning. An independent back Table 8.2 Contingency table for Item 1
translation means that “an original translation
would render items from the original version Group A Group B

of the instrument to a second language, and a
second translator – one not familiar with the in- Pass 58 23
strument – would translate the instrument back
Fail 42 77
into the original language” (Geisinger, 1994, p.
306). Once the process is complete, the final
back-translated version is compared to the ori- Just by looking at this distribution, it is clear
ginal version (Brislin, 1980; Hambleton, 1994). that the item is much easier for members of
Finally, the translated one of the assessment is Group A than it is for Group B. Clearly Item
“tried out” out with a sample of participants and 1 is biased against the members of Group B.
108
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
However, inspection is not good enough and evidence of metric equivalence of the measure-
so the chi-square statistic is used. MH yields a ments, any findings about group differences on
chi-square test with one degree of freedom to the attributes being assessed, and subsequent
test the null hypothesis that there is no relation practical implications of the results in important
between group membership and test performance on areas of functioning, are simply not known.
one item after controlling for ability as given by the Non-uniform DIF occurs when there are dif-
total test score. In other words, an item is biased ferences in the probabilities of a correct response
if there is a significant difference in the propor- for the two groups at different levels of ability (in
tions of each membership group achieving a other words, when there is an interaction between
correct or desired response on each test item. ability level and group membership. Non-uniform
Once the item has been examined in this way, item bias has implications at the measurement
the next step is to compare the scores for Item unit (or metric) equivalence level because the
2 in exactly the same way. This is continued for variables of interest are not measured on the same
Item 3 and all other items, until they have all metric scales across different groups. As a result,
been compared. assessment outcome decisions (e.g. personnel se-
In order to calculate the Mantel-Haenszel lection, mental health status) that are based on
statistic, the following steps need to be taken. the attribute measured may not be meaningful
Firstly, the test data must be coded and scored. where relative differences exist between groups.
Each examinee must have (a) a code or label for The only way around this is to develop and use
group membership; (b) the actual response (right group-specific norms to avoid adverse impact (as
or wrong) for each item; and (c) total score on determined by similar selection ratios for majority
the test. Secondly, data for each item must be and minority groups).
organised into a three-way contingency table. When non-uniform DIF occurs or is suspect-
Thirdly, the statistical analysis for detecting and ed, it is necessary to calculated DIF scores at the
testing for DIF and item bias (chi-square) has to different levels of ability. To do this, the whole
be conducted for each item and each ability level. sample must be divided into a number of sub-
The method outlined above assumes that the groups (K) on the basis of their ability scores
amount of DIF is the same across all members (call these K1, K2, K3, etc.). The comparison
of Groups A and B, and that there is no inter- of responses for each item is then carried out for
action between item difficulties for members each of the ability subgroups, so that the passes
with different levels of ability. This assump- and fails for Groups A and B are compared at
tion is termed a uniform DIF and exists when ability level K1, then again at ability level K2,
the probability of answering an item correct- and so on for each ability level. Then the whole
ly is greater for one group consistently over all process is repeated for Item 2 and Item 3, and
ability levels. In other words, uniform DIF oc- so on. As can be seen, this approach requires
curs when there is no interaction between abil- a two-by-two contingency table for each item
ity level and group membership. As Ekermans and each ability level. If there are 50 items and
(2009) shows, uniform bias results from differ- four subgroups (K = 4), the chi-square statistic
ences in item difficulty as shown by differences must be computed 50 × 4 or 200 times. How-
in the regression intercept of the observed item ever, as Gierl, Jodoin and Ackerman (2000, p.
scores on the variable across different socio- 11) note, non-uniform DIF is quite rare in prac-
cultural groups (the offset described in Chapter tice. Nevertheless, an alternative approach that
3, section 3.6.3). She argues further that if as- takes non-uniform DIF into account is likely to
sumptions of scalar equivalence remain untest- be more useful.

ed, there is minimal impact on within-cultural
group decisions. This is because all scores will
8.5.2.2 Simultaneous Item Bias Test (SIBTEST)
be affected in the same direction. At the same
time, in the absence of evidence of scalar in- A second non-parametric statistical method for
variance, between-group differences may be in- detecting DIF is the Simultaneous Item Bias
correctly interpreted as showing real differences Test (SIBTEST) proposed by Shealy and Stout
between the groups (Cheung & Rensvold, 2002; (1993). SIBTEST, which is an extension of the
Steenkamp & Baumgartner, 1998; Van de Vij- MH approach, differs from MH by using a more
ver & Tanzer, 2004). In the absence of empirical sophisticated matching process. They argue

APPLIED PSYCHOLOGY
Account: ns190599
that the observed ability score is not the best especially with large-scale testing programmes.
means of categorising the ability groups as these The basic argument of IRT is that the higher
scores contain an error component as shown in an individual’s ability level, the greater the indi-
Chapter 3, where it is argued that the Observed vidual’s chance of getting an item correct. This
Score is made up of a True Score, plus or minus is understandable as people with higher scores
an Error Score. As Zumbo (1999) correctly can generally expect to get more items right than
notes, composite scores (i.e. scale total scores) those with lower scores. This relationship can be
are merely indicators of a latent (unobservable) shown graphically by plotting the ability level of
variable. The SIBTEST uses a regression esti- the test-taker (represented by the total score) on
mate of the true score instead of the Observed the x-axis, and the probability of getting the item
Score as the matching or categorising variable. correct on the y-axis. Such a plot is known as an
As a result, examinees are matched on an es- Item Characteristic Curve or ICC. This is shown
timated latent ability score rather than an ob- in Figure 8.2.
served score. An advantage of this method is
that SIBTEST can be used to evaluate DIF in 1
Probability of correct response

two or more items simultaneously in the analy- 0,9 Item 1 Item 2
sis, whereas in the MH approach a separate an- 0,8
alysis has to be carried out for each item in the 0,7
test. SIBTEST does this by grouping the items 0,6
into “testlets” or item bundles (Douglas, Rous- 0,5
sos & Stout, 1996). 0,4
Although MH has been the preferred method 0,3
in DIF detection (Roussos & Stout, 1996), re- 0,2
searchers have shown that SIBTEST has superi- 0,1
or statistical characteristics compared to MH, 0,0
especially for detecting uniform DIF (Naray- Low Ability level High
anan & Swaminathan, 1994; Roussos & Stout,
1996; Shealy & Stout, 1993). Figure 8.2 Item Characteristic Curve
8.5.2.3 Distracter Response Analysis (DRA) As can be seen in Figure 8.2, the probability
of doing well on Item 1 increases as the ability
A variant of the MH approach that can be
levels of the individuals taking the test increase
used when multiple alternatives are provided is
– low-ability individuals do relatively badly on
known as Distracter Response Analysis (DRA),
the item, whereas high-ability individuals do
which examines the incorrect alternatives or
relatively well on the item. Item 2, on the other
distracters to a test item for differences in pat-
hand, is far more difficult, as the probability of
terns of response among different subgroups of a
getting the item correct remains low, irrespec-
population. In the DRA, responses are analysed
tive of the respondents’ ability level. The slope
in terms of the null hypothesis that there is no
of the curve indicates the discriminating power
significant difference in proportions when selecting
of the item. Note that if the curve is relatively flat
distracters on the test items between the reference and then the item does not discriminate among indi-
focal groups. As with MH, contingency tables are viduals with high, moderate or low total scores
used and evaluated using chi-square. In terms of on the measure.
this framework, no item bias occurs when there
Zumbo (1999, p. 16) identifies a number of

is no significant difference in the proportion of parameters that characterise ICCs. These in-
the different groups selecting particular distrac- clude the slope of the curve (which in the case of
ters on the test items. a cognitive test indicates the ability of the item
to discriminate between individuals), the pos-
8.5.3 Parametric approaches to DIF analysis ition along the axis (which represents the item
difficulty level, or conversely the ability level
8.5.3.1 Item Response Theory (IRT) required to get the item correct) and the min-
Item Response Theory (IRT) is an extremely imum, non-zero level, which represents what
powerful theory that can be used to detect bias, someone with zero ability would get – that is the
110
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
guessing level. In the case of a personality meas- 1
Probability of correct response

ure, these three parameters reflect firstly the 0,9 Group A Group B
ability of the item to distinguish between people 0,8
with a particular characteristic and those with- 0,7
out; secondly, the amount of the characteristic 0,6
that the person must have to endorse the item; 0,5
and finally, the likelihood that the person will 0,4
endorse the item without due consideration, as 0,3
a result of social desirability, guessing and the 0,2
like. The meanings of these three parameters in 0,1
both the cognitive and personality domains are 0,0
summarised in Table 8.3. Low Ability level High
Figure 8.3 Item Characteristic Curve

Table 8.3 Interpretation of ICC properties for demonstrating item bias
cognitive and personality measures
ICC property Cognitive, Personality,

B (i.e. the relative difficulty of the item) increas-
aptitude, social or es as the ability levels of the groups increase. Re-
achievement attitude member, the two groups have been matched for
or knowledge measures ability level in each of the five ability groups.
test
Slope (com- Item discrimina- Item discrimina- 8.5.3.2 Logistic Regression (LR)
monly called tion – a flat ICC tion – a flat ICC A different approach to detecting DIF is
the a-parameter does not differ- does not differ- based on parametric analysis (unlike MH and
in IRT) entiate among entiate among SIBTEST which are both non-parametric), and
test-takers test-takers
makes use of Logistic Regression (LR) tech-
Position along Item difficulty – Threshold – the niques (Swaminathan & Rogers, 1990). Logistic
the X-axis the amount of a amount of a Regression is a kind of regression analysis often
(commonly latent variable latent variable used when the dependent variable is dichot-
called the needed to get needed to en- omous and scored 0 or 1. It can also be used
b-parameter in an item right dorse the item when the dependent variable has more than two
IRT)
categories. It is usually used for predicting
Y-intercept Guessing The likelihood whether something will happen or not – any-
(commonly of indiscrimin- thing that can be expressed as Event/Non-event.
called the ate responding Independent variables may be categorical or
c-parameter in or socially continuous. The logistic regression approach
IRT) desirable re- compares group membership variables (such
sponses as gender, ethnicity or age) and/or item param-
eters associated with two groups. With LR, the
Against this background, it is relatively simple presence of DIF is determined by testing the
to demonstrate how ICCs can be used to show improvement in model fit that occurs when a
DIF. Figure 8.3 shows the ICCs for a single term for group membership and a term for the
item for two groups (A and B) at various ability interaction between test score and group mem-
levels. bership are successively added to the logistic

Here it can be seen that the difficulty level of regression model. When there is no DIF, the
the item is greater for all ability levels of Group Item Characteristic Curves (ICC) for the two
B (squares) than it is for Group A (circles) – the groups will be the same. The null hypothesis
probability of getting the answer right is lower is that for two groups at a given ability level, the
for each ability subgroup in Group B than for population value is zero for either the difference be-
the corresponding ability subgroup in Group A. tween the proportions correct or the log odds ratio
In addition, the difference in probability of be- on the test items between the reference and the focal
ing correct increases as the ability of the groups group. A chi-square test is used to evaluate the
increases – the gap between Group A and Group presence of uniform and non-uniform DIF on

APPLIED PSYCHOLOGY
Account: ns190599
the item of interest by successively testing each Table 8.4 Statistical criteria for identifying
term included in the model. LR can thus detect biased items

uniform and non-uniform DIF, which is thus
an improvement over the MH. (For a more DIF Focus of analysis Measure of
detailed account of LR, see Foxcroft & Roodt, approaches bias
2001, pp. 97–101 and Foxcroft & Roodt, 2009, Chi-square Differences in Significance
pp. 83–84.) proportions attaining of chi-square
The decision whether to use non-parametric a correct response
(MH and/or SIBTEST) techniques or the para- across score cat-
metric LR technique depends on the situation. egories
As Gierl, Jodoin and Ackerman (2000) have Distracter Difference in pro- Significance
shown, LR is more powerful than MH at de- Response portions selecting of chi-square
tecting non-uniform DIF since the latter method Analysis distracters
was only designed to detect uniform DIF (Rog-
ers & Swaminathan, 1993). However, although Logistic Odds of getting the Significance
Regression item right of chi-square
the IRT approach is superior theoretically and
clearly recommended in the literature (Shepard, Mantel- Performing chi- Significance
Camilli & Williams, 1985, p. 84), LR requires Haenszel square statistical of chi-square
large sample sizes – the use of three different tests for DIF effect
parameters requires a minimum of 1000 cases Source: Pedrajita & Talisayon (2009), Table 1, p. 25
per group. As Schumacher (2010) notes,
[t]he Mantel-Haenszel method can be used shown in Chapter 5, section 5.2.1.3, one way of
with smaller sample sizes, while logistic re- showing that a measure has theoretical or con-
gression, which can be conceptualized as a struct validity is to show that the factor structure
link between the contingency table method of the new measure is very close to that of more
(Mantel-Haenszel) and IRT method, offers a established measures assessing the same con-
more robust solution under both uniform and struct. This technique can also be used to show
non-uniform DIF conditions. equivalence using both Exploratory Factor An-
alysis (EFA) and Confirmatory Factor Analysis
Pedrajita and Talisayon (2009) point out that (CFA).
the common measure of bias across these
various approaches is the significance of the • Exploratory Factor Analysis: EFA can be used
chi-square value obtained. A significant chi- to check and compare factor structures, es-
square value indicates: (1) difference in propor- pecially when the underlying dimensions of
tion attaining a correct response across total score a construct are unclear. Groups can then be
categories for the X2 procedure; (2) difference compared and the similarity of the underlying
in proportions selecting distracters for the DRA; structures can be taken as an indicator of the
(3) difference in the odds of getting an item right degree to which the groups attach a similar
between the reference/focal groups compared for the meaning to the assessment. Multiple groups
LR; and (4) large DIF effect for the MH statistic. can be compared either in a pairwise or a one-
They argue further that no one method is better to-all (each cultural group versus the pooled
than any other, and that the argument for the solution) fashion. Target rotations are em-
presence of DIF is increased when two, three or ployed to compare the structure across coun-
all of the four methods yield a statistically sig- tries and to evaluate factor congruence, often
nificant chi-square value on an item or groups by means of the computation of Tucker’s phi
of items. They summarise the various meth- coefficient (Van de Vijver & Poortinga, 2002).
ods of DIF and their accompanying statistical This statistic examines the extent to which
analyses in Table 8.4. factors are identical across cultures. Values
of Tucker’s phi above 0,90 are usually con-
sidered to be adequate and values above 0,95
8.5.3.3 Factor analysis to be excellent. Tucker’s phi can be computed
A third parametric approach to the detection with dedicated software such as an SPSS rou-
of inequivalence is the use of factor analysis. As tine (syntax available from Van de Vijver &
112
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
Leung, 1997a, and http://www.fonsvandevij- – references to the local context of the test de-
ver.org) (He & Van de Vijver, 2012, p. 11). veloper.

Of course, this approach does not identify
• Confirmatory Factor Analysis: A more refined
particular items that behave differently across
and theory-driven way of examining the sim-
the groups, although an examination of the
ilarity of factor structures across different
items that load differently on particular factors
groups is through the use of Confirmatory
in the two factor analyses will point to the dif-
Factor Analysis (CFA, or structural equation
ferential behaviour of items. These may then be
modelling). In general, the closeness of one
factor structure to another is demonstrated further explored, as described in the previous
using CFA, in which the “goodness of fit” section on DIF.
between the two is determined using the chi- In conclusion, DIF is a strong indication that
square statistic (or x2). This technique can some items of the measure, or the measure as a
be used to compare the equivalence of factor whole, may be biased against one of the socio-
structures in different cultural settings (Marsh cultural groups being assessed. At the same time,
& Byrne, 1993) by showing the degree of sim- DIF is a necessary, but not sufficient, condition
ilarity between the factor structures obtained for item bias to exist. In other words, if an item
in the target group and the reference group. In does not show DIF, then no item bias is present.
this respect, the cross-cultural equivalence of However, if DIF is detected, this is not sufficient
the two tests is seen as a form of validity gen- reason to declare item bias; it rather indicates
eralisation – is the test equally valid for both the possibility that such bias exists and one
groups? If the goodness-of-fit statistic shows would have to apply follow-up item bias analy-
an acceptable fit (0,90 or higher), the hypoth- ses (e.g. content analysis, empirical evaluation)
esis that the two structures are similar cannot to determine the presence of item bias. Two ap-
be rejected. proaches to examining potential measurement
bias have been described, namely judgemental
CFA is more sophisticated than the EFA approaches and statistical approaches. Judge-
approach as it is based on covariance matrix mental methods rely solely on one or more ex-
information to test hierarchical models (He & pert judges’ opinions to select potentially biased
Van de Vijver, 2011, p. 11). In addition to the items. Clearly this is an impressionistic meth-
use of the chi-square test to determine good- odology, whereas statistical techniques that in-
ness of fit, this can also be evaluated using the vestigate in some depth those items or measures
Tucker Lewis Index (acceptable fit is indicated that show potential bias and then probe these
by values of above ,90 and excellent above ,95), in greater depth using statistical techniques are
the Root Mean Square Error of Approxima- scientifically far more defensible.
tion (RMSEA, with acceptable fit indicated by
values of below ,06 and excellent fit by values
8.6 Method bias*
below ,04), and Comparative Fit Index (accept-
able above ,90 and excellent above ,95) (Kline, As indicated in Chapter 3, section 3.4, assess-
2010). These analyses can be carried out with ments can be administered in many ways, in-
software such as AMOS and Mplus (Byrne, cluding the following:
2001, 2010).
• Interviewing
According to Van de Vijver and Hambleton
• Pencil and paper
(1996), the advantage of using CFA is that it al-

lows for incomplete overlap of stimuli. However, • Card sorting (e.g. sorting a pile of cards with
they point out (p. 10) that the amount of over- an adjective on each into piles such as “very
lap in conceptualisation of the construct or the much like me”, “somewhat like me” and “not
extent of shared behaviours across cultures can at all like me”
be so small that an entirely new instrument has • Manual (e.g. fitting objects together to make
to be assembled. This is most likely to happen a whole, such as jigsaw puzzles, drawing lines/
when an instrument that has to be developed objects)
in one cultural context, usually some Western • Computerised testing, including adaptive
country, contains various – implicit or explicit testing

APPLIED PSYCHOLOGY
Account: ns190599
Various problems are experienced when 8.6.1.2 Measures of response sets

these different formats are used in a cross- A second method for detecting method bias
cultural context – these are termed method identified by Van de Vijver and Hambleton
bias or instrument bias* as discussed in sec- (1996) involves the measurement of social de-
tion 8.2.3. Clearly when people are not used sirability or other response sets (e.g. Fioravanti,
to being assessed (i.e. are relatively low on test Gough & Frere, 1981; Hui & Triandis, 1989).
wiseness or test sophistication*), they may Should these scores be very different across the
suffer from test anxiety and as a result tend to cultures assessed, one can surmise that the as-
underperform. This may be particularly relevant sessment itself is behaving quite differently in
when high-tech methods such as questionnaires the different contexts.
and computer-based applications are used, and
less likely when interviews and assessment tech- 8.6.1.3 Non-standard administration
niques based on culturally familiar methods
Finally, method bias can be examined by ad-
such as toys and the like are used. Novel assess-
ministering the instrument in a non-standard
ment techniques have used sand tray drawings,
way, soliciting all kinds of responses from a
clay modelling, models of animals and every-
respondent about the interpretation of instruc-
day objects, and so forth. Such techniques have tions, items, response alternatives and motiva-
been used, inter alia, by Deregowski and Serpell tions for answers. Such a non-standard admin-
(1971), who asked Scottish and Zambian chil- istration provides an approximate check on the
dren in one condition to sort miniature models suitability of the instrument in the target group.
of animals and motor vehicles, and in another
condition to sort photographs of these models.
Many of these same techniques are used in vari- 8.7 Addressing issues of bias and lack of
ous forms of psychotherapy, including art ther- equivalence
apy (e.g. Oaklander, 1997). In general terms, all psychological assessments
require the assessors to demonstrate the reli-
8.6.1 Detecting method bias ability, validity and fairness of the techniques
used. By extension, part of this requirement is
Van de Vijver and Hambleton (1996) also
that equivalence of the assessment techniques
argue that an often-neglected source of bias in used in a cross-cultural context also needs to be
cross-cultural studies is method bias. They iden- demonstrated. Minimising bias in cross-cultural
tify several approaches to this, including triangu- assessment usually amounts to a combination of
lation, response set detection and non-standard strategies: integrating design, implementation
administration. and analysis procedures. Van de Vijver and Tan-
zer (2004) have identified a number of strategies
8.6.1.1 Triangulation for describing and dealing with the different
In order to detect method bias, they argue for a biases outlined above.
process of triangulation (e.g. Lipson & Meleis, According to He and Van de Vijver (2011),
1989) using single-trait, multimethod matrices actions can or should be taken to reduce or
(e.g. Campbell & Fiske, 1959; Marsh & Byrne, prevent low levels of equivalence from occur-
1993). Unless these different measures that are ring at various stages of the assessment process.
known to assess similar constructs yield very They identify three such stages, namely the de-
sign, implementation and analysis stages (see
similar outcomes, one or all of the methods used

are likely to be suspect. An alternative method pp. 9–14). Although this categorisation makes
is to use repeated test administrations and to sense, it is difficult to see how actions taken at
examine score patterns between two administra- the analytic stage can reduce inequivalence – at
tions. If individuals from different groups with best, analysis will identify the presence, nature
and extent of such inequivalence.
equal test scores on the first occasion have very
different scores on the second administration,
the validity of the first administration is open to 8.7.1 At the design stage
doubt. They argue that this approach is particu- The actions that can be taken at the design
larly useful for mental tests. stage to ensure construct equivalence in a cross-
114
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
cultural comparative study fall into two broad at ease and do not experience any cultural bar-
categories, namely decentring* and conver- riers Brislin (1986). As shown in section 8.3.3,
gence* (Van de Vijver & Leung, 1997a). Accord- an important source of inequivalence that arises
ing to Werner and Campbell (1970), cultural de- during this implementation stage is method bias,
centring means that an instrument is developed which refers to problems caused by the manner
simultaneously in several cultures and only the in which a study is conducted (method-related
common items are retained for the comparative issues). Four types of method bias are identified,
study; making items suitable for a cross-cultural namely sample bias, instrument bias, response sets
context in this approach often implies the use of and administration bias. Steps need to be taken
more general items and the removal of specifics, to address each of these components.
such as references to places and currencies when Sample bias arises when sample parameters
these concepts are not part of the construct being differ systematically between the people being as-
measured. This is essentially an adaptation ap- sessed and those for whom the assessment process
proach. He and Van de Vijver (2011) point out was initially developed. These differences may be
that large international educational assessment the result of educational levels, urban versus rural
programmes such as the Program of International residency and religious affiliation, or even inten-
Student Assessment (PISA), generally adopt this sity of religious belief. To address the issue of
approach, which involves committee members sampling bias, Boehnke, Lietz, Schreier and Wil-
from target cultures meeting to develop culturally helm (2011) suggest that the sampling of cultures
suitable concepts and items. should be guided by research goals (e.g. select a
broad cultural spectrum if the goal is to establish
When the convergence approach is used, in-
cross-cultural similarities and far more homogen-
struments measuring similar constructs are
eous cultural groups if cultural differences are be-
developed independently within cultures, and
ing looked for). When participants are recruited
the various instruments are then administered
using convenience sampling, the generalisability
across the various cultures (Campbell, 1986).
of findings to their population needs special atten-
It is essentially a process of assembly and then tion. Accordingly, sampling must be guided by the
adoption. An example of this is given by He and distribution of the target variable being assessed.
Van de Vijver (2011, pp. 9–10) when they de- Convenience sampling must be tempered by the
scribe a study by Cheung, Cheung, Leung, nature of the characteristic being investigated in
Ward and Leung (2003). Both the NEO-Five order to match the two samples as closely as pos-
Factor Inventory (NEO-FFI) (a Big Five meas- sible. If this matching strategy does not work, it
ure developed and validated mostly in Western may well be possible to control for factors that in-
countries) and the Chinese Personality Assess- duce sample bias so that a statistical correction for
ment Inventory (CPAI) (which was developed the confounding differences can be achieved. For
in the Chinese context) were administered to example, educational quality has a significant im-
both Chinese and Americans. Joint factor an- pact on the assessment of intelligence, and there-
alysis of the two personality measures revealed fore the nature, quality and extent of education
that the Interpersonal Relatedness factor of the must be collected for later use as possible mod-
CPAI was not covered by the NEO-FFI, where- erating or adjustment variables. In this respect,
as the Openness domain of the NEO-FFI was He and Van de Vijver (2011) show, in a study
not covered by the CPAI. Consequently, one by Blom, De Leeuw and Hox (2011) how, when
can expect that merging items from the meas- the non-response information from the European
ures developed in distinct cultural settings may Social Survey (see http://www.europeansocial-
show a more comprehensive picture of person- survey.org for more details) was combined with
ality than when the measure is developed in one a detailed interviewer questionnaire, systematic
setting and then adapted for use in others. country differences in non-response could in part
be attributed to interviewer characteristics such as
contacting strategies.
8.7.2 At the implementation stage As we have seen, instrument bias arises when
Because the interaction between administrators the assessment method used behaves different-
and respondents can be a significant source of er- ly across the different groups as illustrated by
ror variance, the right administrator/interviewers Deregowski and Serpell’s (1971) findings in re-
should be selected so that the respondents feel spect of Scottish and Zambian children’s ability

APPLIED PSYCHOLOGY
Account: ns190599
to sort photographs and models of animals and that can affect the interpretation of cross-cultural
motor vehicles. assessment processes. This must also involve the

Response sets refer to systematic differences provision of clear instructions with sufficient ex-
in the tendency to respond in particular ways. amples. A fact that is often overlooked is the need
These need to be identified early and response to ensure rapport with the participants – “warm-
formats adjusted accordingly. For example, if up” or practice exercises to ensure understanding
particular groups are known to agree with every- of the assessment procedures need to be given at
thing (acquiescence response set), care must be the outset of any assessment process. Needless to
taken to ensure that equal numbers of positively say, the results from these practice components
and negatively phrased items are presented in are not analysed.
the assessment instrument. These paired items These various forms of bias and the strategies
need to be interrogated to ensure consistency of for reducing them are shown in Table 8.5.
response. Because second guessing and/or pres-
entation of self may be a problem, a few distrac- Table 8.5 Strategies to reduce bias in
ter items should be included to ensure that this cross-cultural assessment
is minimised. A useful technique in this respect
is to label the instrument in some innocuous Type of Strategies
bias
way – instead of labelling the scale “Integrity”,
“Job Satisfaction” or “Trust in Minorities”, the Construct Decentring (i.e. simultaneously devel-
instrument could simply be labelled “Attitudes bias oping the same instrument in several
to Others”, “Attitudes to Work”, and so forth. cultures)
Administration bias is a form of method bias Convergence approach (i.e. independ-
that comes about as a result of various admin- ent within-culture development of
istration practices and conditions (e.g. data instruments and subsequent cross-cul-
collection modes, class size), ambiguous in- tural administration of all instruments)
structions, interaction between administrator Construct Use of informants with expertise in
and respondents, and communication problems and/or local culture and language
(e.g. language difference, taboo topics), to name method Use of samples of bilingual subjects
a few. This is not only a question of the ad- bias Use of local surveys (e.g. content an-
ministrator’s language ability, but also involves alyses of free-response questions)
sensitivity to important aspects of culture, the Non-standard instrument administra-
avoidance of inappropriate modes of address or tion (e.g. thinking aloud)
cultural norm violations by the interviewer, all Cross-cultural comparison of
of which can seriously endanger the collection nomological networks (e.g. conver-
of appropriate data, even in very structured as- gent/discriminate validity studies,
sessment situations. For example, male medical monotrait-multimethod studies, conno-
tation of key phrases
staff may experience difficulty in collecting sex-
ual or other sensitive information from female Method Extensive training of administrators
participants, especially in conservative soci- bias (e.g. increasing cultural sensitivity)
eties. In this regard, a study by Davis and Silver Detailed manual/protocol for adminis-
(2003) revealed that, in answering questions re- tration, scoring and interpretation
garding political knowledge, African-American Establishing rapport through cultural
respondents got fewer answers right when inter- sensitivity and practice items
viewed by a European-American interviewer Detailed instructions (e.g. with a
than when the information was collected by an sufficient number of examples and/or
African-American interviewer. exercises)
In order to minimise this form of bias, a stan- Use of subject and context variables
(e.g. educational background)
dardised administration protocol should be de-
veloped and adhered to by all assessors. The Addressing sample issues
establishment of rapport between the adminis- Use of collateral information (e.g.
test-taking behaviour or test attitudes)
trators and those being assessed is always crucial
Assessing response styles
but is of particular importance when assessing
Use of test-retest, training and/or inter-
cross-culturally. Ensuring proper administration
vention studies
can help minimise the various response biases
116
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
any systematic differences that may be identi-

Item bias Judgemental methods of item bias

detection (e.g. linguistic and psycho- fied (He & Van de Vijver, 2012, p. 11).
logical analysis)
Psychometric methods of item bias
detection (e.g. Differential Item
8.8 Summary
Functioning analysis) The need to assess people from minority or im-
Error or distracter analysis migrant communities arises as a result of vari-
Documentation of “spare items” in the ous factors such as migration, natural disasters,
test manual which are equally good warfare, and the like. With particular reference
measures of the construct as actually
to the job situation, most parts of the developed
used test items
world have seen large numbers of immigrants
Source: Van de Vijver & Tanzer (2004) for economic reasons, and it is often necessary
Reproduced from Van de Vijver, F. J. R., & Tanzer, N. K. Bias to determine their suitability for employment
and equivalence in cross-cultural assessment: An overview. and education/training. As a result, occupational
Copyright © 2004. Elsevier Masson SAS. All rights reserved.
psychologists and other selection specialists are
increasingly being confronted with the need to
8.7.3 At the analysis stage assess people whose home language and cultural
background are very different from the domin-
As He and van de Vijver (2012) show, there
ant ethos in which the assessment techniques
are a number of different ways of showing the
were conceived and/or are administered.
existence of bias at the analysis stage. Among
In looking at how this assessment takes place,
the most important ways that they identify are
Van de Vijver and Hambleton (1996) have iden-
exploratory factor analysis (EFA) and confirm-
tified three approaches which they term Apply,
atory factor analysis (CFA) for different levels
of equivalence and differential item functioning Adapt and Assemble, although in this text the
(DIF) analysis for detecting item bias as out- third approach (namely Assemble) has been div-
lined above in section 8.5.3. In brief, EFA can ided in two to yield Develop Culture-Friendly Tests
be used to check and compare factor structures. and Develop Culture-Specific Tests. In order to ex-
However, they argue that CFA is a better ap- plain why any assessment technique cannot be
proach as a good fit between the two factor blindly applied in contexts for which it has not
structures suggests that there is a high level of been designed, three distinct sources of bias and
equivalence across the cultural groups. If a dis- unfairness, namely construct bias, item bias and
crepancy exists between the two groups, DIF an- method bias, have been identified. The presence
alysis can be used to identify anomalous items. of these sources of bias affects the equivalence of
(Recall that DIF indicates that respondents the assessment techniques and outcomes when
from different groups have differing probabil- used in different sociocultural groups. In this
ities of getting an item correct or endorsing the respect, three kinds of equivalence have been
item, after matching on the underlying ability or identified and linked in a hierarchy of increasing
latent trait that the item is intended to measure importance (Van de Vijver & Poortinga, 1997;
(Zumbo, 1999). In this analysis, scales should Van de Vijver & Leung, 1997a, 1997b). These
be uni-dimensional (for multi-dimensional con- levels are: construct equivalence, measurement unit
structs, DIF analyses can be performed on each equivalence and scalar equivalence.
dimension (see section 8.5.3 above). Various methods of detecting and measuring
the extent of equivalence across cultural groups

In conclusion, the various sources of bias and take the form of differences in item means
inequivalence need to be understood, assessed and standard deviations, and various non-
and combated wherever possible. In addition, parametric techniques based on chi-square
the assessment process needs to be carefully expectancies*, item-whole correlations*,
documented, and feedback from respondents factor-loadings* and item curve character-
about their experience of the assessment pro- istics* (ICC). Perhaps the most widely used
cess should be collected for further analysis and, technique in this regard is Differential Item
where the effects cannot be prevented, these Functioning (DIF), which refers to the differ-
data can be used to account for, and adjust for, ing probabilities of success on an item of

APPLIED PSYCHOLOGY
Account: ns190599
people of the same ability but belonging to dif- to detect bias, especially in large-scale testing
ferent groups – that is, when people with equiva- programmes. The basic argument of IRT is that
lent overall test performance but from different the higher an individual’s ability level, the great-
groups have a different probability or likelihood er the individual’s chance of getting a more dif-
of answering an item correctly. ficult item correct, and the less likely it is that
There are several ways in which item bias can a person with lower ability would get the more
be demonstrated. Some are based on expert difficult items correct. This relationship can be
judgements based on inspection as well as for- shown graphically by plotting the ability level of
ward and back translation, while others are based the test-taker (represented by the total score) on
on various forms of statistical analysis. The sta- the x-axis, and the probability of getting the item
tistical techniques are divided into two main cat- correct on the y-axis. Such a plot is known as
egories: non-parametric methods developed for an item characteristic curve or ICC. If this pattern
dichotomously scored items using contingency of responses to items of equal difficulty differs
tables and parametric methods for test scores across cultural (or other) groups, then it is clear
with interval-scale properties based on the an- that the items are behaving differently for the
alysis of variance (ANOVA). Non-parametric different groups – this is what is meant by Dif-
statistical approaches look for differences in ferential Item Functioning or DIF.
the frequency with which tests scores are given, DIF is a strong indication that some items of
using a contingency approach and the chi-square the measure, or the measure as a whole, may be
statistic. There are three such non-parametric biased against one of the sociocultural groups
approaches, namely the Mantel-Haenszel (MH) being assessed. At the same time, DIF is a ne-
approach, the Simultaneous Item Bias Test cessary, but not sufficient, condition for item
(SIBTEST) and Distracter Response Analysis bias to exist – if DIF is detected, this is not a suf-
(DRA). The best known of the non-parametric ficient reason to declare item bias, but indicates
techniques is the Mantel-Haenszel statistic, which the possibility that such bias exists and various
uses chi-square to test the null hypothesis that other techniques should be used to determine if
there is no relation between group membership item bias is present. Factor analysis is one such
and test performance on one item after control- measure that could be used.
ling for ability as given by the total test score. In order to detect whether method bias is
In terms of MH, an item is biased if there is a present, Van de Vijver and Hambleton (1996)
significant difference in the proportions of each suggest several approaches, including triangu-
membership group achieving a correct or de- lation, response set detection and non-standard
sired response on each test item. Once an item administration.
has been examined in this way, the process is Finally, He and Van de Vijver (2012) identify
continued until all items have been compared. three actions that can/should be taken to reduce
Parametric approaches to DIF analysis make or prevent inequivalence from occurring at vari-
use of Item Response Theory (IRT), which is ous stages of the assessment process, namely at
an extremely powerful theory that can be used the design, implementation and analysis stages.
118
EBSCO
µ BACK TO CONTENTS
APPLIED PSYCHOLOGY
Account: ns190599
Additional reading
For good insight into the use of psychometric scales across cultural boundaries, see Douglas, S.P.
& Nijssen, E.J. (2002). On the use of ‘borrowed’ scales in cross-national research: a cautionary note.
International Marketing Review, 20(6), 621–642.
Test your understanding
Essays
1. In the light of the theories discussed in this chapter, revisit Case study 6.1 (p. 72) in Chapter 6 and sug-
gest how you would demonstrate the cross-cultural equivalence of the Trauma Symptom Inventory.
2. Suppose that you want to compare two countries on individualism–collectivism and its effect, if any,
on workplace behaviour, bearing in mind that the samples in one group have on average a higher
level of education than the samples in the other group. Discuss how this difference could challenge
your findings and how you could try to disentangle educational and cultural differences.
3. Suppose that you wanted to investigate the conformity levels of employees in your organisation
which has sizable groups of people from Eastern Europe, Asia, the US and South Africa. How can
sources of method bias be controlled in cross-cultural studies in this study? Discuss procedures at
both the design and analysis stage.

APPLIED PSYCHOLOGY
Account: ns190599
APPLIED PSYCHOLOGY
Account: ns190599

PA PR - S5 - Moerdyk - Ch8 (1) Week 5 Reading 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

PA PR - S5 - Moerdyk - Ch8 (1) Week 5 Reading 1

Uploaded by

Copyright:

Available Formats

Copyright © 2015. Van Schaik Publishers. All rights reserved.

o describe the importance of assessment in a

8.1 Introduction purposes of people with limited English ability

international economic migration, particularly ences are best conceptualised as different

words, are any measured group differences on

98EBSCO µ BACK TO CONTENTS

crimination or other forms of exclusion). As De- uals.

cited by Denoso) have argued that the mar-

µ BACK TO CONTENTS 101

8.2.3 Develop culture-friendly tests

µ BACK TO CONTENTS 103

import across cultural divides – this needs to be

µ BACK TO CONTENTS 105

the similarity of the underlying psychological 8.4.3 Scalar equivalence

construct in the various groups, a view that is

Country A Compare contextual similarity Country B

Assess salience in Country B based on:

Compare Develop modified

Examine nomological validity of

Assess criterion and

Figure 8.1 A general model for assessing equivalence

the accepted definition of DIF is as follows: different probability or likelihood of answering

µ BACK TO CONTENTS 107

would render items from the original version Group A Group B

sumptions of scalar equivalence remain untest- be more useful.

µ BACK TO CONTENTS 109

Probability of correct response

Zumbo (1999, p. 16) identifies a number of

guessing level. In the case of a personality meas- 1

Probability of correct response

Figure 8.3 Item Characteristic Curve

ICC property Cognitive, Personality,

levels. bership are successively added to the logistic

µ BACK TO CONTENTS 111

term included in the model. LR can thus detect biased items

ver.org) (He & Van de Vijver, 2012, p. 11). veloper.

(1996), the advantage of using CFA is that it al-

µ BACK TO CONTENTS 113

Various problems are experienced when 8.6.1.2 Measures of response sets

similar outcomes, one or all of the methods used

µ BACK TO CONTENTS 115

motor vehicles. assessment processes. This must also involve the

any systematic differences that may be identi-

Item bias Judgemental methods of item bias

the extent of equivalence across cultural groups

µ BACK TO CONTENTS 117

Test your understanding

µ BACK TO CONTENTS 119

You might also like

Figure 8.1 A general model for assessing equivalence

Figure 8.3 Item Characteristic Curve