You are on page 1of 12

REVIEW OF RELATED LITERATURE

Assessment
Assessment is one of the most critical dimensions of the education process; it
focuses not only on identifying how many of the predefined education aims and goals
have been achieved but also works as a feedback mechanism that educators should
use to enhance their teaching practices. Assessment is located among the main
factors that contribute to a high quality teaching and learning environment. Banta &
Palomba, (2015) termed the assessment that is often used to describe the
measurement of what an individual knows and can do and, particularly outside of
student affairs in the academic arena, the term has come to mean the assessment of
student learning very specifically (Suskie, 2009). With that focus on student learning
in higher education, the term outcomes assessment has come to “imply aggregating
individual measures for the purpose of discovering group strengths and weaknesses”
(Banta & Palomba, p. 1); however, in practice, many people eliminate the term
outcomes and use assessment to describe the specific concept of the assessment of
student learning.
Furthermore, Assessment is one of the most critical dimensions of the
education process; this focuses not only on identifying how many of the predefined
education aims and goals have been achieved but also works as a feedback
mechanism that educators should use to enhance their teaching practices. On this
matter, Lamprianou and Athanasou (2009:22) point out that assessment is connected
with the education goals of “diagnosis, prediction, placement, evaluation, selection,
grading, guidance or administration”. Consequently, assessment is a critical process
that provides information about the effectiveness of teaching and the progress of
students and also makes clearer what teachers expect from students (Biggs, 1999).
According to Depdiknas in Nur Fatmah’ thesis stated that assessment is the
application of various kinds of ways to get information to what extent the students
achieve the competence. It is not same with evaluation. Assessment refers to any
variety of methods and procedures used to obtain information about students’ learning
achievement or competence.
The type of assessment used must be suitable for "what" it is that is being
assessed i.e. fit for purpose. It would not be appropriate for example, to assess a
dental student's ability to fill a tooth cavity, via a multiple choice or written examination.
Such traditional written examinations and assessments may effectively measure a
student's knowledge about a skill, and even their knowledge and understanding of how
that skill can be applied. It will not capture the student's ability to actually "do" the skill.
For this to be effectively measured, assessment needs to be designed in such a way
that the student can demonstrate their ability to do or perform the skill. This can be
likened to learning the skill of riding a bike, it is not sufficient to have a person describe
how to ride a bike, if they cannot actually do it.
Thus, the assessment method chosen must be appropriate for the desired
learning outcomes being assessed.
Furthermore, assessment tasks must be appropriate to the stage or level of the
course within the program. Students must first have the foundational knowledge and
skills within an area, prior to being able to apply that knowledge or use it in other ways.
Assessment tasks early in a program are more likely to focus on recall of information;
whereas it is reasonable to expect students further in their program to competently
complete assessment tasks that require higher levels of cognitive application (i.e.
analyzing, evaluating, comparing, and judging).
The primary purpose of the assessment is also an important consideration
when designing appropriate tasks. Is the assessment intended to be formative or
summative? For example, a formative assessment where you want students to
attempt and practice something new, needs to be a low weight, low risk assessment
for learning. It would not be appropriate (or fair) for such a task to count heavily towards
a final grade. To be purely formative, no marks are awarded - but, the formative task
relates directly to an assessment where marks are awarded.
VALIDITY
Validity is described as the degree to which a research study measures what it
intends to measure. There are two main types of validity, internal and external.
Internal validity refers to the validity of the measurement and test itself, whereas
external validity refers to the ability to generalize the findings to the target population.
Both are very important in analyzing the appropriateness, meaningfulness and
usefulness of a research study.
If the results of a study are not deemed to be valid then they are meaningless
to study. If it does not measure what we want it to measure then the results cannot
be used to answer the research question, which is the main aim of the study. These
results cannot then be used to generalize any findings and become a waste of time
and effort. It is important to remember that just because a study is valid in one instance
it does not mean that it is valid for measuring something else.

External Validity. The external validity is related to the generalization: to what


extent an effect on the research can be generalized to populations, settings, treatment
variables and measurement variables. This is generally divided into two types: the
validity of population and ecological validity. Both are essential elements for judging
the strength of an experimental design.
Internal Validity can come in many forms, and it is important that these are
controlled for as much as possible during research to reduce their impact on validity.
The term history refers to effects that are not related to the treatment that may result
in a change of performance over time. This could refer to events in the participant’s
life that have led to a change in their mood etc. Instrumental bias refers to a change
in the measuring instrument over time which may change the results. This is often
evident in behavioral observations where the practice and experience of the
experimenter influences their ability to notice certain things and changes their
standards. A main threat to internal validity is testing effects. Often participants can
become tired or bored during an experiment, and previous tests may influence their
performance. This is often counterbalanced in experimental studies so that
participants receive the tasks in a different order to reduce their impact on validity.
Each type views validity from a different perspective and evaluates different
relationships between measurements. Moreover, there are 4 main types of validity
used when assessing internal validity.
Face Validity This refers to whether a technique looks as if it should measure
the variable it intends to measure. For example, a method where a participant is
required to click a button as soon as a stimulus appears and this time is measured
appears to have face validity for measuring reaction time. An example of analysing
research for face validity by Hardesty and Bearden (2004) can be found here.
Concurrent Validity This compares the results from a new measurement
technique to those of a more established technique that claims to measure the same
variable to see if they are related. Often two measurements will behave in the same
way, but are not necessarily measuring the same variable, therefore this kind of validity
must be examined thoroughly. An example and some weakness associated with this
type of validity can be found here (Shuttleworth, 2009).
Predictive Validity This is when the results obtained from measuring a
construct can be accurately used to predict behavior. There are obvious limitations to
this as behavior cannot be fully predicted to great depths, but this validity helps predict
basic trends to a certain degree. A meta-analysis by van IJzendoorn (1995) examines
the predictive validity of the Adult Attachment Interview.
Construct Validity This is whether the measurements of a variable in a study
behave in exactly the same way as the variable itself. This involves examining past
research regarding different aspects of the same variable. The use of construct
validity in psychology is examined by Cronbach and Meehl (1955) here.
A research study will often have one or more types of these validities but maybe
not them all so caution should be taken. For example, using measurements of weight
to measure the variable height has concurrent validity as weight generally increases
as height increases, however it lacks construct validity as weight fluctuates based on
food deprivation whereas height does not.
Validity’s relationship to Reliability
It is important to ensure that validity and reliability do not get confused.
Reliability is the consistency of results when the experiment is replicated under the
same conditions, which is very different to validity. These two evaluations of research
studies are independent factors, therefore a study can be reliable without being valid,
and vice versa, as demonstrated here (this resource also provides more information
on types of validity and threats). However, a good study will be both reliable and valid.
So to conclude, validity is very important in a research study to ensure that our
results can be used effectively, and variables that may threaten validity should be
controlled as much as possible.
Validity refers to the evidence base that can be provided about appropriateness
of the inferences, uses, and consequences that come from assessment (McMillan,
2001a). Appropriateness has to do with the soundness (accuracy), trustworthiness, or
legitimacy of the claims or inferences (conclusion that testers would like to make on
the basis of obtained scores. Clearly, we have to evaluate the whole assessment
process and its constituent (component) parts by how soundly (thoroughly) we can
defend the consequences that arise from the inferences and decisions we make.
Validity, in other words, is not a characteristic of a test or assessment; but a judgment,
which can have varying degrees of strength.
The second characteristic of good tests is validity, which refers to whether the
test is actually measuring what it claims to measure. This is important for us as we do
not want to make claims concerning what a student can or cannot do based on a test
when the test is actually measuring something else. Validity is usually determined
logically although several types of validity may use correlation coefficients.
According to Brown (2010), a valid test of reading ability actually measures
reading ability and not 20/20 vision, or previous knowledge of a subject, or some other
variables of questionable relevance. To measure writing ability, one might ask
students to write as many words as they can in 15 minutes, then simply count the
words for the final score. Such a test is practical (easy to administer) and the scoring
quite dependable (reliable). However, it would not constitute (represent) a valid test
of writing ability without taking into account its comprehensibility (clarity), rhetorical
discourse elements, and the organization of ideas.
Hence, a valid assessment does not require knowledge or skills that are
irrelevant to what is actually being assessed. Examples include: ability to read, write,
role-play, or understand the context; personality; physical limitations; or knowledge of
irrelevant background information.
In order to provide sounds, evidence of the extent of a student's learning; the
assessment must be representative of the area of learning being assessed. Assessing
content randomly selected from 6 weeks only of a 13 weeks’ semester, for an end of
semester exam, is not representative; and will not validly measure a student's overall
learning achievement. A student with solid overall knowledge of the entire semester's
work can still fail such a narrow exam as adequate opportunity to demonstrate their
learning is not possible.
Validity can also relate to the value of an assessment task in preparing students
for what they will be required to do once they have graduated. Students need to see
clear links to how the assessment is related to their future roles.
Reliability

The reliability of an assessment tool is the extent to which it consistently and


accurately measures learning.

When the results of an assessment are reliable, we can be confident that


repeated or equivalent assessments will provide consistent results. This puts us in a
better position to make generalized statements about a student’s level of achievement,
which is especially important when we are using the results of an assessment to make
decisions about teaching and learning, or when we are reporting back to students and
their parents or caregivers. No results, however, can be completely reliable. There is
always some random variation that may affect the assessment, so educators should
always be prepared to question results.

Factors which can affect reliability:

 The length of the assessment – a longer assessment generally produces more


reliable results.
 The suitability of the questions or tasks for the students being assessed.
 The phrasing and terminology of the questions.
 The consistency in test administration – for example, the length of time given
for the assessment, instructions given to students before the test.
 The design of the marking schedule and moderation of marking procedures.
 The readiness of students for the assessment – for example, a hot afternoon
or straight after physical activity might not be the best time for students to be
assessed.

Reliability tools do not have a deliverable like a function or a feature. They are an
assist for the function or feature during the process of creation. As an assist discipline,
it only works if implemented while the other processes are in motion. Attempting to
add it after the design has been completed requires a deconstruction of the previous
work if improvements are needed (Bahret,2018).

Reliability means the degree to which an assessment tool produces stable and
consistent results. It is a concept, which is easily being misunderstood (Feldt &
Brennan, 1989).
Reliability essentially denotes ‘consistency, stability, dependability, and
accuracy of assessment results’ (McMillan, 2001a, p.65 in Brown, G. et al, 2008).
Since there is tremendous variability from either teacher or tester to teacher/tester that
affects student performance, thus reliability in planning, implementing, and scoring
student performances gives rise to valid assessment.
The reliability of the internal consistency defines the consistency of the results
of a test, ensuring that the various elements that measure the different constructs
provide consistent results.
For example, an English test is divided into vocabulary, spelling,
punctuation and grammar. The reliability test of internal consistency gives
a measure that indicates that each of these different skills is measured
correctly and reliably. One way to test this is by a test and repeat method,
where the same test is administered after the initial test and the results are
compared. However, this creates some problems and that is why many
researchers prefer to measure internal consistency including two versions
of the same instrument in the same test. Our example of the English test
could include two very similar questions about the use of the comma, two
about spelling, etc.
The basic principle is that the student must give the same answer to both. If
you do not know how to use the comma, you will respond badly both times. Some
ingenious statistical manipulations will provide the reliability of internal consistency and
allow the researcher to evaluate the reliability of the test. There are three main
techniques to measure the reliability of internal consistency, depending on the degree,
complexity and scope of the test.
All of them verify that the results and constructs measured by a test are correct
and that the exact type used is dictated by the subject, the size of the data set and the
resources.
Split Test by Halves
The division test by halves for the reliability of internal consistency is the easiest
type and consists of dividing a test into two halves.
For example, a questionnaire to measure extroversion could be
divided into even and odd questions. The results of both halves are
analyzed statistically and if the correlation between the two is weak, then
the test has a reliability problem.
However, in the era where computers are dealing with all calculations, scientists
tend to use much more powerful tests.
The division test by halves gives a measure between 0 and 1, where 1 means
a perfect correlation.
The division of the question into two parts must be random. Halves division
testing was a popular way to measure reliability, for its simplicity and speed.
Kuder-Richardson Test
The Kuder-Richardson test for the reliability of internal consistency is a more
advanced, and a bit more complex, version of the halving test.
In this version, the test calculates the average correlation of all possible
combinations of division by halves in a test. The Kuder-Richardson test also generates
a correlation between 0 and 1, with a more accurate result than the division test by
halves. The weakness of this approach, like division by halves, is that the answer to
each question must be a simple correct or incorrect answer, of 0 or 1.
Cronbach's Alpha Test
The Cronbach's alpha not only averages the correlation between all possible
combinations of split-half, but allows multi - level responses.
For example, a series of questions could ask subjects to rate their response
between 1 and 5. Cronbach's alpha gives a score between 0 and 1, where 0.7 is
generally accepted as an acceptable sign of reliability.
The test also takes into account the size of the sample and the number of
possible answers. A test of 40 questions with possible grades between 1 and 5 is
considered to be more accurate than a test of 10 questions with 3 possible levels of
response.
Of course, even with Cronbach's intelligent methodology, which makes
calculation much simpler instead of going crazy with every possible permutation, it is
still better to leave this test for computers and statistical spreadsheet programs.
The reliability of internal consistency measures the degree to which a test
addresses different constructs and offers reliable results. The test and repeat method
involves administering the same test after a period of time and comparing the results.
On the other hand, measuring the reliability of the internal consistency consists
of measuring two different versions of the same element in the same test.
Instruments in Research
As an example, a researcher will always test the reliability of the balance
instruments with a set of calibration weights, ensuring that the results obtained are
within an acceptable margin of error.
Some high-precision scales can give false results if they are not placed on a
completely flat surface, so this calibration process is the best way to avoid this.
In non-physical sciences, the definition of an instrument is much broader and
ranges from a set of survey questions to an intelligence test. A survey to measure
reading ability in children should produce reliable and consistent results to be taken
seriously.
On the other hand, political opinion polls are known to produce inaccurate
results and have an almost unviable margin of error.
In the physical sciences, it is possible to isolate a measuring instrument from
external factors, such as environmental conditions and temporal factors. In the social
sciences, this is much more difficult, so any instrument must be tested with a
reasonable range of reliability.
Stability Test
Any reliability test of the instruments should check the stability of the test over
time, ensuring that the same test performed on the same individual of exactly the same
results.
The test and repeat method is a way to ensure that any instrument is stable
over time.
Of course, there is no perfection and there will always be some disparity and
potential for regression, so statistical methods are used to determine if the stability
of the instrument is within acceptable limits.
Equivalence Test
Proving equivalence involves ensuring that a test administered to two people
or similar tests administered at the same time gives similar results.
The A / B tests are a way to guarantee this, especially in the tests or
observations where the results are expected to change over time. In a school test, for
example, the same test on the same subjects will surely generate better results the
second time, so it is not practical to test stability.
Check that two researchers observe similar results also falls within the
competence of the equivalence test.
Internal Consistency Test
The internal consistency test involves ensuring that each part of the test
generates similar results and that each part of a test measures the correct construct.
For example, an IQ test should measure only the IQ and each question
should contribute. One way to do this is with the variations on splitting tests
by halves, where the test is divided into two sections that are checked
against each other. The even-odd reliability is a similar method used to
check the internal consistency.
The physical sciences usually use tests of internal consistency and that is why
the drivers of sports drugs take two samples. Each test is measured independently in
a different laboratory to ensure that experimental or human error does not bias or
influence the results.
Hence, reliability of assessment refers to the accuracy and precision of
measurement; and therefore also its reproducibility. When an assessment provides an
accurate and precise measurement of student learning, it will yield the same,
consistent result regardless of when the assessment occurs or who does the marking.
There should be compelling evidence to show that results are consistent across
markers and across marking occasions. If there is only one marker, there should be
evidence to support the claim that marks awarded by that person would be comparable
to those awarded by similarly qualified persons elsewhere.
One area which poses particular challenges for assessment reliability and
consistency is when the same course is delivered and assessed across multiple
campuses. Employing consensus moderation processes will ensure that the
standards required of students to achieve a particular mark or grade is consistent and
comparable across all campuses.
Assessments need to be reliable if the decisions based on the results are to be
trusted and defensible.
TRUSTWORTHINESS
In qualitative research, trustworthiness has become an important concept
because it allows researchers to describe the virtues of qualitative terms outside of the
parameters that are typically applied in quantitative research. Credibility and internal
validity are also considered to be parallel concepts.
The concepts of validity and reliability are relatively foreign to the field of
qualitative research. The concepts are just not a good fit. Instead of focusing on
reliability and validity, qualitative researchers substitute data trustworthiness.
Trustworthiness consists of the following components: (a) Credibility; (b)
transferability; (c); dependability; and (d) confirmability.
Credibility
Credibility depends on the richness of the data and analysis and can be
enhanced by triangulation (Patton, 2002), rather than relying on sample size aiming at
representing a population.
There are four types of triangulation as introduced by Denzin (1970), which can
also be used in conjunction with each other:
1. Data triangulation – using different sources of data, e.g. from existing
research
2. Methodological triangulation – using more than one method, e.g. mixed
methods approach, however with focus on qualitative methods
3. Investigator triangulation – using more than one researcher adds to the
credibility of a study in order to mitigate the researcher’s influence
4. Theoretical triangulation – using more than one theory as conceptual
framework
Transferability
Transferability corresponds to external validity, i.e. generalizing a study’s
results. Transferability can be achieved by thorough description of the research
context and underlying assumptions (Trochim, 2006). With providing that information,
the research results may be transferred from the original research situation to a similar
situation.
Dependability
Dependability aims to replace reliability, which requires that when replicating
experiments, the same results should be achieved. As this would not be expected to
happen in a qualitative setting, alternative criteria are general understandability, flow
of arguments, and logic. Both the process and the product of the research need to be
consistent (Lincoln & Guba, 1985).
Confirmability
Instead of general objectivity in quantitative research, the researcher’s
neutrality of research interpretations is required. This can be achieved by means of a
confirmability audit that includes an audit trail of raw data, analysis notes,
reconstruction, and synthesis products, process notes, personal notes, as well as
preliminary developmental information (Lincoln & Guba, 1985).
The approach to sampling differs significantly in quantitative and qualitative
research. Qualitative samples are usually small and should be selected purposefully
in order to select information-rich cases for in-depth study (Patton, 2002). There may
be as few as five (Creswell, 1998, p. 64) or six participants (Morse, 1994, p. 225).
As seen from the above criteria, qualitative research requires far more
documentation than quantitative research in order to establish trustworthiness.
Quantitative research, on the other hand, requires more effort during the research
design phase.
With qualitative and quantitative research serving different objectives and being
designed in a different way, quality assessment criteria must be adapted and adhered
to accordingly.
Credibility and Trustworthiness
Credibility contributes to a belief in the trustworthiness of data through the
following attributes: (a) prolonged engagement; (b) persistent observations; (c)
triangulation; (d) referential adequacy; (e) peer debriefing; and (f) member checks.
Triangulation and member checks are primary and commonly used methods to
address credibility.
Triangulation is accomplished by asking the same research questions of
different study participants and by collecting data from different sources and by using
different methods to answer these research questions. Member checks occur when
the researcher asks participants to review both the data collected by the interviewer
and the researchers' interpretation of that interview data. Participants are generally
appreciative of the member check process, and knowing that they will have a chance
to verify their statements tends to cause study participants to willingly fill in any gaps
from earlier interviews.
Trust is an important aspect of the member check process.
Generalization and Trustworthiness
Transferability is the generalization of the study findings to other situations and
contexts. Transferability is not considered a viable naturalistic research objective. The
contexts in which qualitative data collection occurs defines the data and contributes to
the interpretation of the data. For these reasons, generalization in qualitative research
is limited.
Purposive sampling can be used to address the issue of transferability since
specific information is maximized in relation to the context in which the data collection
occurs. That is, specific and varied information is emphasized in purposive sampling,
rather than generalized and aggregate information, which would be the case,
generally, in quantitative research. Purposive sampling requires the consideration of
the characteristics of the individual members of a sample in as much as those
characteristics are very directly related to the research questions.
Reliability and Trustworthiness
Reliability is dependent upon validity. Therefore, many qualitative researchers
believe that if credibility has been demonstrated, it is not necessary to also and
separately demonstrate dependability. However, if a researcher permits parsing of the
terms, then credibility seems more related to validity and dependability seems more
related to reliability.
Sometimes data validity is assessed through the use of a data audit. A data
audit can be conducted if the data set is both rich-thick so that an auditor can
determine if the research situation applies to their circumstances. Without sufficient
details and contextual information, this is not possible. Regardless, it is important to
remember that the aim is not to generalize beyond the sample.
A qualitative researcher must doggedly record the criteria on which category
decisions are to be taken (Dey, 1993, p. 100). The ability of a qualitative researcher to
use the data analysis framework flexibly, to remain open to alterations, to avoid
overlaps, and to consider previously unavailable or unobservable categories, is largely
dependent on the researcher's familiarity and understanding of the data. This level of
data analysis is achieved by wallowing in the data (Glasser & Strauss, 1967).
Qualitative research can be conducted to replicate earlier work, and when that
is the goal, it is important for the data categories to be made internally consistent. For
this to happen, the researcher must devise rules that describe category properties and
that can, ultimately, be used to justify the inclusion of each data bit that remains
assigned to the category as well as to provide a basis for later tests of replicability
(Lincoln & Guba, 1985, p. 347).
The Art of Qualitative Research and Trustworthiness
The process of refining the data within and across categories must be
systematically carried out, such that the data is first organized into groups according
to similar attributes that are readily apparent. Following that step, the data is put into
piles and sub-piles, such that the differentiation is based on finer and finer
discriminations.
Through the process of writing memos, a qualitative researcher records notes
about the emergence of patterns or the changes and considerations that are
associated with the category refining process. Categorical definitions can be expected
to change over the course of the study since that is fundamental to the constant
comparative process-categories become less general and more specific as data is
grouped and regrouped over the course of the research. In defining categories,
therefore, we have to be both attentive and tentative - attentive to the data, and
tentative in our conceptualizations of them (Dey, 1993, p.102).
FAIRNESS
Fairness is akin to defensibility and is an increasingly important concept in
testing. It is dependent on psychometric adequacy, diligence of construction, attention
to consequential validity and appropriate standard setting (McCoubrie, 2004).
Moreover, fairness falls least under the category of ‘performance’ character and
most under the category of ‘moral’ character. Although the importance of fairness goes
beyond this, developing fairness necessarily develops the desire to act towards the
greater good of those around us, and contribute to a society that is better to live in.
Fairness also refers to the consideration of learner’s needs and characteristics, and
any reasonable adjustments that need to be applied to take account of them. Paulus,
M., & Moore, C. (2017) stated the important to ensure that the learner is informed
about, understands and is able to participate in the assessment process, and agrees
that the process is appropriate. It also includes an opportunity for the person being
assessed to challenge the result of the assessment and to be reassessed if necessary.
Ideally an assessment should not discriminate between learners except on grounds of
the ability being assessed.
A fair and just assessment tasks provide all students with an equal opportunity
to demonstrate the extent of their learning. Achieving fairness throughout your
assessment of students involves considerations about workload; timing and
complexity of the task.
In addition, a fair assessment must take into consideration issues surrounding
access, equity and diversity. Assessment practices need to be as free as possible from
gender, racial, cultural or other potential bias and provisions need to be made for
students with disabilities and/or special needs.
The impact of fairness and equity norms may render direct wage cuts
unprofitable (Agell and Lundborg 1995; Kahneman, Knetsch and Thaler 1986). Firms
may, therefore, be forced to cut wages in indirect ways, e.g., by outsourcing activities.
Fairness concerns may thus influence decisions about the degree of vertical
integration.
According to Felder, students deem an exam unfair in the following cases gives
similar criteria):(1) problems on content not covered in lectures or homework
assignments; (2) problems the students consider tricky, with unfamiliar twists that must
be worked out on the spur of the moment; (3) excessive length, so that only the best
students can finish in the allotted time;(4) excessively harsh grading, with little
distinction being made between major conceptual errors and minor calculation
mistakes; (5) inconsistent grading, so that two students who make the identical
mistake lose different points.
Test fairness is a value-laden judgement based on several factors, including
moral, ethical, social and sometimes legal standards (Thorndike & Thorndike-Christ,
2010), and is often expressed in guidelines for fairness in testing (Zieky, 2006). Tests
that are unfair result in systematic differences in scores for predefined groups of
examinees – that is bias. So, test bias is a psychometric property of test scores; it is
the quantitative evidence that supports claims of test unfairness (Furr & Bacharach,
2008).
Fairness, in the credential testing realm, is a type of validity evidence in which
the principle rests that test results are based solely on the ability of candidates to
provide safe, competent practice. It is important to note here that not only examination
items, but exam policies may also hinder the performance of pre-defined groups of
candidates based on factors such as their gender, language, culture, ethnicity and
disability. Any exam item or examination policy that systemically advantages or
disadvantages groups of candidates is said to be unfair. Thus, test fairness is more
akin to the social concept of equity.

You might also like