You are on page 1of 7

In 1989, Messick (1989, p.

11) indicated validity as “an integrated evaluative


judgment of the degree to which empirical evidence and theoretical rationales
support the adequacy and appropriateness of inferences and actions based on test
scores or other modes of assessment”. In other words, validity apparently
emphasizes how weekly exercises, mini-tests, midterm tests, final tests, and other
international standardized tests measure the language ability of learners concisely
and successfully achieve their own goals. For instance, the validity of research
refers to the final result of research participants that represent individuals outside
the study. Additionally, this concept of validity is effectively applied to most
clinical studies, including other research that analyses prevalence, associations,
and interventions. In the teaching language process, there are six main dimensions
that make validity become one of the crucial parts of the assessment. First, a valid
test mainly focuses as much as feasible on actual data. Second, it does not measure
unrelated or malicious and accidental alteration of samples. Third, a proper test
frequently relies on the performance of statistics from real-world situations.
Fourth, a legitimate test also requires performance that samples the test’s objective
criterion. Next, it provides beneficial, insightful data regarding a test-takers
aptitude. Finally, there must be a theoretical justification or reason for a test to be
valid.

Theoretically, there are four different kinds of validity evidence. The first one is
construct validity which is concerned with the appropriateness and significance of
the inferences we draw from test results. However, the big question is how we
verify those explanations of scores by using assessments as indicators of test
takers’ language proficiency. Apparently, the test maker must be able to give clear
and authentic details of language assessment criteria. To be clearer, the teacher
needs to create assignments that can measure accurately and appropriately the
language proficiency of learners. To achieve that measurement, we may see
construct validity as a test task’s explicit definition of a skill that serves as the
foundation for scoring and analyzing. For example, in the office, staff would like
to measure the depression severity of the client. As a result, he decides to get the
Beck Depression Inventory which contains a self-scored questionnaire of
depression and has evidence that those questions are accurate and valid. In
constructing the validity of score interpretations, there are two essential elements.
First, the test maker should define characteristics of the assignment that includes
the rubric, the input, the response in conversation, and how learners interact in
specific scenarios or tests (Alderson, 2000; Bachman, 1990; Bachman, 1991;
Bachman and Palmer, 1996; Douglas, 2000). Second, the teacher has to ascertain
students’ language competencies that supportably lead to identifying agreeable
levels of tests. The second one is content-related evidence that measures the
performance of behaviors belonging to students by evaluating how they bear all
the content of tasks (e.g., Hughes, 2003; Mousavi, 2009). To identify the content-
related evidence, you have to be as clear as possible goals of the measuring
process. You cannot ask students to compete in the marathon competition to
estimate their second language speaking ability. Basically, an authentic context
includes print and auditory materials that activate learners’ instructiveness with
various tasks such as creating a menu, writing blogs, or hosting a TV show, which
work well for that purpose. To acknowledge the requirement of content validity,
perhaps we need to clarify the differences between direct and indirect testing.
Direct testing emphasizes the actual performance of applicants. In contrast, the
performers do tasks that are somewhat related to the original ones. For instance,
the teacher would like to measure learners’ oral production. However, instead of
asking students to individually produce the sound, a teacher may use written tasks
such as putting word stress to find out the odd one. Next, criterion-related
evidence determines the comparison between different assignments and decides
which test is more effective and valid. For example, if a researcher wants to know
if a college entrance exam can forecast future academic success; they may use the
GPA of the first semester as the criteria variable. Last but not least, Brindley
(2001), Fulcher & Davidson (2007), Kane (2010), McNamara (2000), Messick
(1989), and Zumbo and Hubley (2016) also mentioned the importance of
consequential validity that refers to possible results of the assessment. For
instance, the consequence of standardized examinations includes many positive
sides such as motivating students and assuring that all students have access to the
same curriculum.

IELTS (The International English Test System) is one of the reputable language
proficiency assessments which is dramatically popular around the globe. In 2009,
Yen and Zenma indicated a significant relationship between IELTS and GPA that
relatively proved and predict students’ academic progress. The observation to
calculate the predictive validity of IELTS occurred in 1992 by Fiocco which
consisted of 61 non-English speaking participants at Curtin University in Western
Australia. In the first phase, it seemed like there was mostly no relationship
between IELTS grades and semester-weighted averages (PPI). However, the
second phase showed the opposite result when some reports of students about
teaching methods, lecturers, and a sense of solitude and alienation emphasized the
essential role of language proficiency in improving GPA (Dooey and Oliver
2002). Ferguson and White (1994) found that the predictive validity of IELTS is
stronger in case of IELTS scores are lower. Specifically, IELTS Reading parts
have a positive effect on academic records. In contrast, IELTS Speaking parts
expressed inverse correspondence with worldwide academic outcomes. As a
result, the scores of the reading subtest substantially gauged the future of academic
success (p. 109).

Reliability refers to consistent and dependable characteristics of a test which


ensures that outcomes are unchangeable for the exact participant at different times.
For example, if the same person steps on a scale again, his weight must be the
same. The principles of reliability include consistency of evaluating two or more
scorers, clear in the assessment, uniforming standardization, showing scorer’s
consistency of creating and applying rubrics, and giving obvious tasks to
candidates.

Besides, some researcher consider some factors which might possibly make the
test uncertain including the student, the evaluation, the test control and the quality
of the test. First, test-taker emotions and behaviors such as anxiety, depression,
and carelessness are able to make the score uncertain. Next, individuals prejudice
and subjective characteristic would effect on inter-rater reliability which only
happen when the examiners are consensus on the assessment. If the judges lack
inter-rater reliability, they might derange the scoring criteria. Lumley (2002)
demonstrated valuable ideas to abandon inter-rater unreliability. In a classroom,
intra-rater reliability is a key factor of justice due to the fact that some teachers
might be partial in giving scores to their favorite students. In addition, teachers
may evaluate more strictly or flexibly at different times and then cause an
imbalance in scoring. Therefore, before assigning any grades, it is a good idea to
check carefully half of the tests then go back and reread entire papers. Finally,
teachers can guarantee fair assessment without worrying of fatigue or bias.
Barkaoui (2011) specially emphasized that there is some difficulty to evaluate
writing skills due to the fact that writing evaluation involves various sides that is
difficult to set a common criterion. Nevertheless, the analytical tool can improve
both inter- and intra-rater by applying a wide classification to record data or
analyzed material samples. Thirdly, the consequence of unreliability occurs when
the administration has trouble. For example, students may be distracted by the
loud honking of a car horn in the street while taking a listening test. Last but not
least, in rare cases, the nature of the test can directly affect the test takers'
assessment of competence. In a test, there are many criteria to judge it's an
effective test including the difficulty of the test, the proper design of distractions
and the distribution of items from easy to difficult levels. Therefore, these factors
can diminish the reliability, for instance, rater bias or poorly written test items.

In IELTS test, the reliability based on the test production which uses a Cronbach
Alpha of 0.88 on average to design Reading and Listening parts. In addition, over
90,000 test-takers have participated in approved Reading and Listening editions
has improved that this is a consistent and reliable measurement (UCLES, 2007).
The reliability of the Writing and Speaking test is tested through a thorough
training, certification, and oversight procedure for examiners. To be more specific,
the speaking test will be recorded and the entire script will be archived for two
months in case candidates want to appeal about their test scores. IELTS test scores
will be published on the organization's homepage or sent via personal devices so
that test takers can look up and review their scores within two weeks from the date
of receiving the results. Despite gaining credibility and being widely distributed
around the world. Currently, the homepage of the IELTS exam cannot provide
clear evidence of the reliability of the modules. The reliability of the test is
completely based on the theory of Feldt and Brennan (1989) which gives a high
coefficient of 0.95 and a low SEM of 0.21. However, researchers have different
arguments for reliability. Lado (1961) suggested that tests of listening
comprehension are less reliable than tests of reading comprehension. This leads to
a correlation between the IELTS test and school results. Dooey and Oliver (2002)
argued that the correlation between academic performance and language
competence is due to the nature of the discipline. The rigor of the IELTS test was
mentioned by Alshammari (2016) as a factor affecting test takers' performance.
However, perhaps more research is needed to perfect the scoring system to match
the test output. The overall score of the IELTS test consists of four separate
sections that assess each learner's skill: Listening, Speaking, Reading, and Writing.
Therefore, Fazel and Ahmani (2011) synthesized information from Feldt and
Brennan (1989) to apply to the assessment of each skill. Overall, IELTS is still
seen as a standard measure of second language proficiency. Governments have
recognized the value of IELTS for the purpose of studying, living and working in
English-speaking countries. Therefore, it has attracted a large number of
academics to learn its good and bad sides. Extensive studies have partly proven the
reliability of IELTS. In addition, nowadays IELTS is also necessary for businesses
that require candidates to have a relatively good level of English to serve the job.
Although IELTS also has limitations that need to be overcome, it can be gradually
improved depending on the purpose of the study and the overall goals of the
learner. Finally, schools need to be mindful of some of the predictable
consequences of using IELTS as a benchmark for all students in order for it to
become a suitable criterion for the vast majority.

Self-assessment is the process of quantifying aspects of a person's self. Everyone


has a desire to discover themselves through self-assessment. Sedikides (1993)
suggested that self-assessment helps people self-assess and enhance their own self-
confidence. Often, people will be more concerned with their self-worth through
their current assessment than with the desire to find their own weaknesses to
overcome them. Therefore, some people's self-esteem is damaged and this leads to
less confidence in communication as well as in the process of studying and
working.

In the classroom, there are four important criteria teachers need to keep in mind
when asking students to self-assess in order to create a classroom that includes
positive and self-repairing people. Firstly, teachers need to create a comfortable
atmosphere in the process of asking students to self-assess their abilities by
showing students the exact concepts of the assessment items. First, the teacher
needs to carefully analyze the purpose of the components in the questionnaire and
then disseminate it to the students. Secondly, the student must clearly understand
the task or objective of the assessment. For example, open-ended questions often
confuse students because sometimes they don't really have the appropriate answers
in mind. Therefore, teachers can provide some sample sentences for students to
follow and create their own answers. Thirdly, teachers should emphasize that
students must evaluate themselves objectively and honestly. Most students are
often overconfident or self-deprecating, leading to a vastly different assessment
from their true self. Teachers need to reinforce students' knowledge about seeing
their own good and bad sides. Lastly, teachers should not give students a
questionnaire without emphasizing that wash back is important for subsequent
goals. Later on, students may be asked to explore in more detail about themselves
by journaling or responding to school texts.

You might also like