Article Language and Assessment

Validity, Reliability, Practicality, Authentic and Washback
In Assessment
Annisa Ulhasanah
17178046
This article focuses to make an assessment to be valid, reliable, practice, authentic, and
washback. What are the statistical tools to make an assessment valid, reliable, practice,
authentic, and washback. Explaining the uses of language test, five major principles of
language assessment: validity, reliability, practicality, authenticity, and washback. Then
discussion the principle
Keywords: Validity, Reliability, Practicality, Authentic, Washback
Introduction forms of assessment, and a chance to adapt the

assessment with the students’ need and
background (Reynisdottir, 2016; Case,
The real-life language in testing language
2013Wang; 2008 in Reynisdottir, 2016)
due to a new approach of teaching methodology
Generally speaking, valid test itself can
namely communicative language teaching (CLT)
be meant as a test that test what is being taught
since 1970s (Lewkowicz, 2000; Richard, 2006).
and how it is being taught along the learning and
This approach concerns on developing
teaching process. If a test is measured what
communicative competence in using language.
needs to be measured from the students, then it is
Communicative competence includes four
a valid test (Brown & Abeywickrama ,2010).
components of competence; grammatical
Investigating validity can be done by
competence, discourse competence,
analyzing content, construct and face validity.
sociolinguistic competence, and strategic
The content validity of a test is examined
competence as explained by Shumin (2002).
whether the test is a representative sample of the
However, these competences cannot be assessed
content which the test was designed to measure
merely by using a traditional assessment which
is considered to be content validity (Bachman,
usually comes in a form of open-ended, short
1990; Brown, 2010) . The construct validity as
answer, true-false, multiple choice and ,matching
Bachman (1990) mentioned can be concluded
tests (Nasab, 2015; Caliskan & Kasikci, 2010).
exists in a test if it is demonstrated that it
Therefore, the use of an alternative assessment is
measures just the ability which it is supposed to
needed to be able to assess these communicative
measure. Meanwhile, face validity refers to the
competences. Any assessment that can comes in
degree to which a test looks right, and appears to
multiple forms and can reflect students learning,
measure the knowledge or abilities it claims to
achievement, motivation, attitudes on
measure, based on the subjective judgment of the
instructionally-relevance classroom activities is
examinees who take it, the administrative
called as an authentic assessment ( O’Malley &
personnels who decide on its use, and other
Valdes-Peirce , 1996). The authentic assessment
psychometrically unsophisticated observers
is able to provide an illustration what might be
(Mousavi, 2009 as cited in Brown &
found or encountered in the real world by
Abeywickrama , 2010).
students, a meaningful way of assessing, many
The reliability of a test mostly concerns valid. A test can be defined as a valid in its
with the consistency of a test and a test would be content if it represents sample of the language
valid if it is reliable. Best and James (2003) also skills, structures or the subject matter (Bachman,
explains that a test is reliable to the extent that it 1990; Brown, 2010). This finding is also in line
measures accurately and consistently from one with the finding found by Suarimbawa et all and
time to another time For instance, when people Ing et all in the same year of 2017. Suarimba et
perform differently in different time and the all found that the authentic assessment used in
smaller the difference between two sets of one of junior high schools of Singaraja had been
scores, the more reliable the test is (Hughes , based on the curriculum 13. In other other
2002). In other words, having a reliable test words, from its content, the assessment was
would minimize error in order to avoid designed under the learning objective of
unidentical score for the test taker after being re- curriculum 13. It had assessed attitudes,
tested. Rater judgement is one of methods in knowledge and skill of the students. The content
determining the test reliability (Kumar, 1996 in validity also can be fulfilled by the teacher –
Sak, 2008) . made assessment of Chinese Elementary Schools
The consistency in rater judgments in Johor ( Ing, 2017: 193).
should be determined by relying on inter-rater Nevertheless, it was found in the
and intra-rater reliability (Brown, 2005:185).The interview that some units had a problem with the
degree to which scores from two or more content of the learning which is not suitable with
markers agree is considered to be an inter-rater the learning objective and students’ ability.
reliability (Nunan, 1992:14-15; Weir & Roberts, Inappropriate context and task to the learning
1994:172). In other words, inter-rater reliability object given in the oral test also was found by
refers to the consistency between the marks Salabary in 2000. He found that lack of features
given by different teachers. Doubts upon inter- of conversational interaction, limited ramge of
scorer reliability could arise when the same interactional context and lack of specification of
quality of answers is given different scores by content areas to be addressed (Salbary, 2000:
different teachers. On the other hand, the intra- 31). It is suggested that the content of the task
rater measured by having an assessor measure and topic should be more authentic.
balance and then repeat the measurement of the Moreover, the teachers also suggested
same person after a specifid time lapses (Downs that the instruction should also be clearer and
et all, 2013). That is to say, intra-scorer more specific to be undertood either by teachers
reliability refers to marks given by the same or students. The tasks and topics in the product
teacher on different occasions. An example of should also be related to students’ life to
intra-scorer reliability at stake is when a teacher encourage and motive them to perform in oral
gets tired of marking and starts to give lower test in front of the class. Validity, particularly
marks as time goes on. Consistent grading is content validity , in the authentic assessment
essential in order to ensure the reliability of test should ensure that the tasks given are alike with
scores. Scorer reliability can be improved by a what they encounter in their real life as Gosh et
marking scheme or a scoring rubric that is all( 2017) explains that content validity should
prepared in advance and used to assist. ascertain if the authentic tasks resemble real-
world scenarios, encompassing wide but
required content and assessing only intended
DISCUSSION outcomes.
The construct validity also was
The content validity represented by categorized as valid by all teachers. The
learning objective and basic competence construct validity is fulfilled if the speaking test
indicators in both products is categorized as measures the components of fluency and
accuracy. Similarly, Nakamura in 1997 the process should be overlooked and one of the
investigated the construct validity of an English major nonlinguistic factors can be the cognitive
speaking test. It examined whether the nine traits ones, among which intelligence is of great
of pronounciation, grammar, discourse, fluency, importance.
content, vocabulary, comprehensibility were Finally, the face validity also can be
relevant and separable parts of speaking ability. considered as valid one. Types of speaking
It was found that the nine traits were functioning activities and tasks and form of authentic
as factor construct elements in oral test. speaking assessment are the two indicators of
However, it was found in the interview this validity type. According to Brown &
that some criteria in the scoring rubric need to be Abeywickrama (2010:35) increasing
adjusted, added and omitted so that the category examinees’perception of fair by concerning to a
of the construct was not very valid category. well-constructed and familiar tasks, an allotted
Rating criteria can greatly affect the construct time limit, clear instruction, the subject material
validity of a test (Brown & Taylor, 2006; in the class, and difficulty level. It was
Esquinca, Yaden, & Rueda, 2005; Hubbard, concluded that both products was valid since it
Gilbert, Pidcock, 2006 as cited in Pishgadam & had fulfilled the criteria of face validiy of
Shams, 2012:72). Grammar suggested in both providing constructed and familiar tasks and
grades not to be assessed since junior high clear instructon. The evaluation of face validity
school students were still in very low level in in IELTS speaking test also can fulfill the criteria
English ability. They had not got any English of having face validity as the unfamiliarity of
material in their previous level of education. format and lack of authenticity in task can not be
Therefore, some teachers suggested grammar found in the test item (Karim & Haq, 2014:156).
should not be assessed in the oral performance
test. Nevertheless this argument could not be The inter-rater reliability found had
taken as a note to improve the product since it mostly high and moderate correlation
would treat the validity, particularly the coefficients. Similarly to the findings of Halleck
construct validity. The construct validity ensures (1996) investigating the inter-rater reliability of
that the assessment content is assessed the trained raters on Oral Proficiency Interviews
learning objective in the curriculum (Aplhine, (OPI), as a result of the correlations computed,
2002: 11). The learning objective of curriculum statistically significant results were obtained
13 explains that students need to be able to since most score were satisfactory in each unit of
express their idea orally in appropriate grammar. both products. It is important to have high rates
In addition, assessing speaking must include reliability between raters, so we can be sure that
assessment of students’ fluency and accuracy. students are being scored accurately and fairly
One of the indicators in accuracy is the use of (WIDA, 2017:2).
grammar (Brown, 2004:157). On the other hand, a study Jafarpur
The criteria of assessing ideas, (1988) reported lower correlation coefficients.
improvisation, mimic is suggested to be added to An FSI-type oral interview was used in his study
enhance the construct validity of the product. and it was conducted at Shiraz University. The
Having a brilliant idea, making improvisation performances of 58 students were scored by 3
and do the right mimic is related to the cognitive raters and inter-rater reliability was reported as
skill of the students as mentioned earlier, the between 0.58 and 0.65. The researcher indicated
cognitive also become the target learning of that since the raters were not language teachers
curriculum 13. It is in line with the research done who received some training, low correlations
Pishgadam & Shams in 2012. They stated that to may have emerged. In the light of these, it can be
avoid misconceptions about the speaking said that the inter-rater reliability of the exam is
construct, the nonlinguistic factors involved in not as satisfactory as is expected since the
correlation of the scores of the third pair is fairly the principle of Curriculum 13. Although some
low. arguments from the teachers as some of the
There are many factors that can lower material is not interested, related to the
the inter-rater reliability. A research done by curriculum, the ambiguous instruction and the
Attali in 2015 found that assessing the subjective unspecified and inapproriated criteria in scoring
assessment was dramatically depending on the rubric, overall it had fulfilled the content,
rater’s experience, training and bias. Other construct and face validity criteria as both
researchers( Schaefer, 2008; Kim, H-J, 2015, products included most of the learning objectives
Barkaoui, 2010) found that the use of novice and demanded to be achieved in the curriculum 13.
experienced rater may affect the inter-rater In sum, both products can be implemented in
reliability as the score of experienced rater was junior high schools with some improvement to
more reliable. Kim also found that giving be made. It can ease teachers’ burdern in finding
training to the novice rater can improve the the appropriate authentic assessment in assessing
consisted and objective score given by the speaking as it has variation in types of speaking
novice raters sa well as the experienced ones. assessment and it represents the curriculum 13
These findings suggest that training in scoring demand.
and immediate feedback are as baluable as years
of teaching anad grading experience. In addition,
the published evidence on inter-rater reliability References
suggests that high correlation coefficients are
generally achieved when multiple trained raters Abdollah. 2016. An Analysis of Authentic
are used to score performances (Fulcher, 2003: Assessment in 2013 Curriculum Used by
142). However, Armes (2016:12) suggested that the English Teacher at SMA Negeri 4
the use of at least 2 raters in any speaking test Malang.Malang. University of
for practical consideration. Muhammadiyah Malang.
Interpretation of the rubric by the raters
also found to affect the reliability of the test. A Attali, Y. 2016. A Comparison of Newly-Trained
research done by Chong and Romney in 2016 and Experienced RATERS ON a
indicated the fair agreement of the inter-rater standardized Writing Assessment.
reliability was due to the absence of formal Doi:10.1177/02655332215582283.
training in using the rubric.
The raters themselves are not the only Armes, J.W. 2016. Quantifying the Qualitative:
source of affecting the reliability of a test. Increasing the Reliability of Subjective
Gebril in 2009 found that type and number of Language AssessmentS. The University of
tasks students are asked to do can have a San Fransisco.
measurable effect on performance and in turn
affect the reliability of the score. Test procedure Bachman, Lyle F. 1990. Fundamental
also becomes the source of the unreliability as Consideration in Language Testing.
Cassady found in 2005 that procedures for test Oxford: Oxford University press.
administration affect the outcome of test.
Bachman, Lyly F & Palmer, A.S. 2009.
Language Testing in Practice: Designing
Conclusion and Devloping useful Language tests.
New York. Oxford University Press.
The Products are valid in term of its content,
construct and face validity to be use by teachers Bernhardt, E.B et all. 2004. The Practicality and
in helping them to assess speaking ability under Efficiency of Web-Based Placement
Testing for College-Level Language Gay, l. R. and P. Airasian. 2009. Educational
Program. Foreign language Research: Competencies for Analysis and
Annal.vol.37.No.3 Applications. New Jersey: Pearson
Education.Inc.
Braskamp, L.A & Engberg M.E. 2014.
Guidelines for Judging the Effectiveness Gebril, A. 2009. Score Generalizability of
of Assessing Student Learning. Chicago. Academic Writing Tasks: does one test
Loyola University. method fit it all?. Language Testing 26,
no.4,507-31.doi
Barkaoui, K. 2010. Variability in ESL Esssay 10.1177/0265532209340188.
Rating Processes: The role of the Rating Genesee, Fred and Upsur, John A.2002.
Scales and Rater Experience. Language Classroom-based Evaluation in Second
Assessment Quarterly 7, no.1, 54- Language Education. Cambridge:
74.Doi:10.1080/1543300903464418. Cambridge University Press.
Gosh , Samrat et all. 2017. Improving the

Brown, H.Douglas and Priyanda Validity and Reliability of Authentic
Abeywickrama.2010. Language Assessment in Seafarer Education and
Assessment: Principles and Classroom Training: a conceptual and practical.
Practices. New York: Pearson Education. WMU Journal of Maritime Affairs.
Buzzeto-More, Nicole. 2010. Assessing the Gronlund. E.N & Waugh. K.C (2009).
Efficacy and Effectiveness of an E- Assessment of Student Achievement Upper
Portfolio Used for Summative Assessment. Saddle River, New Jersey. Pearson
Interdiciplinary Journal of E-Learning and Education, Inc.
Learning Objects. Volume 6.
Ing. L.M et all. 2015. Validity of Teacher –Made
Brown, H.Douglas, 1994. Teaching Action by Assessment: A table of Specification.
Principles.An Integrated Approach to Asian Journal Science, vol.11. Canada.
Language Pedagogy (3rd Ed).New Jersey: Canadian Center of Science and
Prentice Hall Regents. Education.w .
Case. R. 2013. Four Principles of Authentic Harris, David, p. Testing English as a Second
Assessment. Blogs.ubc.ca/file. Retrieved Language. New York: mc.Graw Hill Book
February 13th 2018. Company.
Cambridge English. 2016. Principles of Good Hidayanti. N. 2016. The Authenticity of English
Practice. Cambridge University. Language Assessment for the Twelfth
Graders of SMK Negeri 4 Surakarta.
Chong, Alan & Romkey, Lisa. 2016. Testing Premise Journal Volume 5 no.1
Inter-Rater Reliability in Rubrics for
Large Scale Undergraduate Independed Hughes, Arthur.2003. Testing for Language
Projects. Pro. 2016 Candadiam Teachers. Cambridge: Cambridge
EngineeringEducation Association University Press.
(CEEA160 Conf.
Ing et all. 2015. Validity of Teacher –Made
Assessment; A table of Specification
Approach. Asian Social Science: Vol11,
no.5:2015. ISSN 1911-2017 McAlphine, Mhairi. 2002. Principle of
Johnson, Nat. 2007. A Consideration of Assessment. University of Glasgow.

Assessment Validity in Relation to
Classroom Practice. Cambridge Marhaeni. A.A.I.N et all. 2014. Toward
University. Authentic Language Assessment: A Case
in Indonesia EFL Classroom. The
Karim, S. & Haq.N.2014. An Assessment of European Conference on Language
IELTS Speaking Test. International Journal of Leaning.
Evaluation and Research in Education
(IJERE).Vol.3, no. 3, September 2014, pp.152- Nasab, Fatemeh Ghanavati. 2015. Alternative
157. versus Traditional Assessement. Journal of
Applied Linguistic and Language
Kimberlin, C.I & Winterstein, A.G. 2008. Research. Volume 2, Issue 6, 2015,
Validity and Reliability of Measurement pp.165-178.ISSN:2376-760 X
Instruments Used in Research. Research
Fundamental -Vol 65 Dec 1. 2008. Nation, I.S.P & Newton,J. 2009. Teaching
ESL/EFL Listening and Speaking. New
Kim, Hyun Jung. 2015. A Qualitative Analysis of York &London. Routledge.
Rateer Behavior on an L2 Speaking Assessment.
Language Assessment Quarterly 12, no.3.
Natsir, Yuliana et all. 2018. The Rise and Fall of
Curriculum 2013: Insights on the Attitude
Lewkowicz, Jo.A. 2000. Authenticity in Assessemnt from Practicing Teachers.
Language Testing: Some Outstanding SHS Web of Conferences 42.00010.
Questions. Language Testing 2000 17
(1)43-64. Nunan, David. 1988. Syllabus Design
Luoma. S. 2004. Assessing Speaking. (Language Teaching: A Scheme for
Cambridge: Cambridge University Press. Teacher Education). UK: OUP Oxford.
P Pishgadam, R & Shams, M.A. 2013. A New
Olfos. R & Zulantary. H. 2007. Reliability and Look into the construct Validity of the
Validity of Authentic Assessment in a Web IELTS Speaking Module. The journal of
Based Course. Educational Technology & Teaching Language skills (JTLS) 5(1),
Society, 10 (4).156-173. Spring 2012, ser 70/4.ISSN:2008-88191.
—71-90
Ojung. J & Allida . D. 2017. A Survey of
Authentic Assessment Used to Evaluate
English Lnaguage Learning in Nandi Quaid, E.D. 2018. Reviewing the IELTS
Central Sub-County Secondary Schools , Speaking Test in East Asia: Theoritical
Kenya. Baraton Interdicipline Research and Practice-based Insights. Language
Journal, 7 (special issue), pp 1-11. Testing in Asia (2018)8:2 DOI
O’ Malley, J.Michael & Pierce, Lorraine 10.1186/s40468-018-0056-5.
Valdez.1996. Aunthentic Assessment For
English Langauge Leaners: Practical Reynisdottir. B.B. 2016. The Efficacy of
Approaches For Teachers. Addison- Authentic Assessment. University of
Wesley Publishing Company. Iceland.
Richards, Jack C.2006. Communicative Suarimbawa. K.A et all. 2017. An Analysis of
Language Teaching Today. New York. Authentic Assessment Implementation
Cambridge University Press. Based on Curriculum 2013 in SMP Ngeri
4 Singaraja. Journal of Education
Research and Evaluation. Vol.1(i)pp.38-
Richards, J & Renandya.2002 . Methodology in 45.
Language Teaching. New York.
Cambridge University Press. Sudjana. 2008. Metoda Statistika. Bandung.
Parsito.
Riduwan. 2004. Metode dan Teknik Menyusun
Tesis. Bandung. Alfabeta. Swanson. 2002. Constructing Written Test
Questions for the Basic and Clinical
Sciences. Third Edition (Revised) 2002.
Riemer, M. J.2007. Communication Skills for the National Board of Medical Examiners.
21st Century Engineer. Australia: Global Available at
Journal of Engineering Education. http://www.abme.org/PDF/Item Writing
2003/2003IWGwhole.pdf.
Rudner, Lawrence M. 1994. Question to Ask Trisanti. N. 2014. English Teacher’s Perspective
When Evaluating Tests. ERIC Digest. on Authentic Assessment
Implemnetation of Curriculum 2013.
Sak, Gonca. 2008. An investigation of Validity The 61st Teflin International
and Reliability of the Speaking Exam Conference.UNS Solo.
at TurkishUniversity. Middle East
Technical University. Wang. Y. A. 2008. Authenticity in Language
Testing. Hawai Pacific Uneversity. Teson
Schaefer, Edward. 2008. Rater Bias Patterns in Work Paper Series 6(2).
an EFL Writing Assessment.
Language Testing 25, no.4 465-93.
WIDA Resources for Educators. 2017.
Srikaew, D et all. 2016. Development of an Maintaining Rater Reliability in Scoring.
English Speaking Skill Assessment
Model for Grade 6 Students by Using
Portfolio. Procedia-Sosioal and
Behavioral Sciences 191 (2015)764- Yuji, Nakamura. 2008. Theoritical and Practical
768. Issues of Assessing Speaking in University
Entrance Examinations.Keio University.

Article Language and Assessment

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Article Language and Assessment

Uploaded by

Copyright:

Available Formats

Validity, Reliability, Practicality, Authentic and Washback

Keywords: Validity, Reliability, Practicality, Authentic, Washback

Introduction forms of assessment, and a chance to adapt the

Gosh , Samrat et all. 2017. Improving the

Johnson, Nat. 2007. A Consideration of Assessment. University of Glasgow.

You might also like