Measurement and Evaluation Issues in Science Education

Measurement and Evaluation issues in Science Education
Abstract
This paper examines the assertion that the key to effective testing is the successful integration of
all the major components of teaching a course or subject, namely objectives, instruction,
assessment and evaluation. With reference to Geography in particular, and to the teaching of
science in general, the paper also attempts to summarise some of the instructional processes in
science education and recommend appropriate measurement and evaluation procedures.
1
Author: Ezra Chipatiso (MScEd), Bindura University of Science Education (BUSE), Department of Science
Education: Correspondence: echipatiso@gmail.com
1|Page
1. Introduction to Instructional Process
In the processes of instruction and education in any society, the level of success at any given
time period is determined by assessment or testing which takes various forms. Test consist of
task/tasks used to enable systematic observations and recording of behaviours selected to
represent important educational goal, (Mpofu, 1991). Therefore testing must be a formalized
systematic way of gathering data or information about students’ behaviours. A test may be
administered orally, on paper, on computer, or in a confined area that requires the test-taker to
physically perform a set of skills. It is for this reason that tests differ in nature, style, rigor and
requirement and their administration may be formal or informal. A test may be developed and
administered by an instructor, a governing body or a test provider and in some cases the test
developer may not necessarily be in charge in administering its test. According to Marry (1998),
for tests to be reliable instruments of measurement, their construction must take the following
steps: identification of tests specifications, selecting the contents of the test, writing test items,
reviewing test items, final test item selection and finally the administration of the test to the
students. More often, the format and difficulty of the test is dependent upon the thrust of the
instructor, subject matter, class size, policy of the educational institution, and requirements of
governing bodies. Tests, though having their fair share of demerits, remain a vital tool for
determining performance, providing feedback and outcomes in the classroom in particular and
the education system in general especially with proper checks and balance in place.
All learning system is based on relationship between the teacher and the student. In this regard it
is necessary to evaluate the knowledge students are acquiring. To facilitate this evaluation, in
most cases, we test the course that students attended. Zimmaro (2004), observes that the key to
effective testing is the successful integration of all the major components of teaching a course /
subject, namely objectives, instruction, assessment and evaluation. Objectives are specific
statements of the goals of the instruction which express what the students should be able to do or
know as a result of taking the course, (Zimmaro, 2004). The objectives should indicate the
cognitive level of performance expected, for example, basic knowledge level, deeper
comprehension level, or application level. Course objectives should contain clear statements
about what must be achieved by the end of the course or term. In support of Zimmaro’s view that
course objectives should not be so specific that the creativity of the instructor and student are
stifled, nor should they be so vague that the students are left without direction, objectives should
2|Page
be clear and specific so that the instructor will have an effective means of evaluating what the
students learned. An example of a well constructed objective might be: “Geography students will
be able to demonstrate their knowledge of Central Place Theory by outlining the five
assumptions and apply them to Zimbabwean setup.” Peat, (2006) pointed that objective should
be written in terms of what the students will be able to do, not what the instructor will teach.
Since learning objectives should focus on what the students should be able to do or know at the
end of the semester, this exposes the importance of integrating objectives and instruction.
Instruction consists of all the usual elements of the curriculum designed to teach a course,
including lesson plans, study guides, and reading and homework assignments (Zimmaro 2004).
According to Zimmaro, the amount of weight given to the different subject matter areas on the
test should match the relative importance of each of the course objectives as well as the emphasis
given to each subject area during instruction. The instruction should correspond directly to the
course objectives, and the two are inseparable. Bloom specified different abilities and behaviors
that are related to thinking processes in his Taxonomy of Educational Objectives. This taxonomy
can also be helpful in test construction by use of test specification grid comprising of knowledge,
comprehension, application, analysis, and synthesis and evaluation questions. For example, the
student will recall the four major sectors of the economy without error (knowledge), given a
description of a country’s environmental management system, the student will defend it by
basing arguments on environmental principles and policies (evaluation).
Assessment is also a valuable element in testing. According to Jackson (2009), assessment is to

decide how valuable or useful something is. Assessment is generally used to refer to all activities
teachers use to help students learn and to gauge student progress. Thus, assessment gathers,
describes, or quantifies information about performance. Assessment is a process by which
information is obtained relative to some known objective or goal (Jackson, 2009). Whether
implicit or explicit, assessment is most usefully connected to some goal or objective for which
the assessment is designed. A test or assessment yields information relative to an objective or
goal. In that sense, we test or assess to determine whether or not an objective or goal has been
obtained. In short, assessment stipulates the conditions by which the behaviour specified in an
objective may be ascertained. We assess whether skill exists at some acceptable level or not. In
other words, all tests are assessments, but not all assessments are tests. We test at the end of a
3|Page
lesson or unit. We assess progress during and at the end of a school year through testing, and we
can also assess verbal and quantitative skills. Marzano (2000), argued that assessment of
understanding is much more difficult and complex, skills can be practiced and understandings
cannot. Although we assess a student’s knowledge in a variety of ways, but there is always a
leap, an inference that we make about what a student does in relation to what it signifies about
what he knows. Assessment can be formative or summative. According to Marzano (2000),
summative and formative assessments are often referred to in a learning context as ‘assessment
of learning’ and ‘assessment for learning’, respectively. Summative assessment of learning
generally occurs at the conclusion of a class, course, semester, or academic year. In an
educational setting, formative assessment might be a teacher or the learner, providing feedback
on a student's work, and would not necessarily be used for grading purposes. Formative
assessment is used by teachers to consider approaches to teaching and next steps for individual
learners and the class (Marzano, 2000). For example, if a teacher notices that some students did
not learn a concept, she or he can try a different instructional strategy or re-teach the content.
Thus, formative assessments allow students to monitor their progress and are often used to make
decisions about instructional practices. In summative assessments, learning outcomes are
reported to students, parents, and administrators, and this form of assessment is evaluative in
nature. A common form of formative assessment is diagnostic assessment. Diagnostic
assessment measures a student's current knowledge and skills for the purpose of identifying a
suitable program of learning, (Thaler, etal 2009). Self-assessment is also a form of diagnostic
assessment which involves students assessing themselves.
2. Performance-Based Assessment
Performance-based assessment is similar to summative assessment, as it focuses on achievement.

According to Marry (1998), performance-based assessment is a procedure of reviewing a task so
as to make judgments about its quality. A well-defined task is identified and students are asked
to create, produce, or do something, often in settings that involve real-world application of
knowledge and skills. It integrates learning and doing within and across content areas and place
emphasis on teaching students to acquire skills, knowledge and experiences that directly relate to
societal needs. The performance may result in a product, such as research portfolio, or it may
4|Page
consist of a performance such as skill, for example, skill in use of arcview software application
to study geographic phenomena.
3. Forms of Assessment
Assessment (either summative or formative) is often categorized as either objective or

subjective. Objective assessment is a form of questioning which has a single correct answer
while subjective assessment is a form of questioning which may have more than one way of
expressing the correct answer (Fisher and Frey, 2007). There are various types of objective and
subjective questions. Objective question types include true/false answers, multiple choice,
multiple-response and matching questions. Subjective questions include extended-response
questions and essays. Objective assessment is also well suited to the increasingly popular
computerized online assessment format.
Test results can be compared against an established criterion, or against the performance of other
students, or against previous performance. Criterion-referenced assessment typically uses
criterion-referenced test to measure students against defined and objective criteria (Fisher and
Frey, 2007). Criterion-referenced assessment is often, but not always, used to establish a person's
competence. Example of criterion-referenced assessment is the test on graphical representation
of geographic data when learners are measured against a range of explicit criteria such as the
key, campus showing direction, scale and accuracy in general. Norm-referenced assessment
typically uses norm-referenced tests, and is effectively a way of comparing students not
measured against defined criteria. The Intelligent Quotient test is the best known example of
norm-referenced assessment. Many entrance tests to schools or universities are norm-referenced,
permitting a fixed proportion of students to pass, passing in this context means being accepted
into the school or university rather than an explicit level of ability. This means that standards
may vary from year to year, depending on the quality of the cohort.
There is growing concern that the focus on summative assessment has distracted most teachers
from the importance of high-quality formative assessments. All these forms of assessment are
useful in learning process but should be applied where appropriate, and should be integrated with
objectives and instruction, but high-quality assessments are considered those with a high level of
5|Page
reliability and validity. A valid assessment is one which measures what it is intended to measure
(Fisher and Frey, 2007). For example, it would not be valid to assess environmental management
skills through a written test alone. A more valid way of assessing environmental management
skills would be through a combination of tests that help determine what a student knows, such as
through a written test of environmental management knowledge, and what a student is able to do,
such as through a performance assessment of actual environmental management skills. A good
assessment has both validity and reliability. In practice, an assessment is rarely totally valid or
totally reliable. A rain gauge which is marked wrong will always give the same wrong
measurements. It is very reliable, but not very valid. Validity is in two forms, namely, "subject-
matter" validity and "predictive" validity (Fisher and Frey, 2007). The former, used widely in
education, predicts the score a student would get on a similar test but with different questions,
while the latter predicts performance. Thus, a subject-matter valid test of knowledge of
environmental laws is appropriate while a predictively valid test would assess whether the
potential environmentalist or student could follow those laws. Since component of assessment is
the amount of weight given to the different subject matter areas on the test, it should match to the
relative importance of the learning objectives.
4. Paradigm Shifts in Assessment Practices
It has been widely noted that with the emergence of social media and web technologies and mind
sets, learning is increasingly collaborative and knowledge increasingly distributed across many
members of a learning community. Traditional assessment practices, however, focus in large part
on the individual and fail to account for knowledge-building and learning in context. As
researchers in the field of assessment consider the cultural shifts that arise from the emergence of
a more participatory culture, they will need to find new methods of applying assessments to
learners (Stevens and Levi 2004).
5. Assessment and Evaluation
Assessment is a vital component of any evaluation, especially in education context. Evaluation is

the process of examining learners’ performance, comparing and judging their ability, and is used
to determine whether the learner has met the objectives and how well they do (Zimmaro, 2004).
6|Page
Evaluation judges the elements of education institutions such as program, curricula,
organizations, and institutions. In classroom, decisions can be made based on the result of the
test. Evaluation provides a systematic process of determining the extent to which the pupils have
achieved instructional objectives (Bone, 1999). Evaluation can be seen in school, district and
national examination as the indicator to see whether the education system is being success or not.
Test or examination evaluations must involve students’ opinions. Teacher may ask students to
write their reactions to tests and examination, in this way students feel that their input matters,
faculty receive feedback so they can make the examinations more effective as learning and
assessment devices.
6. Context, Input, Process and Product (CIPP) Model
The context, input, process and product model (CIPP) has aspects of evaluation that assist in
decision making (Stufflebeam et al., 1971). The first aspect involves collecting and analysing
needs to determine goals, priorities and objectives. The second aspect answers the question,
’How should we do it?’ This involves the steps and resources needed to meet the new goals and
objectives and might include identifying successful external programs and materials as well as
gathering information. Third aspect provides decision-makers with information about how well
the programme is being implemented or the test is being administered. By continuously
monitoring the program, decision-makers learn such things as how well it is following the plans
and guidelines, conflicts arising, staff support and morale, strengths and weaknesses of materials.
The last aspect measures the actual outcomes and comparing them to the anticipated outcomes,
decision makers are better able to decide if the program should be continued, modified, or
dropped altogether. This is the essence of product or test evaluation.
The four aspects of evaluation in the CIPP model support different types of decisions and
questions. CIPP model is useful in instructional practices, in that, it takes a holistic approach to
evaluation, aiming to paint a broad picture of understanding of the integration of objectives,
instruction, assessment of the processes at work and overall evaluation. It has the potential to act
in a formative, as well as summative way, helping to shape improvements in the process of
teaching, as well as providing a summative or final evaluation. The formative aspect of it should
also, in theory, be able to provide a well-established archive of data for a final or summative
7|Page
evaluation. The main purposes of using tests and other evaluation instruments during the
instructional process is to guide and direct pupil learning and monitor progress towards course
objectives. For evaluation to be sound, objectives and purpose of the test should be clearly
defined. In the end, a comprehensive evaluation of the pupil’s achievement at the end of the
course or at some point will show us how much a pupil will have progressed towards the final
examination. Newell et al., (2002) argues that evaluation of instruction should be informative
and educative. Incorporating CIPP model in schools promote educational evaluations that are
proper, useful, feasible, and accurate, and the student accuracy standards help ensure that student
evaluations will provide sound, accurate, and credible information about student learning and
performance.
7. Rubrics for Evaluation

Rubrics can also be employed for evaluation, and these are tools essential for evaluating and
providing guidance for students’ writing. Andrade (2005) claimed that rubrics significantly
enhance the learning process by providing both students and instructors with a clear
understanding of the goals of the writing assignment and the scoring criteria and facilitate timely
and meaningful feedback to students. Peat (2006), suggested that, because of their explicitly
defined criteria, rubrics lead to increased objectivity in the assessment of writing. Thus, different
instructors might use a common rubric across courses and course sections to ensure consistent
measurement of students’ performance. The analytic rubric provides more detailed feedback for
the student and increases consistency between graders (Zimmaro, 2004). Regardless of its
format, when used as the basis of evaluating student performance, a rubric is a type of
measurement instrument and, as such, it is important that the rubric exhibits reliability and
validity. Although reliability and validity have been noted as issues of concern in rubric
development, the reliability and validity of grading rubrics require effort and time commitment
to do so.
8. Norm-Referenced and Criterion-Referenced Evaluation
In the context of individual environment interaction, evaluation judges measured competences

against a defined benchmark (Tinsley and Weiss, 2000). In this context, two approaches are
distinguished: norm-referenced and criterion-referenced evaluation. In the case of norm-
8|Page
referenced evaluation, the measure of competence is interpreted and judged in terms of the
individual’s position relative to some known group. Criterion-referenced evaluation is
interpreting and judging the measured student performance in terms of a clearly defined criterion
(Gagnon and Collay, 2001). The two approaches have common and different characteristics.
Both require a specification of the achievement domain to be mastered, a relevant and
representative sample of tasks or test items. Norm-referenced evaluation typically covers a large
domain of requirements with a few tasks used to measure mastery, emphasizes discrimination
among individuals, favours tasks of average difficulty, omits very easy and very hard tasks, and
requires a clearly defined group of persons for interpretations. Criterion-referenced evaluation
focuses on a large number of tasks used to measure mastery, emphasises what requirements the
individual can or cannot perform, matches task difficulty of requirements, and demands a clearly
defined criteria (Stevens and Levi, 2004). These different orientations of the two approaches
have distinct impacts on the statistical measurement model. The norm-referenced assessment or
the classical model is bonded with the normal curve whereas criterion referenced assessment is
aligned with probabilistic models focussing on level of mastering a task and not related to the
performance of other persons (Stevens and Levi, 2004).
9. Testing
From the perspective of a test instructor, there is great variability with respect to time and effort
to prepare a test. When the test developer constructs a test, the amount of time and effort
invested are dependent upon the significance of the test itself, the proficiency of the test-taker,
the format of the test, class size, and experience of the test developer. Perceptions of classroom
preparedness and confidence are best understood and articulated using the concept of teacher
efficacy, which has been connected to positive teaching behaviours, to a willingness to remain in
teaching and to student achievement (Gagnon and Collay, 2001).
Conclusion
The integration of instruction, objectives, assessment and evaluation remain significant in testing
if validity and reliability is maintained. The relationship between assessor and the test taker
exerts influence on assessment especially in teacher made assessments, where pupils know the
teacher personally and professionally. The marking practices are not always reliable; markers
9|Page
may be too generous, marking by effort and ability rather than by performance. Therefore the
test construction process must involve many stages to be followed to produce assessment tools
which contribute to increased reliability, validity, objectivity, simplicity, comprehensiveness and
scorability.
10 | P a g e
REFERENCES
Andrade, H. G. (2005). Teaching with rubrics: The good, the bad, and the ugly. College
Teaching, 53, 27–30.
Bone, Alison (1999). Ensuring Successful Assessment. In Burridge, Roger & Varnava,
Tracey(Eds.), Assessment. The National Centre for Legal Education, University of Warwick,
Coventry, U.K.
Crocker, L., & Algina, J. (1986). Introduction to classical & modern test theory. Belmont, CA:
Wadsworth.
Fisher, D. & Frey, N. (2007). Checking for understanding. Alexandria, VA: Association for
Supervision and Curriculum Development.
Gagnon, G. W. & Collay, M. (2001). Designing for learning. Thousand Oaks, CA: Corwin Press.
Jackson, R. R. (2009). Never work harder than your students. Alexandria, VA: Association for
Marry, J. (1998). Using Assessment for School Improvement: Oxford Melbourne Auckland
:Heinnemann Educational Publishers.
Marzano, R. J. (2000). Transforming Classroom Grading. Alexandria, VA: Association for
Mpofu, E. (1991). Testing for Teaching: Longman Teacher Education handbooks. Zimbabwe:
Longman.
Peter, W. A. (1996). Assessment in the classroom: Boston College: MacGraw-Hill, Inc.
Moskal, B. M., & Leydens, J. A. (2000). Scoring rubric development: Validity and reliability.
Practical Assessment,
Rubric Development and Interrater Reliability Issues in Assessing Learning Outcomes. Chemical
Engineering Education, 36, 212–215.
Peat, B. (2006). Integrating writing and research skills: Development and testing of a rubric to
measure student outcomes. Journal of Public Affairs Education, 12, 295–311.
Stufflebeam, D. et al. (1971). Educational Evaluation and Decision Making. Itasca, ll Peackock.
Stevens, D. D., & Levi, A. (2004). Introduction to rubrics: An assessment tool to save grading
time, conveys feedback, and promote student learning. Sterling, VA: Stylus.
11 | P a g e
Thaler, N., Kazemi, E., & Huscher, C. (2009). Developing a rubric to assess student learning
outcomes using a class assignment. Teaching of Psychology, 36, 113–116.
Tinsley, H. E. A., &Weiss, D. J. (2000). Interrater reliability and agreement. In H. E. A. Tinsley

& S. D. Brown (Eds.), Handbook of applied multivariate statistics and mathematical
modeling (pp. 95–124). San Diego, CA: Academic.
Zimmaro, D. M. (2004). Developing grading rubrics. Retrieved September, 2008. Available on

[http://www.utexas.edu/ academic/mec/research/pdf/rubricshandout.pdf Notes]. Accessed on 20
January 2020.
12 | P a g e

Measurement and Evaluation Issues in Science Education

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Measurement and Evaluation Issues in Science Education

Uploaded by

Copyright:

Available Formats

Measurement and Evaluation issues in Science Education

Assessment is also a valuable element in testing. According to Jackson (2009), assessment is to

Performance-based assessment is similar to summative assessment, as it focuses on achievement.

Assessment (either summative or formative) is often categorized as either objective or

4. Paradigm Shifts in Assessment Practices

5. Assessment and Evaluation

Assessment is a vital component of any evaluation, especially in education context. Evaluation is

6. Context, Input, Process and Product (CIPP) Model

7. Rubrics for Evaluation

8. Norm-Referenced and Criterion-Referenced Evaluation

In the context of individual environment interaction, evaluation judges measured competences

Tinsley, H. E. A., &Weiss, D. J. (2000). Interrater reliability and agreement. In H. E. A. Tinsley

Zimmaro, D. M. (2004). Developing grading rubrics. Retrieved September, 2008. Available on

You might also like