You are on page 1of 71

Introduction

Education is a word that we come across every day and an aspect that has a lot
of impact in our social, economical even our psychological status. The history of
education can be traced back to human origin and man’s never ending passion for
knowledge.

The word education is derived from the Latin educare (with a short u)
meaning "to raise", "to bring up", "to train", "to rear", bringing up, raising. In recent
times, there has been a return to, an alternative assertion that education derives from a
different verb: educare (with a long u), meaning "to lead out" or "to lead forth".

Definition of education

Education has been defined from a multitude of perspectives and the real
essence of education is even today the subject of critical deliberations.

Some of the definitions of education from wikipedia are :


 the gradual process of acquiring knowledge
 Educational activity primarily involves the presentation of material by the
faculty to students who are learning about the subject matter. The material
being studied is fundamentally well known material. Those activities known as
teaching and training are included in this category.
 the action or process of educating or of being educated; also : a stage of such a
process
 the knowledge and development resulting from an educational process
 the field of study that deals mainly with methods of teaching and learning in
schools
the act or process of imparting or acquiring general knowledge, developing
the powers of reasoning and judgment, and generally of preparing oneself or
others intellectually for mature life.
 the act or process of imparting or acquiring particular knowledge or skills, as
for a profession.
 the result produced by instruction, training, or study
 The act or process of educating or being educated.

1
 The knowledge or skill obtained or developed by a learning process.

History of Education

The history of education is both long and short. In 1994, Dieter Lenzen,
president of the Freie Universität Berlin and an authority in the field of education, said
"education began either millions of years ago or at the end of 1770". This quote by
Lenzen includes the idea that education as a science cannot be separated from the
educational traditions that existed before.

Education was the natural response of early civilizations to the struggle of


surviving and thriving as a culture. Adults trained the young of their society in the
knowledge and skills they would need to master and eventually pass on. The
evolution of culture, and human beings as a species depended on this practice of
transmitting knowledge. In pre-literate societies this was achieved orally and through
imitation. Story-telling continued from one generation to the next. Oral language
developed into written symbols and letters. The depth and breadth of knowledge that
could be preserved and passed soon increased exponentially.

When cultures began to extend their knowledge beyond the basic skills of
communicating, trading, gathering food, religious practices, etc, formal education, and
schooling, eventually followed. Schooling in this sense was already in place in Egypt
between 3000 and 500BC.

History of education in India

India has a long history of organized education. The Gurukul system of


education is one of the oldest on earth, and was dedicated to the highest ideals of all-
round human development: physical, mental and spiritual. Gurukuls were traditional
Hindu residential schools of learning; typically the teacher's house or a monastery.
Education was free, but students from well-to-do families payed Gurudakshina, a
voluntary contribution after the completion of their studies. At the Gurukuls, the
teacher imparted knowledge of Religion, Scriptures, Philosophy, Literature, Warfare,
Statecraft, Medicine Astrology and History (the Sanskrit word "Itihaas" means
History). The first millennium and the few centuries preceding it saw the flourishing

2
of higher education at Nalanda, Takshashila University, Ujjain, & Vikramshila
Universities. Art, Architecture, Painting, Logic, Grammar, Philosophy, Astronomy,
Literature, Buddhism, Hinduism, Arthashastra (Economics & Politics), Law, and
Medicine were among the subjects taught and each university specialized in a
particular field of study. Takshila specialized in the study of medicine, while Ujjain
laid emphasis on astronomy. Nalanda, being the biggest centre, handled all branches
of knowledge, and housed up to 10,000 students at its peak. British records show that
education was widespread in the 18th century, with a school for every temple, mosque
or village in most regions of the country. The subjects taught included Reading,
Writing, Arithmetic, Theology, Law, Astronomy, Metaphysics, Ethics, Medical
Science and Religion. The schools were attended by students representative of all
classes of society. The current system of education, with its western style and content,
was introduced & founded by the British in the 20th century, following
recommendations by Macaulay. Traditional structures were not recognized by the
British government and have been on the decline since. Gandhi is said to have
described the traditional educational system as a beautiful tree that was destroyed
during the British rule.

Assessment of Education
Assessment is the process of documenting, usually in measurable terms, knowledge,
skills, attitudes and beliefs

History of assessment

The earliest recorded example of academic assessment arose in China in


206BC when the Han dynasty sought to introduce testing to assist with the selection
of civil servants. The objectivity of the assessment was questionable (it being oral and
still subject to the whims of the assessors) but it was the first example of introducing
merit to the selection process in place of favouritism. In 622AD the Tang dynasty
administered formal written exams to candidates for the civil service; these exams
lasted for several days and had a pass rate of 2% - and successful candidates were
then subjected to an oral assessment by the Emperor. In Europe, tests were used
during the Middle Ages to aid the selection of priests and knights, and school children
were tested for their knowledge of the catechism. Oral exams were used to assess

3
knowledge and skills demonstrations were used to meassure practical abilities. The
University of Paris first introduced formal examinations during the 12 Century. These
exams were theological oral disputations. Questions were known in advance,
requiring students to memorise and regurgitate answers. In the 1740s, Cambridge
University began using (oral) examinations to compare students, similar to the earlier
Chinese tests. During the 18th Century, Cambridge and Oxford began testing students'
mathematical abilities using written tests and thereafter the use of paper for
assessment spread to all subjects. The Unitied States introduced formal written
examinations in the 1830s in an attempt to reduce the subjectivity of assessment.
Horace Mann introduced written tests in the Boston Public Schools to compare school
performance. However, the United States main contribution to the history of testing
came during the First World War when the US Army introduced large scale IQ testing
to assign massive numbers of recruits to positions within the Army. The Army Alpha,
as it was known, consisted of multiple choice questions and was administered to over
two million recruits.

Types
Assessments can be classified in many different ways. The most important
distinctions are:
(1) formative and summative;
(2) objective and subjective;
(3) criterion-referenced and norm-referenced; and
(4) informal and formal.

Formative and summative

There are two main types of assessment:

• Summative Assessment - Summative assessment is generally carried out at


the end of a course or project. In an educational setting, summative
assessments are typically used to assign students a course grade.
• Formative Assessment - Formative assessment is generally carried out
throughout a course or project. Formative assessment, also referred to as
educative assessment, is used to aid learning. In an educational setting,
formative assessment might be a teacher (or peer) or the learner, providing

4
feedback on a student's work, and would not necessarily be used for grading
purposes.

Summative and formative assessment are referred to in a learning context as


"assessment of learning" and "assessment for learning" respectively.

A common form of formative assessment is diagnostic assessment.


Diagnostic assessment measures a student's current knowledge and skills for the
purpose of identifying a suitable program of learning. Self-assessment is a form of
diagnostic assessment which involves students assessing themselves. Forward-
looking assessment asks those being assessed to consider themselves in hypothetical
future situations.

Objective and subjective

Assessment (either summative or formative) can be objective or subjective.


Objective assessment is a form of questioning which has a single correct answer.
Subjective assessment is a form of questioning which may have more than one current
answer (or more than one way of expressing the correct answer). There are various
types of objective and subjective questions. Objective question types include
true/false, multiple choice, multiple-response and matching questions. Subjective
questions include extended-response questions and essays. Objective assessment is
becoming more popular due to the increased use of online assessment (e-assessment)
since this form of questioning is well-suited to computerisation.

Criterion-referenced and norm-referenced

Criterion-referenced assessment, typically using a criterion-referenced test, as the


name implies, occurs when candidates are measured against defined (and objective)
criteria. Criterion-referenced assessment is often, but not always, used to establish a
person’s competence (whether s/he can do something). The best known example of
criterion-referenced assessment is the driving test, when learner drivers are measured
against a range of explicit criteria (such as “Not endangering other road users”).
Norm-referenced assessment (colloquially known as "grading on the curve"),
typically using a norm-referenced test, is not measured against defined criteria. This
type of assessment is relative to the student body undertaking the assessment. It is

5
effectively a way of comparing students. The IQ test is the best known example of
norm-referenced assessment. Many entrance tests (to prestigious schools or
universities) are norm-referenced, permitting a fixed proportion of students to pass
(“passing” in this context means being accepted into the school or university rather
than an explicit level of ability). This means that standards may vary from year to
year, depending on the quality of the cohort; criterion-referenced assessment does not
vary from year to year (unless the criteria change).
Informal and formal

Assessment can be either formal or informal. Formal assessment usually a


written document, such as a test, quiz, or paper. Formal assessment is given a
numerical score or grade based on student performance. Whereas, informal
assessment does not contribute to a student's final grade. It usually occurs in a more
casual manner, including observation, inventories, participation, peer and self
evaluation, and discussion.

Standards of quality

The considerations of validity and reliability typically are viewed as essential


elements for determining the quality of any assessment. However, professional and
practitioner associations frequently have placed these concerns within broader
contexts when developing standards and making overall judgments about the quality
of any assessment as a whole within a given context.

Testing standards

In the field of psychometrics, the Standards for Educational and Psychological


[1]
Testing place standards about validity and reliability, along with errors of
measurement and related considerations under the general topic of test construction,
evaluation and documentation. The second major topic covers standards related to
fairness in testing, including fairness in testing and test use, the rights and
responsibilities of test takers, testing individuals of diverse linguistic backgrounds,
and testing individuals with disabilities. The third and final major topic covers
standards related to testing applications, including the responsibilities of test users,

6
psychological testing and assessment, educational testing and assessment, testing in
employment and credentialing, plus testing in program evaluation and public policy.

Evaluation standards

In the field of evaluation, and in particular educational evaluation, the Joint


Committee on Standards for Educational Evaluation has published three sets of
standards for evaluations. The Personnel Evaluation Standards was published in
1988, The Program Evaluation Standards (2nd edition) was published in 1994, and
The Student Evaluation Standards [5] was published in 2003.

Each publication presents and elaborates a set of standards for use in a variety
of educational settings. The standards provide guidelines for designing, implementing,
assessing and improving the identified form of evaluation. Each of the standards has
been placed in one of four fundamental categories to promote educational evaluations
that are proper, useful, feasible, and accurate. In these sets of standards, validity and
reliability considerations are covered under the accuracy topic. For example, the
student accuracy standards help ensure that student evaluations will provide sound,
accurate, and credible information about student learning and performance.

Validity and reliability

A valid assessment is one which measures what it is intended to measure. For


example, it would not be valid to assess driving skills through a written test alone. A
more valid way of assessing driving skills would be through a combination of tests
that help determine what a driver knows, such as through a written test of driving
knowledge, and what a driver is able to do, such as through a performance assessment
of actual driving. Teachers frequently complain that some examinations do not
properly assess the syllabus upon which the examination is based; they are,
effectively, questioning the validity of the exam.

Reliability relates to the consistency of an assessment. A reliable assessment is


one which consistently achieves the same results with the same (or similar) cohort of
students. Various factors affect reliability – including ambiguous questions, too many
options within a question paper, vague marking instructions and poorly trained
markers.

7
A good assessment has both a validity and reliability, plus the other quality
attributes noted above for a specific context and purpose. In practice, an assessment is
rarely totally valid or totally reliable. A ruler which is marked wrong will always give
the same (wrong) measurements. It is very reliable, but not very valid. Asking random
individuals to tell the time without looking at a clock or watch is sometimes used as
an example of an assessment which is valid, but not reliable. The answers will vary
between individuals, but the average answer is probably close to the actual time. In
many fields, such as medical research, educational testing, and psychology, there will
often be a trade-off between reliability and validity. A history test written for high
validity will have many essay and fill-in-the-blank questions. It will be a good
measure of mastery of the subject, but difficult to score completely accurately. A
history test written for high reliability will be entirely multiple choice. It isn't as good
at measuring knowledge of history, but can easily be scored with great precision.

Controversy

The assessments which have caused the most controversy are the use of High
school graduation examinations, which first appeared to support the defunct
Certificate of Initial Mastery, which can be used to deny diplomas to students who do
not meet high standards. They argue that one measure should not be the sole
determinant of success for failure. Technical notes for standards based assessments
such as Washington's WASL warn that such tests lack the reliability needed to use
scores for individual decisions, yet the state legislature passed a law requiring that the
WASL be used for just such a purpose. Others such as Washington State University's
Don Orlich question the use of test items far beyond standard cognitive levels for
testing ages, and the use of expensive, holistically graded tests to measure the quality
of both the system and individuals for very large numbers of students.

High stakes tests, even when they do not invoke punishment, have been cited
for causing sickness and anxiety in students and teachers, and narrowing the
curriculum towards test preparation. In an exercise designed to make children
comfortable about testing, a Spokane, Washington newspaper published a picture of a
monster that feeds on fear when asked to draw a picture of what she thought of the
state assessment. This, however is thought to be an acceptable if it increases student
learning outcomes.

8
Standardized multiple choice tests do not conform to the latest education
standards. Nevertheless, they are much less expensive, less prone to disagreement
between scorers, and can be scored quickly enough to be returned before the end of
the school year. Legislation such as No Child Left Behind also define failure if a
school does not show improvement from year to year, even if the school is already
successful. The use of IQ tests has been banned in some states for educational
decisions, and norm referenced tests have been criticized for bias against minorities.
Yet the use of standards based assessments to make high stakes decisions, with
greatest impact falling on low-scoring ethnic groups, is widely supported by education
officials because they show the achievement gap which is promised to be closed
merely by implementing standards based education reform. Many states are currently
using testing practices which have been condemned by dissenting education experts
such as Fairtest and Alfie Kohn.

Evaluation in India except in schools

The grading system in India varies somewhat as a result of being a large


country. The most predominant form of grading is the percentage system. An
examination consists of a number of questions each of which give credit. The sum of
credit for all questions generally counts up to 100. The grade awarded to a student is
the percentage obtained in the examination. The percentage of all subjects taken in an
examination is the grade awarded at the end of the year. The percentage system is
used at both the school and university. Some universities also use the grading system
and a CGPA on a 10 or 4 point scale. Notably, all the IITs, BITS Pilani (Pilani, Goa
campuses) and most NITs use a 10-point GPA system. However, the grades
themselves may be absolute (as in NITs), exlusively relative (as in BITS Pilani), or a
combination of absolute, relative and/or historic, as in some IITs.

There are several universities and recognized school boards in India which
makes an objective comparison of percentage grades awarded by one examination
difficult with those for another, even for an examination at the same level. At the
school level percentages of 80-90 are considered excellent while above 90 is
exceptional and uncommon. At the university level however percentages between 70-
80 are considered excellent and are quite difficult to obtain. It should be pointed out

9
that the percentage of marks at university vary from one to another which makes
direct comparison of percentages obtained at different universities difficult.

Official Grading System (for all Government/Autonomous/Deemed Indian


Universities except schools)

Old Percentage Range Grade U.S. Grade Class/Division


80-100% A+ 4 First Division with Honours/Distinction
75-79% A 3.75-3.95 ""
70-74% A- 3.5-3.7 ""
65-69% B+ 3.25-3.45 First Division
60-64% B/B- 3-3.2 ""
55-59% C+ 2.5-2.9 Second Division
50-54% C/C- 2-2.4 ""
45-49% D+ 1.5-1.9 Third Division
40-44% D/D- 1-1.4 ""
Less than 40% F 0 Fail

Evaluation in schools

Until 1994 the schools up to tenth standard followed the marking or the
ranking system where each students overall marks were calculated and his rank
relative to others was released. However the system had some accompanying
problems contrary to the interest of the students such as
 Unhealthy competition
 Increased parental pressure
 Heightened tension levels in students

Introduction of Grading

Owing to the inherent flaws in the ranking system and the persistent demand
from the academic and social circles that the outdated and anti-student system must be
replaced in the best interest of the students the government decided to change the

10
system of evaluation from ranking to grading on a stage by stage basis which was
rolled out from the year 1994 and completed by 2003.

However it is not known whether the change executed had returned the
expected results.

This study aims to analyse whether the introduction of the grading system has
achieved the major objectives which justified its introduction such as

 Alleviation of students tension


 Change in unhealthy competition
 Better evaluation of student
 More scientific approach
Its been more than two years since the system of grading has been
implemented and as per advice from the academic circle the study on how far grading
has achieved its objectives is relevant and an evaluation on its effectiveness is due.

11
Review of literature - Journals

This chapter presents some of the studies and the journals that have been made
and published and it also helps the researcher to concentrate on the main aspects and
to avoid duplication

grading and assessment of student writing. (Focus on Teaching). Rebecca B.


Worley and Marilyn A. Dyrud

ALTHOUGH GRADING is undoubtedly at or near the top of our "to do" lists,
it is not the recurrent nature of this task that prompted the topic for this column. As
the articles in this journal testify, business communication as a discipline has been
changing. Most of us are not teaching the same course in the same way that we did 15
or 10 or even five years ago. But are we still grading in the same way? If content,
teaching methods, and delivery systems have changed, has grading also changed? And
if so, how? This column addresses those questions.

As Joel Bowman discusses in the first article of this column, technology


complicates the submission and return of assignments in distance education courses.
Although software provides some solutions, marking papers and providing
meaningful feedback to students in an online environment proves to be more
complicated and time consuming than the traditional pen-on-paper method. As is so
often the case, however, the challenge of technology encourages us to re-examine how
we assess student writing.

In the second article, Marilyn Dyrud has done just that, advocating an holistic
method for evaluating assignments. Reflecting the evaluation procedures used in
business for determining the success of a written communication, the system assigns
numbers (0, 1, or 2) for unacceptable, acceptable, and excellent work respectively. As
the article explains, this system not only reduces the time commitment of instructors,
but also encourages students to take ownership of their written work.

Although her approach differs somewhat, Nancy Schullery also adopts the
holistic approach to grading, using seven foundation concepts that are essential to
every business communication, regardless of genre. Such assessment, she argues,

12
more clearly simulates the response of supervisors, clients, customers, and other
readers in the business world.

LeeAnne Kryder, in the final article of this column, explains the rubrics she
has developed to assure some degree of consistency in grading among instructors and
teaching assistants in various sections of the same writing course. These rubrics she
finds particularly useful for evaluating individual student performance in group
projects.

If there is a consistent thread among these four articles, it is this: the quality of
the evaluation is what matters, not the amount of time instructors spend grading.
Regardless of the methods they use, all of the authors represented here are searching
for ways to provide meaningful and impartial evaluation of student writing that
encourages learning and rewards excellence.

It's Not Easy Being Green: Evaluating Student Performance in Online


Business Communication Courses

Joel P. Bowman
Western Michigan University, Kalamazoo

ANYONE WHO HAS SEEN the popular children's TV show "Sesame Street"
knows that Kermit the Frog worried about being green. Although Kermit had a
different metaphor in mind, those of us teaching business communication in an online
format also face the issue of being "green." Classes taught over the Internet are
relatively new, and online instructors are having to learn--often the hard way--how to
take full advantage of electronic delivery to provide good instruction and effective
feedback on student work.

Until the advent of e-mail that permitted attaching formatted files, students
submitted work for evaluation in essentially the same way: as paper documents to be
marked and returned by the instructor. Most of us currently teaching business
communication learned what we know about marking student papers from our
instructors, who typically used a combination of marginal notations and standard
proofreader's marks. As instructors, we probably use essentially the same basic
approach to provide feedback on student work and justify our evaluation of that work.

13
Until the advent of the Internet, the procedures were basically the same for
those completing courses at a distance. Assignments arrived on paper and were
marked and returned, whether by mail (typical of "correspondence" courses) or by
fax, as video-based delivery became increasingly common. With the advent of the
Internet and submission of work by e-mail attachment, everything changed.

Online Class Structure and coursework

Over the past three years I have taught eight sections of a standard business
communication course using Web-based, Internet technology for delivery of
information. While most of my students have been within easy driving distance of
campus and were also taking classes on campus, a significant number have been
several hundred miles away. A few have been several thousand miles away. Many
were working full time and elected to take the online version of the class because of
the flexibility it afforded. Some enrolled out of curiosity, knowing that online courses
were proliferating. A few enrolled expecting it to be easier than the traditional version
of the course because of the absence of regularly scheduled classes.

In the online sections, all my students submit most of their assignments as


formatted documents (in Microsoft Word) attached to e-mail messages. The
exceptions are a videotaped presentation, which may be hand delivered or mailed, and
a PowerPoint presentation, which is also submitted as an e-mail attachment. With one
exception, the assignments are all completed individually. The exception is a group
report requiring collaboration. An electronic conference serves as a substitute for
classroom discussion. The conference affords students the opportunity to ask
questions about course materials and assignments and to discuss possible solutions to
the assigned cases. A majority of the assignments are based on cases requiring
students to write short documents using an appropriate format and correct spelling,
grammar, mechanics, and business writing style. These are submitted twice, once in
"draft" form and then as a revision.

Planning and the Vagaries of Technology

In every semester that I have taught online classes, technical difficulties have
created problems. Servers have a tendency to fail without notice, and the class

14
conference and/or e-mail service may not be available for hours--and, on occasion, for
days. E-mail server software may strip the contents from attachments, so while a
document is sent and received, the contents are missing. Such challenges present one
of the principal differences in the evaluation of online students. To be successful in an
online environment, both faculty and students need greater flexibility and
perseverance. Online classes, like all courses taught at a distance, require more
attention to planning so that students know exactly what is expected of them at the
beginning of the semester to ensure that they can plan around their work schedules
and complete the assignments by their due dates.

For students at a distance, an old-fashioned technology, the US Postal Service,


can present challenges. Students find that delivery by overnight mail and next-day
service by Airborne Express can take a week to 10 days. (So far, at least, UPS and
FedEx have provided the most reliable delivery of student videos.)

Online Evaluation of Written Work

Before preparing their solutions to the cases, students use the class conference
to ask questions and post sample solutions for my comments. Students earn points for
their participation in the conference. I evaluate entries for class relevance (students
may use the conference to discuss other topics of interest), spelling, grammar,
mechanics, and formatting. I also award extra credit to students who are the first to
find and report errors in my postings. At regular intervals, I send each student a record
of his or her postings indicating problems with the postings that resulted in a loss of
points for the evaluation period. Students may compensate for problems in an
evaluation period by increasing their level of participation in the next.

For the short documents, online discussion of the cases and their possible
solutions begins the week before the case is due. Students submit the drafts of their
cases on Monday. I return the marked drafts on Tuesday. The final versions of the
assignments are due Friday, and I return those by Sunday to ensure that students know
how well they have done before submitting the drafts of their solutions to the next
case.

15
The feedback on formatted documents, both the draft and final versions of
solutions to the cases, requires a slightly different strategy from that most of us
learned to use for documents submitted on paper. On paper, if a student has a problem
with comma usage, it is a simple matter to mark problems with usage and draw lines
to general comments about the need to review and what to look for in the review. This
isn't so easy to do with documents that never see paper. Microsoft Word (and other
word processing programs) allow for highlighting and changing text color. It is also
possible to use a "draw" function to show connections between related elements, but
imitating the kind of feedback possible with pen-on-paper is both time consuming and
awkward when done electronically.

I elected not to use the "track changes" function in Microsoft Word, preferring
to save copies of the drafts returned with my comments for comparison with the final
submissions. Much of the feedback on the draft was commentary about language
usage (such as explaining why a modifier "dangled"), business writing style (such as
commenting on message structure and tone), and the need to follow special
instructions for the case (such as using a numbered list if the directions said to do so).
Tracking the changes did not adequately allow for the revisions most students were
having to make between the draft and the final versions of their solutions and actually
added to the difficulty of evaluating the final copies.

Providing this kind of running commentary on the cases has both advantages
and disadvantages. The principal advantage is that documents submitted electronically
encourage instructors to provide more comprehensive, explanatory feedback than is
typical for paper submissions. The principal disadvantage is that providing such
feedback for each student individually takes more time than is required for marking
papers and returning them in class, where the explanation for the cryptic comments on
paper can be provided orally to all the students at once.

Bottom Lines

A variety of studies have shown no significant difference in learning based on


the method of educational delivery . Even so, teaching and learning in an online
environment are not the same as teaching and learning in a traditional classroom
environment. The skills required for success may overlap, but they are not identical.

16
Online classes tend to have higher attrition. Students enroll but drop the course
early in the semester or simply stop doing the work. The mixture of highly motivated,
often nontraditional students and those who enrolled expecting the course to require
less time and effort typically results in bimodal distribution of grades, with a majority
of the students doing very well or very poorly.

Student success in online classes has a high correlation to willingness to learn


from reading and to participate in electronic discussion of course concepts. Success
also depends on student comfort level with the technology. At least a few who insist
that they are comfortable with sending and receiving e-mail attachments when they
enroll discover that, while they may have sent an attachment or two, they do not know
how to retrieve attachments sent to them and frequently have difficulty sending
attachments as well.

In the history of education, online classes are new, and we have yet to
determine how to take full advantage of the technology. The natural inclination is to
try to imitate the classroom environment. The extraordinary family therapist Virginia
Satir referred to this inclination as "the lure of the familiar." Even when we
experiment, we tend to do what we have always done.

Because we are most familiar with the traditional classroom setting, we tend to
assume that it is the "gold standard" for educational delivery and seek to replicate
what we see as the advantages of that setting even as we use new technology. Those
of us who teach so-called "upper division" classes, however, might pause to consider
how much our students actually learned in their previous traditional classes. It may be
time to reinvent the wheel. As changing cultural needs continue to push us in the
direction of "any time, anywhere" delivery of education, what we are learning now
about how to evaluate student performance in an online environment may well
provide the foundation for new strategies of teaching and learning.

Preserving Sanity by Simplifying Grading


Marilyn A. Dyrud
Oregon Institute of Technology Klamath Falls

17
ONCE--AND ONLY ONCE--I calculated how many papers I graded during an
academic year. With four writing classes averaging 25 students each, the total was a
staggering 4,500 papers. And, since all of my writing classes are process-oriented, the
real total was at least double that. I was, I thought, a very thorough and conscientious
grader, circling mechanical errors, rewriting wretched sentences, and carefully
marking each mistake as awk, pn, sp, ro, frag, and a plethora of other hieroglyphs. I
was also putting in 10-hour work days and, on occasion, waking up at 4 a.m. so I
could finish grading before classes met. Obviously, things needed to change.

My epiphany occurred during a discussion with a sagely professor in my


department. He used a very different grading system and recommended a then-popular
technical writing text, Technical Writing: Process and Product, by Charles R. Stratton
from the University of Idaho, that discusses the types of technical writing generated in
industry. Stratton uses a three-tiered scale: about 20 percent of industrial documents
are deemed excellent, about 20 percent are unacceptable, and the remaining 60
percent are considered adequate. Excellent documents yield promotions and pay
boosts for their writers; unacceptable writing results in crash writing-improvement
classes or termination. Acceptable documents are salvageable, but require, in some
cases, extensive revisions which, of course, translates to money in business.

It was a small step from Stratton's book and further discussions with colleagues
to a new and improved grading system, one that would be more efficient, help prepare
students for the writing they would produce professionally, and encourage revision.

Thus was born 0-1-2, a system which I, and many of my communication


colleagues, have now incorporated in virtually all classes that require written
documents. The 0-1-2 system encourages instructors to quit editing and start
evaluating and has three primary virtues: simplicity, efficiency, and equity.

Simplicity

In the olden days, before 0-1-2, I tended to edit and revise my students' papers,
accompanied by a rather unwieldy grading sheet .While the students liked the
expository essay grading sheet primarily because it looked organized and "identified"
all of their mistakes, from an instructor perspective, it was fraught with peril. Under

18
mechanics, for example, 7 points are allowed for spelling. But what if a student made
10 errors in their paper? Does that student owe points? Is 6 points adequate to assess
logic in a persuasive paper? How much "freshness" can we expect from a newly-
minted high school graduate? What if a student submitted a paper that was
mechanically and structurally excellent but devoid of meaning? Using this grading
form, an instructor can subtract only 8 points; this well-written but vapid paper might
earn an "A."

Moreover, the instructor is compelled to use at least 12 different abbreviations


as comments and needs to take valuable class time to explain their meanings.
Although we all also wrote comments in margins and at the end of papers, the
abbreviations were confusing to the students.

Using 0-1-2, most of these problems are non-existent because instructors read
the papers holistically, which results in simplified grading criteria and fewer editing
remarks. The figure is my current grading criteria handout for students in my business
correspondence class. Criteria will vary according to class; a business writing class,
for example, may place more emphasis on audience and formatting than a
composition class.

The 0-1-2 stands for unacceptable, acceptable, and excellent. Students who
receive 0s and is may revise their work as often as they wish within a specified time.
In my more-than-a-decade of using this system, I have witnessed much better writing
as a result of revision, and the students willingly revise because everyone has the
potential to earn an "A".

Efficiency

In addition to downsizing assessment criteria, I have also quit editing and


revising my students' papers. In lieu of circling and writing, I now make a simple
checkmark next to a line that has a mechanical error. The student's job is to find the
error and correct it. It is, after all, the student's paper, not mine, and reduced editing
empowers students to take possession of their own writing .

Just this simple change has resulted in massive time savings. In the past, using
a grading form and editing comments, it took me about 90 minutes to grade a set of

19
business letters; now, I spend less than an hour. To score longer reports, I used to
spend about 45 minutes per report; now, it takes about 20.

While 0-1-2 speeds up the process, it does not reduce the quality of evaluation.
My students still receive ample feedback for improvement. It is, in fact, a form of
holistic assessment, an evaluation system widely used to score student work: the
Educational Testing Service uses holistic methods to assess SAT and GRE exams, and
the State of Oregon, which has significantly revised K-12 standards, uses it to score
for minimal competencies. Furthermore, numerous studies on holistic assessment
praise its reliability and note its versatility .

Equity

A third virtue of 0-1-2 is equity. With so few points available, there is little
room for instructor larceny, which, sadly, certainly does occur. In a traditional 100-
point system, it is easy to mark down a paper that, for example, might disagree with
the instructor's political philosophy. But with only 2 points available, arbitrary
deductions cannot occur.

Student Satisfaction

Student complaints are also minimized. With a traditional system, students


could quibble about a point or two; but with this system, two points is a whole
assignment! Since I have implemented 0-1-2, I receive virtually no student
complaints. Quite the contrary: students like this system because they know that they
can correct their errors or polish their style. According to comments on my quarterly
student evaluation forms, satisfaction is high. Responding to the question "What did
you like about this course," my students obviously appreciated the revision
opportunity:

* "It is great that you allow rewrites because you learn a lot from your mistakes. Keep
it the same so others will get as much from this class as I did."

* "I liked having the opportunity to do rewrites."

* "Rewrite options are nice."

* "Opportunity to rewrite papers; fair chance to get the grade you desire."

20
Conclusion

A simplified grading system such as 0-1-2 offers many benefits to both


students and instructors. For students, it offers them a chance to improve their work
and allows them to prepare for the work world where review cycles are common. It
allows them to debunk the myth that once something is written, it is set in granite.
Professional writers, both scholarly and industrial, revise their work many times prior
to publication. It also helps students develop their editorial skills and, hopefully,
establishes solid work habits.

For instructors, 0-1-2 offers the luxury of more time, substantially reducing
the many hours of evaluation required for writing classes. Those extra hours can be
applied to other tasks, such as course preparation or professional development. More
importantly, though, it changes the instructor's role from judge to coach, since the
primary goal is to produce an excellent piece of work. As Toby Fulwiler, University of
Vermont, explains, his role as a writing teacher has "little to do with teaching students
about semicolons, dangling modifiers, or predicate nominatives and a lot to do with
changing their attitude towards writing in general so they would care about it and
maybe learn to do it better" .

A Holistic Approach to Grading


Nancy M. Schullery
Western Michigan University Kalamazoo

TO GRADE students' accomplishments and assess learning, I use a holistic


method modeled after that used in the "real world." When students enter the
workplace, their ability to accomplish their assigned goals will determine their career
success. Similarly, their ability to accomplish their purpose on an assignment should,
to my mind, determine their grades. A document's intended purpose is accomplished,
by definition, when the document is effective. Therefore, grades in my class are based
largely on my judgment of the degree to which students effectively accomplish the
purposes of their assigned papers. I would like to describe, first, how I define
effectiveness, and, second, how I implement a holistic approach to its assessment.

21
I define effectiveness as satisfying the following criteria: the writer has to
understand the complexity of the context and approach it in a way that respects the
viewpoint of the audience. The writer must demonstrate an appreciation of the
audience's viewpoint by using language that the audience understands and by
considering the relevant content in a way that relates to audience needs. The writer
must arrange the content in a strategically functional way, using an organization that
allows the reader (and any secondary audiences) to understand and accept the writer's
goals. Ideally, the format/design will invite the reader in to read the rest of the
document and allow the reader to make optimum use of the information. The content
must include enough detail to help the reader understand the situation from the
writer's perspective, yet include no irrelevant or prejudicial information. Thus, the
writer's self-presentation must be positive, giving the impression of a competent,
responsible, fair-minded business person speaking for a reasonable organization.
These seven foundation concepts (italicized above) are so essential and tightly
interwoven that they all must be incorporated into any written assignment, along with
the standard conventions of American English. These fundamentals are applicable
across the board, whether the student is writing a negative message, a persuasive
message, a research report, or an application letter and resume, each of which has its
own unique elements (as discussed in business communication textbooks). It is the
successful implementation of both the foundation concepts and the unique elements
that leads to effective accomplishment of purpose.

Effectiveness Criterion

My first task in grading is to make my use of the effectiveness criterion clear


to the students. An "A" quality application letter or resume, for example, must be
ready to mail to the employer, and should appear to have a good chance of success.
Memos and letters must be written as though the writer's reputation, job security,
career advancement, and potential pay raises depend on the cumulative effect of the
writer's effectiveness in attaining his/her purpose with each document.

They are to assume that each paper is to be judged by that paper's intended
audience. In other words, when I do the grading, I will look at their papers through the
eyes of their workplace supervisor, customer/client, colleague, or potential employer.
As the assigned audience would react to the writer's document, so I also will react.

22
How do I do that? First, I make a quick preliminary judgment. For example,
we are told that prospective employers typically spend only 15-30 seconds reading
any single resume before making a preliminary (and sometimes final!) decision. I
strive for such a quick initial reading. My preliminary reaction may range from "this
is pretty good" to "oh, s/he's missed the point entirely." This initial judgment leads to
an initial valuation as an "A," "B," etc. grade, which I note in pencil and expect to
modify somewhat after a second, more careful analytical reading.

The second reading is where the real work is done. I identify, from the
perspective of the target audience, any of the seven fundamental concepts that the
student has implemented either particularly well or poorly. The good points are noted
in the margin to reinforce learning. The problems are also noted in the margin, but
always with specific instructions for improvement. Providing sufficient quality
feedback to the students is critical. I want students to understand the reader's likely
reaction to the document, so my comments are numerous. Some comments are
cryptic, such as "positive?" or "unclear." Others offer brief instruction (not editing)
toward a more appropriate direction, such as "rephrase in 'you' attitude" or "frame
more positively." Whenever possible, I point out important readers' perspectives that
the student has overlooked, such as, "could be read as defensive." Also, a few
summary statements at the bottom are given, especially early in the semester, when
positive reinforcement and motivation are most crucial. In keeping with my
employer's-eyes role, I attempt in my comments to both give the student credit for
what was done well and identify areas where improvement is needed or the purpose
has not been accomplished.

Thus, a paper that makes a very favorable first impression and has positive
comments regarding all, or most, of the fundamental concepts, and only a few minor
shortcomings, is judged excellent and the penciled-in grade is made an "A." One that
is "pretty good" but doesn't make the "A" grade due to falling short on one or more of
the criteria will receive a "B" based both on the seriousness of the errors and the
strength of the correct part of the paper in relation to previous papers I have read. A
"C" paper has potential, but lacks overall effectiveness due to some combination of
missing ingredients. Those papers that either miss the point entirely, ignore the basic
concepts or the key elements of the assignment, or display ignorance or carelessness

23
with the basics of English punctuation, spelling, or grammar are, fortunately, rather
rare and earn a "D" very quickly.

Conclusion

Such an holistic approach contrasts with the use of a grading rubric, which is
the more common path to efficiency in grading. It is my observation that rubrics tend
to grow over time in response to students' creativity, becoming overly complex and
losing sight of the big picture. More important, I am concerned that the typical rubric
approach of points off for every error may be unfair and, possibly, do more harm than
good. For example, a common problem is repeated point deductions for the same type
of error (e.g., comma usage). To my mind, a repeated error of a single type repeated
ten times is less objectionable than ten different errors. Further, sometimes a specific
technical error really does little to reduce the effectiveness of the message.

In contrast, the holistic method's focus on the larger goal, together with
explicit recognition of students' strengths, is an effective motivational tool, providing
students with a sense of building on a base toward greater mastery. While such
grading is not particularly efficient, I find that it does not take too much more time
than grading with a rubric. Of course, writing comments takes time. However, I
believe that such comments--reasonably prioritized and including genuine positives--
are both immediately helpful to the students and build goodwill that sets the stage for
future instructional success. This grading process does not end with the individual
papers; feedback is important also at the classroom level. As I return the papers, I give
positive reinforcement to the class by briefly explaining how they have generally
handled the concepts and showing a few exemplary papers in which the key elements
were masterfully treated. These are shown without names, to avoid heaping praise on
any individual student and inviting the suspicion of a teacher's favorite. I also describe
(but don't show examples of!) problems that seemed to be common. In this manner,
the grading process is transformed from a pure chore into an extension of the teaching
process, which, for me, makes it all worthwhile.

Although such holistic grading may be criticized for not enforcing absolute
standards with respect to details, I believe that its focus on the big picture helps
motivate students to strive toward improvement in all aspects of the subject. The

24
method is akin to that used in industry of rewarding employees who do something
well with an "atta boy." Industry asks employees to work toward goals, and then
measures the degree of goal attainment in a performance evaluation (e.g., meeting
sales targets or budget allocations), in effect awarding points rather than deducting
points. It is my experience that students respond very favorably to any efforts that
convincingly connect their classroom subject matter with their own present or future
employment success. I recommend the method as one that helps motivate students
both to learn the principles of business communication and to master its skills.

Endnote

The foundation concepts noted here are drawn from E. A. Hoger & N. M.
Schullery (2001), Core Concepts for Business Communication, provided to instructors
of the business communication course taught at the Haworth College of Business,
Western Michigan University. The college does not mandate holistic grading.

Grading for Speed, Consistency, and Accuracy


LeeAnne G. Kryder
University of California at Santa Barbara

WITH AN AVERAGE OF 75 papers to grade for six or seven assignments per


quarter, I have sought techniques to ensure efficiency and effectiveness in grading.
While grading still isn't easy, I feel comfortable with several techniques I've adopted
over the years.

Grading Rubrics

Most of my assignments in business communication are individually authored,


but I also have at least one large collaborative report and collaborative oral
presentation. In my "writing for accounting" course, I have two collaboratively
written reports and oral presentations. I've developed rubrics for all these assignments,
similar to those often used for lower-division freshman composition. These help
ensure my consistency over the many hours of grading time, and they help speed my
evaluation and commenting time.

25
My rubrics generally reflect five categories of effective business writing:
organization and content, visual communication, concise and varied word choice,
mechanics, and format (the particular genre conventions for the memo, letter, and
report). For each category I determine the percent of points allotted . Depending on
what elements are being emphasized, points vary among categories in each
assignment. For an assignment calling for a resume, application letter, and brief memo
(explaining the job or organization targeted by the letter), I allot the following points:

* Assignment Content and Organization 2 pts

* Visual Communication 3 pts

* Word Choice/Tone 1 pt

* Grammar, Punctuation, Spelling 3 pts

* Format 1 pt

Before, I used to agonize over whether a document was a "B" or a "B+"; now I
focus attention on each category and let the numbers as totaled "tell me" if the
document is a "B" or a "B+."

Initially, the use of rubrics and numbers was foreign to me, but I needed a
clearly articulated guide when I began to work with teaching assistants in the "large
lecture" format . How could I ensure consistent treatment of student papers across
three graders (two TAs and myself)? For each of the assignment papers, I developed a
rubric and sent out a draft for comment from each TA. Discussions that followed
seemed to clarify the assignment for them, so the rubrics helped to "get us on the
same page" (literally). Perhaps because our students knew we were using common
evaluation sheets, we seldom had student complaints about unfair grading. In fact, the
final grade distribution assigned by the two TAs was very similar--and I believe the
rubrics helped us achieve that.

Although we haven't required all business communication teachers to follow


the same rubric, I believe the use of a common rubric could further ensure consistency
among classes--especially when we add two or three new teachers to the mix.

26
The rubric also saves time in making summary comments because it forces me
to be concise due to space limitations. I still annotate individual passages in student
papers, but I now make brief marginal comments and then refer to the pre-printed
rubrics.

My large, collaborative project papers (reports or business plans) also benefit


from a unique rubric . Because I want each author to receive my comments, I often
type them and the assigned points into the Word template so that I can print out four
or fives copies. Certainly these rubric sheets can also be adapted for electronic
comments and sent as attachments to an e-mail from teacher to student.

Status Reports and Performance Evaluations

As a way to manage my students' group projects and ensure fair assignment of


grades, I require some self-reporting. Each student on a team (typically five students
on each of five teams) takes a turn each week to send me an e-mail status report. Part
one is a set of minutes to be distributed to everyone on the team; part two is just for
me and includes the student's assessment of the project thus far. This is an excellent
way for me to gather an "early warning" if there are problems in the group and to
monitor individual students' attendance at team meetings. I also field questions and
address concerns earlier in the quarter.

Soon after I put the student teams together, I distribute the performance
evaluation that each student is to use in evaluating his/her own work and the
contributions of each team member. We discuss characteristics of successful teams
and I use the performance evaluation to highlight individual team member
responsibilities. The performance evaluation focuses on a variety of contributions
needed for a successful collaborative report; certainly writing and editing skills are
important, but so are computer skills, visual communication, willingness to take on
responsibility, initiative, and so on.

Periodically throughout the quarter, I refer to this performance evaluation;


then, on the day the students submit their collaborative document, I ask that they also
submit the confidential performance evaluation. This, too, helps with final grading. I
tell the students that the project's rubric helps me assign a grade to the document;

27
whether each student shares in it depends on the performance evaluations and status
reports (sometimes individual grades are higher or lower, depending on that student's
contribution to the whole). With the periodic status reports and the final performance
evaluations, I feel comfortable in assigning collaborative project grades and the
ultimate course grades.

Concluding Remarks

The rubrics I have developed for each of my business communication courses


have been an enormous help in my grading practices. They have helped make my
grading less time consuming, and more consistent and effective.

Student perceptions of grades: a systems perspective. (The scholarship of


teaching and learning). Jane Strobino, Kimberlee Gravitz and Cathy Liddle.

Academic Exchange Quarterly

Evaluation of student achievement in courses is one of the major


responsibilities of faculty in the educational enterprise. Theoretically, grades are
indicative of students' acquisition of knowledge in a particular content area, or the
degree to which the student has learned. Faculty use a variety of rubrics for assigning
grades and employ grading practices that may have evolved throughout their teaching
careers .

During the grading process, tension arises when there is a difference between
the grade that the teacher assigns and the grade that the student expects (Goulden &
Griffin, 1995). Such tension can have important consequences for the student. Some
students, upon receipt of a grade lower than expected, may be discouraged from
further investment in the learning process, or may be motivated to work harder.
Additionally, grades may impact on students' self-esteem, self-worth, and self-efficacy
(Edwards & Edwards, 1999; Goulden & Griffin, 1995; House, 2000). One way to
clarify this tension of grading is to consider student perceptions of grades.

Viewing student perceptions of grades from a Systems Perspective,


specifically the Person-in-Environment (PE) paradigm, allows for a greater
understanding of factors that contribute to the tension (Germain & Gitterman, 1996).

28
The PE paradigm suggests that the person, in interaction with his/her environment,
creates a personal perception of grades. These environments include not only the
classroom and school environments, but also physical, social and psychological
environments. The broader context of student environments will affect the actions
taken by the student and thus his/her achievement in the classroom. Omitting
consideration of these other environments when assessing student achievement can
result in a grade that may reflect a biased assessment .Thus, from the PE paradigm, in
order to understand student perceptions of grades, one must look at the student or
person factors and the environments in which the student is a part.

In the current higher education environment, students are viewed as consumers


and the student body is comprised of a large percentage of adults who are considered
non-traditional. Their perception of grades may be unique. Adult students, in
particular, may want to take charge of their learning and may be at odds with course
assignments and grading protocols. Unless the teacher and the student discuss any
differences in expectations, tension will arise when the student is given a grade with
which he/she does not agree .

Person Factors

According to the PE paradigm, students' perceptions of grades may be


understood when person factors/characteristics are viewed in relation to the person's
multiple environments. These person characteristics include: previous school
experiences, student efforts to learn, motivation to learn, expectations regarding
grades, and readiness or preparedness for the academic program.

Previous school experiences of the students focus on the grading practices and
the extent to which the student ascribed to them. A number of authors found that some
students experienced grades as a reward or a punishment (Becker et al., 1968;
Kadakia et al., 1998; Tropp, 1970). These experiences relative to grading shape the
perceptions and behaviors of the students. Over time, students become conditioned to
the extrinsic rewards that grades convey and may continue this focus in graduate
school (Edwards & Edwards, 1999). Kadakia et al. (1998) found that although MSW
(Master of Social Work) students saw their colleagues as obsessive about grades, they
failed to identify this characteristic in themselves. Weiner (2000) makes the point that

29
an outcome (i.e. grade) may be explained in terms of both personal behaviors (the
student studied or did not prepare for the test), and actions of an `other' (the teacher
recognized the effort expended). Further, Weiner suggests that the Protestant Work
Ethic is the value base upon which grades are perceived. And so, it seems that the
experiences one has throughout the years may serve as an impetus for the efforts one
expends and also for the motivation to engage in the learning process.

According to Becker et al. (1968), efforts expended by students are important


when considering their perceptions of the grades that are assigned. The degree of
effort legitimizes the validity of the students' meaning of grades. This point is
consistent with Brookfield (1986) and Tiberius and Billson (1991) who suggested that
students have a responsibility in the learning enterprise and that responsibility
contributes to the teacher's evaluation and assignment of the grade. Likewise, House
(2000) viewed grading according to systems theory and saw the student inputs (efforts
and responsibilities) as being mediated by the school environment to produce the
output (grade). These authors concur with Rogers (1969) who suggested that learning
results from an interaction of the student with both the course materials and the
teacher. Efforts or responsibilities may be seen in the amount of time devoted to
reading, seeking resources, consulting with others and in actual writing and reflecting
on the course content. Thus, it is important that student effort or student
responsibilities are part of the context when looking at student perceptions of grades.

Learning efforts may be uniquely linked to other person factors such as student
motivations and grade expectations that may be illustrated by the consumer model.
This model views students as consumers of education, and emphasizes satisfaction
with their educational experience .The basis of student motivation is the monetary
contribution, which is made by the student in order to obtain a degree. If, indeed,
students perceive grades as a right (according to the consumer model) and faculty take
the traditional approach that students earn the grade, then tension surely will arise
during the grading periods. The consumer model may inadvertently diminish the
values of hard work and persistence which are essential to learning (Snare, 1997).
Also, this model projects a cynical viewpoint about students' motivations to learn and
fails to recognize the other contexts for which students are responsible. Other
motivating factors that have been suggested include the desire for recognition and for

30
knowledge so as to become expert in the field (House, 2000). It is important then to
document student motivations and grade expectations when investigating student
perceptions of grades.

Environments

Environments refer to the larger system contexts in which persons interact.


These larger systems may support or hinder the efforts made by students in their
academic endeavors.

One environment of particular importance to learning is the classroom culture


or climate. This environment includes transactions among students, and between
students and faculty. A classroom environment may stimulate or deter students from
engaging in learning experiences. Engagement in learning activities leads to greater
student achievement and consequently higher grades. A safe environment results in
higher student participation and student performance (Billson & Tiberius, 1991).
Student access to the teacher as a means to personalize the student teacher-
relationship also is said to impact on grades (Becker et al., 1968; Rogers, 1969;
Somers, 1970). As that relationship develops, the teacher may be able to influence the
student to engage in productive learning activities which result in higher grades.

Another environment important to understanding student perceptions of grades


is the student culture. Student culture includes the knowledge/perception of the
educational program and faculty as to strictness of grading requirements, grapevine
information about faculty and how to succeed. "Most colleges probably develop a
student subculture that identifies tough graders and easy graders and often encourages
students to go for the easy ones" (Walhout, 1997, p. 86). The student subculture places
a negative spin on student approach to learning. Perhaps learning and grades are
viewed as separate with grades serving as a means to an end that may or may not
include knowledge acquisition.

An environment outside of school, yet intensely important to the student


perceptions of grades, is what might be called the student's personal environment. The
personal environment includes the home situation, employment, culture, health status,
finances, life stage, physical proximity/access to the school and faculty, and those

31
social environments that are meaningful to the student. This environment, because it
places demands on the student's time, serves to reduce the amount of time available
for students to use in academic pursuits. Consequently, although students may believe
their completed assignments are well done given the time available, they may be
graded unfavorably by the teacher. This discrepancy creates tension about the grade.
In studying student perceptions of grades, it is essential for their personal environment
to be included.

A broader environment, namely the higher education system, has contributed


both to grading protocols and grading practices. While the university establishes
policies that mandate a grading system for teachers, there remains a wide variety of
grading practices. This lack of a standard grading practice demands that students learn
and adapt to variations among teachers. In some cases, students must choose the
courses on which to focus their attention . Also, easy grading may be part of an
entitlement bargain in which universities seek students to fill the classrooms and meet
minority and diversity requirements .Wiesenfeld (1996) raises question about the
consequences of such a situation in terms of passing students who lack the expertise
and skills when they graduate and enter professional employment. Thus, documenting
the grading protocols that are used in the broader environment is important. They
interact with student learning behaviors and lead to a construction of the student
perception of grades. Given the literature on environments, any study about student
perceptions of grades should necessarily include those contexts. Doing so may
identify supports or barriers for student engagement and academic performance.

Meaning of Grades

Three studies were identified that focused specifically on the meaning of


grades to students. Two involved undergraduate students and one involved graduate
students in a School of Social Work. In their classic work, Becker et al. (1968)
conducted a qualitative study of undergraduate students at a Midwestern university in
order to understand students' academic work in the context of their other life
experiences. The authors indicated that faculty were not usually aware of the demands
on the students' time, including those from other classrooms and from their personal
environments. Students believed that if they met faculty demands they could achieve

32
the desired grade. Such a perspective implies the external control of grades by faculty,
and minimizes student academic efforts in affecting the grade.

Nearly thirty years later, Goulden and Griffin (1995) also conducted their
study at a Midwestern university and included undergraduate students. They
addressed the different perceptions of teachers and students as one source of grade-
conflict. The underlying premise was that the meaning of grades differs between
students and faculty and this difference is a source of conflict. Students in their
sample were given two prompts from which to respond: What do grades mean to you;
and Grades are like.... The results indicated that grades were viewed as a means of
feedback in which a measure or judgement about student effort was given. Also,
grades were seen as emotional triggers in which the student, as a person, was being
judged. Lastly, grades were seen a motivators within the context of a reward and
punishment system.

Kadakia et al. (1998) conducted a survey of students in one Northeastern


Graduate School of Social Work. Their focus was on grade expectations and locus of
control. The results indicated that most students expected to receive high grades and
identified colleagues as obsessed with grades. The authors note that "Overestimating
academic performance in a profession that holds self-awareness as sacrosanct is
paradoxical and unacceptable" (p. 10). While not focusing directly on the meaning of
grades, the survey did inquire as to the importance of grades to personal beliefs about
self and the importance of receiving an acceptable grade. The reconceptualization of
grading, using the PE paradigm, provides direction for understanding student
perceptions of grades. Both student viewpoints and the multiple environments in
which these students interact are important to assess.

Summary

Grading is an area of tension between students and faculty that may be better
understood from a Systems Theory point of view. Utilizing the conceptual paradigm
of PE Fit recognizes the importance of environments as impacting on student
engagement in the academic enterprise. However, documenting student perceptions of
grades will provide faculty with insights that can be used to reconsider rubrics for
grading. Such insights can serve to reduce the grading tension between faculty and

33
students. Additionally, this information can have important implications for
curriculum development and program planning. In terms of curriculum, the
identification of student perceptions of grades (and grading practices) may enhance
faculty discussions about the grading systems that should be in place. Also, such
discussions may include the importance of consistent expectations for student
achievement across courses as well as within different sections of a course. Regarding
program planning, documenting student perceptions and being responsive to these in
light of the environmental contexts may serve to heighten the reputation of the
school/program and thereby have an indirect effect on recruitment of new students.
This model also recognizes the importance of the student body profile which includes
both traditional and non-traditional students, and who may have unique perceptions
about grades. Finally, the authors look forward to extending the ideas presented in this
paper by actually conducting research that addresses student perceptions of grades.

Davis, J. Thomas. "Fairness in Grading." Academic Exchange


Quarterly 5.1 (Spring 2001): 64. British Council Journals Database. Thomson
Gale. British Council

This article reviews the difficulties in assigning grades to student work, briefly
reviewing highlights from the history of grading practices. It concludes with the
suggestion that, given the impossibility of comparing grades either within an
institution or between institutions, instructors should base grades on a measure of an
individual student's progress during a course.

One of the most important duties of a faculty member at the end of a term is
that of determining the final grade that individual students will receive in the class. As
difficult a process as this is, it is made even more difficult not only by having to
determine the process to arrive at the final grade, but also by the various
interpretations of that grade that will be made later by others.

These assigned grades are designed to serve a variety of purposes. Dr. James S.
Terwillinger wrote that grades are to serve three primary functions: administrative,
guidance, and informational. He indicated that grades should be viewed only "as an
arbitrarily selected set of symbols employed to transmit information from teachers to
students, parents, other teachers, guidance personnel, and school administrators."(1)

34
However, unless the meaning and interpretation of the grades assigned are universally
understood, the system, no matter how carefully designed and understood by the
instructor awarding the grade, will not be an effective means of communication to
others or over a period of time for cumulative evaluation.

This is true even if the purpose of grading is more specifically defined--as in


the following list by Professor James M. Thyne: "To ascertain whether a specified
standard has been reached; To select a given number of candidates; To test the
efficiency of the teaching; To indicate to the student how he (sic) is progressing; To
evaluate each candidate's particular merit; and To predict each candidate's subsequent
performance."(2)

In the development of an individual or institutional grading policy, it is


important that a decision be made as to the reason for the assessment. If it is merely to
have twelve grades at the end of the term or that departmental policy requires that all
work be graded, these will become ends in themselves, and the interpretation of the
final assigned grade will become even more difficult. Even with a definite purpose
beyond "institutional policy," it is extremely difficult to have a consensus as to how to
arrive at a grade to properly evaluate the progress made by any individual student in a
particular course.

Dr. William L. Wrinkle wrote in 1947 of six interpretation fallacies that are
made in understanding course grades. The number one fallacy that he listed in his
book was the belief that anyone can tell from the grade assigned what the student's
level of achievement was or what progress had been made in the class.(3) This fallacy
is as widely believed and probably as correct today as it was when he wrote it in 1947.

Even earlier in a study published in 1912, Dr. Daniel Starch and Edward
Elliott questioned the reliability of grades as a measurement of pupil accomplishment.
Their study involved the mailing of two English papers to two hundred high schools
to be graded according to the practices and standards of that school and its English
instructor. The papers were to be graded on a scale of 1 to 100, with 75 being
indicated as the passing grade. Teachers at one hundred forty-two schools graded and
returned the papers. On one paper the grades ranged from 64 to 98, with an average of
88.2. On the other, the range was 50 to 97, with an average of 80.2. With more than

35
thirty different grades assigned and a range of more than forty points for the same
paper, it is no wonder than the interpretation of assigned grades is extremely
difficult.(4)

Perhaps the earliest study on individual grading differences was done by Dr. F.
Y. Edgeworth of the University of Oxford in 1889. Professor Edgeworth included a
portion of a Latin prose composition in an article he wrote for the Journal of
Education. He invited his readers to assign a grade to the composition and forward it
to him. His only other instruction was that this composition was submitted by a
candidate for the India Civil Service, that the work was to be graded as if the reader
were the appointed examiner, and that a grade of 100 was the maximum possible.

He received twenty-eight responses distributed as follows: 45, 59, 67, 67.5, 70,
70, 72.5, 75, 75, 75, 75, 75, 75, 77, 80, 80, 80, 80, 80, 82, 82, 85, 85, 87.5, 88, 90,
100, 100. In his conclusions Edgeworth wrote, "I find the element of chance in these
public examinations to be such that only a fraction--from a third to two thirds--of the
successful candidates can be regarded as quite safe, above the danger of coming out
unsuccessful if a different set of equally competent judges had happened to be
appointed."(5)

The criteria for evaluation vary not only from institution to institution but
from course to course within the same institution and from instructor to instructor of
the same courses within the same institution. Since methods used by various
instructors vary considerably, it becomes extremely difficult to read a student's
transcript to determine the student's standing among others at the same institution or
throughout the country at other institutions. The National Collegiate Athletic
Association's Division I institutions voted down a requirement that student athletes
maintain a standard grade point average in order to retain their eligibility to participate
in collegiate sports from year to year. The major reason given was the difference in
grading standards that existed between institutions and between courses and programs
at the same institution.

One difficulty is that the methods used in arriving at the final course grade are
almost too numerous to enumerate. These include the averaging of all course grades
made during the term, dropping the lowest one or two test marks, determining the

36
entire course grade on the basis of the final exam or one term paper, counting only the
final course grades, grading on the basis of class average, and having only a written
comment rather than course grade.

Compounding the problem of interpretation of the grades indicated for students


is that both standards and grades themselves vary over time. For example, during the
decade of the 60's, many institutions began experimenting with a variety of grading
systems, both institution-wide and within selected courses. This was the era of student
protests, demonstrations, and student revolts on our college campuses. Institutions that
changed their grade reporting system include both small, private institutions and those
with longstanding Ivy League academic reputations.

These innovative grading systems included allowing pass/fail grades in


selected courses; replacing the traditional grades with "High Pass, Pass, Fail" or with
Credit/No Credit; not counting failed but repeated courses in the grade point average;
and "A, B, C, No Credit," with the NC not counting in the GPA. Additionally, changes
had to be made in the system used to determine academic suspension, semester and
graduation honors, and class rank. In many cases, since class rank had become
virtually impossible to determine, it was left off the transcript entirely.

Many institutions that made global changes in the recording of grades during
the decade of the 60's have changed to a system that is based on the instructor's
evaluation as measured by traditional grades. But the same problems of interpretation
that existed earlier are still present, with the additional difficulty of interpretation of
the transcript of a student who was enrolled during the transition period. For example,
at the University of South Carolina, during a seven-year period, which many part-time
students need to complete their baccalaureate degree, a student's transcript would
indicate the assignment of course grades under four different grading systems. The
student would also have been subject to three different suspension and graduation
honor criteria.

Even where there appears to be a standardization of the items to be rated there


can still be difficulty. In his book, The Pyramid Climbers, Vance Packard reproduces
two report cards published in The New York Times Magazine. One report card was

37
for a kindergarten for four-year olds and the other for evaluating executives in one of
the largest corporations in the country.

The first report card used a rating system of Very Satisfactory, Satisfactory, and
Unsatisfactory. The items to be evaluated were: Dependability, Stability, Imagination,
Originality, Self-expression, Health and vitality, Ability to plan and control, and
Cooperation. The second report card used a rating of Satisfactory, Improving, and
Needs Improvement. The items to be evaluated were: Can be depended upon,
Contributes to the good work of others, Accepts and uses criticism, Thinks critically,
Shows initiative, Plans work well, Physical resistance, Self-expression, and Creative
ability. The first report card was used to evaluate the executives, and the second to
evaluate the four-year olds.(6)

In the January 1988 issue of the Academic Leader, Dr. Stephen J. Huxley
points out another difficulty and recommends a possible correction.(7) His
observation is that the final record of the student, the college transcript, is blind to the
differences indicated above. On the transcript in determining the student's grade point
average, an "A" in Organic Chemistry, earned under an instructor who rarely gives
them, is given the same weight as an "A" in Outdoor Fly Casting with an instructor
who rarely gives any grade other than an "A." Since these differences are generally
disregarded by employees, scholarship committees, and graduate and professional
schools, students with their peer network learn which courses and instructors to take
to bolster their grade point average.

Dr. Huxley's recommendation is that, in addition to the individual student's


grade in the course, the transcript should indicate the average grade assigned by the
instructor for that particular course and section. This would allow a transcript reader
to determine more easily if the grade the student has in a particular course was the
result of individual academic performance or the result of enrollment in an "easy"
course. However, since this would necessitate not only a sophisticated computer
grading program, but the official recording of instructors' grading practices, it is
doubtful if many institutions will adopt Dr. Huxley's proposal.

Regardless of the grading system used, in interpreting the grades a central


problem is in determining what they are trying to measure. Is the grade to measure an

38
individual's achievement against others in the same class, course, or school, or is it
only to measure changes in the student's progress since the start of the course? If the
measure is of the individual's progress, it makes the measure of one's progress against
others almost impossible to ascertain. It is as if one were using an elastic ruler to
measure heights of individuals in a class. If the measuring device varies for each
student, then one student can be taller than another simply because the ruler was
stretched more in one measurement than in another, even if by simple observation it is
evident that the first is taller. In educational jargon, such a measuring device would be
labeled "unreliable." Yet in many courses the "measuring device" is changed for each
semester and possibly for each student.

In certain courses in which competency is to be developed, some instructors


have assigned grades at the end of the course on the basis of a student's sustained
performance, regardless of the actual average attained or the average of the others in
the course. For example, a student enters a writing course making grades of D on the
material submitted. During the semester, the student makes the following grades: D,
D, C, D, C, C, B, C, B, B+, B. What grade should be assigned as a final course mark?
If a strict average is used, then the student has a grade of C or, at best, C+; however,
the belief of some faculty is that since this student is writing at a "B" grade at the end
of the course, then this is the final grade that should be assigned.

It would seem that in those courses in which competency is desired, the latter
example would be a reasonable approach to assigning the final mark for a student.
Certainly it is reasonable from the student's standpoint; however, it makes impossible
an interpretation with others in the class, as well as any comparison with students in
other classes, even of the same subject at the same institution.

The attempts to arrive at a fair and equitable grade to assign to an individual


student, without distorting either the student's standing in class or comparative
ranking with students at that or other institutions, has proven to be one of the most
difficult quests of the faculty member. To be reliable as a tree measure of achievement
over a period of time, the grade assigned must be understood by the instructor,
students, colleagues, and future evaluators.

39
Since there appears to be little doubt that a given mark has different
interpretations, perhaps the best choice is for the faculty member to follow the course,
within departmental and institution guidelines, that in his or her opinion best measures
the student's progress during the measuring period without being overly concerned
with grading practices of other faculty and other institutions.

Grade distributions, grading procedures, and students' evaluations of


instructors: a justice perspective. Jasmine Tata.
The Journal of Psychology

Grades are the basic currency of our educational system. Instructors in


universities and colleges assign grades to students on a regular basis. High grades
result in both immediate benefits to students (e.g., intrinsic motivation, approval of
family) and long-term consequences (e.g., admission to graduate school, preferred
employment). Students who perceive their grades as unfair are more likely to react
negatively toward the instructor, and these negative reactions may influence students'
ratings of teaching effectiveness.

A number of researchers have examined the connection between students'


grades and their evaluations of instructors. The literature is equivocal on this issue;
some researchers suggested that student grades and grading standards may bias
teaching evaluations because students who receive higher grades tend to rate the
instructor more positively than students who receive lower grades. Results of other
studies did not find consistent effects of grades on evaluations of teaching .

This inconsistency in the literature concerning the connection between grades


and students' evaluations of instructors can be clarified by examining the fairness of
grades. It is possible that the connection between low grades and unfavorable
evaluations of instructors exists not because of the level of the outcome (the grade)
received by the student, but because the low grade is perceived to be unfair. The
literature on distributive justice (Adams, 1965; Crosby, 1984; Folger, 1986) indicates
that people receiving outcomes that are lower than expected are more likely to
perceive the distribution as unfair, and perceptions of unfairness may lead to negative
evaluations of distributors. In the context of grade allocations, a student who spends a
number of hours preparing for an examination may expect to receive an A. If the

40
student receives a lower grade, the grade may be perceived as unfair, especially if
other students who spent fewer hours preparing for the examination received higher
grades.

In addition to the grade distribution, grading procedures can also influence


students' perceptions of fairness and evaluations of instructors. The literature on
procedural justice states that procedures that are consistent and impartial are perceived
as fairer than those that are inconsistent and biased (Leventhal, 1980). In the context
of the classroom, instructors are expected to apply grading standards (procedures)
consistently to all students. If an instructor lowers the standards for a few students,
this is likely to be perceived as procedurally unfair. Students who believe that the
grade allocation procedures are unfair may be more likely to evaluate the instructor
unfavorably.

For this study, I examined the connections between students' evaluations of


instructors, the fairness of grade distributions, and the fairness of grading

procedures. I also investigated the relative influence of procedural and


distributive fairness on evaluations of instructors. Examining these relationships can
be of importance to instructors, administrators, and students. Instructors may use the
findings to identify how best to allocate grades and the appropriate procedures
associated with grade distribution. Administrators and students may understand the
extent to which evaluations are influenced by students' perceptions of procedural and
distributive fairness, and the judgment processes involved in evaluations of instructors
.

The Fairness of Grade Distributions

Justice theory and research have dealt with both distributive and procedural
justice. Distributive justice is concerned with the fairness of decisions about the
distribution of resources, whereas procedural justice is concerned with the fairness of
the procedures used to reach those decisions . Distributive justice refers to the extent
to which the outcomes received in an allocation decision are perceived as fair; this
type of fairness has been considered implicitly within the contexts of equity theory
relative deprivation theory , and referent cognitions theory . These theories suggest

41
that individuals use standards of distributive justice such as equality allocated equally
to all participants regardless of inputs) and equity (outcomes allocated based on inputs
such as productivity) to establish the fairness or unfairness of the outcome). Thus, the
experience of injustice involves the realization that outcomes do not correspond to
expectations determined by standards of distributive justice.

In the context of the classroom, grades are the outcomes allocated to students.
Students receiving lower grades than expected are likely to perceive the grades as
distributively unfair, whereas students receiving expected grades are likely to perceive
the grades as fair; this phenomenon can be explained by relative deprivation and the
egocentric bias. Relative deprivation theory posits that the fundamental source of
feelings of injustice is the realization that one's outcomes fall short of expectations.
The egocentric bias in distributive justice suggests that individuals' expectations of
their own performance and outcomes are higher than their expectations of others'
outcomes; hence, people who receive higher outcomes are more likely to perceive
those outcomes as fair than people who receive lower outcomes . Empirical support
for this phenomenon has been found in the work of Lind and Tyler and Tyler , who
found connections between outcomes (relative to expectations) and perceptions of
distributive justice.

The Fairness of Grading Procedures

Procedural justice refers to the extent to which the processes used in making
allocation decisions are perceived as fair Lind & Tyler, 1988; Thibaut & Walker, .
Research on procedural justice has evolved from two conceptual models - Thibaut and
Walker's (1975) dispute resolution procedures and Leventhal's (1980) principles of
resource allocation procedures. These researchers suggested that procedural justice
involves the realization that procedures correspond to those determined by certain
standards (e.g., consistency, suppression of personal bias, use of accurate information,
voice, and congruity with prevailing standards or ethics). In the classroom context, the
procedures used to allocate grades could influence students' perceptions of procedural
fairness and evaluations of the instructor.

Results of research in organizational contexts have shown that distributive


fairness and procedural fairness influence employees' reactions. Folger and Konovsky

42
found that perceived fairness was related to satisfaction, trust in supervisors, and
organizational commitment. Alexander and Ruderman determined that employees'
perceptions of fairness influenced their approval of supervisors.

Extrapolating to the classroom context, the fairness of grade distributions and


grading procedures should influence student reactions, such as their evaluations of
instructors. Therefore, my first hypothesis was that evaluations of the instructor would
be higher for students who received expected grades (fair grade distributions) than for
students receiving grades lower than expected (unfair grade distributions). My second
hypothesis was that students' evaluations of the instructor would be higher for
consistent (fair) grade allocation procedures than for inconsistent (unfair) procedures.

It is possible that procedural and distributive justice are predictive of different


types of outcomes. Sweeney and McFarlin's two-factor model suggests that
distributive justice primarily influences attitudes toward the outcome in question,
whereas procedural justice influences attitudes toward the system. For example,
Sweeney and McFarlin ) found that employees who believed their pay was lower than
expected (distributively unfair) demonstrated lower levels of pay satisfaction, an
attitude specifically directed toward the outcome (pay). In contrast, when pay
distributions were made using fair procedures, employees showed higher levels of
trust in management and commitment toward the organization (attitudes directed
toward the system).

In the classroom context, students' evaluations of instructors can be considered


attitudes toward the university system; such attitudes are more likely to be influenced
by procedural justice than by distributive justice. based on Sweeney and McFarlin's
model, my third hypothesis was that consistency (fairness) of grading procedures
would influence students' evaluations of the instructor to a greater extent than grade
distributions.

Method

Participants and Design

Based on a definition of fairness as meeting expectations and being consistent, I used


a 2 (grade distribution: met expectations vs. did not meet expectations) x 2 (grading

43
procedure: consistent vs. inconsistent) between-subjects, scenario-based experimental
design. Undergraduate students (51 men and 46 women) participated in the study.
Most were sophomores (32%) or juniors (41%), and the rest were seniors. The
average age of the students was 20.10 years.

Materials

The participants were asked to respond to one of four different scenarios. Each
scenario described a classroom situation and an instructor. Participants were given
contextual information about the situation and were asked to place themselves in the
position of a student in the class who had worked hard on a term paper by conducting
research, writing, and rewriting the paper. On the basis of the grading criteria
described in the syllabus, the student expected to receive a grade of A on the paper.

The grade distribution was manipulated by informing the participants that the
student received a grade that either met expectations (A = fair grade distribution) or
did not meet expectations (B = unfair grade distribution). The grading procedure was
manipulated by stating that the instructor used the grading scheme specified in the
syllabus to grade the paper (consistent/fair grading procedure) or that the instructor
changed the grading scheme after the paper was turned in (inconsistent/unfair grading
procedure).

Procedure

Participants were randomly assigned to one of the four manipulation


conditions. After reading the scenario, they were asked to complete measures of the
dependent variable (students' evaluations of the instructor) and two manipulation
checks (distributive justice and procedural justice) on 7-point Likert-type scales.

Students' evaluations of the instructor were measured by asking them to rate


the preparation of the instructor, course organization, subject matter presentation,
knowledge of the subject matter, availability of the instructor, his or her attitude
toward students, and an overall evaluation of the instructor. These items were based
on scales used in empirical research by Chacko (1983) and Carkenord and Stephens
(1994).

44
Distributive justice was measured by asking participants to rate the extent to
which they felt that the actual distribution of grades was fair and what they deserved,
and procedural justice was measured by asking participants to rate the extent to which
they felt that the decision about the grade was made in a fair way and they were
treated fairly; these scales were based on those used by Bies, Shapiro, and Cummings
(1988) and Shapiro (1991). After completing the ratings, participants were debriefed
about the purpose of the study.

Results

Reliability Analyses and Manipulation Checks

Cronbach's reliability coefficients were calculated for the scales and were
found to be greater than .75 for each condition; mean ratings were also computed for
each scale. Next, I conducted manipulation checks to examine the participants'
understanding of the distributive justice and procedural justice manipulations. I
conducted separate t tests for the two manipulation checks. The results of the t tests
indicated that the manipulations had the intended effects. Grade distributions
influenced perceptions of distributive justice; participants who had been assigned
expected grades gave higher ratings for distributive justice than those who had been
assigned grades lower than expected, Ms = 4.81 and 3.59, respectively, t(95) = 3.84, p
[less than] .05. Also, the instructor's grading procedures influenced perceptions of
procedural justice; participants gave higher ratings of procedural justice for consistent
procedures than for inconsistent procedures, Ms = 5.11 and 3.87, respectively, t(95) =
4.06, p [less than] .05.

Tests of Hypotheses

Of interest in this study was the relative influence of grade distributions


(distributive fairness) and grading procedures (procedural fairness) on students'
evaluations of the instructor. I conducted an analysis of variance (ANOVA) with grade
distributions and grading procedures as independent variables and students'
evaluations of the instructor as the dependent variable. The two main effects and the
interaction effect were significant.

45
In support of Hypothesis 1, evaluations of the instructor were influenced by
grade distributions. Participants who were assigned expected grades (fair
distributions) gave higher evaluations of the instructor than participants who were
assigned grades lower than expected (unfair distributions), Ms = 5.54 and 4.67,
respectively, t(95) = 2.42, p [less than] .05. Hypothesis 2 was also supported.
Students' evaluations of the instructor were influenced by grading procedures; when
consistent (fair) procedures were used, participants gave higher evaluations of the
instructor than when inconsistent (unfair) procedures were used, Ms = 5.52 and 4.69,
respectively, t(95) = 2.31, p [less than] .05.

The interaction of grade distributions and grading procedures was also


significant, and simple main effects were calculated using Gabriel's simultaneous test
procedure (Kirk, 1982). Among participants who received expected grades (fair
distributions), there were no significant differences in evaluations of the instructor
between those who were provided with consistent procedures and those who were
provided with inconsistent procedures, Ms = 5.61 and 5.47, respectively, t(95) = 0.39,
p [greater than] .05. Among the participants who received grades lower than expected
(unfair distributions), however, respondents who were provided with consistent (fair)
grading procedures gave the instructor higher evaluations than those who were
provided with inconsistent (unfair) procedures, Ms = 5.42 and 3.91, respectively, t(95)
= 4.19, p [less than] .05. Therefore, grading procedures appeared to influence
evaluations of the instructor only when students received grades lower than expected.

To test Hypothesis 3, I calculated partial correlation coefficients. The partial


correlation between procedural fairness and the students' evaluations of the instructor
(controlling for distributive fairness) was compared with the partial correlation
between distributive fairness and evaluations of the instructor (controlling for
procedural fairness). The results suggest that the relationship between procedural
fairness and evaluations of the instructor was no stronger than the relationship
between distributive fairness and evaluations of the instructor. Thus, Hypothesis 3 was
not supported.

Discussion

46
The purpose of this study was to examine the influence of the fairness of
grade distributions and grading procedures on students' evaluations of the instructor.
Distributive fairness was manipulated by providing participants with grades that either
met expectations or were lower than expected. Procedural fairness was manipulated
by providing consistent or inconsistent grading procedures.

The results indicate that students' evaluations of an instructor are influenced by


distributive fairness because participants who received expected grades gave higher
evaluations than those receiving grades lower than expected. Procedural fairness also
influenced evaluations of the instructor. Participants provided higher evaluations
under consistent procedures than under inconsistent procedures. The fairness of
grading procedures, however, influenced evaluations of the instructor only under
unfair grade distributions. When students received expected (fair) grade distributions,
grading procedures did not significantly influence evaluations of the instructor. This
suggests that procedural fairness becomes more salient under conditions of
distributive unfairness.

The fairness of grading procedures, however, did not influence students'


evaluations of the instructor to a greater extent than the fairness of grade distributions;
this finding is not consistent with previous research conducted in organizations but
may be explained by examining the differences between organizational settings and
the classroom context. Employees in organizations are likely to have a long-term
perspective of their relationships with management and organizations. In contrast,
students are more likely to have a short-term perspective of their relationships with
instructors, as they generally interact with instructors for only the length of a semester.

These differences in time horizons can be connected to perceptions of fairness.


Procedural fairness influences system variables (e.g., trust in management or
evaluations of instructors) partly because fair procedures ensure that, over time,
outcome distributions (e.g., pay or grades) will be favorable (Lind & Tyler, 1988).
Employees who have long-term relationships with management may be influenced to
a greater extent by procedural fairness than students who have short-term
relationships with instructors and are not concerned about future outcomes distributed
by the instructor. Thus, students may not emphasize grading procedures to a greater
extent than grade distributions when evaluating instructors.

47
Before generalizing from the results of this study, certain limitations of the
methodology should be kept in mind. Patterns obtained in a scenario-based study may
not always be generalizable to other settings. Unfortunately, the sensitive nature of
this line of research made it problematic to conduct in a classroom setting. Also, the
subtle differences between the independent variables used in this study made it
difficult to examine the independent and interaction effects under natural
circumstances. The external validity of the study, however, was increased by using
students as participants, because they could easily relate to the grading incidents.
Future researchers can extend the generalizability of this study by replicating it using
other methods in other settings.

Another potential limitation of the study is the use of only one manipulation of
grade distributions and one of grading procedures. In actuality, students' perceptions
of distributive fairness may be influenced not only by comparisons between their
grades and expectations, but also by comparisons between their grades and others'
grades. Similarly, procedural fairness may be perceived not only through the
consistency of grading procedures but also through other factors such as lack of bias
and the use of accurate information. Although the manipulation checks indicated that
participants' perceptions of distributive and procedural justice were influenced by the
manipulations used in the study, future researchers can use other techniques to
examine connections between the perceived fairness of grades and students'
evaluations of instructors.

When the results of this study are viewed along with past studies (Perkins et
al., 1990; Snyder & Clair, 1976), grade distributions appear to be a consistent
influence on evaluations of teaching. To the extent that students' evaluations of the
instructor's performance reflect the instructor's evaluations of the students'
performance (grades), teaching evaluations have the potential to be contaminated by
factors unrelated to teaching behavior.

The influence of grade distributions, however, can be mitigated by the grading


procedures used. The results suggest that the fairness of grading procedures has a
significant influence on students' evaluations of instructors. As such, this study
connects the research on procedural justice in organizational settings (Greenberg,
1990; Lind & Tyler, 1988; Tyler, 1986) to the classroom context; just as managers

48
perceived as fair by employees are more likely to receive positive evaluations,
instructors perceived as fair receive higher ratings. Instructors can ensure the fairness
of their grading procedures by being consistent, using accurate information, and
maintaining an impartial process.

The validity of students' evaluations of instructors is a complex issue.


Although factors external to instructor performance (such as grade distributions) can
influence evaluations, so can other factors intrinsic to performance such as the
fairness of the grading process. The validity of students' evaluations of instructors can
be strengthened by using other measures of teaching effectiveness along with student
ratings, especially in making decisions about salary increases, promotions, and tenure.

A new approach to exploring biases in educational assessment. Ian


Dennis, Stephen E. Newstead and David E. Wright.
British Journal of Psychology

Assessment procedures which lead to a subjective evaluation of the work of


students or pupils are extensively used at all levels in British and European education.
Although the North American tradition has relied more heavily on objectively scored
multiple choice assessments, there, too, in recent years there has been an increasing
advocacy of subjectively marked, open-ended or 'authentic' assessment (e.g. Jones,
1988). One of the disadvantages of this form of assessment is that marking may be
subject to various forms of bias. Our knowledge of the biases which may be at work
and especially of the extent of their impact in live educational settings is quite limited
and relies in part on generalization from more general evidence concerning
judgmental biases. This paper aims to contribute relevant evidence based on marks
obtained from a real rather than simulated assessment situation. However, more
importantly it aims to illustrate an approach to studying assessment biases which, if
applied more generally, has the potential for adding considerably to our knowledge in
this area.

One situation in which there is good reason to suppose that marking biases
may operate is that in which the student whose work is being assessed is personally
known to the marker. Although this situation may be undesirable from the perspective
of summative assessment, there are often other considerations, such as the provision

49
of appropriate feedback, which outweigh this. Examples of marking being undertaken
by the same individuals who teach students, and who are therefore personally familiar
with them, are thus widespread. In such situations there is good a priori reason for
suspecting that marks may be contaminated by individual biases.

One form of individual bias which is likely to occur in assessment is based on


generalization from previous performance. Theories of impression formation suggest
that in general individuals tend to form consistent impressions of others at an early
stage in the impression formation process and having done this are prone to discount
evidence which is inconsistent with these early views. Although there has been little
attempt to examine the consequences of this for educational assessment, it would be
strange if markers were exempt from these effects.

Related to this is an extensive if rather muddled literature in occupational


psychology on halo effects in performance rating. This has primarily focused on the
situation in which an individual's performance is rated on a number of different
dimensions or attributes where high correlations are often found between the different
dimensions. The high observed correlations have been explained partly as a result of
the true correlation of the dimensions being rated (true halo) but also partly as a
consequence of systematic rater errors (illusory halo). Recent reviews have suggested
that the extent of illusory halo may have been exaggerated and that attempts to reduce
the total observed halo effect may sometimes be misguided. One argument made in
both reviews is that there is not a major problem if the purpose of rating is to make
comparisons between individuals according to a criterion which involves pooling the
rated dimensions. Such comparisons are not vitiated if the ratings on the dimension
under consideration are contaminated by some other dimension which also relates to
the decision being made or by some impression of general merit. Whilst this may be
true for some occupational settings the argument does not hold good in education and
training. In the educational context it is important to be accurate not only in
comparisons across individuals but also in comparisons within an individual over time
or over different assessments. There are clear reasons in the educational context for
seeking to avoid biases which downplay changes in the relative standing of
individuals since such biases are clearly unfair to individuals who show more than
average improvements in their performance.

50
Although illusory halo can occur because ratings of various relevant
dimensions contaminate one another, it can also occur if they are all contaminated by
a common influence which is not relevant to the decision being made. Influences
which might plausibly work in this way are not hard to identify, although there is only
a limited amount of work demonstrating their impact and even less which enables
their magnitude to be evaluated in applied settings. One dimension which has received
some attention is the physical attractiveness of the individual being assessed or rated.
Landy & Sigall found that the evaluation of essays could be influenced by the
physical attractiveness of their author, although a study by Bull & Stevens only
partially confirmed this result. In the occupational sphere Morrow, McElroy &
Stamper examined the influence of physical attractiveness on judgements of
suitability for promotion made by personnel professionals on the basis of simulated
assessment centre data. Physical attractiveness was found to have a significant effect
on the ratings, although the effect was not large, accounting for only 2 per cent of the
variance in the ratings given.

Another unwanted influence based on an assessor's or appraiser's knowledge


of the individual being assessed is that of interpersonal liking. Cardy & Dobbins
(1986) asked student subjects to evaluate vignettes of professors. The inclusion of trait
terms that engendered different liking levels but which were irrelevant to performance
nevertheless affected the rating given. Tsui & Bruce working with ratings of real
occupational performances, also concluded that interpersonal affect may contaminate
appraisal ratings. Ratings on a number of different dimensions were higher when
raters liked the person being rated. Moreover, the existence of either positive or
negative affect towards the ratee increased the intercorrelations between the different
dimensions relative to those obtained from raters whose feelings towards the person
being rated were neutral.

Thus there is good reason to suspect that where markers know the students
whose work they are assessing their marks may be biased by overgeneralization from
the student's previous performance, by whether or not they like the student, and by
irrelevant considerations such as the student's physical attractiveness. However, such
influences have had little direct investigation in the educational setting and virtually
nothing is known of whether or how seriously marks may be contaminated by them.

51
As well as resulting in unfair treatment for individual students, it seems likely that if
these biases are operating, especially those based on interpersonal affect and physical
attractiveness, they will vary in their impact for different groups of students such as
males and females or students of different ages.

Thus the differential impact of biases based on individual knowledge of


students is one way in which different groups of students may come to suffer unequal
treatment. However, a second possible way in which this may come about is through
the operation of group stereotypes. These are likely to have most impact when the
marker does not know the student personally but can identify the group to which the
student belongs. The most widely discussed example of this is where markers are able
to identify gender from the student's name.

The possibility that gender stereotypes may bias marking has received more
attention than most aspects of marking bias. However, even here there is no clear
agreement on what types of effects are operating and on the extent of their impact.
Whilst differences in mark and grade distributions between males and females have
often been observed, it has not proved easy to make progress in disentangling whether
these reflect genuine differences in performance or whether they are in some part
attributable to biases in marking. A study by Bradley is one of the few which has
made progress on this issue. Bradley exploited the fact that where two markers mark
the same piece of work the discrepancy in their marks will in part be determined by
any differences in their biases. She found that second markers (whom she assumed to
have less knowledge both of the project topic and of the student) marked the projects
of female students closer to the centre of the scale than first markers. For male
students the pattern was reversed, with second markers tending to award more
extreme marks. Since Bradley's effect relates to the comparison between the marks
awarded by first and second markers it cannot be explained solely on the basis of
different distributions of performance in male and female students. Bradley attributed
the outcome to a centrality bias in the marking of female students' work which derived
from gender stereotypes and was consequently stronger in the less specialist second
markers.

However, the data could equally be explained if there were some influence
which inflated the variance of first marker's marks for female students. Bradley's

52
preference for the explanation in terms of group stereotypes was based partly on the
previous literature and partly on the fact that the pattern of results failed to obtain in a
department where the second markers were blind to the student's identity and gender.
However, there could well be other differences between the departments in which the
effect was observed and that in which it was not. This point was reinforced by the
failure of Newstead & Dennis (1990) to replicate Bradley's data pattern in a large
department where second markers did know the student's gender. A variety of
explanations for the discrepant outcomes drawing on different interpretations of
Bradley's initial effect have been advanced and discussed. However, this debate has
been largely indecisive and it might be concluded that the approach being adopted
provides insufficient evidence to adjudicate between alternative interpretations of the
effects.

Thus, whilst there is good reason to believe that personal biases could operate
in marking, there is little direct evidence of their importance, and in the area of gender
bias it has been difficult to disentangle effects of bias from true differences in
performance. One of the most promising approaches to the latter issue is subject to
ambiguities of interpretation. The main aim of the present paper is to propose and
illustrate an approach which can contribute considerably to progressing the study of
both individual biases and gender bias in marking. The essence of this approach has
previously been advanced in relation to occupational ratings by Kenny & Berman
(1980). However, it appears to have been little exploited for occupational ratings and
not at all in relation to marking bias. The present study goes beyond the proposals of
Kenny & Berman in using a multi-sample approach to compare the way in which the
work of male and female students is marked and in using a structured means model to
locate the source of differences in the average marks of males and females.

Overview of the approach

Consider a model in which the mark which a marker awards to a piece of work
is the sum of three components. The first component is determined by the true merit
of the work being assessed. The second component reflects the aggregate influence of
the marker's biases concerning the student in question. The third component consists
of purely random influences. We can hope to make some progress in distinguishing
the variance attributable to these different influences if markers assign nominally

53
independent marks to a number of pieces of work from the same student and if each
piece of work is marked by more than one marker. In this situation each of the marks
assigned to a particular piece of work will be influenced by its true worth (along with
other influences). All the marks which a marker awards to a particular student will be
influenced by, amongst other things, that marker's biases concerning the student.

When viewed in this way the problem of separating variance attributable to


merit from that attributable to bias is isomorphic with that of separating trait and
method variance in personality assessment . Thus marking data of the form discussed
above can be very effectively analysed using the same sort of confirmatory factor
models which have been applied to multitrait-multimethod matrices. This should be a
useful tool since it enables the percentage of variance attributable to individual biases
to be estimated for each marker. Moreover, using programs such as LISREL or EQS,
this type of model can be fitted simultaneously to data from separate groups such as
males and females. The advantage of this is that the consequences for the overall fit of
the model of constraining its various parameters to be equal for the two groups can be
assessed. Hence it can be determined where differences in the marking of the two
groups arise. Thus, for example, if differences in variance are found between male and
female marks it can be determined whether these are attributable to the merit
component of the mark, the bias component or the error component.

Structure of the data

The data which were used in this study derive from the marking of final year
undergraduate psychology projects in the University of Plymouth between 1991 and
1993. These projects report a piece of empirical work carried out throughout an
academic year. This work is supervised by one supervisor who meets with the student
regularly during the course of the year. The student's project report is sectioned in the
conventional way for reports of empirical work and is independently marked by the
supervisor, acting as first marker, and a second marker. Second markers are chosen,
subject to constraints imposed by workload, on the basis of their interest or expertise
in the topic area of the project. Project reports, which are typed, carry the student's
name on the cover. Second markers will vary in their degree of personal knowledge of
the student: they may have come to know the student quite well in some role such as
that of the student's personal tutor but may know the student hardly at all. Typically

54
they will know the student considerably less well than does the supervisor by the time
the project report is submitted.

The supervisor awards four marks to the project whilst the second marker
awards three. The supervisor's A mark is intended to be based on the student's
performance in designing and conducting the project work over the course of the year.
It is an explicit part of the marking policy that the remaining three marks awarded by
each marker are to be based solely on evidence which is available to both markers in
the form of the written project report. The B mark relates to the introduction section
of the report, the C mark to the method and results sections and the D mark to the
discussion. The marks are awarded on a 100-point scale commonly used in British
higher education where a mark of 70 or above corresponds to a first-class honours
degree and a mark below 40 represents a fail, with intermediate classifications all
having specified mark ranges. For each of the four marks, markers have available to
them a set of marking guidelines which provide a two or three sentence description of
the performance appropriate to each degree class. Second markers award the B, C and
D marks without knowledge of the corresponding marks given by the supervisor and
without receiving any comments from the supervisor. They do, however, have sight of
the supervisor's A mark prior to awarding their own marks. Data from one male
student and two female students were excluded from the analysis because they were
incomplete. This left data from 197 female students and 58 males which were used in
the analysis. Twenty-five different markers were involved, of whom eight were
female.

Structure of the model

The path diagram for the model fitted to the data is shown in Fig. 1. This
follows the standard conventions for such diagrams. The variables in square boxes are
manifest variables - in this case the seven different marks awarded to each project. Al
denotes the first marker's A mark, B1 the first marker's B mark, B2 the second
marker's B mark and so forth. The variables enclosed by circles are latent variables
which the model assumes to combine in determining the manifest variables. In general
the model assumes that each mark is determined by the sum of three influences.

55
The first influence on each mark is a factor which is specific to the section
being marked but which influences both markers. This influence is reflected in factors
SSB, SSC or SSD for sections B, C and D respectively (the labels are chosen to
provide a mnemonic for the fact that the influence of these factors is section specific).
These factors will probably reflect primarily the true merit of the section being
marked but they could also embody biases which are shared by both markers or any
other influence which operates on both of them.

The second influence on each mark is one which is marker specific but general
across all the marks awarded by that marker to the student. These factors appear on
the right of the model in Fig. 1. The factor MS1 influences only the marks of the first
marker and the factor MS2 influences only the marks of the second marker (again the
labelling is chosen to provide a mnemonic for the fact these are marker specific
factors). These two factors affect each of the marks which a marker awards to a
particular student. The influences represented by MS1 and MS2 would include any
pre-existing biases the marker may have concerning either the student in question or a
group to which the marker knows the student belongs. The marker's reaction to
aspects of the student's work which transcend the different sections, such as the
student's writing style, would also enter into the factors MS1 and MS2. It also needs
to be recalled that the markers under consideration are not the same individuals for all
projects being marked. This complicates interpretation a little since any differences in
stringency between markers, whereby some markers are more severe than others in all
the marks which they award, will also appear in MS1 and MS2. However, it also
means that rather more general conclusions can be reached than would have been
possible if only two markers had been involved.

The third influence on each mark is one of the error components E2 to E7.
These account for the component of the variance which is not explained by the other
factors. They represent the component of each mark which is idiosyncratic and section
specific and are analogous to the error component in traditional true score models of
marking reliability.

Factor SSA differs slightly from factors SSB to SSD. Because the A mark is
not independently awarded by a second marker there is no way of identifying the
component of this mark which is idiosyncratic to the supervisor. Thus the model does

56
not contain an error component influencing A. Instead in this case the error
component can be thought of as being hidden within SSA.

The double headed arcs linking factors SSA to SSD signify that these factors
are allowed to be intercorrelated. Clearly if the major determinant of these factors is
the real merit of the student's performance then students who perform well on one
element of the project are likely also to perform well on other elements so this is
necessary to make the model appropriate. It is important to note that influences which
are not linked by double headed arcs are assumed to be uncorrelated. In particular the
influences on marks reflected in MS1 and MS2 are ones which are uncorrelated with
each other and with the complex of SSA to SSD. Thus biases which are shared by the
two markers will appear in SSA to SSD rather than MS1 and MS2.

Biases which are correlated with ability will also not affect MS1 and MS2.
Consider, for example, Bradley's (1984) suggestion that because of gender stereotypes
second markers mark the work of female students less extremely. This bias would
reduce the mark given to good work presented by a female student but elevate the
mark given to poor work. Thus the bias which Bradley proposes to be operating in
second markers is one which is correlated, albeit negatively, with student
performance. Because it is not orthogonal to the section specific factors it will not
contribute to MS2. However, if this bias is operating, the second marker will award a
lower mark to a good female project than to an equally good male project and
conversely for poor projects from male and female students. Thus variations in project
quality will have less impact on the mark for female than for male students. This will
lead to smaller path coefficients for females than for males on the paths from SSB to
SSD to the corresponding marks awarded by the second marker.

Results

Descriptive statistics for each of the seven marks are given for male and
female students separately in Table 1. The modelling procedures used here assume
multivariate normality of the underlying data and are known to be sensitive to
violations of this assumption. It can be seen that in general, but especially for the
females, the distributions tend to be leptokurtic and to have negative skew. After
experimentation with alternative transformations this was dealt with by squaring all

57
marks prior to calculating the variances and covariances used in modelling. Skew and
kurtosis for the transformed marks are also given in Table 1. It is apparent that the
transformation alleviates the problems which were present in the raw data and
produces data which are reasonably close to normality for both males and females.

The correlation matrix for the transformed data along with the relevant
standard deviations are presented in Table 2. These are the data to which the model
was fitted. Informal inspection of Table 2 reveals a number of features which are
helpful in gaining an insight into the conclusions which emerge more formally from
the fitting of the model in Fig. 1. The correlations in the table are of three types: (i)
correlations between marks awarded by the same marker to different sections, (ii)
correlations between markers in their marks for the same section, and (iii) correlations
involving both different markers and different sections. In Table 2 correlations of type
(i) are in bold type, correlations of type (ii) in italics, and correlations of type (iii) in
plain type. In general correlations of type (i) are larger than those of type (ii), which
are in turn larger than those of type (iii). This general pattern, which, as might be
expected, emerges more clearly in the larger female sample, is consistent with the
model in Fig. 1.

Correlations of type (iii) are expected to be smallest since the two measures
being correlated are not influenced by any shared factor. The fact that the correlations
of type (iii) are all positive presumably reflects the existence of positive correlations
amongst the factors SSA to SSD. Correlations of type (ii) differ from those of type
(iii) in that the measures being correlated share the influence of either MS1 or MS2.
The fact that correlations of type (ii) tend to be bigger than those of type (iii)
demonstrates the need for the inclusion of these two factors in the model. For
correlations of type (i) the two measures being correlated share the common influence
of one of the factors SSA to SSD. The observation that correlations of type (i) tend to
be bigger than those of type (ii) suggests that, as might be expected, the influence of
SSA to SSD is greater than that of MS1 or MS2.

A second feature of the data which is evident in Tables 1 and 2 and which
contributes to the outcome of model fitting is that the marks awarded by first markers
show a larger SD for males than for females. If this difference in variance is tested
using the transformed data it reaches significance on the B and D marks though not on

58
the A and C marks (for mark A, F(57, 196) = 1.28, p = .22; for mark B, F(57, 196) =
1.51, p = .04; for mark C, F(57, 196) = 1.17, p = .42; for mark D, F(57, 196)= 1.65, p
= .006). The difference in SD is only marginally and non-significantly evident in the
second marker's marks (for mark B, F(57, 196) = 1.12, p = .58; for mark C, F(57, 196)
= 1.05, p = .79; for mark D, F(57, 196) = 1.13, p = .53). Some aspect of the first
marker's marking must differ between males and females but without formal
modelling it is not clear which of several possibilities holds. It could be, for example,
that individual biases are larger for males than for females so that the influence of
factor MS1 is greater for males than for females. Alternatively, it could be that first
markers let the same degree of variation in merit produce larger decrees of variation in
marks when the work is that of male rather than female students. Thus the paths
leading from the factors SSA to SSD to the marks awarded by the first marker would
have larger coefficients for males than for females. Perhaps less plausibly it may be
that supervisors' marking of male students is noisier so that the error components of
the first marker's marks (E2, E4 and E6) are greater for males than for females.

The purpose of the model fitting, carried out here using EQS, is to predict the
entire pattern of covariances and variances amongst the seven marks in each of the
two samples. EQS and similar programs determine values for the unknown
parameters of the model in such a way as to minimize the discrepancy between the
observed variances and covariances and those predicted by the model. A number of
alternative criteria for assessing this discrepancy are available. In the present case a
maximum likelihood criterion was employed. Under the assumption of multivariate
normality this criterion leads to a function which is approximately distributed as
[[Chi].sup.2] with a number of degrees of freedom which depends on the number of
measured variables and on the number of parameters which are estimated in fitting the
model. The value of this statistic can be used to assess the overall compatibility of the
model with the data. It is also possible when fitting models to constrain parameters to
particular values or to equality with one another so that fewer separate parameters
need to be estimated and the resulting [[Chi].sup.2] statistic has more degrees of
freedom. The change in [[Chi].sup.2] resulting from the imposition of constraints
provides a method of assessing whether those constraints are significantly worsening
the fit of the model - this is sometimes referred to as the [[Chi].sup.2] difference test.
In addition to the [[Chi].sup.2] statistics a variety of other fit indices are also

59
available. Some of these, such as the normed fit index (Bentler & Bonnett, 1980),
provide a measure of fit on a scale from 0 to 1; for these the fit index inevitably
increases as constraints are released in a series of nested models. Other measures such
as Akaike's Information Criterion also take account of the parsimony of the model
and favour models in which a good fit is obtained with a small number of free
parameters.

Because of the previous evidence that males and females may be treated
differently in project marking they were treated as separate samples using the facilities
which EQS provides for multi-sample modelling. This approach makes it possible to
impose constraints which equate particular parameters across the two samples. The
degree of fit of the constrained model then provides a test of whether the assumption
that these parameters are the same for the two samples is consistent with the data. By
investigating which parameters, if any, need to differ across the two samples it should
be possible to be more specific about the origins of different mark distributions for
males and females.

Table 3 presents a number of measures of fit for a series of variants of the


model under discussion. Preliminary explorations of the male and female data
separately indicated that in both cases no strain was imposed on the model by making
the error component of the C and D marks equal for the two markers and that this
constraint helped to avoid some problems with error variances becoming negative in
the smaller male sample. Accordingly this constraint was retained in all the models
reported in Table 3. There was never any indication that releasing this constraint
would have produced a noticeable improvement in fit of any model.

Model 1 of Table 3 contains the requirement that all parameters of the model
should be equal for males and females. It can be seen that overall this model is quite
compatible with the data. However, it was noted previously that the marks awarded by
first markers to male and female students had different variances. This provides
evidence that some aspect of the model needs to be different for male and female
students. Detailed consideration of the results from fitting model 1 indicated that the
fit of the model could be improved if MS1 was allowed to have a greater influence for
males than for females, particularly on the B and D marks.

60
Model 2 in Table 3 differs from model 1 in that all four path coefficients of
MS1 have been allowed to differ in males and females. The [[Chi].sup.2] difference
test confirms that the fit of model 1 is significantly poorer than that of model 2
([[Chi].sup.2](4) = 10.141, p = .0381). Thus the influences represented by MS1 (the
nature of these is taken up later) have greater impact when the work being marked is
that of a male student.

Inspection of the fitted parameters of model 2 showed that the influence of the
first marker factor MS1 was considerably larger than that of the second marker factor
MS2 (parameter values from a model which is very close to model 2 are given in
Table 4). How strongly do the data dictate a model in which both MS1 and MS2 have
some influence but the influence of MS1 is stronger? The remaining models in Table
3 were considered with a view to exploring this question. In model 3 the second
marker factor, MS2, is removed from the model entirely. Model 3 is otherwise
identical to model 2. The change in [[Chi].sup.2] on moving from model 2 to model 3
is not significant ([[Chi].sup.2](3) = 5.097, p = .165), indicating that the second
marker effect, MS2, is not necessary for an adequate fit to the data. Because of its
greater parsimony, model 3 appears slightly superior on Akaike's (1987) information
criterion but the other fit indices suggest that model 2 yields a marginally better fit.

The results from model 3 show that a model from which the second marker
effect, MS2, has been removed provides an adequate account of the data. Would a
model from which MS1 has been removed also be satisfactory? Model 4 examines
this by dropping MS1 whilst retaining MS2. To make the comparison with model 3 a
fair one, MS2 was allowed to have different path coefficient for males and females.
Model 4 produces a significant overall [[Chi].sup.2] ([[Chi].sup.2]33) = 55.8, p =
.008), implying that the data pattern observed would be unlikely if this model were
correct. Thus whilst a model with only the first marker effect, MS1, is tenable, there is
enough evidence in the data to discount a model in which only the second marker
effect, MS2, is present.

It is worth noting that models which exclude both of the marker specific
factors, MS1 and MS2, unsurprisingly provide an even poorer account of the data than
model 4. Complications arise in fitting these models because some estimates of the
correlations between section specific factors become constrained at unity. This in

61
itself suggests that these models are unsatisfactory. Further evidence comes from the
fit measures obtained when both MS1 and MS2 are dropped from the model. In model
6 the complications mentioned above have been dealt with by having the same factor
load on both section C and section D marks. Additionally, differences between the
male model and the female model which lead to a significant improvement in
[[Chi].sup.2] have been introduced on an ad hoc basis. Despite these ad hoc changes
the model produces a highly significant overall [[Chi].sup.2] and can thus be rejected.
The exclusion of factors MS1 and MS2 (and also the merging of SS3 and SS4) make
model 6 more parsimonious than the other models in Table 3 and this will in part
account for its poorer fit. However, even on the AIC, whose purpose is to allow for
differences in parsimony, model 6 comes out markedly worse than the other models in
Table 3. This confirms the view that at least one of the marker specific factors is
needed to give an adequate account of the data.

Model 5 is intermediate between model 3 and model 4 in that the path


coefficients of MS1 and MS2 on the two marks awarded to a particular section were
constrained to equal one another. These equal path coefficients were allowed to differ
between males and females. Model 5 cannot be discounted at conventional
significance levels, although on all measures of fit it performs more poorly than
models 2 and 3.

In summary, the general form of the model presented in Fig. 1 provides a good
account of the data provided that the first marker effect, MS1, is included. The effect
of MS1 is greater for males than for females. The best account of the data is provided
by models in which the impact of MS2 is either substantially smaller than that of MS1
or absent. However, a model in which MS1 and MS2 have an equal impact cannot be
rejected. The good fit of the models should not be overemphasized given the relatively
large number of free parameters which they contain. However, less sensibly motivated
models with as many free parameters do less well in accounting for the data, and the
results reported above from model 4 show that a satisfactory fit is not guaranteed for
models of the order of complexity considered here. The satisfactory fit does imply that
sensible interpretations may be placed on the parameter values of the fitted models. A
number of issues arise from these parameter values and these are taken up in the
discussion.

62
Inspection of Table 1 reveals that the mean marks given to male students are
higher than those for females, especially in the supervisor's marks. The model
presented in Fig. 1 implies that this must occur either because males have higher
scores on some or all of the factors SSA to SSD or because they have higher scores on
one or both of MS1 or MS2. The fact that the difference is more apparent in the
supervisor's marking begins to suggest that it might arise from MS1. The question can
be looked at more formally by use of a structured means model.

The models whose fits are presented in Table 3 are based solely on the
variances and covariances of the seven observed marks. They differ from a structured
means model in that the latter also takes account of the means of the manifest
variables in estimating parameter values. Multi-sample structured means models are
discussed by Bentler . They make use of more evidence from the data in that they seek
to account for the means of the manifest variables in the different groups but doing
this also involves estimating additional parameters.

The additional parameters which are needed in the present example include
the means of each of the factors SSA to MS2 in each of the two samples. Since the
zero point of the factors is arbitrary it can be set at zero in one of the samples. When
this is done the estimated means of the factors for the remaining sample provide
information on how the means of the factors differ in the two samples. It is these
estimated differences in factor means which are the focus of interest here. The other
additional parameters needed are the intercepts of the seven linear expressions given
manifest variables as a function of the latent factors. These intercepts are constrained
to equality in the two samples. In applying this approach to the present model 13 extra
parameters need to be estimated - the intercepts of the seven manifest variables and
the means for the six factors in one group. With 14 observed means available as data
to be explained by the model and 13 additional parameters to be estimated, the
structured means model provides virtually no additional information to assist in
discriminating between models. However, the parameter estimates it provides are of
considerable interest for the reasons given above.

The structured means version of model 2 again has a good fit ([[Chi].sup.2]
(29) = 32.89, p = .28, NFI = .973, NNFI = 0.995, CFI = .997). Table 4 presents all the
parameter estimates from the fitting of this model along with their respective standard

63
errors. The ratio of these two quantities is in effect a z score which may be used to test
whether the parameter differs significantly from zero. The parameter estimates
relating to variance and covariance are all virtually identical to those obtained from
model 2. Of immediate concern are the differences in factor means for the two groups.
These appear in the bottom row of Table 4. It is important to note that whilst the
means of the observed marks are higher for males, the means of the section specific
factors (SSA to SSD) are all higher for females, though the differences are slight and
fall well short of significance (for SSA z = 0.211, for SSB z = 0.086, for SSC z =
0.125, and for SSD z = 0.104). Thus the higher mean marks supervisors give to males
cannot be explained in terms of the greater merit of their projects. Rather the higher
marks of males are accounted for by their higher mean on MS1; here the difference
between males and females does reach significance (z = 2.177, p = .029). Whatever
influences are represented in MS1 are on average raising the marks of male candidates
relative to those of female candidates to a small but significant extent. MS2 also
operates to raise the marks of males relative to females; the strength of this effect is
only a little smaller than for MS1 but because of a larger standard error of estimate it
fails to reach significance (z = 1.09, p = .28).

Discussion

The most importance feature of these analyses lies in the satisfactory fit
obtained from the model presented in the Introduction, in the need to include factors
MS1 and possibly MS2 in order to obtain that fit, and in the proportion of mark
variance accounted for by these factors. Thus, for example, under the model whose
parameters are given in Table 4 the proportion of the total variance in the supervisor's
mark attributable to the influence of MS1 ranges from 17.1 per cent for the B marks
awarded to females up to 52.8 per cent for the D mark awarded to males with an
average of 26.5 per cent for females and 31.1 per cent for males. In the case of the
second marker the proportion of total variance due to MS2, which model 2 constrains
to be the same for males and females, is 2.4 per cent for the B mark, 18.7 per cent for
the C mark and 11.6 per cent for the D mark. What then do the factors MS1 and MS2
represent?

A number of influences which might contribute to these factors are mentioned


in the Introduction. Undoubtedly, the least interesting account of MS1 and MS2 is that

64
they merely reflect differences in stringency between the different individuals acting
as first and second markers. There are, however, good reasons for believing that this is
not a major component of MS1 and MS2. First, and most directly, it is possible to
estimate a factor score for each project on MS1 and MS2 and to see whether these
vary according to the marker. If differences in marker stringency are an important
component of MS1 and MS2 then some markers should be associated with
consistently high factor scores and some with consistently low factor scores. Although
records of marker identity were not available for the first cohort of students whose
data were included in the analysis, first markers were known for 190 projects and
second markers for 168 projects. Analyses of variance comparing MS1 scores and
MS2 scores across different markers yielded no hint of any significant differences
between markers (for first markers, F(23, 166) = 1.26, p = .20; for second markers,
F(23, 144) = 0.96, p = .52).

The second reason for discounting differences in marker stringency as the


main explanation, for MS1 especially, is that there is evidence that they are simply not
large enough to have the impact which MS1 has. Newstead & Dennis (1994) report a
study of the marking of examination answers by much the same group of markers as
were involved here using the same marking scale. In terms of raw marks the standard
deviation of marker means in that study was 2.32. Relating this to the standard
deviations presented in Table 2 suggests that if differences in marker stringency here
are of the same magnitude they would on average account for about 11.5 per cent of
the mark variance. On this basis differences in marker stringency seem capable of
accounting for only about two-fifths of the impact of MS1, although they might be
large enough to account entirely for MS2. It is of course possible that differences in
marker stringency are much greater in project marking than in exam marking but this
seems unlikely given that markers work in a greater variety of partnerships when
marking projects and there is more opportunity for transfer of standards.

Third, differences in stringency should be equal for first and second markers
since essentially the same individuals are acting in both roles. In so far as the data
tend to point to a model in which the impact of MS1 is greater than that of MS2, the
first marker factor MS1 cannot wholly be attributed to variation in marker stringency.

65
Finally, differences in marker stringency should apply equally to both male
and female students. The finding that the influence of MS1 is moderated by student
gender therefore leads to the conclusion that it cannot be entirely attributable to this
source. Thus whilst a part of MS1 and perhaps the entirety of MS2 might be
attributable to section-transcendent differences in marker severity there are good
reasons for believing that some part of MS1, and probably the largest part, is not
attributable to this cause.

A second possibility raised in the introduction is that factors MS1 and MS2
reflect a particular marker's reaction to section-transcendent features of the student's
work such as writing style. If this is the case then markers are giving such features an
inappropriately high weighting. The marking guidelines refer to quite separate
features of the project for each section. It should be noted that all projects are word
processed or typed so quality of handwriting is not a candidate as a section-
transcendent feature. There are in any case reasons for being sceptical about any
explanation of this general type. MS1 loads not only on the marks awarded to the
sections of the written project but also on the supervisor's A mark given for the
conduct of the project over the year. It is difficult to see what sort of feature could
transcend both conduct of the work of an empirical project during the year and the
written report on the work. Again, in so far as the data favour a model in which MS1
has more impact than MS2, it is difficult to see why factors such as writing style
should affect the marks awarded by supervisors to a greater extent than they affect the
marks given by second markers, especially when it is recalled that it is the same group
of individuals who act in both roles. Finally, an account based on section transcendent
features of the work offers no ready explanation of why the impact of MS1 is different
when the work being marked is that of a male rather than a female student, though it
is possible that some tortuous account based on stylistic differences in the work of the
two genders might be developed.

If explanations based on differences in marker stringency or reactions to


section-transcendent features of the work are discounted, then what remains is the
conclusion that when a mark is awarded to a section of the project, that mark is
inappropriately influenced by factors external to that section. These external factors
could be a carryover from other sections of the project or they could reflect an

66
influence of knowledge of the student external to the project report. That knowledge
might in turn be an assessment of the student's abilities based on evidence outside of
the student's project performance or may even less appropriately represent a reaction
to the student's other personal characteristics. The present data do not offer a great
deal of evidence with which to disentangle these possibilities. There is no reason to
expect that halo effects internal to the project report should be any greater for
supervisors than for second markers. Thus if the first marker effect is stronger this
would suggest that supervisors' B, C and D marks are being influenced by their
contact with the student over the course of the year rather than solely by the project
report. Although the data do not provide a particularly strong case against a model in
which MS1 and MS2 have equal influence it would be surprising if both markers were
susceptible to halo effects within the project report whilst supervisors were totally
impervious to effects from their prior knowledge of the student.

The conclusion that in marking supervisors are influenced by their prior


knowledge of the student is perhaps not a greatly surprising one although there have
been few previous attempts to demonstrate or investigate it. One useful feature of the
application of covariance modelling to investigating such influences is that it provides
an estimate of the magnitude of the effect. Whilst the existence of the influence it
reveals is unsurprising, the strength of the effect which emerges may provide grounds
for both surprise and concern. It would clearly be useful to discover the extent to
which the influences in question are related to the student's personal characteristics
and the extent to which they are imported from aspects of academic performance
other than that nominally being assessed.

One personal characteristic to which the influences under consideration are to


some extent related is the student's gender. A model in which the factor MS1 was
allowed to have different path coefficients for males and females produced a
significant improvement in fit over one in which these coefficients were constrained
to equality. Inspection of Table 4 shows that the main difference in path coefficients
between males and females relates to the B and D marks and that on these the
coefficient of MS1 is greater for males than for females. This finding relates to the
results of Bradley (1984) on project marking. She found that second markers who
knew the student's gender tended to mark female projects less extremely than male

67
projects compared to supervisors. Bradley explained this by suggesting that the less
expert second markers displayed a bias whereby they were reluctant to give female
students extreme marks. As noted previously, if this sort of mechanism were at work
in the present data the path coefficient of SSB to SSD on the second marker's marks
should be smaller for female students than for males. There is no evidence for this.

This may be unsurprising given that a previous study in the department from
which the present data come failed to replicate Bradley's effect (Newstead & Dennis,
1990). Newstead & Dennis (1990) suggested that an alternative explanation of
Bradley's results might lie in biases shown by supervisors concerning individual
students. Such biases would inject variance into the supervisor's marks tending to
make them more extreme than the second marker's marks. If the biases concerning
individual students were stronger for females this could then explain the pattern of
data observed by Bradley. In so far as they are consistent with the existence of quite a
strong influence on supervisors' marks coming from reactions to individual students
the present results are compatible with the proposal made by Newstead & Dennis.
However, the direction of the gender effect in the present data is opposite to that
which would be necessary to explain Bradley's data.

Possible reasons for the discrepancy between Bradley's findings and those of
Newstead & Dennis have been extensively rehearsed and the present results can only
make a very limited contribution to resolving that debate. The present findings might
perhaps most easily be reconciled with those of Bradley by suggesting that there are
biases relating to individual students which may have a differential impact on the two
genders but that whether this happens and which gender is most affected varies
according to factors such as the type of material being marked, the population of
students involved, and the particular set of markers. It is noteworthy that in this study
the greater impact on male marks was restricted to the marking of the more discursive
introduction and discussion sections. Another possibility is that gender is not the true
variable underlying these effects but some other variable which was correlated with it
in the samples studied, and which in particular was correlated in opposite directions in
the present sample and in Bradley's sample.

68
The pattern which is evident in Table 2 whereby the marks given to males
generally show greater variability than those of females, especially in the first
marker's marking, is an example of a more general finding which has provoked
considerable debate (Rudd, 1984). Is the fact that females seem to obtain less extreme
marks and fewer first-class and third-class degrees a reflection of truly different
patterns of performance in the two genders or does it reflect a difference in the way
their work is marked? In the present case the difference in variance between male and
female marks arises from the influence of MS1. Thus in this case the evidence
inclines towards the position that the greater variance of male marks arises primarily
from sources which influence all of the marks awarded by the first marker. Obviously
this need not imply that all examples of mark variance being greater in males than in
females arise in the same way.

One advantage of the method used here, and in particular the use of a
structured means model, is that it makes it possible to obtain more information about
how differences in means between groups arise. In this case the difference in means
between male and female marks seems to arise in the factor MS1 (and possibly also
MS2) but not in the section specific factors, with the non-significant effect on SSA to
SSD taking the form of a female superiority. It is important to recognize that this is
quite a small effect relative to the variance of MS1. Thus although the average effect
of MS1 is to favour males slightly, there will be some males as well as some females
where it acts quite strongly to reduce marks. Having said this it is difficult to escape
the conclusion that the sex difference in MS1 reflects an influence of the personal
knowledge which the marker gains of the student during the year and that in this
sample that influence was acting in a manner which on average favoured male
students over female students. If it has any degree of generality this is a finding with
significant implications for the marking process.

The data and modelling reported here also have implications for the reliability
of marking. It is evident from the correlations in Table 2 that inter-marker agreement
on each section is only very moderate. This outcome is not unsurprising in relation to
other data on the reliability of degree level assessment (Byrne, 1980; Cox, 1967;
Laming, 1990; Newstead & Dennis, 1994). However, given the weight attached to the
project in many degree schemes it might have been hoped that it would have a higher
level of reliability than a single exam answer or even an exam paper.

69
In the present case it is clear that much of the disagreement between markers
derives not from a random influence on their marks but rather from a consistent but
marker-specific influence on the marks awarded to a particular student. This has
important implications for the extent to which poor reliability on individual elements
of assessment is overcome by averaging. Thus, for example, if an overall mark for the
projects involved here were calculated by averaging the section marks, the agreement
between supervisors and second markers on the project average would not be greatly
superior to their correlation on individual sections. Authors who have found only
modest reliability in exam marking in higher education have sometimes taken comfort
from the benefits of averaging over large numbers of assessments (e.g. Newstead &
Dennis, 1994). The present results suggest that it may be unwise to assume that
averaging is satisfactorily dealing with the problems of unreliability.

Table 4 shows some differences between markers and sections in the error
component of mark variance. Interpretations of these differences are necessarily
speculative. Supervisors show greater error variance than second markers on the mark
for the introduction. Error variance here really means variance that is both marker and
section specific. It may be that it is in marking the introduction that the supervisor's
specialist knowledge of the literature is most relevant and hence it is here that their
mark is most idiosyncratic.

The data analysed in this study derive from a single department, and a
relatively small group of markers was involved. In view of this it would be unwise to
reach overly strong conclusions about marking in general. The situation examined in
which a supervisor teaches a student on an individual basis for a year and is then
required to assess the work around which all their meetings have centred is perhaps
unusually susceptible to the supervisor developing biases towards the student.
However, the strength of the effects detected does suggest that the influence of
markers' personal knowledge of the individuals whose work they are marking
deserves considerably more attention than it has previously received. If the results
reported here do have any generality then there is a strong case for avoiding, as far as
possible, the assessment of work by those who know the student. Student gender
appears to moderate the influence of the marker's personal knowledge of the student
and although the effect is not enormous it again gives cause for concern and the
generality of the effect warrants further investigation. Student gender was the only
personal characteristic of the students which was considered in this study and it may
well be that there are other characteristics which have a similar or even a larger

70
influence; this too deserves further study. Whilst caution is advisable in generalizing
the conclusions of this study what may be more usefully generalized are its methods.
Whilst there are ambiguities of interpretation, which have been discussed, the use of
structural equation modelling provides a valuable new handle on marking bias. An
accumulation of studies using the approach illustrated here could substantially
improve our knowledge of the nature and magnitude of marking biases.

71

You might also like