Professional Documents
Culture Documents
Education is a word that we come across every day and an aspect that has a lot
of impact in our social, economical even our psychological status. The history of
education can be traced back to human origin and man’s never ending passion for
knowledge.
The word education is derived from the Latin educare (with a short u)
meaning "to raise", "to bring up", "to train", "to rear", bringing up, raising. In recent
times, there has been a return to, an alternative assertion that education derives from a
different verb: educare (with a long u), meaning "to lead out" or "to lead forth".
Definition of education
Education has been defined from a multitude of perspectives and the real
essence of education is even today the subject of critical deliberations.
1
The knowledge or skill obtained or developed by a learning process.
History of Education
The history of education is both long and short. In 1994, Dieter Lenzen,
president of the Freie Universität Berlin and an authority in the field of education, said
"education began either millions of years ago or at the end of 1770". This quote by
Lenzen includes the idea that education as a science cannot be separated from the
educational traditions that existed before.
When cultures began to extend their knowledge beyond the basic skills of
communicating, trading, gathering food, religious practices, etc, formal education, and
schooling, eventually followed. Schooling in this sense was already in place in Egypt
between 3000 and 500BC.
2
of higher education at Nalanda, Takshashila University, Ujjain, & Vikramshila
Universities. Art, Architecture, Painting, Logic, Grammar, Philosophy, Astronomy,
Literature, Buddhism, Hinduism, Arthashastra (Economics & Politics), Law, and
Medicine were among the subjects taught and each university specialized in a
particular field of study. Takshila specialized in the study of medicine, while Ujjain
laid emphasis on astronomy. Nalanda, being the biggest centre, handled all branches
of knowledge, and housed up to 10,000 students at its peak. British records show that
education was widespread in the 18th century, with a school for every temple, mosque
or village in most regions of the country. The subjects taught included Reading,
Writing, Arithmetic, Theology, Law, Astronomy, Metaphysics, Ethics, Medical
Science and Religion. The schools were attended by students representative of all
classes of society. The current system of education, with its western style and content,
was introduced & founded by the British in the 20th century, following
recommendations by Macaulay. Traditional structures were not recognized by the
British government and have been on the decline since. Gandhi is said to have
described the traditional educational system as a beautiful tree that was destroyed
during the British rule.
Assessment of Education
Assessment is the process of documenting, usually in measurable terms, knowledge,
skills, attitudes and beliefs
History of assessment
3
knowledge and skills demonstrations were used to meassure practical abilities. The
University of Paris first introduced formal examinations during the 12 Century. These
exams were theological oral disputations. Questions were known in advance,
requiring students to memorise and regurgitate answers. In the 1740s, Cambridge
University began using (oral) examinations to compare students, similar to the earlier
Chinese tests. During the 18th Century, Cambridge and Oxford began testing students'
mathematical abilities using written tests and thereafter the use of paper for
assessment spread to all subjects. The Unitied States introduced formal written
examinations in the 1830s in an attempt to reduce the subjectivity of assessment.
Horace Mann introduced written tests in the Boston Public Schools to compare school
performance. However, the United States main contribution to the history of testing
came during the First World War when the US Army introduced large scale IQ testing
to assign massive numbers of recruits to positions within the Army. The Army Alpha,
as it was known, consisted of multiple choice questions and was administered to over
two million recruits.
Types
Assessments can be classified in many different ways. The most important
distinctions are:
(1) formative and summative;
(2) objective and subjective;
(3) criterion-referenced and norm-referenced; and
(4) informal and formal.
4
feedback on a student's work, and would not necessarily be used for grading
purposes.
5
effectively a way of comparing students. The IQ test is the best known example of
norm-referenced assessment. Many entrance tests (to prestigious schools or
universities) are norm-referenced, permitting a fixed proportion of students to pass
(“passing” in this context means being accepted into the school or university rather
than an explicit level of ability). This means that standards may vary from year to
year, depending on the quality of the cohort; criterion-referenced assessment does not
vary from year to year (unless the criteria change).
Informal and formal
Standards of quality
Testing standards
6
psychological testing and assessment, educational testing and assessment, testing in
employment and credentialing, plus testing in program evaluation and public policy.
Evaluation standards
Each publication presents and elaborates a set of standards for use in a variety
of educational settings. The standards provide guidelines for designing, implementing,
assessing and improving the identified form of evaluation. Each of the standards has
been placed in one of four fundamental categories to promote educational evaluations
that are proper, useful, feasible, and accurate. In these sets of standards, validity and
reliability considerations are covered under the accuracy topic. For example, the
student accuracy standards help ensure that student evaluations will provide sound,
accurate, and credible information about student learning and performance.
7
A good assessment has both a validity and reliability, plus the other quality
attributes noted above for a specific context and purpose. In practice, an assessment is
rarely totally valid or totally reliable. A ruler which is marked wrong will always give
the same (wrong) measurements. It is very reliable, but not very valid. Asking random
individuals to tell the time without looking at a clock or watch is sometimes used as
an example of an assessment which is valid, but not reliable. The answers will vary
between individuals, but the average answer is probably close to the actual time. In
many fields, such as medical research, educational testing, and psychology, there will
often be a trade-off between reliability and validity. A history test written for high
validity will have many essay and fill-in-the-blank questions. It will be a good
measure of mastery of the subject, but difficult to score completely accurately. A
history test written for high reliability will be entirely multiple choice. It isn't as good
at measuring knowledge of history, but can easily be scored with great precision.
Controversy
The assessments which have caused the most controversy are the use of High
school graduation examinations, which first appeared to support the defunct
Certificate of Initial Mastery, which can be used to deny diplomas to students who do
not meet high standards. They argue that one measure should not be the sole
determinant of success for failure. Technical notes for standards based assessments
such as Washington's WASL warn that such tests lack the reliability needed to use
scores for individual decisions, yet the state legislature passed a law requiring that the
WASL be used for just such a purpose. Others such as Washington State University's
Don Orlich question the use of test items far beyond standard cognitive levels for
testing ages, and the use of expensive, holistically graded tests to measure the quality
of both the system and individuals for very large numbers of students.
High stakes tests, even when they do not invoke punishment, have been cited
for causing sickness and anxiety in students and teachers, and narrowing the
curriculum towards test preparation. In an exercise designed to make children
comfortable about testing, a Spokane, Washington newspaper published a picture of a
monster that feeds on fear when asked to draw a picture of what she thought of the
state assessment. This, however is thought to be an acceptable if it increases student
learning outcomes.
8
Standardized multiple choice tests do not conform to the latest education
standards. Nevertheless, they are much less expensive, less prone to disagreement
between scorers, and can be scored quickly enough to be returned before the end of
the school year. Legislation such as No Child Left Behind also define failure if a
school does not show improvement from year to year, even if the school is already
successful. The use of IQ tests has been banned in some states for educational
decisions, and norm referenced tests have been criticized for bias against minorities.
Yet the use of standards based assessments to make high stakes decisions, with
greatest impact falling on low-scoring ethnic groups, is widely supported by education
officials because they show the achievement gap which is promised to be closed
merely by implementing standards based education reform. Many states are currently
using testing practices which have been condemned by dissenting education experts
such as Fairtest and Alfie Kohn.
There are several universities and recognized school boards in India which
makes an objective comparison of percentage grades awarded by one examination
difficult with those for another, even for an examination at the same level. At the
school level percentages of 80-90 are considered excellent while above 90 is
exceptional and uncommon. At the university level however percentages between 70-
80 are considered excellent and are quite difficult to obtain. It should be pointed out
9
that the percentage of marks at university vary from one to another which makes
direct comparison of percentages obtained at different universities difficult.
Evaluation in schools
Until 1994 the schools up to tenth standard followed the marking or the
ranking system where each students overall marks were calculated and his rank
relative to others was released. However the system had some accompanying
problems contrary to the interest of the students such as
Unhealthy competition
Increased parental pressure
Heightened tension levels in students
Introduction of Grading
Owing to the inherent flaws in the ranking system and the persistent demand
from the academic and social circles that the outdated and anti-student system must be
replaced in the best interest of the students the government decided to change the
10
system of evaluation from ranking to grading on a stage by stage basis which was
rolled out from the year 1994 and completed by 2003.
However it is not known whether the change executed had returned the
expected results.
This study aims to analyse whether the introduction of the grading system has
achieved the major objectives which justified its introduction such as
11
Review of literature - Journals
This chapter presents some of the studies and the journals that have been made
and published and it also helps the researcher to concentrate on the main aspects and
to avoid duplication
ALTHOUGH GRADING is undoubtedly at or near the top of our "to do" lists,
it is not the recurrent nature of this task that prompted the topic for this column. As
the articles in this journal testify, business communication as a discipline has been
changing. Most of us are not teaching the same course in the same way that we did 15
or 10 or even five years ago. But are we still grading in the same way? If content,
teaching methods, and delivery systems have changed, has grading also changed? And
if so, how? This column addresses those questions.
In the second article, Marilyn Dyrud has done just that, advocating an holistic
method for evaluating assignments. Reflecting the evaluation procedures used in
business for determining the success of a written communication, the system assigns
numbers (0, 1, or 2) for unacceptable, acceptable, and excellent work respectively. As
the article explains, this system not only reduces the time commitment of instructors,
but also encourages students to take ownership of their written work.
Although her approach differs somewhat, Nancy Schullery also adopts the
holistic approach to grading, using seven foundation concepts that are essential to
every business communication, regardless of genre. Such assessment, she argues,
12
more clearly simulates the response of supervisors, clients, customers, and other
readers in the business world.
LeeAnne Kryder, in the final article of this column, explains the rubrics she
has developed to assure some degree of consistency in grading among instructors and
teaching assistants in various sections of the same writing course. These rubrics she
finds particularly useful for evaluating individual student performance in group
projects.
If there is a consistent thread among these four articles, it is this: the quality of
the evaluation is what matters, not the amount of time instructors spend grading.
Regardless of the methods they use, all of the authors represented here are searching
for ways to provide meaningful and impartial evaluation of student writing that
encourages learning and rewards excellence.
Joel P. Bowman
Western Michigan University, Kalamazoo
ANYONE WHO HAS SEEN the popular children's TV show "Sesame Street"
knows that Kermit the Frog worried about being green. Although Kermit had a
different metaphor in mind, those of us teaching business communication in an online
format also face the issue of being "green." Classes taught over the Internet are
relatively new, and online instructors are having to learn--often the hard way--how to
take full advantage of electronic delivery to provide good instruction and effective
feedback on student work.
Until the advent of e-mail that permitted attaching formatted files, students
submitted work for evaluation in essentially the same way: as paper documents to be
marked and returned by the instructor. Most of us currently teaching business
communication learned what we know about marking student papers from our
instructors, who typically used a combination of marginal notations and standard
proofreader's marks. As instructors, we probably use essentially the same basic
approach to provide feedback on student work and justify our evaluation of that work.
13
Until the advent of the Internet, the procedures were basically the same for
those completing courses at a distance. Assignments arrived on paper and were
marked and returned, whether by mail (typical of "correspondence" courses) or by
fax, as video-based delivery became increasingly common. With the advent of the
Internet and submission of work by e-mail attachment, everything changed.
Over the past three years I have taught eight sections of a standard business
communication course using Web-based, Internet technology for delivery of
information. While most of my students have been within easy driving distance of
campus and were also taking classes on campus, a significant number have been
several hundred miles away. A few have been several thousand miles away. Many
were working full time and elected to take the online version of the class because of
the flexibility it afforded. Some enrolled out of curiosity, knowing that online courses
were proliferating. A few enrolled expecting it to be easier than the traditional version
of the course because of the absence of regularly scheduled classes.
In every semester that I have taught online classes, technical difficulties have
created problems. Servers have a tendency to fail without notice, and the class
14
conference and/or e-mail service may not be available for hours--and, on occasion, for
days. E-mail server software may strip the contents from attachments, so while a
document is sent and received, the contents are missing. Such challenges present one
of the principal differences in the evaluation of online students. To be successful in an
online environment, both faculty and students need greater flexibility and
perseverance. Online classes, like all courses taught at a distance, require more
attention to planning so that students know exactly what is expected of them at the
beginning of the semester to ensure that they can plan around their work schedules
and complete the assignments by their due dates.
Before preparing their solutions to the cases, students use the class conference
to ask questions and post sample solutions for my comments. Students earn points for
their participation in the conference. I evaluate entries for class relevance (students
may use the conference to discuss other topics of interest), spelling, grammar,
mechanics, and formatting. I also award extra credit to students who are the first to
find and report errors in my postings. At regular intervals, I send each student a record
of his or her postings indicating problems with the postings that resulted in a loss of
points for the evaluation period. Students may compensate for problems in an
evaluation period by increasing their level of participation in the next.
For the short documents, online discussion of the cases and their possible
solutions begins the week before the case is due. Students submit the drafts of their
cases on Monday. I return the marked drafts on Tuesday. The final versions of the
assignments are due Friday, and I return those by Sunday to ensure that students know
how well they have done before submitting the drafts of their solutions to the next
case.
15
The feedback on formatted documents, both the draft and final versions of
solutions to the cases, requires a slightly different strategy from that most of us
learned to use for documents submitted on paper. On paper, if a student has a problem
with comma usage, it is a simple matter to mark problems with usage and draw lines
to general comments about the need to review and what to look for in the review. This
isn't so easy to do with documents that never see paper. Microsoft Word (and other
word processing programs) allow for highlighting and changing text color. It is also
possible to use a "draw" function to show connections between related elements, but
imitating the kind of feedback possible with pen-on-paper is both time consuming and
awkward when done electronically.
I elected not to use the "track changes" function in Microsoft Word, preferring
to save copies of the drafts returned with my comments for comparison with the final
submissions. Much of the feedback on the draft was commentary about language
usage (such as explaining why a modifier "dangled"), business writing style (such as
commenting on message structure and tone), and the need to follow special
instructions for the case (such as using a numbered list if the directions said to do so).
Tracking the changes did not adequately allow for the revisions most students were
having to make between the draft and the final versions of their solutions and actually
added to the difficulty of evaluating the final copies.
Providing this kind of running commentary on the cases has both advantages
and disadvantages. The principal advantage is that documents submitted electronically
encourage instructors to provide more comprehensive, explanatory feedback than is
typical for paper submissions. The principal disadvantage is that providing such
feedback for each student individually takes more time than is required for marking
papers and returning them in class, where the explanation for the cryptic comments on
paper can be provided orally to all the students at once.
Bottom Lines
16
Online classes tend to have higher attrition. Students enroll but drop the course
early in the semester or simply stop doing the work. The mixture of highly motivated,
often nontraditional students and those who enrolled expecting the course to require
less time and effort typically results in bimodal distribution of grades, with a majority
of the students doing very well or very poorly.
In the history of education, online classes are new, and we have yet to
determine how to take full advantage of the technology. The natural inclination is to
try to imitate the classroom environment. The extraordinary family therapist Virginia
Satir referred to this inclination as "the lure of the familiar." Even when we
experiment, we tend to do what we have always done.
Because we are most familiar with the traditional classroom setting, we tend to
assume that it is the "gold standard" for educational delivery and seek to replicate
what we see as the advantages of that setting even as we use new technology. Those
of us who teach so-called "upper division" classes, however, might pause to consider
how much our students actually learned in their previous traditional classes. It may be
time to reinvent the wheel. As changing cultural needs continue to push us in the
direction of "any time, anywhere" delivery of education, what we are learning now
about how to evaluate student performance in an online environment may well
provide the foundation for new strategies of teaching and learning.
17
ONCE--AND ONLY ONCE--I calculated how many papers I graded during an
academic year. With four writing classes averaging 25 students each, the total was a
staggering 4,500 papers. And, since all of my writing classes are process-oriented, the
real total was at least double that. I was, I thought, a very thorough and conscientious
grader, circling mechanical errors, rewriting wretched sentences, and carefully
marking each mistake as awk, pn, sp, ro, frag, and a plethora of other hieroglyphs. I
was also putting in 10-hour work days and, on occasion, waking up at 4 a.m. so I
could finish grading before classes met. Obviously, things needed to change.
It was a small step from Stratton's book and further discussions with colleagues
to a new and improved grading system, one that would be more efficient, help prepare
students for the writing they would produce professionally, and encourage revision.
Simplicity
In the olden days, before 0-1-2, I tended to edit and revise my students' papers,
accompanied by a rather unwieldy grading sheet .While the students liked the
expository essay grading sheet primarily because it looked organized and "identified"
all of their mistakes, from an instructor perspective, it was fraught with peril. Under
18
mechanics, for example, 7 points are allowed for spelling. But what if a student made
10 errors in their paper? Does that student owe points? Is 6 points adequate to assess
logic in a persuasive paper? How much "freshness" can we expect from a newly-
minted high school graduate? What if a student submitted a paper that was
mechanically and structurally excellent but devoid of meaning? Using this grading
form, an instructor can subtract only 8 points; this well-written but vapid paper might
earn an "A."
Using 0-1-2, most of these problems are non-existent because instructors read
the papers holistically, which results in simplified grading criteria and fewer editing
remarks. The figure is my current grading criteria handout for students in my business
correspondence class. Criteria will vary according to class; a business writing class,
for example, may place more emphasis on audience and formatting than a
composition class.
The 0-1-2 stands for unacceptable, acceptable, and excellent. Students who
receive 0s and is may revise their work as often as they wish within a specified time.
In my more-than-a-decade of using this system, I have witnessed much better writing
as a result of revision, and the students willingly revise because everyone has the
potential to earn an "A".
Efficiency
Just this simple change has resulted in massive time savings. In the past, using
a grading form and editing comments, it took me about 90 minutes to grade a set of
19
business letters; now, I spend less than an hour. To score longer reports, I used to
spend about 45 minutes per report; now, it takes about 20.
While 0-1-2 speeds up the process, it does not reduce the quality of evaluation.
My students still receive ample feedback for improvement. It is, in fact, a form of
holistic assessment, an evaluation system widely used to score student work: the
Educational Testing Service uses holistic methods to assess SAT and GRE exams, and
the State of Oregon, which has significantly revised K-12 standards, uses it to score
for minimal competencies. Furthermore, numerous studies on holistic assessment
praise its reliability and note its versatility .
Equity
A third virtue of 0-1-2 is equity. With so few points available, there is little
room for instructor larceny, which, sadly, certainly does occur. In a traditional 100-
point system, it is easy to mark down a paper that, for example, might disagree with
the instructor's political philosophy. But with only 2 points available, arbitrary
deductions cannot occur.
Student Satisfaction
* "It is great that you allow rewrites because you learn a lot from your mistakes. Keep
it the same so others will get as much from this class as I did."
* "Opportunity to rewrite papers; fair chance to get the grade you desire."
20
Conclusion
For instructors, 0-1-2 offers the luxury of more time, substantially reducing
the many hours of evaluation required for writing classes. Those extra hours can be
applied to other tasks, such as course preparation or professional development. More
importantly, though, it changes the instructor's role from judge to coach, since the
primary goal is to produce an excellent piece of work. As Toby Fulwiler, University of
Vermont, explains, his role as a writing teacher has "little to do with teaching students
about semicolons, dangling modifiers, or predicate nominatives and a lot to do with
changing their attitude towards writing in general so they would care about it and
maybe learn to do it better" .
21
I define effectiveness as satisfying the following criteria: the writer has to
understand the complexity of the context and approach it in a way that respects the
viewpoint of the audience. The writer must demonstrate an appreciation of the
audience's viewpoint by using language that the audience understands and by
considering the relevant content in a way that relates to audience needs. The writer
must arrange the content in a strategically functional way, using an organization that
allows the reader (and any secondary audiences) to understand and accept the writer's
goals. Ideally, the format/design will invite the reader in to read the rest of the
document and allow the reader to make optimum use of the information. The content
must include enough detail to help the reader understand the situation from the
writer's perspective, yet include no irrelevant or prejudicial information. Thus, the
writer's self-presentation must be positive, giving the impression of a competent,
responsible, fair-minded business person speaking for a reasonable organization.
These seven foundation concepts (italicized above) are so essential and tightly
interwoven that they all must be incorporated into any written assignment, along with
the standard conventions of American English. These fundamentals are applicable
across the board, whether the student is writing a negative message, a persuasive
message, a research report, or an application letter and resume, each of which has its
own unique elements (as discussed in business communication textbooks). It is the
successful implementation of both the foundation concepts and the unique elements
that leads to effective accomplishment of purpose.
Effectiveness Criterion
They are to assume that each paper is to be judged by that paper's intended
audience. In other words, when I do the grading, I will look at their papers through the
eyes of their workplace supervisor, customer/client, colleague, or potential employer.
As the assigned audience would react to the writer's document, so I also will react.
22
How do I do that? First, I make a quick preliminary judgment. For example,
we are told that prospective employers typically spend only 15-30 seconds reading
any single resume before making a preliminary (and sometimes final!) decision. I
strive for such a quick initial reading. My preliminary reaction may range from "this
is pretty good" to "oh, s/he's missed the point entirely." This initial judgment leads to
an initial valuation as an "A," "B," etc. grade, which I note in pencil and expect to
modify somewhat after a second, more careful analytical reading.
The second reading is where the real work is done. I identify, from the
perspective of the target audience, any of the seven fundamental concepts that the
student has implemented either particularly well or poorly. The good points are noted
in the margin to reinforce learning. The problems are also noted in the margin, but
always with specific instructions for improvement. Providing sufficient quality
feedback to the students is critical. I want students to understand the reader's likely
reaction to the document, so my comments are numerous. Some comments are
cryptic, such as "positive?" or "unclear." Others offer brief instruction (not editing)
toward a more appropriate direction, such as "rephrase in 'you' attitude" or "frame
more positively." Whenever possible, I point out important readers' perspectives that
the student has overlooked, such as, "could be read as defensive." Also, a few
summary statements at the bottom are given, especially early in the semester, when
positive reinforcement and motivation are most crucial. In keeping with my
employer's-eyes role, I attempt in my comments to both give the student credit for
what was done well and identify areas where improvement is needed or the purpose
has not been accomplished.
Thus, a paper that makes a very favorable first impression and has positive
comments regarding all, or most, of the fundamental concepts, and only a few minor
shortcomings, is judged excellent and the penciled-in grade is made an "A." One that
is "pretty good" but doesn't make the "A" grade due to falling short on one or more of
the criteria will receive a "B" based both on the seriousness of the errors and the
strength of the correct part of the paper in relation to previous papers I have read. A
"C" paper has potential, but lacks overall effectiveness due to some combination of
missing ingredients. Those papers that either miss the point entirely, ignore the basic
concepts or the key elements of the assignment, or display ignorance or carelessness
23
with the basics of English punctuation, spelling, or grammar are, fortunately, rather
rare and earn a "D" very quickly.
Conclusion
Such an holistic approach contrasts with the use of a grading rubric, which is
the more common path to efficiency in grading. It is my observation that rubrics tend
to grow over time in response to students' creativity, becoming overly complex and
losing sight of the big picture. More important, I am concerned that the typical rubric
approach of points off for every error may be unfair and, possibly, do more harm than
good. For example, a common problem is repeated point deductions for the same type
of error (e.g., comma usage). To my mind, a repeated error of a single type repeated
ten times is less objectionable than ten different errors. Further, sometimes a specific
technical error really does little to reduce the effectiveness of the message.
In contrast, the holistic method's focus on the larger goal, together with
explicit recognition of students' strengths, is an effective motivational tool, providing
students with a sense of building on a base toward greater mastery. While such
grading is not particularly efficient, I find that it does not take too much more time
than grading with a rubric. Of course, writing comments takes time. However, I
believe that such comments--reasonably prioritized and including genuine positives--
are both immediately helpful to the students and build goodwill that sets the stage for
future instructional success. This grading process does not end with the individual
papers; feedback is important also at the classroom level. As I return the papers, I give
positive reinforcement to the class by briefly explaining how they have generally
handled the concepts and showing a few exemplary papers in which the key elements
were masterfully treated. These are shown without names, to avoid heaping praise on
any individual student and inviting the suspicion of a teacher's favorite. I also describe
(but don't show examples of!) problems that seemed to be common. In this manner,
the grading process is transformed from a pure chore into an extension of the teaching
process, which, for me, makes it all worthwhile.
Although such holistic grading may be criticized for not enforcing absolute
standards with respect to details, I believe that its focus on the big picture helps
motivate students to strive toward improvement in all aspects of the subject. The
24
method is akin to that used in industry of rewarding employees who do something
well with an "atta boy." Industry asks employees to work toward goals, and then
measures the degree of goal attainment in a performance evaluation (e.g., meeting
sales targets or budget allocations), in effect awarding points rather than deducting
points. It is my experience that students respond very favorably to any efforts that
convincingly connect their classroom subject matter with their own present or future
employment success. I recommend the method as one that helps motivate students
both to learn the principles of business communication and to master its skills.
Endnote
The foundation concepts noted here are drawn from E. A. Hoger & N. M.
Schullery (2001), Core Concepts for Business Communication, provided to instructors
of the business communication course taught at the Haworth College of Business,
Western Michigan University. The college does not mandate holistic grading.
Grading Rubrics
25
My rubrics generally reflect five categories of effective business writing:
organization and content, visual communication, concise and varied word choice,
mechanics, and format (the particular genre conventions for the memo, letter, and
report). For each category I determine the percent of points allotted . Depending on
what elements are being emphasized, points vary among categories in each
assignment. For an assignment calling for a resume, application letter, and brief memo
(explaining the job or organization targeted by the letter), I allot the following points:
* Word Choice/Tone 1 pt
* Format 1 pt
Before, I used to agonize over whether a document was a "B" or a "B+"; now I
focus attention on each category and let the numbers as totaled "tell me" if the
document is a "B" or a "B+."
Initially, the use of rubrics and numbers was foreign to me, but I needed a
clearly articulated guide when I began to work with teaching assistants in the "large
lecture" format . How could I ensure consistent treatment of student papers across
three graders (two TAs and myself)? For each of the assignment papers, I developed a
rubric and sent out a draft for comment from each TA. Discussions that followed
seemed to clarify the assignment for them, so the rubrics helped to "get us on the
same page" (literally). Perhaps because our students knew we were using common
evaluation sheets, we seldom had student complaints about unfair grading. In fact, the
final grade distribution assigned by the two TAs was very similar--and I believe the
rubrics helped us achieve that.
26
The rubric also saves time in making summary comments because it forces me
to be concise due to space limitations. I still annotate individual passages in student
papers, but I now make brief marginal comments and then refer to the pre-printed
rubrics.
Soon after I put the student teams together, I distribute the performance
evaluation that each student is to use in evaluating his/her own work and the
contributions of each team member. We discuss characteristics of successful teams
and I use the performance evaluation to highlight individual team member
responsibilities. The performance evaluation focuses on a variety of contributions
needed for a successful collaborative report; certainly writing and editing skills are
important, but so are computer skills, visual communication, willingness to take on
responsibility, initiative, and so on.
27
whether each student shares in it depends on the performance evaluations and status
reports (sometimes individual grades are higher or lower, depending on that student's
contribution to the whole). With the periodic status reports and the final performance
evaluations, I feel comfortable in assigning collaborative project grades and the
ultimate course grades.
Concluding Remarks
During the grading process, tension arises when there is a difference between
the grade that the teacher assigns and the grade that the student expects (Goulden &
Griffin, 1995). Such tension can have important consequences for the student. Some
students, upon receipt of a grade lower than expected, may be discouraged from
further investment in the learning process, or may be motivated to work harder.
Additionally, grades may impact on students' self-esteem, self-worth, and self-efficacy
(Edwards & Edwards, 1999; Goulden & Griffin, 1995; House, 2000). One way to
clarify this tension of grading is to consider student perceptions of grades.
28
The PE paradigm suggests that the person, in interaction with his/her environment,
creates a personal perception of grades. These environments include not only the
classroom and school environments, but also physical, social and psychological
environments. The broader context of student environments will affect the actions
taken by the student and thus his/her achievement in the classroom. Omitting
consideration of these other environments when assessing student achievement can
result in a grade that may reflect a biased assessment .Thus, from the PE paradigm, in
order to understand student perceptions of grades, one must look at the student or
person factors and the environments in which the student is a part.
Person Factors
Previous school experiences of the students focus on the grading practices and
the extent to which the student ascribed to them. A number of authors found that some
students experienced grades as a reward or a punishment (Becker et al., 1968;
Kadakia et al., 1998; Tropp, 1970). These experiences relative to grading shape the
perceptions and behaviors of the students. Over time, students become conditioned to
the extrinsic rewards that grades convey and may continue this focus in graduate
school (Edwards & Edwards, 1999). Kadakia et al. (1998) found that although MSW
(Master of Social Work) students saw their colleagues as obsessive about grades, they
failed to identify this characteristic in themselves. Weiner (2000) makes the point that
29
an outcome (i.e. grade) may be explained in terms of both personal behaviors (the
student studied or did not prepare for the test), and actions of an `other' (the teacher
recognized the effort expended). Further, Weiner suggests that the Protestant Work
Ethic is the value base upon which grades are perceived. And so, it seems that the
experiences one has throughout the years may serve as an impetus for the efforts one
expends and also for the motivation to engage in the learning process.
Learning efforts may be uniquely linked to other person factors such as student
motivations and grade expectations that may be illustrated by the consumer model.
This model views students as consumers of education, and emphasizes satisfaction
with their educational experience .The basis of student motivation is the monetary
contribution, which is made by the student in order to obtain a degree. If, indeed,
students perceive grades as a right (according to the consumer model) and faculty take
the traditional approach that students earn the grade, then tension surely will arise
during the grading periods. The consumer model may inadvertently diminish the
values of hard work and persistence which are essential to learning (Snare, 1997).
Also, this model projects a cynical viewpoint about students' motivations to learn and
fails to recognize the other contexts for which students are responsible. Other
motivating factors that have been suggested include the desire for recognition and for
30
knowledge so as to become expert in the field (House, 2000). It is important then to
document student motivations and grade expectations when investigating student
perceptions of grades.
Environments
31
social environments that are meaningful to the student. This environment, because it
places demands on the student's time, serves to reduce the amount of time available
for students to use in academic pursuits. Consequently, although students may believe
their completed assignments are well done given the time available, they may be
graded unfavorably by the teacher. This discrepancy creates tension about the grade.
In studying student perceptions of grades, it is essential for their personal environment
to be included.
Meaning of Grades
32
the desired grade. Such a perspective implies the external control of grades by faculty,
and minimizes student academic efforts in affecting the grade.
Nearly thirty years later, Goulden and Griffin (1995) also conducted their
study at a Midwestern university and included undergraduate students. They
addressed the different perceptions of teachers and students as one source of grade-
conflict. The underlying premise was that the meaning of grades differs between
students and faculty and this difference is a source of conflict. Students in their
sample were given two prompts from which to respond: What do grades mean to you;
and Grades are like.... The results indicated that grades were viewed as a means of
feedback in which a measure or judgement about student effort was given. Also,
grades were seen as emotional triggers in which the student, as a person, was being
judged. Lastly, grades were seen a motivators within the context of a reward and
punishment system.
Summary
Grading is an area of tension between students and faculty that may be better
understood from a Systems Theory point of view. Utilizing the conceptual paradigm
of PE Fit recognizes the importance of environments as impacting on student
engagement in the academic enterprise. However, documenting student perceptions of
grades will provide faculty with insights that can be used to reconsider rubrics for
grading. Such insights can serve to reduce the grading tension between faculty and
33
students. Additionally, this information can have important implications for
curriculum development and program planning. In terms of curriculum, the
identification of student perceptions of grades (and grading practices) may enhance
faculty discussions about the grading systems that should be in place. Also, such
discussions may include the importance of consistent expectations for student
achievement across courses as well as within different sections of a course. Regarding
program planning, documenting student perceptions and being responsive to these in
light of the environmental contexts may serve to heighten the reputation of the
school/program and thereby have an indirect effect on recruitment of new students.
This model also recognizes the importance of the student body profile which includes
both traditional and non-traditional students, and who may have unique perceptions
about grades. Finally, the authors look forward to extending the ideas presented in this
paper by actually conducting research that addresses student perceptions of grades.
This article reviews the difficulties in assigning grades to student work, briefly
reviewing highlights from the history of grading practices. It concludes with the
suggestion that, given the impossibility of comparing grades either within an
institution or between institutions, instructors should base grades on a measure of an
individual student's progress during a course.
One of the most important duties of a faculty member at the end of a term is
that of determining the final grade that individual students will receive in the class. As
difficult a process as this is, it is made even more difficult not only by having to
determine the process to arrive at the final grade, but also by the various
interpretations of that grade that will be made later by others.
These assigned grades are designed to serve a variety of purposes. Dr. James S.
Terwillinger wrote that grades are to serve three primary functions: administrative,
guidance, and informational. He indicated that grades should be viewed only "as an
arbitrarily selected set of symbols employed to transmit information from teachers to
students, parents, other teachers, guidance personnel, and school administrators."(1)
34
However, unless the meaning and interpretation of the grades assigned are universally
understood, the system, no matter how carefully designed and understood by the
instructor awarding the grade, will not be an effective means of communication to
others or over a period of time for cumulative evaluation.
Dr. William L. Wrinkle wrote in 1947 of six interpretation fallacies that are
made in understanding course grades. The number one fallacy that he listed in his
book was the belief that anyone can tell from the grade assigned what the student's
level of achievement was or what progress had been made in the class.(3) This fallacy
is as widely believed and probably as correct today as it was when he wrote it in 1947.
Even earlier in a study published in 1912, Dr. Daniel Starch and Edward
Elliott questioned the reliability of grades as a measurement of pupil accomplishment.
Their study involved the mailing of two English papers to two hundred high schools
to be graded according to the practices and standards of that school and its English
instructor. The papers were to be graded on a scale of 1 to 100, with 75 being
indicated as the passing grade. Teachers at one hundred forty-two schools graded and
returned the papers. On one paper the grades ranged from 64 to 98, with an average of
88.2. On the other, the range was 50 to 97, with an average of 80.2. With more than
35
thirty different grades assigned and a range of more than forty points for the same
paper, it is no wonder than the interpretation of assigned grades is extremely
difficult.(4)
Perhaps the earliest study on individual grading differences was done by Dr. F.
Y. Edgeworth of the University of Oxford in 1889. Professor Edgeworth included a
portion of a Latin prose composition in an article he wrote for the Journal of
Education. He invited his readers to assign a grade to the composition and forward it
to him. His only other instruction was that this composition was submitted by a
candidate for the India Civil Service, that the work was to be graded as if the reader
were the appointed examiner, and that a grade of 100 was the maximum possible.
He received twenty-eight responses distributed as follows: 45, 59, 67, 67.5, 70,
70, 72.5, 75, 75, 75, 75, 75, 75, 77, 80, 80, 80, 80, 80, 82, 82, 85, 85, 87.5, 88, 90,
100, 100. In his conclusions Edgeworth wrote, "I find the element of chance in these
public examinations to be such that only a fraction--from a third to two thirds--of the
successful candidates can be regarded as quite safe, above the danger of coming out
unsuccessful if a different set of equally competent judges had happened to be
appointed."(5)
The criteria for evaluation vary not only from institution to institution but
from course to course within the same institution and from instructor to instructor of
the same courses within the same institution. Since methods used by various
instructors vary considerably, it becomes extremely difficult to read a student's
transcript to determine the student's standing among others at the same institution or
throughout the country at other institutions. The National Collegiate Athletic
Association's Division I institutions voted down a requirement that student athletes
maintain a standard grade point average in order to retain their eligibility to participate
in collegiate sports from year to year. The major reason given was the difference in
grading standards that existed between institutions and between courses and programs
at the same institution.
One difficulty is that the methods used in arriving at the final course grade are
almost too numerous to enumerate. These include the averaging of all course grades
made during the term, dropping the lowest one or two test marks, determining the
36
entire course grade on the basis of the final exam or one term paper, counting only the
final course grades, grading on the basis of class average, and having only a written
comment rather than course grade.
Many institutions that made global changes in the recording of grades during
the decade of the 60's have changed to a system that is based on the instructor's
evaluation as measured by traditional grades. But the same problems of interpretation
that existed earlier are still present, with the additional difficulty of interpretation of
the transcript of a student who was enrolled during the transition period. For example,
at the University of South Carolina, during a seven-year period, which many part-time
students need to complete their baccalaureate degree, a student's transcript would
indicate the assignment of course grades under four different grading systems. The
student would also have been subject to three different suspension and graduation
honor criteria.
37
for a kindergarten for four-year olds and the other for evaluating executives in one of
the largest corporations in the country.
The first report card used a rating system of Very Satisfactory, Satisfactory, and
Unsatisfactory. The items to be evaluated were: Dependability, Stability, Imagination,
Originality, Self-expression, Health and vitality, Ability to plan and control, and
Cooperation. The second report card used a rating of Satisfactory, Improving, and
Needs Improvement. The items to be evaluated were: Can be depended upon,
Contributes to the good work of others, Accepts and uses criticism, Thinks critically,
Shows initiative, Plans work well, Physical resistance, Self-expression, and Creative
ability. The first report card was used to evaluate the executives, and the second to
evaluate the four-year olds.(6)
In the January 1988 issue of the Academic Leader, Dr. Stephen J. Huxley
points out another difficulty and recommends a possible correction.(7) His
observation is that the final record of the student, the college transcript, is blind to the
differences indicated above. On the transcript in determining the student's grade point
average, an "A" in Organic Chemistry, earned under an instructor who rarely gives
them, is given the same weight as an "A" in Outdoor Fly Casting with an instructor
who rarely gives any grade other than an "A." Since these differences are generally
disregarded by employees, scholarship committees, and graduate and professional
schools, students with their peer network learn which courses and instructors to take
to bolster their grade point average.
38
individual's achievement against others in the same class, course, or school, or is it
only to measure changes in the student's progress since the start of the course? If the
measure is of the individual's progress, it makes the measure of one's progress against
others almost impossible to ascertain. It is as if one were using an elastic ruler to
measure heights of individuals in a class. If the measuring device varies for each
student, then one student can be taller than another simply because the ruler was
stretched more in one measurement than in another, even if by simple observation it is
evident that the first is taller. In educational jargon, such a measuring device would be
labeled "unreliable." Yet in many courses the "measuring device" is changed for each
semester and possibly for each student.
It would seem that in those courses in which competency is desired, the latter
example would be a reasonable approach to assigning the final mark for a student.
Certainly it is reasonable from the student's standpoint; however, it makes impossible
an interpretation with others in the class, as well as any comparison with students in
other classes, even of the same subject at the same institution.
39
Since there appears to be little doubt that a given mark has different
interpretations, perhaps the best choice is for the faculty member to follow the course,
within departmental and institution guidelines, that in his or her opinion best measures
the student's progress during the measuring period without being overly concerned
with grading practices of other faculty and other institutions.
40
student receives a lower grade, the grade may be perceived as unfair, especially if
other students who spent fewer hours preparing for the examination received higher
grades.
Justice theory and research have dealt with both distributive and procedural
justice. Distributive justice is concerned with the fairness of decisions about the
distribution of resources, whereas procedural justice is concerned with the fairness of
the procedures used to reach those decisions . Distributive justice refers to the extent
to which the outcomes received in an allocation decision are perceived as fair; this
type of fairness has been considered implicitly within the contexts of equity theory
relative deprivation theory , and referent cognitions theory . These theories suggest
41
that individuals use standards of distributive justice such as equality allocated equally
to all participants regardless of inputs) and equity (outcomes allocated based on inputs
such as productivity) to establish the fairness or unfairness of the outcome). Thus, the
experience of injustice involves the realization that outcomes do not correspond to
expectations determined by standards of distributive justice.
In the context of the classroom, grades are the outcomes allocated to students.
Students receiving lower grades than expected are likely to perceive the grades as
distributively unfair, whereas students receiving expected grades are likely to perceive
the grades as fair; this phenomenon can be explained by relative deprivation and the
egocentric bias. Relative deprivation theory posits that the fundamental source of
feelings of injustice is the realization that one's outcomes fall short of expectations.
The egocentric bias in distributive justice suggests that individuals' expectations of
their own performance and outcomes are higher than their expectations of others'
outcomes; hence, people who receive higher outcomes are more likely to perceive
those outcomes as fair than people who receive lower outcomes . Empirical support
for this phenomenon has been found in the work of Lind and Tyler and Tyler , who
found connections between outcomes (relative to expectations) and perceptions of
distributive justice.
Procedural justice refers to the extent to which the processes used in making
allocation decisions are perceived as fair Lind & Tyler, 1988; Thibaut & Walker, .
Research on procedural justice has evolved from two conceptual models - Thibaut and
Walker's (1975) dispute resolution procedures and Leventhal's (1980) principles of
resource allocation procedures. These researchers suggested that procedural justice
involves the realization that procedures correspond to those determined by certain
standards (e.g., consistency, suppression of personal bias, use of accurate information,
voice, and congruity with prevailing standards or ethics). In the classroom context, the
procedures used to allocate grades could influence students' perceptions of procedural
fairness and evaluations of the instructor.
42
found that perceived fairness was related to satisfaction, trust in supervisors, and
organizational commitment. Alexander and Ruderman determined that employees'
perceptions of fairness influenced their approval of supervisors.
Method
43
procedure: consistent vs. inconsistent) between-subjects, scenario-based experimental
design. Undergraduate students (51 men and 46 women) participated in the study.
Most were sophomores (32%) or juniors (41%), and the rest were seniors. The
average age of the students was 20.10 years.
Materials
The participants were asked to respond to one of four different scenarios. Each
scenario described a classroom situation and an instructor. Participants were given
contextual information about the situation and were asked to place themselves in the
position of a student in the class who had worked hard on a term paper by conducting
research, writing, and rewriting the paper. On the basis of the grading criteria
described in the syllabus, the student expected to receive a grade of A on the paper.
The grade distribution was manipulated by informing the participants that the
student received a grade that either met expectations (A = fair grade distribution) or
did not meet expectations (B = unfair grade distribution). The grading procedure was
manipulated by stating that the instructor used the grading scheme specified in the
syllabus to grade the paper (consistent/fair grading procedure) or that the instructor
changed the grading scheme after the paper was turned in (inconsistent/unfair grading
procedure).
Procedure
44
Distributive justice was measured by asking participants to rate the extent to
which they felt that the actual distribution of grades was fair and what they deserved,
and procedural justice was measured by asking participants to rate the extent to which
they felt that the decision about the grade was made in a fair way and they were
treated fairly; these scales were based on those used by Bies, Shapiro, and Cummings
(1988) and Shapiro (1991). After completing the ratings, participants were debriefed
about the purpose of the study.
Results
Cronbach's reliability coefficients were calculated for the scales and were
found to be greater than .75 for each condition; mean ratings were also computed for
each scale. Next, I conducted manipulation checks to examine the participants'
understanding of the distributive justice and procedural justice manipulations. I
conducted separate t tests for the two manipulation checks. The results of the t tests
indicated that the manipulations had the intended effects. Grade distributions
influenced perceptions of distributive justice; participants who had been assigned
expected grades gave higher ratings for distributive justice than those who had been
assigned grades lower than expected, Ms = 4.81 and 3.59, respectively, t(95) = 3.84, p
[less than] .05. Also, the instructor's grading procedures influenced perceptions of
procedural justice; participants gave higher ratings of procedural justice for consistent
procedures than for inconsistent procedures, Ms = 5.11 and 3.87, respectively, t(95) =
4.06, p [less than] .05.
Tests of Hypotheses
45
In support of Hypothesis 1, evaluations of the instructor were influenced by
grade distributions. Participants who were assigned expected grades (fair
distributions) gave higher evaluations of the instructor than participants who were
assigned grades lower than expected (unfair distributions), Ms = 5.54 and 4.67,
respectively, t(95) = 2.42, p [less than] .05. Hypothesis 2 was also supported.
Students' evaluations of the instructor were influenced by grading procedures; when
consistent (fair) procedures were used, participants gave higher evaluations of the
instructor than when inconsistent (unfair) procedures were used, Ms = 5.52 and 4.69,
respectively, t(95) = 2.31, p [less than] .05.
Discussion
46
The purpose of this study was to examine the influence of the fairness of
grade distributions and grading procedures on students' evaluations of the instructor.
Distributive fairness was manipulated by providing participants with grades that either
met expectations or were lower than expected. Procedural fairness was manipulated
by providing consistent or inconsistent grading procedures.
47
Before generalizing from the results of this study, certain limitations of the
methodology should be kept in mind. Patterns obtained in a scenario-based study may
not always be generalizable to other settings. Unfortunately, the sensitive nature of
this line of research made it problematic to conduct in a classroom setting. Also, the
subtle differences between the independent variables used in this study made it
difficult to examine the independent and interaction effects under natural
circumstances. The external validity of the study, however, was increased by using
students as participants, because they could easily relate to the grading incidents.
Future researchers can extend the generalizability of this study by replicating it using
other methods in other settings.
Another potential limitation of the study is the use of only one manipulation of
grade distributions and one of grading procedures. In actuality, students' perceptions
of distributive fairness may be influenced not only by comparisons between their
grades and expectations, but also by comparisons between their grades and others'
grades. Similarly, procedural fairness may be perceived not only through the
consistency of grading procedures but also through other factors such as lack of bias
and the use of accurate information. Although the manipulation checks indicated that
participants' perceptions of distributive and procedural justice were influenced by the
manipulations used in the study, future researchers can use other techniques to
examine connections between the perceived fairness of grades and students'
evaluations of instructors.
When the results of this study are viewed along with past studies (Perkins et
al., 1990; Snyder & Clair, 1976), grade distributions appear to be a consistent
influence on evaluations of teaching. To the extent that students' evaluations of the
instructor's performance reflect the instructor's evaluations of the students'
performance (grades), teaching evaluations have the potential to be contaminated by
factors unrelated to teaching behavior.
48
perceived as fair by employees are more likely to receive positive evaluations,
instructors perceived as fair receive higher ratings. Instructors can ensure the fairness
of their grading procedures by being consistent, using accurate information, and
maintaining an impartial process.
One situation in which there is good reason to suppose that marking biases
may operate is that in which the student whose work is being assessed is personally
known to the marker. Although this situation may be undesirable from the perspective
of summative assessment, there are often other considerations, such as the provision
49
of appropriate feedback, which outweigh this. Examples of marking being undertaken
by the same individuals who teach students, and who are therefore personally familiar
with them, are thus widespread. In such situations there is good a priori reason for
suspecting that marks may be contaminated by individual biases.
50
Although illusory halo can occur because ratings of various relevant
dimensions contaminate one another, it can also occur if they are all contaminated by
a common influence which is not relevant to the decision being made. Influences
which might plausibly work in this way are not hard to identify, although there is only
a limited amount of work demonstrating their impact and even less which enables
their magnitude to be evaluated in applied settings. One dimension which has received
some attention is the physical attractiveness of the individual being assessed or rated.
Landy & Sigall found that the evaluation of essays could be influenced by the
physical attractiveness of their author, although a study by Bull & Stevens only
partially confirmed this result. In the occupational sphere Morrow, McElroy &
Stamper examined the influence of physical attractiveness on judgements of
suitability for promotion made by personnel professionals on the basis of simulated
assessment centre data. Physical attractiveness was found to have a significant effect
on the ratings, although the effect was not large, accounting for only 2 per cent of the
variance in the ratings given.
Thus there is good reason to suspect that where markers know the students
whose work they are assessing their marks may be biased by overgeneralization from
the student's previous performance, by whether or not they like the student, and by
irrelevant considerations such as the student's physical attractiveness. However, such
influences have had little direct investigation in the educational setting and virtually
nothing is known of whether or how seriously marks may be contaminated by them.
51
As well as resulting in unfair treatment for individual students, it seems likely that if
these biases are operating, especially those based on interpersonal affect and physical
attractiveness, they will vary in their impact for different groups of students such as
males and females or students of different ages.
The possibility that gender stereotypes may bias marking has received more
attention than most aspects of marking bias. However, even here there is no clear
agreement on what types of effects are operating and on the extent of their impact.
Whilst differences in mark and grade distributions between males and females have
often been observed, it has not proved easy to make progress in disentangling whether
these reflect genuine differences in performance or whether they are in some part
attributable to biases in marking. A study by Bradley is one of the few which has
made progress on this issue. Bradley exploited the fact that where two markers mark
the same piece of work the discrepancy in their marks will in part be determined by
any differences in their biases. She found that second markers (whom she assumed to
have less knowledge both of the project topic and of the student) marked the projects
of female students closer to the centre of the scale than first markers. For male
students the pattern was reversed, with second markers tending to award more
extreme marks. Since Bradley's effect relates to the comparison between the marks
awarded by first and second markers it cannot be explained solely on the basis of
different distributions of performance in male and female students. Bradley attributed
the outcome to a centrality bias in the marking of female students' work which derived
from gender stereotypes and was consequently stronger in the less specialist second
markers.
However, the data could equally be explained if there were some influence
which inflated the variance of first marker's marks for female students. Bradley's
52
preference for the explanation in terms of group stereotypes was based partly on the
previous literature and partly on the fact that the pattern of results failed to obtain in a
department where the second markers were blind to the student's identity and gender.
However, there could well be other differences between the departments in which the
effect was observed and that in which it was not. This point was reinforced by the
failure of Newstead & Dennis (1990) to replicate Bradley's data pattern in a large
department where second markers did know the student's gender. A variety of
explanations for the discrepant outcomes drawing on different interpretations of
Bradley's initial effect have been advanced and discussed. However, this debate has
been largely indecisive and it might be concluded that the approach being adopted
provides insufficient evidence to adjudicate between alternative interpretations of the
effects.
Thus, whilst there is good reason to believe that personal biases could operate
in marking, there is little direct evidence of their importance, and in the area of gender
bias it has been difficult to disentangle effects of bias from true differences in
performance. One of the most promising approaches to the latter issue is subject to
ambiguities of interpretation. The main aim of the present paper is to propose and
illustrate an approach which can contribute considerably to progressing the study of
both individual biases and gender bias in marking. The essence of this approach has
previously been advanced in relation to occupational ratings by Kenny & Berman
(1980). However, it appears to have been little exploited for occupational ratings and
not at all in relation to marking bias. The present study goes beyond the proposals of
Kenny & Berman in using a multi-sample approach to compare the way in which the
work of male and female students is marked and in using a structured means model to
locate the source of differences in the average marks of males and females.
Consider a model in which the mark which a marker awards to a piece of work
is the sum of three components. The first component is determined by the true merit
of the work being assessed. The second component reflects the aggregate influence of
the marker's biases concerning the student in question. The third component consists
of purely random influences. We can hope to make some progress in distinguishing
the variance attributable to these different influences if markers assign nominally
53
independent marks to a number of pieces of work from the same student and if each
piece of work is marked by more than one marker. In this situation each of the marks
assigned to a particular piece of work will be influenced by its true worth (along with
other influences). All the marks which a marker awards to a particular student will be
influenced by, amongst other things, that marker's biases concerning the student.
The data which were used in this study derive from the marking of final year
undergraduate psychology projects in the University of Plymouth between 1991 and
1993. These projects report a piece of empirical work carried out throughout an
academic year. This work is supervised by one supervisor who meets with the student
regularly during the course of the year. The student's project report is sectioned in the
conventional way for reports of empirical work and is independently marked by the
supervisor, acting as first marker, and a second marker. Second markers are chosen,
subject to constraints imposed by workload, on the basis of their interest or expertise
in the topic area of the project. Project reports, which are typed, carry the student's
name on the cover. Second markers will vary in their degree of personal knowledge of
the student: they may have come to know the student quite well in some role such as
that of the student's personal tutor but may know the student hardly at all. Typically
54
they will know the student considerably less well than does the supervisor by the time
the project report is submitted.
The supervisor awards four marks to the project whilst the second marker
awards three. The supervisor's A mark is intended to be based on the student's
performance in designing and conducting the project work over the course of the year.
It is an explicit part of the marking policy that the remaining three marks awarded by
each marker are to be based solely on evidence which is available to both markers in
the form of the written project report. The B mark relates to the introduction section
of the report, the C mark to the method and results sections and the D mark to the
discussion. The marks are awarded on a 100-point scale commonly used in British
higher education where a mark of 70 or above corresponds to a first-class honours
degree and a mark below 40 represents a fail, with intermediate classifications all
having specified mark ranges. For each of the four marks, markers have available to
them a set of marking guidelines which provide a two or three sentence description of
the performance appropriate to each degree class. Second markers award the B, C and
D marks without knowledge of the corresponding marks given by the supervisor and
without receiving any comments from the supervisor. They do, however, have sight of
the supervisor's A mark prior to awarding their own marks. Data from one male
student and two female students were excluded from the analysis because they were
incomplete. This left data from 197 female students and 58 males which were used in
the analysis. Twenty-five different markers were involved, of whom eight were
female.
The path diagram for the model fitted to the data is shown in Fig. 1. This
follows the standard conventions for such diagrams. The variables in square boxes are
manifest variables - in this case the seven different marks awarded to each project. Al
denotes the first marker's A mark, B1 the first marker's B mark, B2 the second
marker's B mark and so forth. The variables enclosed by circles are latent variables
which the model assumes to combine in determining the manifest variables. In general
the model assumes that each mark is determined by the sum of three influences.
55
The first influence on each mark is a factor which is specific to the section
being marked but which influences both markers. This influence is reflected in factors
SSB, SSC or SSD for sections B, C and D respectively (the labels are chosen to
provide a mnemonic for the fact that the influence of these factors is section specific).
These factors will probably reflect primarily the true merit of the section being
marked but they could also embody biases which are shared by both markers or any
other influence which operates on both of them.
The second influence on each mark is one which is marker specific but general
across all the marks awarded by that marker to the student. These factors appear on
the right of the model in Fig. 1. The factor MS1 influences only the marks of the first
marker and the factor MS2 influences only the marks of the second marker (again the
labelling is chosen to provide a mnemonic for the fact these are marker specific
factors). These two factors affect each of the marks which a marker awards to a
particular student. The influences represented by MS1 and MS2 would include any
pre-existing biases the marker may have concerning either the student in question or a
group to which the marker knows the student belongs. The marker's reaction to
aspects of the student's work which transcend the different sections, such as the
student's writing style, would also enter into the factors MS1 and MS2. It also needs
to be recalled that the markers under consideration are not the same individuals for all
projects being marked. This complicates interpretation a little since any differences in
stringency between markers, whereby some markers are more severe than others in all
the marks which they award, will also appear in MS1 and MS2. However, it also
means that rather more general conclusions can be reached than would have been
possible if only two markers had been involved.
The third influence on each mark is one of the error components E2 to E7.
These account for the component of the variance which is not explained by the other
factors. They represent the component of each mark which is idiosyncratic and section
specific and are analogous to the error component in traditional true score models of
marking reliability.
Factor SSA differs slightly from factors SSB to SSD. Because the A mark is
not independently awarded by a second marker there is no way of identifying the
component of this mark which is idiosyncratic to the supervisor. Thus the model does
56
not contain an error component influencing A. Instead in this case the error
component can be thought of as being hidden within SSA.
The double headed arcs linking factors SSA to SSD signify that these factors
are allowed to be intercorrelated. Clearly if the major determinant of these factors is
the real merit of the student's performance then students who perform well on one
element of the project are likely also to perform well on other elements so this is
necessary to make the model appropriate. It is important to note that influences which
are not linked by double headed arcs are assumed to be uncorrelated. In particular the
influences on marks reflected in MS1 and MS2 are ones which are uncorrelated with
each other and with the complex of SSA to SSD. Thus biases which are shared by the
two markers will appear in SSA to SSD rather than MS1 and MS2.
Biases which are correlated with ability will also not affect MS1 and MS2.
Consider, for example, Bradley's (1984) suggestion that because of gender stereotypes
second markers mark the work of female students less extremely. This bias would
reduce the mark given to good work presented by a female student but elevate the
mark given to poor work. Thus the bias which Bradley proposes to be operating in
second markers is one which is correlated, albeit negatively, with student
performance. Because it is not orthogonal to the section specific factors it will not
contribute to MS2. However, if this bias is operating, the second marker will award a
lower mark to a good female project than to an equally good male project and
conversely for poor projects from male and female students. Thus variations in project
quality will have less impact on the mark for female than for male students. This will
lead to smaller path coefficients for females than for males on the paths from SSB to
SSD to the corresponding marks awarded by the second marker.
Results
Descriptive statistics for each of the seven marks are given for male and
female students separately in Table 1. The modelling procedures used here assume
multivariate normality of the underlying data and are known to be sensitive to
violations of this assumption. It can be seen that in general, but especially for the
females, the distributions tend to be leptokurtic and to have negative skew. After
experimentation with alternative transformations this was dealt with by squaring all
57
marks prior to calculating the variances and covariances used in modelling. Skew and
kurtosis for the transformed marks are also given in Table 1. It is apparent that the
transformation alleviates the problems which were present in the raw data and
produces data which are reasonably close to normality for both males and females.
The correlation matrix for the transformed data along with the relevant
standard deviations are presented in Table 2. These are the data to which the model
was fitted. Informal inspection of Table 2 reveals a number of features which are
helpful in gaining an insight into the conclusions which emerge more formally from
the fitting of the model in Fig. 1. The correlations in the table are of three types: (i)
correlations between marks awarded by the same marker to different sections, (ii)
correlations between markers in their marks for the same section, and (iii) correlations
involving both different markers and different sections. In Table 2 correlations of type
(i) are in bold type, correlations of type (ii) in italics, and correlations of type (iii) in
plain type. In general correlations of type (i) are larger than those of type (ii), which
are in turn larger than those of type (iii). This general pattern, which, as might be
expected, emerges more clearly in the larger female sample, is consistent with the
model in Fig. 1.
Correlations of type (iii) are expected to be smallest since the two measures
being correlated are not influenced by any shared factor. The fact that the correlations
of type (iii) are all positive presumably reflects the existence of positive correlations
amongst the factors SSA to SSD. Correlations of type (ii) differ from those of type
(iii) in that the measures being correlated share the influence of either MS1 or MS2.
The fact that correlations of type (ii) tend to be bigger than those of type (iii)
demonstrates the need for the inclusion of these two factors in the model. For
correlations of type (i) the two measures being correlated share the common influence
of one of the factors SSA to SSD. The observation that correlations of type (i) tend to
be bigger than those of type (ii) suggests that, as might be expected, the influence of
SSA to SSD is greater than that of MS1 or MS2.
A second feature of the data which is evident in Tables 1 and 2 and which
contributes to the outcome of model fitting is that the marks awarded by first markers
show a larger SD for males than for females. If this difference in variance is tested
using the transformed data it reaches significance on the B and D marks though not on
58
the A and C marks (for mark A, F(57, 196) = 1.28, p = .22; for mark B, F(57, 196) =
1.51, p = .04; for mark C, F(57, 196) = 1.17, p = .42; for mark D, F(57, 196)= 1.65, p
= .006). The difference in SD is only marginally and non-significantly evident in the
second marker's marks (for mark B, F(57, 196) = 1.12, p = .58; for mark C, F(57, 196)
= 1.05, p = .79; for mark D, F(57, 196) = 1.13, p = .53). Some aspect of the first
marker's marking must differ between males and females but without formal
modelling it is not clear which of several possibilities holds. It could be, for example,
that individual biases are larger for males than for females so that the influence of
factor MS1 is greater for males than for females. Alternatively, it could be that first
markers let the same degree of variation in merit produce larger decrees of variation in
marks when the work is that of male rather than female students. Thus the paths
leading from the factors SSA to SSD to the marks awarded by the first marker would
have larger coefficients for males than for females. Perhaps less plausibly it may be
that supervisors' marking of male students is noisier so that the error components of
the first marker's marks (E2, E4 and E6) are greater for males than for females.
The purpose of the model fitting, carried out here using EQS, is to predict the
entire pattern of covariances and variances amongst the seven marks in each of the
two samples. EQS and similar programs determine values for the unknown
parameters of the model in such a way as to minimize the discrepancy between the
observed variances and covariances and those predicted by the model. A number of
alternative criteria for assessing this discrepancy are available. In the present case a
maximum likelihood criterion was employed. Under the assumption of multivariate
normality this criterion leads to a function which is approximately distributed as
[[Chi].sup.2] with a number of degrees of freedom which depends on the number of
measured variables and on the number of parameters which are estimated in fitting the
model. The value of this statistic can be used to assess the overall compatibility of the
model with the data. It is also possible when fitting models to constrain parameters to
particular values or to equality with one another so that fewer separate parameters
need to be estimated and the resulting [[Chi].sup.2] statistic has more degrees of
freedom. The change in [[Chi].sup.2] resulting from the imposition of constraints
provides a method of assessing whether those constraints are significantly worsening
the fit of the model - this is sometimes referred to as the [[Chi].sup.2] difference test.
In addition to the [[Chi].sup.2] statistics a variety of other fit indices are also
59
available. Some of these, such as the normed fit index (Bentler & Bonnett, 1980),
provide a measure of fit on a scale from 0 to 1; for these the fit index inevitably
increases as constraints are released in a series of nested models. Other measures such
as Akaike's Information Criterion also take account of the parsimony of the model
and favour models in which a good fit is obtained with a small number of free
parameters.
Because of the previous evidence that males and females may be treated
differently in project marking they were treated as separate samples using the facilities
which EQS provides for multi-sample modelling. This approach makes it possible to
impose constraints which equate particular parameters across the two samples. The
degree of fit of the constrained model then provides a test of whether the assumption
that these parameters are the same for the two samples is consistent with the data. By
investigating which parameters, if any, need to differ across the two samples it should
be possible to be more specific about the origins of different mark distributions for
males and females.
Model 1 of Table 3 contains the requirement that all parameters of the model
should be equal for males and females. It can be seen that overall this model is quite
compatible with the data. However, it was noted previously that the marks awarded by
first markers to male and female students had different variances. This provides
evidence that some aspect of the model needs to be different for male and female
students. Detailed consideration of the results from fitting model 1 indicated that the
fit of the model could be improved if MS1 was allowed to have a greater influence for
males than for females, particularly on the B and D marks.
60
Model 2 in Table 3 differs from model 1 in that all four path coefficients of
MS1 have been allowed to differ in males and females. The [[Chi].sup.2] difference
test confirms that the fit of model 1 is significantly poorer than that of model 2
([[Chi].sup.2](4) = 10.141, p = .0381). Thus the influences represented by MS1 (the
nature of these is taken up later) have greater impact when the work being marked is
that of a male student.
Inspection of the fitted parameters of model 2 showed that the influence of the
first marker factor MS1 was considerably larger than that of the second marker factor
MS2 (parameter values from a model which is very close to model 2 are given in
Table 4). How strongly do the data dictate a model in which both MS1 and MS2 have
some influence but the influence of MS1 is stronger? The remaining models in Table
3 were considered with a view to exploring this question. In model 3 the second
marker factor, MS2, is removed from the model entirely. Model 3 is otherwise
identical to model 2. The change in [[Chi].sup.2] on moving from model 2 to model 3
is not significant ([[Chi].sup.2](3) = 5.097, p = .165), indicating that the second
marker effect, MS2, is not necessary for an adequate fit to the data. Because of its
greater parsimony, model 3 appears slightly superior on Akaike's (1987) information
criterion but the other fit indices suggest that model 2 yields a marginally better fit.
The results from model 3 show that a model from which the second marker
effect, MS2, has been removed provides an adequate account of the data. Would a
model from which MS1 has been removed also be satisfactory? Model 4 examines
this by dropping MS1 whilst retaining MS2. To make the comparison with model 3 a
fair one, MS2 was allowed to have different path coefficient for males and females.
Model 4 produces a significant overall [[Chi].sup.2] ([[Chi].sup.2]33) = 55.8, p =
.008), implying that the data pattern observed would be unlikely if this model were
correct. Thus whilst a model with only the first marker effect, MS1, is tenable, there is
enough evidence in the data to discount a model in which only the second marker
effect, MS2, is present.
It is worth noting that models which exclude both of the marker specific
factors, MS1 and MS2, unsurprisingly provide an even poorer account of the data than
model 4. Complications arise in fitting these models because some estimates of the
correlations between section specific factors become constrained at unity. This in
61
itself suggests that these models are unsatisfactory. Further evidence comes from the
fit measures obtained when both MS1 and MS2 are dropped from the model. In model
6 the complications mentioned above have been dealt with by having the same factor
load on both section C and section D marks. Additionally, differences between the
male model and the female model which lead to a significant improvement in
[[Chi].sup.2] have been introduced on an ad hoc basis. Despite these ad hoc changes
the model produces a highly significant overall [[Chi].sup.2] and can thus be rejected.
The exclusion of factors MS1 and MS2 (and also the merging of SS3 and SS4) make
model 6 more parsimonious than the other models in Table 3 and this will in part
account for its poorer fit. However, even on the AIC, whose purpose is to allow for
differences in parsimony, model 6 comes out markedly worse than the other models in
Table 3. This confirms the view that at least one of the marker specific factors is
needed to give an adequate account of the data.
In summary, the general form of the model presented in Fig. 1 provides a good
account of the data provided that the first marker effect, MS1, is included. The effect
of MS1 is greater for males than for females. The best account of the data is provided
by models in which the impact of MS2 is either substantially smaller than that of MS1
or absent. However, a model in which MS1 and MS2 have an equal impact cannot be
rejected. The good fit of the models should not be overemphasized given the relatively
large number of free parameters which they contain. However, less sensibly motivated
models with as many free parameters do less well in accounting for the data, and the
results reported above from model 4 show that a satisfactory fit is not guaranteed for
models of the order of complexity considered here. The satisfactory fit does imply that
sensible interpretations may be placed on the parameter values of the fitted models. A
number of issues arise from these parameter values and these are taken up in the
discussion.
62
Inspection of Table 1 reveals that the mean marks given to male students are
higher than those for females, especially in the supervisor's marks. The model
presented in Fig. 1 implies that this must occur either because males have higher
scores on some or all of the factors SSA to SSD or because they have higher scores on
one or both of MS1 or MS2. The fact that the difference is more apparent in the
supervisor's marking begins to suggest that it might arise from MS1. The question can
be looked at more formally by use of a structured means model.
The models whose fits are presented in Table 3 are based solely on the
variances and covariances of the seven observed marks. They differ from a structured
means model in that the latter also takes account of the means of the manifest
variables in estimating parameter values. Multi-sample structured means models are
discussed by Bentler . They make use of more evidence from the data in that they seek
to account for the means of the manifest variables in the different groups but doing
this also involves estimating additional parameters.
The additional parameters which are needed in the present example include
the means of each of the factors SSA to MS2 in each of the two samples. Since the
zero point of the factors is arbitrary it can be set at zero in one of the samples. When
this is done the estimated means of the factors for the remaining sample provide
information on how the means of the factors differ in the two samples. It is these
estimated differences in factor means which are the focus of interest here. The other
additional parameters needed are the intercepts of the seven linear expressions given
manifest variables as a function of the latent factors. These intercepts are constrained
to equality in the two samples. In applying this approach to the present model 13 extra
parameters need to be estimated - the intercepts of the seven manifest variables and
the means for the six factors in one group. With 14 observed means available as data
to be explained by the model and 13 additional parameters to be estimated, the
structured means model provides virtually no additional information to assist in
discriminating between models. However, the parameter estimates it provides are of
considerable interest for the reasons given above.
The structured means version of model 2 again has a good fit ([[Chi].sup.2]
(29) = 32.89, p = .28, NFI = .973, NNFI = 0.995, CFI = .997). Table 4 presents all the
parameter estimates from the fitting of this model along with their respective standard
63
errors. The ratio of these two quantities is in effect a z score which may be used to test
whether the parameter differs significantly from zero. The parameter estimates
relating to variance and covariance are all virtually identical to those obtained from
model 2. Of immediate concern are the differences in factor means for the two groups.
These appear in the bottom row of Table 4. It is important to note that whilst the
means of the observed marks are higher for males, the means of the section specific
factors (SSA to SSD) are all higher for females, though the differences are slight and
fall well short of significance (for SSA z = 0.211, for SSB z = 0.086, for SSC z =
0.125, and for SSD z = 0.104). Thus the higher mean marks supervisors give to males
cannot be explained in terms of the greater merit of their projects. Rather the higher
marks of males are accounted for by their higher mean on MS1; here the difference
between males and females does reach significance (z = 2.177, p = .029). Whatever
influences are represented in MS1 are on average raising the marks of male candidates
relative to those of female candidates to a small but significant extent. MS2 also
operates to raise the marks of males relative to females; the strength of this effect is
only a little smaller than for MS1 but because of a larger standard error of estimate it
fails to reach significance (z = 1.09, p = .28).
Discussion
The most importance feature of these analyses lies in the satisfactory fit
obtained from the model presented in the Introduction, in the need to include factors
MS1 and possibly MS2 in order to obtain that fit, and in the proportion of mark
variance accounted for by these factors. Thus, for example, under the model whose
parameters are given in Table 4 the proportion of the total variance in the supervisor's
mark attributable to the influence of MS1 ranges from 17.1 per cent for the B marks
awarded to females up to 52.8 per cent for the D mark awarded to males with an
average of 26.5 per cent for females and 31.1 per cent for males. In the case of the
second marker the proportion of total variance due to MS2, which model 2 constrains
to be the same for males and females, is 2.4 per cent for the B mark, 18.7 per cent for
the C mark and 11.6 per cent for the D mark. What then do the factors MS1 and MS2
represent?
64
they merely reflect differences in stringency between the different individuals acting
as first and second markers. There are, however, good reasons for believing that this is
not a major component of MS1 and MS2. First, and most directly, it is possible to
estimate a factor score for each project on MS1 and MS2 and to see whether these
vary according to the marker. If differences in marker stringency are an important
component of MS1 and MS2 then some markers should be associated with
consistently high factor scores and some with consistently low factor scores. Although
records of marker identity were not available for the first cohort of students whose
data were included in the analysis, first markers were known for 190 projects and
second markers for 168 projects. Analyses of variance comparing MS1 scores and
MS2 scores across different markers yielded no hint of any significant differences
between markers (for first markers, F(23, 166) = 1.26, p = .20; for second markers,
F(23, 144) = 0.96, p = .52).
Third, differences in stringency should be equal for first and second markers
since essentially the same individuals are acting in both roles. In so far as the data
tend to point to a model in which the impact of MS1 is greater than that of MS2, the
first marker factor MS1 cannot wholly be attributed to variation in marker stringency.
65
Finally, differences in marker stringency should apply equally to both male
and female students. The finding that the influence of MS1 is moderated by student
gender therefore leads to the conclusion that it cannot be entirely attributable to this
source. Thus whilst a part of MS1 and perhaps the entirety of MS2 might be
attributable to section-transcendent differences in marker severity there are good
reasons for believing that some part of MS1, and probably the largest part, is not
attributable to this cause.
A second possibility raised in the introduction is that factors MS1 and MS2
reflect a particular marker's reaction to section-transcendent features of the student's
work such as writing style. If this is the case then markers are giving such features an
inappropriately high weighting. The marking guidelines refer to quite separate
features of the project for each section. It should be noted that all projects are word
processed or typed so quality of handwriting is not a candidate as a section-
transcendent feature. There are in any case reasons for being sceptical about any
explanation of this general type. MS1 loads not only on the marks awarded to the
sections of the written project but also on the supervisor's A mark given for the
conduct of the project over the year. It is difficult to see what sort of feature could
transcend both conduct of the work of an empirical project during the year and the
written report on the work. Again, in so far as the data favour a model in which MS1
has more impact than MS2, it is difficult to see why factors such as writing style
should affect the marks awarded by supervisors to a greater extent than they affect the
marks given by second markers, especially when it is recalled that it is the same group
of individuals who act in both roles. Finally, an account based on section transcendent
features of the work offers no ready explanation of why the impact of MS1 is different
when the work being marked is that of a male rather than a female student, though it
is possible that some tortuous account based on stylistic differences in the work of the
two genders might be developed.
66
influence of knowledge of the student external to the project report. That knowledge
might in turn be an assessment of the student's abilities based on evidence outside of
the student's project performance or may even less appropriately represent a reaction
to the student's other personal characteristics. The present data do not offer a great
deal of evidence with which to disentangle these possibilities. There is no reason to
expect that halo effects internal to the project report should be any greater for
supervisors than for second markers. Thus if the first marker effect is stronger this
would suggest that supervisors' B, C and D marks are being influenced by their
contact with the student over the course of the year rather than solely by the project
report. Although the data do not provide a particularly strong case against a model in
which MS1 and MS2 have equal influence it would be surprising if both markers were
susceptible to halo effects within the project report whilst supervisors were totally
impervious to effects from their prior knowledge of the student.
67
projects compared to supervisors. Bradley explained this by suggesting that the less
expert second markers displayed a bias whereby they were reluctant to give female
students extreme marks. As noted previously, if this sort of mechanism were at work
in the present data the path coefficient of SSB to SSD on the second marker's marks
should be smaller for female students than for males. There is no evidence for this.
This may be unsurprising given that a previous study in the department from
which the present data come failed to replicate Bradley's effect (Newstead & Dennis,
1990). Newstead & Dennis (1990) suggested that an alternative explanation of
Bradley's results might lie in biases shown by supervisors concerning individual
students. Such biases would inject variance into the supervisor's marks tending to
make them more extreme than the second marker's marks. If the biases concerning
individual students were stronger for females this could then explain the pattern of
data observed by Bradley. In so far as they are consistent with the existence of quite a
strong influence on supervisors' marks coming from reactions to individual students
the present results are compatible with the proposal made by Newstead & Dennis.
However, the direction of the gender effect in the present data is opposite to that
which would be necessary to explain Bradley's data.
Possible reasons for the discrepancy between Bradley's findings and those of
Newstead & Dennis have been extensively rehearsed and the present results can only
make a very limited contribution to resolving that debate. The present findings might
perhaps most easily be reconciled with those of Bradley by suggesting that there are
biases relating to individual students which may have a differential impact on the two
genders but that whether this happens and which gender is most affected varies
according to factors such as the type of material being marked, the population of
students involved, and the particular set of markers. It is noteworthy that in this study
the greater impact on male marks was restricted to the marking of the more discursive
introduction and discussion sections. Another possibility is that gender is not the true
variable underlying these effects but some other variable which was correlated with it
in the samples studied, and which in particular was correlated in opposite directions in
the present sample and in Bradley's sample.
68
The pattern which is evident in Table 2 whereby the marks given to males
generally show greater variability than those of females, especially in the first
marker's marking, is an example of a more general finding which has provoked
considerable debate (Rudd, 1984). Is the fact that females seem to obtain less extreme
marks and fewer first-class and third-class degrees a reflection of truly different
patterns of performance in the two genders or does it reflect a difference in the way
their work is marked? In the present case the difference in variance between male and
female marks arises from the influence of MS1. Thus in this case the evidence
inclines towards the position that the greater variance of male marks arises primarily
from sources which influence all of the marks awarded by the first marker. Obviously
this need not imply that all examples of mark variance being greater in males than in
females arise in the same way.
One advantage of the method used here, and in particular the use of a
structured means model, is that it makes it possible to obtain more information about
how differences in means between groups arise. In this case the difference in means
between male and female marks seems to arise in the factor MS1 (and possibly also
MS2) but not in the section specific factors, with the non-significant effect on SSA to
SSD taking the form of a female superiority. It is important to recognize that this is
quite a small effect relative to the variance of MS1. Thus although the average effect
of MS1 is to favour males slightly, there will be some males as well as some females
where it acts quite strongly to reduce marks. Having said this it is difficult to escape
the conclusion that the sex difference in MS1 reflects an influence of the personal
knowledge which the marker gains of the student during the year and that in this
sample that influence was acting in a manner which on average favoured male
students over female students. If it has any degree of generality this is a finding with
significant implications for the marking process.
The data and modelling reported here also have implications for the reliability
of marking. It is evident from the correlations in Table 2 that inter-marker agreement
on each section is only very moderate. This outcome is not unsurprising in relation to
other data on the reliability of degree level assessment (Byrne, 1980; Cox, 1967;
Laming, 1990; Newstead & Dennis, 1994). However, given the weight attached to the
project in many degree schemes it might have been hoped that it would have a higher
level of reliability than a single exam answer or even an exam paper.
69
In the present case it is clear that much of the disagreement between markers
derives not from a random influence on their marks but rather from a consistent but
marker-specific influence on the marks awarded to a particular student. This has
important implications for the extent to which poor reliability on individual elements
of assessment is overcome by averaging. Thus, for example, if an overall mark for the
projects involved here were calculated by averaging the section marks, the agreement
between supervisors and second markers on the project average would not be greatly
superior to their correlation on individual sections. Authors who have found only
modest reliability in exam marking in higher education have sometimes taken comfort
from the benefits of averaging over large numbers of assessments (e.g. Newstead &
Dennis, 1994). The present results suggest that it may be unwise to assume that
averaging is satisfactorily dealing with the problems of unreliability.
Table 4 shows some differences between markers and sections in the error
component of mark variance. Interpretations of these differences are necessarily
speculative. Supervisors show greater error variance than second markers on the mark
for the introduction. Error variance here really means variance that is both marker and
section specific. It may be that it is in marking the introduction that the supervisor's
specialist knowledge of the literature is most relevant and hence it is here that their
mark is most idiosyncratic.
The data analysed in this study derive from a single department, and a
relatively small group of markers was involved. In view of this it would be unwise to
reach overly strong conclusions about marking in general. The situation examined in
which a supervisor teaches a student on an individual basis for a year and is then
required to assess the work around which all their meetings have centred is perhaps
unusually susceptible to the supervisor developing biases towards the student.
However, the strength of the effects detected does suggest that the influence of
markers' personal knowledge of the individuals whose work they are marking
deserves considerably more attention than it has previously received. If the results
reported here do have any generality then there is a strong case for avoiding, as far as
possible, the assessment of work by those who know the student. Student gender
appears to moderate the influence of the marker's personal knowledge of the student
and although the effect is not enormous it again gives cause for concern and the
generality of the effect warrants further investigation. Student gender was the only
personal characteristic of the students which was considered in this study and it may
well be that there are other characteristics which have a similar or even a larger
70
influence; this too deserves further study. Whilst caution is advisable in generalizing
the conclusions of this study what may be more usefully generalized are its methods.
Whilst there are ambiguities of interpretation, which have been discussed, the use of
structural equation modelling provides a valuable new handle on marking bias. An
accumulation of studies using the approach illustrated here could substantially
improve our knowledge of the nature and magnitude of marking biases.
71