This action might not be possible to undo. Are you sure you want to continue?
social, economical even our psychological status. The history of education can be traced back to human origin and man’s never ending passion for knowledge. The word education is derived from the Latin educare (with a short u) meaning "to raise", "to bring up", "to train", "to rear", bringing up, raising. In recent times, there has been a return to, an alternative assertion that education derives from a different verb: educare (with a long u), meaning "to lead out" or "to lead forth". Definition of education Education has been defined from a multitude of perspectives and the real essence of education is even today the subject of critical deliberations. Some of the definitions of education from wikipedia are : the gradual process of acquiring knowledge Educational activity primarily involves the presentation of material by the faculty to students who are learning about the subject matter. The material being studied is fundamentally well known material. Those activities known as teaching and training are included in this category.
the action or process of educating or of being educated; also : a stage of such a process the knowledge and development resulting from an educational process schools the act or process of imparting or acquiring general knowledge, developing the powers of reasoning and judgment, and generally of preparing oneself or others intellectually for mature life.
the field of study that deals mainly with methods of teaching and learning in
the act or process of imparting or acquiring particular knowledge or skills, as for a profession. the result produced by instruction, training, or study The act or process of educating or being educated.
The knowledge or skill obtained or developed by a learning process. History of Education The history of education is both long and short. In 1994, Dieter Lenzen, president of the Freie Universität Berlin and an authority in the field of education, said "education began either millions of years ago or at the end of 1770". This quote by Lenzen includes the idea that education as a science cannot be separated from the educational traditions that existed before. Education was the natural response of early civilizations to the struggle of surviving and thriving as a culture. Adults trained the young of their society in the knowledge and skills they would need to master and eventually pass on. The evolution of culture, and human beings as a species depended on this practice of transmitting knowledge. In pre-literate societies this was achieved orally and through imitation. Story-telling continued from one generation to the next. Oral language developed into written symbols and letters. The depth and breadth of knowledge that could be preserved and passed soon increased exponentially. When cultures began to extend their knowledge beyond the basic skills of communicating, trading, gathering food, religious practices, etc, formal education, and schooling, eventually followed. Schooling in this sense was already in place in Egypt between 3000 and 500BC. History of education in India India has a long history of organized education. The Gurukul system of education is one of the oldest on earth, and was dedicated to the highest ideals of allround human development: physical, mental and spiritual. Gurukuls were traditional Hindu residential schools of learning; typically the teacher's house or a monastery. Education was free, but students from well-to-do families payed Gurudakshina, a voluntary contribution after the completion of their studies. At the Gurukuls, the teacher imparted knowledge of Religion, Scriptures, Philosophy, Literature, Warfare, Statecraft, Medicine Astrology and History (the Sanskrit word "Itihaas" means History). The first millennium and the few centuries preceding it saw the flourishing
of higher education at Nalanda, Takshashila University, Ujjain, & Vikramshila Universities. Art, Architecture, Painting, Logic, Grammar, Philosophy, Astronomy, Literature, Buddhism, Hinduism, Arthashastra (Economics & Politics), Law, and Medicine were among the subjects taught and each university specialized in a particular field of study. Takshila specialized in the study of medicine, while Ujjain laid emphasis on astronomy. Nalanda, being the biggest centre, handled all branches of knowledge, and housed up to 10,000 students at its peak. British records show that education was widespread in the 18th century, with a school for every temple, mosque or village in most regions of the country. The subjects taught included Reading, Writing, Arithmetic, Theology, Law, Astronomy, Metaphysics, Ethics, Medical Science and Religion. The schools were attended by students representative of all classes of society. The current system of education, with its western style and content, was introduced & founded by the British in the 20th century, following recommendations by Macaulay. Traditional structures were not recognized by the British government and have been on the decline since. Gandhi is said to have described the traditional educational system as a beautiful tree that was destroyed during the British rule. Assessment of Education Assessment is the process of documenting, usually in measurable terms, knowledge, skills, attitudes and beliefs History of assessment The earliest recorded example of academic assessment arose in China in 206BC when the Han dynasty sought to introduce testing to assist with the selection of civil servants. The objectivity of the assessment was questionable (it being oral and still subject to the whims of the assessors) but it was the first example of introducing merit to the selection process in place of favouritism. In 622AD the Tang dynasty administered formal written exams to candidates for the civil service; these exams lasted for several days and had a pass rate of 2% - and successful candidates were then subjected to an oral assessment by the Emperor. In Europe, tests were used during the Middle Ages to aid the selection of priests and knights, and school children were tested for their knowledge of the catechism. Oral exams were used to assess
knowledge and skills demonstrations were used to meassure practical abilities. The University of Paris first introduced formal examinations during the 12 Century. These exams were theological oral disputations. Questions were known in advance, requiring students to memorise and regurgitate answers. In the 1740s, Cambridge University began using (oral) examinations to compare students, similar to the earlier Chinese tests. During the 18th Century, Cambridge and Oxford began testing students' mathematical abilities using written tests and thereafter the use of paper for assessment spread to all subjects. The Unitied States introduced formal written examinations in the 1830s in an attempt to reduce the subjectivity of assessment. Horace Mann introduced written tests in the Boston Public Schools to compare school performance. However, the United States main contribution to the history of testing came during the First World War when the US Army introduced large scale IQ testing to assign massive numbers of recruits to positions within the Army. The Army Alpha, as it was known, consisted of multiple choice questions and was administered to over two million recruits. Types Assessments can be classified in many different ways. The most important distinctions are: (1) (2) (3) (4) formative and summative; objective and subjective; criterion-referenced and norm-referenced; and informal and formal.
Formative and summative There are two main types of assessment:
Summative Assessment - Summative assessment is generally carried out at the end of a course or project. In an educational setting, summative assessments are typically used to assign students a course grade.
Formative Assessment - Formative assessment is generally carried out throughout a course or project. Formative assessment, also referred to as educative assessment, is used to aid learning. In an educational setting, formative assessment might be a teacher (or peer) or the learner, providing
feedback on a student's work, and would not necessarily be used for grading purposes. Summative and formative assessment are referred to in a learning context as "assessment of learning" and "assessment for learning" respectively. A common form of formative assessment is diagnostic assessment. Diagnostic assessment measures a student's current knowledge and skills for the purpose of identifying a suitable program of learning. Self-assessment is a form of diagnostic assessment which involves students assessing themselves. Forwardlooking assessment asks those being assessed to consider themselves in hypothetical future situations. Objective and subjective Assessment (either summative or formative) can be objective or subjective. Objective assessment is a form of questioning which has a single correct answer. Subjective assessment is a form of questioning which may have more than one current answer (or more than one way of expressing the correct answer). There are various types of objective and subjective questions. Objective question types include true/false, multiple choice, multiple-response and matching questions. Subjective questions include extended-response questions and essays. Objective assessment is becoming more popular due to the increased use of online assessment (e-assessment) since this form of questioning is well-suited to computerisation. Criterion-referenced and norm-referenced Criterion-referenced assessment, typically using a criterion-referenced test, as the name implies, occurs when candidates are measured against defined (and objective) criteria. Criterion-referenced assessment is often, but not always, used to establish a person’s competence (whether s/he can do something). The best known example of criterion-referenced assessment is the driving test, when learner drivers are measured against a range of explicit criteria (such as “Not endangering other road users”). Norm-referenced assessment (colloquially known as "grading on the curve"), typically using a norm-referenced test, is not measured against defined criteria. This type of assessment is relative to the student body undertaking the assessment. It is
effectively a way of comparing students. The IQ test is the best known example of norm-referenced assessment. Many entrance tests (to prestigious schools or universities) are norm-referenced, permitting a fixed proportion of students to pass (“passing” in this context means being accepted into the school or university rather than an explicit level of ability). This means that standards may vary from year to year, depending on the quality of the cohort; criterion-referenced assessment does not vary from year to year (unless the criteria change). Informal and formal Assessment can be either formal or informal. Formal assessment usually a written document, such as a test, quiz, or paper. Formal assessment is given a numerical score or grade based on student performance. Whereas, informal assessment does not contribute to a student's final grade. It usually occurs in a more casual manner, including observation, inventories, participation, peer and self evaluation, and discussion. Standards of quality The considerations of validity and reliability typically are viewed as essential elements for determining the quality of any assessment. However, professional and practitioner associations frequently have placed these concerns within broader contexts when developing standards and making overall judgments about the quality of any assessment as a whole within a given context. Testing standards In the field of psychometrics, the Standards for Educational and Psychological Testing
place standards about validity and reliability, along with errors of
measurement and related considerations under the general topic of test construction, evaluation and documentation. The second major topic covers standards related to fairness in testing, including fairness in testing and test use, the rights and responsibilities of test takers, testing individuals of diverse linguistic backgrounds, and testing individuals with disabilities. The third and final major topic covers standards related to testing applications, including the responsibilities of test users,
psychological testing and assessment, educational testing and assessment, testing in employment and credentialing, plus testing in program evaluation and public policy. Evaluation standards In the field of evaluation, and in particular educational evaluation, the Joint Committee on Standards for Educational Evaluation has published three sets of standards for evaluations. The Personnel Evaluation Standards was published in 1988, The Program Evaluation Standards (2nd edition) was published in 1994, and The Student Evaluation Standards  was published in 2003. Each publication presents and elaborates a set of standards for use in a variety of educational settings. The standards provide guidelines for designing, implementing, assessing and improving the identified form of evaluation. Each of the standards has been placed in one of four fundamental categories to promote educational evaluations that are proper, useful, feasible, and accurate. In these sets of standards, validity and reliability considerations are covered under the accuracy topic. For example, the student accuracy standards help ensure that student evaluations will provide sound, accurate, and credible information about student learning and performance. Validity and reliability A valid assessment is one which measures what it is intended to measure. For example, it would not be valid to assess driving skills through a written test alone. A more valid way of assessing driving skills would be through a combination of tests that help determine what a driver knows, such as through a written test of driving knowledge, and what a driver is able to do, such as through a performance assessment of actual driving. Teachers frequently complain that some examinations do not properly assess the syllabus upon which the examination is based; they are, effectively, questioning the validity of the exam. Reliability relates to the consistency of an assessment. A reliable assessment is one which consistently achieves the same results with the same (or similar) cohort of students. Various factors affect reliability – including ambiguous questions, too many options within a question paper, vague marking instructions and poorly trained markers.
A good assessment has both a validity and reliability, plus the other quality attributes noted above for a specific context and purpose. In practice, an assessment is rarely totally valid or totally reliable. A ruler which is marked wrong will always give the same (wrong) measurements. It is very reliable, but not very valid. Asking random individuals to tell the time without looking at a clock or watch is sometimes used as an example of an assessment which is valid, but not reliable. The answers will vary between individuals, but the average answer is probably close to the actual time. In many fields, such as medical research, educational testing, and psychology, there will often be a trade-off between reliability and validity. A history test written for high validity will have many essay and fill-in-the-blank questions. It will be a good measure of mastery of the subject, but difficult to score completely accurately. A history test written for high reliability will be entirely multiple choice. It isn't as good at measuring knowledge of history, but can easily be scored with great precision. Controversy The assessments which have caused the most controversy are the use of High school graduation examinations, which first appeared to support the defunct Certificate of Initial Mastery, which can be used to deny diplomas to students who do not meet high standards. They argue that one measure should not be the sole determinant of success for failure. Technical notes for standards based assessments such as Washington's WASL warn that such tests lack the reliability needed to use scores for individual decisions, yet the state legislature passed a law requiring that the WASL be used for just such a purpose. Others such as Washington State University's Don Orlich question the use of test items far beyond standard cognitive levels for testing ages, and the use of expensive, holistically graded tests to measure the quality of both the system and individuals for very large numbers of students. High stakes tests, even when they do not invoke punishment, have been cited for causing sickness and anxiety in students and teachers, and narrowing the curriculum towards test preparation. In an exercise designed to make children comfortable about testing, a Spokane, Washington newspaper published a picture of a monster that feeds on fear when asked to draw a picture of what she thought of the state assessment. This, however is thought to be an acceptable if it increases student learning outcomes.
Standardized multiple choice tests do not conform to the latest education standards. Nevertheless, they are much less expensive, less prone to disagreement between scorers, and can be scored quickly enough to be returned before the end of the school year. Legislation such as No Child Left Behind also define failure if a school does not show improvement from year to year, even if the school is already successful. The use of IQ tests has been banned in some states for educational decisions, and norm referenced tests have been criticized for bias against minorities. Yet the use of standards based assessments to make high stakes decisions, with greatest impact falling on low-scoring ethnic groups, is widely supported by education officials because they show the achievement gap which is promised to be closed merely by implementing standards based education reform. Many states are currently using testing practices which have been condemned by dissenting education experts such as Fairtest and Alfie Kohn. Evaluation in India except in schools The grading system in India varies somewhat as a result of being a large country. The most predominant form of grading is the percentage system. An examination consists of a number of questions each of which give credit. The sum of credit for all questions generally counts up to 100. The grade awarded to a student is the percentage obtained in the examination. The percentage of all subjects taken in an examination is the grade awarded at the end of the year. The percentage system is used at both the school and university. Some universities also use the grading system and a CGPA on a 10 or 4 point scale. Notably, all the IITs, BITS Pilani (Pilani, Goa campuses) and most NITs use a 10-point GPA system. However, the grades themselves may be absolute (as in NITs), exlusively relative (as in BITS Pilani), or a combination of absolute, relative and/or historic, as in some IITs. There are several universities and recognized school boards in India which makes an objective comparison of percentage grades awarded by one examination difficult with those for another, even for an examination at the same level. At the school level percentages of 80-90 are considered excellent while above 90 is exceptional and uncommon. At the university level however percentages between 7080 are considered excellent and are quite difficult to obtain. It should be pointed out
that the percentage of marks at university vary from one to another which makes direct comparison of percentages obtained at different universities difficult.
Official Grading System (for all Government/Autonomous/Deemed Indian Universities except schools) Old Percentage Range Grade U.S. Grade Class/Division 80-100% 75-79% 70-74% 65-69% 60-64% 55-59% 50-54% 45-49% 40-44% Less than 40% Evaluation in schools Until 1994 the schools up to tenth standard followed the marking or the ranking system where each students overall marks were calculated and his rank relative to others was released. However the system had some accompanying problems contrary to the interest of the students such as Unhealthy competition Increased parental pressure Heightened tension levels in students Introduction of Grading Owing to the inherent flaws in the ranking system and the persistent demand from the academic and social circles that the outdated and anti-student system must be replaced in the best interest of the students the government decided to change the A+ A AB+ B/BC+ C/CD+ D/DF 4 3.75-3.95 3.5-3.7 3.25-3.45 3-3.2 2.5-2.9 2-2.4 1.5-1.9 1-1.4 0 First Division with Honours/Distinction "" "" First Division "" Second Division "" Third Division "" Fail
system of evaluation from ranking to grading on a stage by stage basis which was rolled out from the year 1994 and completed by 2003. However it is not known whether the change executed had returned the expected results. This study aims to analyse whether the introduction of the grading system has achieved the major objectives which justified its introduction such as Alleviation of students tension Change in unhealthy competition Better evaluation of student More scientific approach Its been more than two years since the system of grading has been implemented and as per advice from the academic circle the study on how far grading has achieved its objectives is relevant and an evaluation on its effectiveness is due.
Review of literature - Journals This chapter presents some of the studies and the journals that have been made and published and it also helps the researcher to concentrate on the main aspects and to avoid duplication grading and assessment of student writing. (Focus on Teaching). Rebecca B. Worley and Marilyn A. Dyrud ALTHOUGH GRADING is undoubtedly at or near the top of our "to do" lists, it is not the recurrent nature of this task that prompted the topic for this column. As the articles in this journal testify, business communication as a discipline has been changing. Most of us are not teaching the same course in the same way that we did 15 or 10 or even five years ago. But are we still grading in the same way? If content, teaching methods, and delivery systems have changed, has grading also changed? And if so, how? This column addresses those questions. As Joel Bowman discusses in the first article of this column, technology complicates the submission and return of assignments in distance education courses. Although software provides some solutions, marking papers and providing meaningful feedback to students in an online environment proves to be more complicated and time consuming than the traditional pen-on-paper method. As is so often the case, however, the challenge of technology encourages us to re-examine how we assess student writing. In the second article, Marilyn Dyrud has done just that, advocating an holistic method for evaluating assignments. Reflecting the evaluation procedures used in business for determining the success of a written communication, the system assigns numbers (0, 1, or 2) for unacceptable, acceptable, and excellent work respectively. As the article explains, this system not only reduces the time commitment of instructors, but also encourages students to take ownership of their written work. Although her approach differs somewhat, Nancy Schullery also adopts the holistic approach to grading, using seven foundation concepts that are essential to every business communication, regardless of genre. Such assessment, she argues,
more clearly simulates the response of supervisors, clients, customers, and other readers in the business world. LeeAnne Kryder, in the final article of this column, explains the rubrics she has developed to assure some degree of consistency in grading among instructors and teaching assistants in various sections of the same writing course. These rubrics she finds particularly useful for evaluating individual student performance in group projects. If there is a consistent thread among these four articles, it is this: the quality of the evaluation is what matters, not the amount of time instructors spend grading. Regardless of the methods they use, all of the authors represented here are searching for ways to provide meaningful and impartial evaluation of student writing that encourages learning and rewards excellence. It's Not Easy Being Green: Evaluating Student Performance in Online Business Communication Courses Joel P. Bowman Western Michigan University, Kalamazoo ANYONE WHO HAS SEEN the popular children's TV show "Sesame Street" knows that Kermit the Frog worried about being green. Although Kermit had a different metaphor in mind, those of us teaching business communication in an online format also face the issue of being "green." Classes taught over the Internet are relatively new, and online instructors are having to learn--often the hard way--how to take full advantage of electronic delivery to provide good instruction and effective feedback on student work. Until the advent of e-mail that permitted attaching formatted files, students submitted work for evaluation in essentially the same way: as paper documents to be marked and returned by the instructor. Most of us currently teaching business communication learned what we know about marking student papers from our instructors, who typically used a combination of marginal notations and standard proofreader's marks. As instructors, we probably use essentially the same basic approach to provide feedback on student work and justify our evaluation of that work.
Until the advent of the Internet, the procedures were basically the same for those completing courses at a distance. Assignments arrived on paper and were marked and returned, whether by mail (typical of "correspondence" courses) or by fax, as video-based delivery became increasingly common. With the advent of the Internet and submission of work by e-mail attachment, everything changed. Online Class Structure and coursework Over the past three years I have taught eight sections of a standard business communication course using Web-based, Internet technology for delivery of information. While most of my students have been within easy driving distance of campus and were also taking classes on campus, a significant number have been several hundred miles away. A few have been several thousand miles away. Many were working full time and elected to take the online version of the class because of the flexibility it afforded. Some enrolled out of curiosity, knowing that online courses were proliferating. A few enrolled expecting it to be easier than the traditional version of the course because of the absence of regularly scheduled classes. In the online sections, all my students submit most of their assignments as formatted documents (in Microsoft Word) attached to e-mail messages. The exceptions are a videotaped presentation, which may be hand delivered or mailed, and a PowerPoint presentation, which is also submitted as an e-mail attachment. With one exception, the assignments are all completed individually. The exception is a group report requiring collaboration. An electronic conference serves as a substitute for classroom discussion. The conference affords students the opportunity to ask questions about course materials and assignments and to discuss possible solutions to the assigned cases. A majority of the assignments are based on cases requiring students to write short documents using an appropriate format and correct spelling, grammar, mechanics, and business writing style. These are submitted twice, once in "draft" form and then as a revision. Planning and the Vagaries of Technology In every semester that I have taught online classes, technical difficulties have created problems. Servers have a tendency to fail without notice, and the class
conference and/or e-mail service may not be available for hours--and, on occasion, for days. E-mail server software may strip the contents from attachments, so while a document is sent and received, the contents are missing. Such challenges present one of the principal differences in the evaluation of online students. To be successful in an online environment, both faculty and students need greater flexibility and perseverance. Online classes, like all courses taught at a distance, require more attention to planning so that students know exactly what is expected of them at the beginning of the semester to ensure that they can plan around their work schedules and complete the assignments by their due dates. For students at a distance, an old-fashioned technology, the US Postal Service, can present challenges. Students find that delivery by overnight mail and next-day service by Airborne Express can take a week to 10 days. (So far, at least, UPS and FedEx have provided the most reliable delivery of student videos.) Online Evaluation of Written Work Before preparing their solutions to the cases, students use the class conference to ask questions and post sample solutions for my comments. Students earn points for their participation in the conference. I evaluate entries for class relevance (students may use the conference to discuss other topics of interest), spelling, grammar, mechanics, and formatting. I also award extra credit to students who are the first to find and report errors in my postings. At regular intervals, I send each student a record of his or her postings indicating problems with the postings that resulted in a loss of points for the evaluation period. Students may compensate for problems in an evaluation period by increasing their level of participation in the next. For the short documents, online discussion of the cases and their possible solutions begins the week before the case is due. Students submit the drafts of their cases on Monday. I return the marked drafts on Tuesday. The final versions of the assignments are due Friday, and I return those by Sunday to ensure that students know how well they have done before submitting the drafts of their solutions to the next case.
The feedback on formatted documents, both the draft and final versions of solutions to the cases, requires a slightly different strategy from that most of us learned to use for documents submitted on paper. On paper, if a student has a problem with comma usage, it is a simple matter to mark problems with usage and draw lines to general comments about the need to review and what to look for in the review. This isn't so easy to do with documents that never see paper. Microsoft Word (and other word processing programs) allow for highlighting and changing text color. It is also possible to use a "draw" function to show connections between related elements, but imitating the kind of feedback possible with pen-on-paper is both time consuming and awkward when done electronically. I elected not to use the "track changes" function in Microsoft Word, preferring to save copies of the drafts returned with my comments for comparison with the final submissions. Much of the feedback on the draft was commentary about language usage (such as explaining why a modifier "dangled"), business writing style (such as commenting on message structure and tone), and the need to follow special instructions for the case (such as using a numbered list if the directions said to do so). Tracking the changes did not adequately allow for the revisions most students were having to make between the draft and the final versions of their solutions and actually added to the difficulty of evaluating the final copies. Providing this kind of running commentary on the cases has both advantages and disadvantages. The principal advantage is that documents submitted electronically encourage instructors to provide more comprehensive, explanatory feedback than is typical for paper submissions. The principal disadvantage is that providing such feedback for each student individually takes more time than is required for marking papers and returning them in class, where the explanation for the cryptic comments on paper can be provided orally to all the students at once. Bottom Lines A variety of studies have shown no significant difference in learning based on the method of educational delivery . Even so, teaching and learning in an online environment are not the same as teaching and learning in a traditional classroom environment. The skills required for success may overlap, but they are not identical.
Online classes tend to have higher attrition. Students enroll but drop the course early in the semester or simply stop doing the work. The mixture of highly motivated, often nontraditional students and those who enrolled expecting the course to require less time and effort typically results in bimodal distribution of grades, with a majority of the students doing very well or very poorly. Student success in online classes has a high correlation to willingness to learn from reading and to participate in electronic discussion of course concepts. Success also depends on student comfort level with the technology. At least a few who insist that they are comfortable with sending and receiving e-mail attachments when they enroll discover that, while they may have sent an attachment or two, they do not know how to retrieve attachments sent to them and frequently have difficulty sending attachments as well. In the history of education, online classes are new, and we have yet to determine how to take full advantage of the technology. The natural inclination is to try to imitate the classroom environment. The extraordinary family therapist Virginia Satir referred to this inclination as "the lure of the familiar." Even when we experiment, we tend to do what we have always done. Because we are most familiar with the traditional classroom setting, we tend to assume that it is the "gold standard" for educational delivery and seek to replicate what we see as the advantages of that setting even as we use new technology. Those of us who teach so-called "upper division" classes, however, might pause to consider how much our students actually learned in their previous traditional classes. It may be time to reinvent the wheel. As changing cultural needs continue to push us in the direction of "any time, anywhere" delivery of education, what we are learning now about how to evaluate student performance in an online environment may well provide the foundation for new strategies of teaching and learning. Preserving Sanity by Simplifying Grading Marilyn A. Dyrud Oregon Institute of Technology Klamath Falls
ONCE--AND ONLY ONCE--I calculated how many papers I graded during an academic year. With four writing classes averaging 25 students each, the total was a staggering 4,500 papers. And, since all of my writing classes are process-oriented, the real total was at least double that. I was, I thought, a very thorough and conscientious grader, circling mechanical errors, rewriting wretched sentences, and carefully marking each mistake as awk, pn, sp, ro, frag, and a plethora of other hieroglyphs. I was also putting in 10-hour work days and, on occasion, waking up at 4 a.m. so I could finish grading before classes met. Obviously, things needed to change. My epiphany occurred during a discussion with a sagely professor in my department. He used a very different grading system and recommended a then-popular technical writing text, Technical Writing: Process and Product, by Charles R. Stratton from the University of Idaho, that discusses the types of technical writing generated in industry. Stratton uses a three-tiered scale: about 20 percent of industrial documents are deemed excellent, about 20 percent are unacceptable, and the remaining 60 percent are considered adequate. Excellent documents yield promotions and pay boosts for their writers; unacceptable writing results in crash writing-improvement classes or termination. Acceptable documents are salvageable, but require, in some cases, extensive revisions which, of course, translates to money in business. It was a small step from Stratton's book and further discussions with colleagues to a new and improved grading system, one that would be more efficient, help prepare students for the writing they would produce professionally, and encourage revision. Thus was born 0-1-2, a system which I, and many of my communication colleagues, have now incorporated in virtually all classes that require written documents. The 0-1-2 system encourages instructors to quit editing and start evaluating and has three primary virtues: simplicity, efficiency, and equity. Simplicity In the olden days, before 0-1-2, I tended to edit and revise my students' papers, accompanied by a rather unwieldy grading sheet .While the students liked the expository essay grading sheet primarily because it looked organized and "identified" all of their mistakes, from an instructor perspective, it was fraught with peril. Under
mechanics, for example, 7 points are allowed for spelling. But what if a student made 10 errors in their paper? Does that student owe points? Is 6 points adequate to assess logic in a persuasive paper? How much "freshness" can we expect from a newlyminted high school graduate? What if a student submitted a paper that was mechanically and structurally excellent but devoid of meaning? Using this grading form, an instructor can subtract only 8 points; this well-written but vapid paper might earn an "A." Moreover, the instructor is compelled to use at least 12 different abbreviations as comments and needs to take valuable class time to explain their meanings. Although we all also wrote comments in margins and at the end of papers, the abbreviations were confusing to the students. Using 0-1-2, most of these problems are non-existent because instructors read the papers holistically, which results in simplified grading criteria and fewer editing remarks. The figure is my current grading criteria handout for students in my business correspondence class. Criteria will vary according to class; a business writing class, for example, may place more emphasis on audience and formatting than a composition class. The 0-1-2 stands for unacceptable, acceptable, and excellent. Students who receive 0s and is may revise their work as often as they wish within a specified time. In my more-than-a-decade of using this system, I have witnessed much better writing as a result of revision, and the students willingly revise because everyone has the potential to earn an "A". Efficiency In addition to downsizing assessment criteria, I have also quit editing and revising my students' papers. In lieu of circling and writing, I now make a simple checkmark next to a line that has a mechanical error. The student's job is to find the error and correct it. It is, after all, the student's paper, not mine, and reduced editing empowers students to take possession of their own writing . Just this simple change has resulted in massive time savings. In the past, using a grading form and editing comments, it took me about 90 minutes to grade a set of
business letters; now, I spend less than an hour. To score longer reports, I used to spend about 45 minutes per report; now, it takes about 20. While 0-1-2 speeds up the process, it does not reduce the quality of evaluation. My students still receive ample feedback for improvement. It is, in fact, a form of holistic assessment, an evaluation system widely used to score student work: the Educational Testing Service uses holistic methods to assess SAT and GRE exams, and the State of Oregon, which has significantly revised K-12 standards, uses it to score for minimal competencies. Furthermore, numerous studies on holistic assessment praise its reliability and note its versatility . Equity A third virtue of 0-1-2 is equity. With so few points available, there is little room for instructor larceny, which, sadly, certainly does occur. In a traditional 100point system, it is easy to mark down a paper that, for example, might disagree with the instructor's political philosophy. But with only 2 points available, arbitrary deductions cannot occur. Student Satisfaction Student complaints are also minimized. With a traditional system, students could quibble about a point or two; but with this system, two points is a whole assignment! Since I have implemented 0-1-2, I receive virtually no student complaints. Quite the contrary: students like this system because they know that they can correct their errors or polish their style. According to comments on my quarterly student evaluation forms, satisfaction is high. Responding to the question "What did you like about this course," my students obviously appreciated the revision opportunity: * "It is great that you allow rewrites because you learn a lot from your mistakes. Keep it the same so others will get as much from this class as I did." * "I liked having the opportunity to do rewrites." * "Rewrite options are nice." * "Opportunity to rewrite papers; fair chance to get the grade you desire."
Conclusion A simplified grading system such as 0-1-2 offers many benefits to both students and instructors. For students, it offers them a chance to improve their work and allows them to prepare for the work world where review cycles are common. It allows them to debunk the myth that once something is written, it is set in granite. Professional writers, both scholarly and industrial, revise their work many times prior to publication. It also helps students develop their editorial skills and, hopefully, establishes solid work habits. For instructors, 0-1-2 offers the luxury of more time, substantially reducing the many hours of evaluation required for writing classes. Those extra hours can be applied to other tasks, such as course preparation or professional development. More importantly, though, it changes the instructor's role from judge to coach, since the primary goal is to produce an excellent piece of work. As Toby Fulwiler, University of Vermont, explains, his role as a writing teacher has "little to do with teaching students about semicolons, dangling modifiers, or predicate nominatives and a lot to do with changing their attitude towards writing in general so they would care about it and maybe learn to do it better" . A Holistic Approach to Grading Nancy M. Schullery Western Michigan University Kalamazoo TO GRADE students' accomplishments and assess learning, I use a holistic method modeled after that used in the "real world." When students enter the workplace, their ability to accomplish their assigned goals will determine their career success. Similarly, their ability to accomplish their purpose on an assignment should, to my mind, determine their grades. A document's intended purpose is accomplished, by definition, when the document is effective. Therefore, grades in my class are based largely on my judgment of the degree to which students effectively accomplish the purposes of their assigned papers. I would like to describe, first, how I define effectiveness, and, second, how I implement a holistic approach to its assessment.
I define effectiveness as satisfying the following criteria: the writer has to understand the complexity of the context and approach it in a way that respects the viewpoint of the audience. The writer must demonstrate an appreciation of the audience's viewpoint by using language that the audience understands and by considering the relevant content in a way that relates to audience needs. The writer must arrange the content in a strategically functional way, using an organization that allows the reader (and any secondary audiences) to understand and accept the writer's goals. Ideally, the format/design will invite the reader in to read the rest of the document and allow the reader to make optimum use of the information. The content must include enough detail to help the reader understand the situation from the writer's perspective, yet include no irrelevant or prejudicial information. Thus, the writer's self-presentation must be positive, giving the impression of a competent, responsible, fair-minded business person speaking for a reasonable organization. These seven foundation concepts (italicized above) are so essential and tightly interwoven that they all must be incorporated into any written assignment, along with the standard conventions of American English. These fundamentals are applicable across the board, whether the student is writing a negative message, a persuasive message, a research report, or an application letter and resume, each of which has its own unique elements (as discussed in business communication textbooks). It is the successful implementation of both the foundation concepts and the unique elements that leads to effective accomplishment of purpose. Effectiveness Criterion My first task in grading is to make my use of the effectiveness criterion clear to the students. An "A" quality application letter or resume, for example, must be ready to mail to the employer, and should appear to have a good chance of success. Memos and letters must be written as though the writer's reputation, job security, career advancement, and potential pay raises depend on the cumulative effect of the writer's effectiveness in attaining his/her purpose with each document. They are to assume that each paper is to be judged by that paper's intended audience. In other words, when I do the grading, I will look at their papers through the eyes of their workplace supervisor, customer/client, colleague, or potential employer. As the assigned audience would react to the writer's document, so I also will react.
How do I do that? First, I make a quick preliminary judgment. For example, we are told that prospective employers typically spend only 15-30 seconds reading any single resume before making a preliminary (and sometimes final!) decision. I strive for such a quick initial reading. My preliminary reaction may range from "this is pretty good" to "oh, s/he's missed the point entirely." This initial judgment leads to an initial valuation as an "A," "B," etc. grade, which I note in pencil and expect to modify somewhat after a second, more careful analytical reading. The second reading is where the real work is done. I identify, from the perspective of the target audience, any of the seven fundamental concepts that the student has implemented either particularly well or poorly. The good points are noted in the margin to reinforce learning. The problems are also noted in the margin, but always with specific instructions for improvement. Providing sufficient quality feedback to the students is critical. I want students to understand the reader's likely reaction to the document, so my comments are numerous. Some comments are cryptic, such as "positive?" or "unclear." Others offer brief instruction (not editing) toward a more appropriate direction, such as "rephrase in 'you' attitude" or "frame more positively." Whenever possible, I point out important readers' perspectives that the student has overlooked, such as, "could be read as defensive." Also, a few summary statements at the bottom are given, especially early in the semester, when positive reinforcement and motivation are most crucial. In keeping with my employer's-eyes role, I attempt in my comments to both give the student credit for what was done well and identify areas where improvement is needed or the purpose has not been accomplished. Thus, a paper that makes a very favorable first impression and has positive comments regarding all, or most, of the fundamental concepts, and only a few minor shortcomings, is judged excellent and the penciled-in grade is made an "A." One that is "pretty good" but doesn't make the "A" grade due to falling short on one or more of the criteria will receive a "B" based both on the seriousness of the errors and the strength of the correct part of the paper in relation to previous papers I have read. A "C" paper has potential, but lacks overall effectiveness due to some combination of missing ingredients. Those papers that either miss the point entirely, ignore the basic concepts or the key elements of the assignment, or display ignorance or carelessness
with the basics of English punctuation, spelling, or grammar are, fortunately, rather rare and earn a "D" very quickly. Conclusion Such an holistic approach contrasts with the use of a grading rubric, which is the more common path to efficiency in grading. It is my observation that rubrics tend to grow over time in response to students' creativity, becoming overly complex and losing sight of the big picture. More important, I am concerned that the typical rubric approach of points off for every error may be unfair and, possibly, do more harm than good. For example, a common problem is repeated point deductions for the same type of error (e.g., comma usage). To my mind, a repeated error of a single type repeated ten times is less objectionable than ten different errors. Further, sometimes a specific technical error really does little to reduce the effectiveness of the message. In contrast, the holistic method's focus on the larger goal, together with explicit recognition of students' strengths, is an effective motivational tool, providing students with a sense of building on a base toward greater mastery. While such grading is not particularly efficient, I find that it does not take too much more time than grading with a rubric. Of course, writing comments takes time. However, I believe that such comments--reasonably prioritized and including genuine positives-are both immediately helpful to the students and build goodwill that sets the stage for future instructional success. This grading process does not end with the individual papers; feedback is important also at the classroom level. As I return the papers, I give positive reinforcement to the class by briefly explaining how they have generally handled the concepts and showing a few exemplary papers in which the key elements were masterfully treated. These are shown without names, to avoid heaping praise on any individual student and inviting the suspicion of a teacher's favorite. I also describe (but don't show examples of!) problems that seemed to be common. In this manner, the grading process is transformed from a pure chore into an extension of the teaching process, which, for me, makes it all worthwhile. Although such holistic grading may be criticized for not enforcing absolute standards with respect to details, I believe that its focus on the big picture helps motivate students to strive toward improvement in all aspects of the subject. The
method is akin to that used in industry of rewarding employees who do something well with an "atta boy." Industry asks employees to work toward goals, and then measures the degree of goal attainment in a performance evaluation (e.g., meeting sales targets or budget allocations), in effect awarding points rather than deducting points. It is my experience that students respond very favorably to any efforts that convincingly connect their classroom subject matter with their own present or future employment success. I recommend the method as one that helps motivate students both to learn the principles of business communication and to master its skills. Endnote The foundation concepts noted here are drawn from E. A. Hoger & N. M. Schullery (2001), Core Concepts for Business Communication, provided to instructors of the business communication course taught at the Haworth College of Business, Western Michigan University. The college does not mandate holistic grading. Grading for Speed, Consistency, and Accuracy LeeAnne G. Kryder University of California at Santa Barbara WITH AN AVERAGE OF 75 papers to grade for six or seven assignments per quarter, I have sought techniques to ensure efficiency and effectiveness in grading. While grading still isn't easy, I feel comfortable with several techniques I've adopted over the years. Grading Rubrics Most of my assignments in business communication are individually authored, but I also have at least one large collaborative report and collaborative oral presentation. In my "writing for accounting" course, I have two collaboratively written reports and oral presentations. I've developed rubrics for all these assignments, similar to those often used for lower-division freshman composition. These help ensure my consistency over the many hours of grading time, and they help speed my evaluation and commenting time.
My rubrics generally reflect five categories of effective business writing: organization and content, visual communication, concise and varied word choice, mechanics, and format (the particular genre conventions for the memo, letter, and report). For each category I determine the percent of points allotted . Depending on what elements are being emphasized, points vary among categories in each assignment. For an assignment calling for a resume, application letter, and brief memo (explaining the job or organization targeted by the letter), I allot the following points: * Assignment Content and Organization 2 pts * Visual Communication 3 pts * Word Choice/Tone 1 pt * Grammar, Punctuation, Spelling 3 pts * Format 1 pt Before, I used to agonize over whether a document was a "B" or a "B+"; now I focus attention on each category and let the numbers as totaled "tell me" if the document is a "B" or a "B+." Initially, the use of rubrics and numbers was foreign to me, but I needed a clearly articulated guide when I began to work with teaching assistants in the "large lecture" format . How could I ensure consistent treatment of student papers across three graders (two TAs and myself)? For each of the assignment papers, I developed a rubric and sent out a draft for comment from each TA. Discussions that followed seemed to clarify the assignment for them, so the rubrics helped to "get us on the same page" (literally). Perhaps because our students knew we were using common evaluation sheets, we seldom had student complaints about unfair grading. In fact, the final grade distribution assigned by the two TAs was very similar--and I believe the rubrics helped us achieve that. Although we haven't required all business communication teachers to follow the same rubric, I believe the use of a common rubric could further ensure consistency among classes--especially when we add two or three new teachers to the mix.
The rubric also saves time in making summary comments because it forces me to be concise due to space limitations. I still annotate individual passages in student papers, but I now make brief marginal comments and then refer to the pre-printed rubrics. My large, collaborative project papers (reports or business plans) also benefit from a unique rubric . Because I want each author to receive my comments, I often type them and the assigned points into the Word template so that I can print out four or fives copies. Certainly these rubric sheets can also be adapted for electronic comments and sent as attachments to an e-mail from teacher to student. Status Reports and Performance Evaluations As a way to manage my students' group projects and ensure fair assignment of grades, I require some self-reporting. Each student on a team (typically five students on each of five teams) takes a turn each week to send me an e-mail status report. Part one is a set of minutes to be distributed to everyone on the team; part two is just for me and includes the student's assessment of the project thus far. This is an excellent way for me to gather an "early warning" if there are problems in the group and to monitor individual students' attendance at team meetings. I also field questions and address concerns earlier in the quarter. Soon after I put the student teams together, I distribute the performance evaluation that each student is to use in evaluating his/her own work and the contributions of each team member. We discuss characteristics of successful teams and I use the performance evaluation to highlight individual team member responsibilities. The performance evaluation focuses on a variety of contributions needed for a successful collaborative report; certainly writing and editing skills are important, but so are computer skills, visual communication, willingness to take on responsibility, initiative, and so on. Periodically throughout the quarter, I refer to this performance evaluation; then, on the day the students submit their collaborative document, I ask that they also submit the confidential performance evaluation. This, too, helps with final grading. I tell the students that the project's rubric helps me assign a grade to the document;
whether each student shares in it depends on the performance evaluations and status reports (sometimes individual grades are higher or lower, depending on that student's contribution to the whole). With the periodic status reports and the final performance evaluations, I feel comfortable in assigning collaborative project grades and the ultimate course grades. Concluding Remarks The rubrics I have developed for each of my business communication courses have been an enormous help in my grading practices. They have helped make my grading less time consuming, and more consistent and effective. Student perceptions of grades: a systems perspective. (The scholarship of teaching and learning). Jane Strobino, Kimberlee Gravitz and Cathy Liddle. Academic Exchange Quarterly Evaluation of student achievement in courses is one of the major responsibilities of faculty in the educational enterprise. Theoretically, grades are indicative of students' acquisition of knowledge in a particular content area, or the degree to which the student has learned. Faculty use a variety of rubrics for assigning grades and employ grading practices that may have evolved throughout their teaching careers . During the grading process, tension arises when there is a difference between the grade that the teacher assigns and the grade that the student expects (Goulden & Griffin, 1995). Such tension can have important consequences for the student. Some students, upon receipt of a grade lower than expected, may be discouraged from further investment in the learning process, or may be motivated to work harder. Additionally, grades may impact on students' self-esteem, self-worth, and self-efficacy (Edwards & Edwards, 1999; Goulden & Griffin, 1995; House, 2000). One way to clarify this tension of grading is to consider student perceptions of grades. Viewing student perceptions of grades from a Systems Perspective, specifically the Person-in-Environment (PE) paradigm, allows for a greater understanding of factors that contribute to the tension (Germain & Gitterman, 1996).
The PE paradigm suggests that the person, in interaction with his/her environment, creates a personal perception of grades. These environments include not only the classroom and school environments, but also physical, social and psychological environments. The broader context of student environments will affect the actions taken by the student and thus his/her achievement in the classroom. Omitting consideration of these other environments when assessing student achievement can result in a grade that may reflect a biased assessment .Thus, from the PE paradigm, in order to understand student perceptions of grades, one must look at the student or person factors and the environments in which the student is a part. In the current higher education environment, students are viewed as consumers and the student body is comprised of a large percentage of adults who are considered non-traditional. Their perception of grades may be unique. Adult students, in particular, may want to take charge of their learning and may be at odds with course assignments and grading protocols. Unless the teacher and the student discuss any differences in expectations, tension will arise when the student is given a grade with which he/she does not agree . Person Factors According to the PE paradigm, students' perceptions of grades may be understood when person factors/characteristics are viewed in relation to the person's multiple environments. These person characteristics include: previous school experiences, student efforts to learn, motivation to learn, expectations regarding grades, and readiness or preparedness for the academic program. Previous school experiences of the students focus on the grading practices and the extent to which the student ascribed to them. A number of authors found that some students experienced grades as a reward or a punishment (Becker et al., 1968; Kadakia et al., 1998; Tropp, 1970). These experiences relative to grading shape the perceptions and behaviors of the students. Over time, students become conditioned to the extrinsic rewards that grades convey and may continue this focus in graduate school (Edwards & Edwards, 1999). Kadakia et al. (1998) found that although MSW (Master of Social Work) students saw their colleagues as obsessive about grades, they failed to identify this characteristic in themselves. Weiner (2000) makes the point that
an outcome (i.e. grade) may be explained in terms of both personal behaviors (the student studied or did not prepare for the test), and actions of an `other' (the teacher recognized the effort expended). Further, Weiner suggests that the Protestant Work Ethic is the value base upon which grades are perceived. And so, it seems that the experiences one has throughout the years may serve as an impetus for the efforts one expends and also for the motivation to engage in the learning process. According to Becker et al. (1968), efforts expended by students are important when considering their perceptions of the grades that are assigned. The degree of effort legitimizes the validity of the students' meaning of grades. This point is consistent with Brookfield (1986) and Tiberius and Billson (1991) who suggested that students have a responsibility in the learning enterprise and that responsibility contributes to the teacher's evaluation and assignment of the grade. Likewise, House (2000) viewed grading according to systems theory and saw the student inputs (efforts and responsibilities) as being mediated by the school environment to produce the output (grade). These authors concur with Rogers (1969) who suggested that learning results from an interaction of the student with both the course materials and the teacher. Efforts or responsibilities may be seen in the amount of time devoted to reading, seeking resources, consulting with others and in actual writing and reflecting on the course content. Thus, it is important that student effort or student responsibilities are part of the context when looking at student perceptions of grades. Learning efforts may be uniquely linked to other person factors such as student motivations and grade expectations that may be illustrated by the consumer model. This model views students as consumers of education, and emphasizes satisfaction with their educational experience .The basis of student motivation is the monetary contribution, which is made by the student in order to obtain a degree. If, indeed, students perceive grades as a right (according to the consumer model) and faculty take the traditional approach that students earn the grade, then tension surely will arise during the grading periods. The consumer model may inadvertently diminish the values of hard work and persistence which are essential to learning (Snare, 1997). Also, this model projects a cynical viewpoint about students' motivations to learn and fails to recognize the other contexts for which students are responsible. Other motivating factors that have been suggested include the desire for recognition and for
knowledge so as to become expert in the field (House, 2000). It is important then to document student motivations and grade expectations when investigating student perceptions of grades. Environments Environments refer to the larger system contexts in which persons interact. These larger systems may support or hinder the efforts made by students in their academic endeavors. One environment of particular importance to learning is the classroom culture or climate. This environment includes transactions among students, and between students and faculty. A classroom environment may stimulate or deter students from engaging in learning experiences. Engagement in learning activities leads to greater student achievement and consequently higher grades. A safe environment results in higher student participation and student performance (Billson & Tiberius, 1991). Student access to the teacher as a means to personalize the student teacherrelationship also is said to impact on grades (Becker et al., 1968; Rogers, 1969; Somers, 1970). As that relationship develops, the teacher may be able to influence the student to engage in productive learning activities which result in higher grades. Another environment important to understanding student perceptions of grades is the student culture. Student culture includes the knowledge/perception of the educational program and faculty as to strictness of grading requirements, grapevine information about faculty and how to succeed. "Most colleges probably develop a student subculture that identifies tough graders and easy graders and often encourages students to go for the easy ones" (Walhout, 1997, p. 86). The student subculture places a negative spin on student approach to learning. Perhaps learning and grades are viewed as separate with grades serving as a means to an end that may or may not include knowledge acquisition. An environment outside of school, yet intensely important to the student perceptions of grades, is what might be called the student's personal environment. The personal environment includes the home situation, employment, culture, health status, finances, life stage, physical proximity/access to the school and faculty, and those
social environments that are meaningful to the student. This environment, because it places demands on the student's time, serves to reduce the amount of time available for students to use in academic pursuits. Consequently, although students may believe their completed assignments are well done given the time available, they may be graded unfavorably by the teacher. This discrepancy creates tension about the grade. In studying student perceptions of grades, it is essential for their personal environment to be included. A broader environment, namely the higher education system, has contributed both to grading protocols and grading practices. While the university establishes policies that mandate a grading system for teachers, there remains a wide variety of grading practices. This lack of a standard grading practice demands that students learn and adapt to variations among teachers. In some cases, students must choose the courses on which to focus their attention . Also, easy grading may be part of an entitlement bargain in which universities seek students to fill the classrooms and meet minority and diversity requirements .Wiesenfeld (1996) raises question about the consequences of such a situation in terms of passing students who lack the expertise and skills when they graduate and enter professional employment. Thus, documenting the grading protocols that are used in the broader environment is important. They interact with student learning behaviors and lead to a construction of the student perception of grades. Given the literature on environments, any study about student perceptions of grades should necessarily include those contexts. Doing so may identify supports or barriers for student engagement and academic performance. Meaning of Grades Three studies were identified that focused specifically on the meaning of grades to students. Two involved undergraduate students and one involved graduate students in a School of Social Work. In their classic work, Becker et al. (1968) conducted a qualitative study of undergraduate students at a Midwestern university in order to understand students' academic work in the context of their other life experiences. The authors indicated that faculty were not usually aware of the demands on the students' time, including those from other classrooms and from their personal environments. Students believed that if they met faculty demands they could achieve
the desired grade. Such a perspective implies the external control of grades by faculty, and minimizes student academic efforts in affecting the grade. Nearly thirty years later, Goulden and Griffin (1995) also conducted their study at a Midwestern university and included undergraduate students. They addressed the different perceptions of teachers and students as one source of gradeconflict. The underlying premise was that the meaning of grades differs between students and faculty and this difference is a source of conflict. Students in their sample were given two prompts from which to respond: What do grades mean to you; and Grades are like.... The results indicated that grades were viewed as a means of feedback in which a measure or judgement about student effort was given. Also, grades were seen as emotional triggers in which the student, as a person, was being judged. Lastly, grades were seen a motivators within the context of a reward and punishment system. Kadakia et al. (1998) conducted a survey of students in one Northeastern Graduate School of Social Work. Their focus was on grade expectations and locus of control. The results indicated that most students expected to receive high grades and identified colleagues as obsessed with grades. The authors note that "Overestimating academic performance in a profession that holds self-awareness as sacrosanct is paradoxical and unacceptable" (p. 10). While not focusing directly on the meaning of grades, the survey did inquire as to the importance of grades to personal beliefs about self and the importance of receiving an acceptable grade. The reconceptualization of grading, using the PE paradigm, provides direction for understanding student perceptions of grades. Both student viewpoints and the multiple environments in which these students interact are important to assess. Summary Grading is an area of tension between students and faculty that may be better understood from a Systems Theory point of view. Utilizing the conceptual paradigm of PE Fit recognizes the importance of environments as impacting on student engagement in the academic enterprise. However, documenting student perceptions of grades will provide faculty with insights that can be used to reconsider rubrics for grading. Such insights can serve to reduce the grading tension between faculty and
students. Additionally, this information can have important implications for curriculum development and program planning. In terms of curriculum, the identification of student perceptions of grades (and grading practices) may enhance faculty discussions about the grading systems that should be in place. Also, such discussions may include the importance of consistent expectations for student achievement across courses as well as within different sections of a course. Regarding program planning, documenting student perceptions and being responsive to these in light of the environmental contexts may serve to heighten the reputation of the school/program and thereby have an indirect effect on recruitment of new students. This model also recognizes the importance of the student body profile which includes both traditional and non-traditional students, and who may have unique perceptions about grades. Finally, the authors look forward to extending the ideas presented in this paper by actually conducting research that addresses student perceptions of grades.
Davis, J. Thomas. "Fairness in Grading." Academic Exchange Quarterly 5.1 (Spring 2001): 64. British Council Journals Database. Thomson Gale. British Council
This article reviews the difficulties in assigning grades to student work, briefly reviewing highlights from the history of grading practices. It concludes with the suggestion that, given the impossibility of comparing grades either within an institution or between institutions, instructors should base grades on a measure of an individual student's progress during a course. One of the most important duties of a faculty member at the end of a term is that of determining the final grade that individual students will receive in the class. As difficult a process as this is, it is made even more difficult not only by having to determine the process to arrive at the final grade, but also by the various interpretations of that grade that will be made later by others. These assigned grades are designed to serve a variety of purposes. Dr. James S. Terwillinger wrote that grades are to serve three primary functions: administrative, guidance, and informational. He indicated that grades should be viewed only "as an arbitrarily selected set of symbols employed to transmit information from teachers to students, parents, other teachers, guidance personnel, and school administrators."(1)
However, unless the meaning and interpretation of the grades assigned are universally understood, the system, no matter how carefully designed and understood by the instructor awarding the grade, will not be an effective means of communication to others or over a period of time for cumulative evaluation. This is true even if the purpose of grading is more specifically defined--as in the following list by Professor James M. Thyne: "To ascertain whether a specified standard has been reached; To select a given number of candidates; To test the efficiency of the teaching; To indicate to the student how he (sic) is progressing; To evaluate each candidate's particular merit; and To predict each candidate's subsequent performance."(2) In the development of an individual or institutional grading policy, it is important that a decision be made as to the reason for the assessment. If it is merely to have twelve grades at the end of the term or that departmental policy requires that all work be graded, these will become ends in themselves, and the interpretation of the final assigned grade will become even more difficult. Even with a definite purpose beyond "institutional policy," it is extremely difficult to have a consensus as to how to arrive at a grade to properly evaluate the progress made by any individual student in a particular course. Dr. William L. Wrinkle wrote in 1947 of six interpretation fallacies that are made in understanding course grades. The number one fallacy that he listed in his book was the belief that anyone can tell from the grade assigned what the student's level of achievement was or what progress had been made in the class.(3) This fallacy is as widely believed and probably as correct today as it was when he wrote it in 1947. Even earlier in a study published in 1912, Dr. Daniel Starch and Edward Elliott questioned the reliability of grades as a measurement of pupil accomplishment. Their study involved the mailing of two English papers to two hundred high schools to be graded according to the practices and standards of that school and its English instructor. The papers were to be graded on a scale of 1 to 100, with 75 being indicated as the passing grade. Teachers at one hundred forty-two schools graded and returned the papers. On one paper the grades ranged from 64 to 98, with an average of 88.2. On the other, the range was 50 to 97, with an average of 80.2. With more than
thirty different grades assigned and a range of more than forty points for the same paper, it is no wonder than the interpretation of assigned grades is extremely difficult.(4) Perhaps the earliest study on individual grading differences was done by Dr. F. Y. Edgeworth of the University of Oxford in 1889. Professor Edgeworth included a portion of a Latin prose composition in an article he wrote for the Journal of Education. He invited his readers to assign a grade to the composition and forward it to him. His only other instruction was that this composition was submitted by a candidate for the India Civil Service, that the work was to be graded as if the reader were the appointed examiner, and that a grade of 100 was the maximum possible. He received twenty-eight responses distributed as follows: 45, 59, 67, 67.5, 70, 70, 72.5, 75, 75, 75, 75, 75, 75, 77, 80, 80, 80, 80, 80, 82, 82, 85, 85, 87.5, 88, 90, 100, 100. In his conclusions Edgeworth wrote, "I find the element of chance in these public examinations to be such that only a fraction--from a third to two thirds--of the successful candidates can be regarded as quite safe, above the danger of coming out unsuccessful if a different set of equally competent judges had happened to be appointed."(5) The criteria for evaluation vary not only from institution to institution but from course to course within the same institution and from instructor to instructor of the same courses within the same institution. Since methods used by various instructors vary considerably, it becomes extremely difficult to read a student's transcript to determine the student's standing among others at the same institution or throughout the country at other institutions. The National Collegiate Athletic Association's Division I institutions voted down a requirement that student athletes maintain a standard grade point average in order to retain their eligibility to participate in collegiate sports from year to year. The major reason given was the difference in grading standards that existed between institutions and between courses and programs at the same institution. One difficulty is that the methods used in arriving at the final course grade are almost too numerous to enumerate. These include the averaging of all course grades made during the term, dropping the lowest one or two test marks, determining the
entire course grade on the basis of the final exam or one term paper, counting only the final course grades, grading on the basis of class average, and having only a written comment rather than course grade. Compounding the problem of interpretation of the grades indicated for students is that both standards and grades themselves vary over time. For example, during the decade of the 60's, many institutions began experimenting with a variety of grading systems, both institution-wide and within selected courses. This was the era of student protests, demonstrations, and student revolts on our college campuses. Institutions that changed their grade reporting system include both small, private institutions and those with longstanding Ivy League academic reputations. These innovative grading systems included allowing pass/fail grades in selected courses; replacing the traditional grades with "High Pass, Pass, Fail" or with Credit/No Credit; not counting failed but repeated courses in the grade point average; and "A, B, C, No Credit," with the NC not counting in the GPA. Additionally, changes had to be made in the system used to determine academic suspension, semester and graduation honors, and class rank. In many cases, since class rank had become virtually impossible to determine, it was left off the transcript entirely. Many institutions that made global changes in the recording of grades during the decade of the 60's have changed to a system that is based on the instructor's evaluation as measured by traditional grades. But the same problems of interpretation that existed earlier are still present, with the additional difficulty of interpretation of the transcript of a student who was enrolled during the transition period. For example, at the University of South Carolina, during a seven-year period, which many part-time students need to complete their baccalaureate degree, a student's transcript would indicate the assignment of course grades under four different grading systems. The student would also have been subject to three different suspension and graduation honor criteria. Even where there appears to be a standardization of the items to be rated there can still be difficulty. In his book, The Pyramid Climbers, Vance Packard reproduces two report cards published in The New York Times Magazine. One report card was
for a kindergarten for four-year olds and the other for evaluating executives in one of the largest corporations in the country. The first report card used a rating system of Very Satisfactory, Satisfactory, and Unsatisfactory. The items to be evaluated were: Dependability, Stability, Imagination, Originality, Self-expression, Health and vitality, Ability to plan and control, and Cooperation. The second report card used a rating of Satisfactory, Improving, and Needs Improvement. The items to be evaluated were: Can be depended upon, Contributes to the good work of others, Accepts and uses criticism, Thinks critically, Shows initiative, Plans work well, Physical resistance, Self-expression, and Creative ability. The first report card was used to evaluate the executives, and the second to evaluate the four-year olds.(6) In the January 1988 issue of the Academic Leader, Dr. Stephen J. Huxley points out another difficulty and recommends a possible correction.(7) His observation is that the final record of the student, the college transcript, is blind to the differences indicated above. On the transcript in determining the student's grade point average, an "A" in Organic Chemistry, earned under an instructor who rarely gives them, is given the same weight as an "A" in Outdoor Fly Casting with an instructor who rarely gives any grade other than an "A." Since these differences are generally disregarded by employees, scholarship committees, and graduate and professional schools, students with their peer network learn which courses and instructors to take to bolster their grade point average. Dr. Huxley's recommendation is that, in addition to the individual student's grade in the course, the transcript should indicate the average grade assigned by the instructor for that particular course and section. This would allow a transcript reader to determine more easily if the grade the student has in a particular course was the result of individual academic performance or the result of enrollment in an "easy" course. However, since this would necessitate not only a sophisticated computer grading program, but the official recording of instructors' grading practices, it is doubtful if many institutions will adopt Dr. Huxley's proposal. Regardless of the grading system used, in interpreting the grades a central problem is in determining what they are trying to measure. Is the grade to measure an
individual's achievement against others in the same class, course, or school, or is it only to measure changes in the student's progress since the start of the course? If the measure is of the individual's progress, it makes the measure of one's progress against others almost impossible to ascertain. It is as if one were using an elastic ruler to measure heights of individuals in a class. If the measuring device varies for each student, then one student can be taller than another simply because the ruler was stretched more in one measurement than in another, even if by simple observation it is evident that the first is taller. In educational jargon, such a measuring device would be labeled "unreliable." Yet in many courses the "measuring device" is changed for each semester and possibly for each student. In certain courses in which competency is to be developed, some instructors have assigned grades at the end of the course on the basis of a student's sustained performance, regardless of the actual average attained or the average of the others in the course. For example, a student enters a writing course making grades of D on the material submitted. During the semester, the student makes the following grades: D, D, C, D, C, C, B, C, B, B+, B. What grade should be assigned as a final course mark? If a strict average is used, then the student has a grade of C or, at best, C+; however, the belief of some faculty is that since this student is writing at a "B" grade at the end of the course, then this is the final grade that should be assigned. It would seem that in those courses in which competency is desired, the latter example would be a reasonable approach to assigning the final mark for a student. Certainly it is reasonable from the student's standpoint; however, it makes impossible an interpretation with others in the class, as well as any comparison with students in other classes, even of the same subject at the same institution. The attempts to arrive at a fair and equitable grade to assign to an individual student, without distorting either the student's standing in class or comparative ranking with students at that or other institutions, has proven to be one of the most difficult quests of the faculty member. To be reliable as a tree measure of achievement over a period of time, the grade assigned must be understood by the instructor, students, colleagues, and future evaluators.
Since there appears to be little doubt that a given mark has different interpretations, perhaps the best choice is for the faculty member to follow the course, within departmental and institution guidelines, that in his or her opinion best measures the student's progress during the measuring period without being overly concerned with grading practices of other faculty and other institutions. Grade distributions, grading procedures, and students' evaluations of instructors: a justice perspective. Jasmine Tata. The Journal of Psychology Grades are the basic currency of our educational system. Instructors in universities and colleges assign grades to students on a regular basis. High grades result in both immediate benefits to students (e.g., intrinsic motivation, approval of family) and long-term consequences (e.g., admission to graduate school, preferred employment). Students who perceive their grades as unfair are more likely to react negatively toward the instructor, and these negative reactions may influence students' ratings of teaching effectiveness. A number of researchers have examined the connection between students' grades and their evaluations of instructors. The literature is equivocal on this issue; some researchers suggested that student grades and grading standards may bias teaching evaluations because students who receive higher grades tend to rate the instructor more positively than students who receive lower grades. Results of other studies did not find consistent effects of grades on evaluations of teaching . This inconsistency in the literature concerning the connection between grades and students' evaluations of instructors can be clarified by examining the fairness of grades. It is possible that the connection between low grades and unfavorable evaluations of instructors exists not because of the level of the outcome (the grade) received by the student, but because the low grade is perceived to be unfair. The literature on distributive justice (Adams, 1965; Crosby, 1984; Folger, 1986) indicates that people receiving outcomes that are lower than expected are more likely to perceive the distribution as unfair, and perceptions of unfairness may lead to negative evaluations of distributors. In the context of grade allocations, a student who spends a number of hours preparing for an examination may expect to receive an A. If the
student receives a lower grade, the grade may be perceived as unfair, especially if other students who spent fewer hours preparing for the examination received higher grades. In addition to the grade distribution, grading procedures can also influence students' perceptions of fairness and evaluations of instructors. The literature on procedural justice states that procedures that are consistent and impartial are perceived as fairer than those that are inconsistent and biased (Leventhal, 1980). In the context of the classroom, instructors are expected to apply grading standards (procedures) consistently to all students. If an instructor lowers the standards for a few students, this is likely to be perceived as procedurally unfair. Students who believe that the grade allocation procedures are unfair may be more likely to evaluate the instructor unfavorably. For this study, I examined the connections between students' evaluations of instructors, the fairness of grade distributions, and the fairness of grading procedures. I also investigated the relative influence of procedural and distributive fairness on evaluations of instructors. Examining these relationships can be of importance to instructors, administrators, and students. Instructors may use the findings to identify how best to allocate grades and the appropriate procedures associated with grade distribution. Administrators and students may understand the extent to which evaluations are influenced by students' perceptions of procedural and distributive fairness, and the judgment processes involved in evaluations of instructors . The Fairness of Grade Distributions Justice theory and research have dealt with both distributive and procedural justice. Distributive justice is concerned with the fairness of decisions about the distribution of resources, whereas procedural justice is concerned with the fairness of the procedures used to reach those decisions . Distributive justice refers to the extent to which the outcomes received in an allocation decision are perceived as fair; this type of fairness has been considered implicitly within the contexts of equity theory relative deprivation theory , and referent cognitions theory . These theories suggest
that individuals use standards of distributive justice such as equality allocated equally to all participants regardless of inputs) and equity (outcomes allocated based on inputs such as productivity) to establish the fairness or unfairness of the outcome). Thus, the experience of injustice involves the realization that outcomes do not correspond to expectations determined by standards of distributive justice. In the context of the classroom, grades are the outcomes allocated to students. Students receiving lower grades than expected are likely to perceive the grades as distributively unfair, whereas students receiving expected grades are likely to perceive the grades as fair; this phenomenon can be explained by relative deprivation and the egocentric bias. Relative deprivation theory posits that the fundamental source of feelings of injustice is the realization that one's outcomes fall short of expectations. The egocentric bias in distributive justice suggests that individuals' expectations of their own performance and outcomes are higher than their expectations of others' outcomes; hence, people who receive higher outcomes are more likely to perceive those outcomes as fair than people who receive lower outcomes . Empirical support for this phenomenon has been found in the work of Lind and Tyler and Tyler , who found connections between outcomes (relative to expectations) and perceptions of distributive justice. The Fairness of Grading Procedures Procedural justice refers to the extent to which the processes used in making allocation decisions are perceived as fair Lind & Tyler, 1988; Thibaut & Walker, . Research on procedural justice has evolved from two conceptual models - Thibaut and Walker's (1975) dispute resolution procedures and Leventhal's (1980) principles of resource allocation procedures. These researchers suggested that procedural justice involves the realization that procedures correspond to those determined by certain standards (e.g., consistency, suppression of personal bias, use of accurate information, voice, and congruity with prevailing standards or ethics). In the classroom context, the procedures used to allocate grades could influence students' perceptions of procedural fairness and evaluations of the instructor. Results of research in organizational contexts have shown that distributive fairness and procedural fairness influence employees' reactions. Folger and Konovsky
found that perceived fairness was related to satisfaction, trust in supervisors, and organizational commitment. Alexander and Ruderman determined that employees' perceptions of fairness influenced their approval of supervisors. Extrapolating to the classroom context, the fairness of grade distributions and grading procedures should influence student reactions, such as their evaluations of instructors. Therefore, my first hypothesis was that evaluations of the instructor would be higher for students who received expected grades (fair grade distributions) than for students receiving grades lower than expected (unfair grade distributions). My second hypothesis was that students' evaluations of the instructor would be higher for consistent (fair) grade allocation procedures than for inconsistent (unfair) procedures. It is possible that procedural and distributive justice are predictive of different types of outcomes. Sweeney and McFarlin's two-factor model suggests that distributive justice primarily influences attitudes toward the outcome in question, whereas procedural justice influences attitudes toward the system. For example, Sweeney and McFarlin ) found that employees who believed their pay was lower than expected (distributively unfair) demonstrated lower levels of pay satisfaction, an attitude specifically directed toward the outcome (pay). In contrast, when pay distributions were made using fair procedures, employees showed higher levels of trust in management and commitment toward the organization (attitudes directed toward the system). In the classroom context, students' evaluations of instructors can be considered attitudes toward the university system; such attitudes are more likely to be influenced by procedural justice than by distributive justice. based on Sweeney and McFarlin's model, my third hypothesis was that consistency (fairness) of grading procedures would influence students' evaluations of the instructor to a greater extent than grade distributions. Method Participants and Design Based on a definition of fairness as meeting expectations and being consistent, I used a 2 (grade distribution: met expectations vs. did not meet expectations) x 2 (grading
procedure: consistent vs. inconsistent) between-subjects, scenario-based experimental design. Undergraduate students (51 men and 46 women) participated in the study. Most were sophomores (32%) or juniors (41%), and the rest were seniors. The average age of the students was 20.10 years. Materials The participants were asked to respond to one of four different scenarios. Each scenario described a classroom situation and an instructor. Participants were given contextual information about the situation and were asked to place themselves in the position of a student in the class who had worked hard on a term paper by conducting research, writing, and rewriting the paper. On the basis of the grading criteria described in the syllabus, the student expected to receive a grade of A on the paper. The grade distribution was manipulated by informing the participants that the student received a grade that either met expectations (A = fair grade distribution) or did not meet expectations (B = unfair grade distribution). The grading procedure was manipulated by stating that the instructor used the grading scheme specified in the syllabus to grade the paper (consistent/fair grading procedure) or that the instructor changed the grading scheme after the paper was turned in (inconsistent/unfair grading procedure). Procedure Participants were randomly assigned to one of the four manipulation conditions. After reading the scenario, they were asked to complete measures of the dependent variable (students' evaluations of the instructor) and two manipulation checks (distributive justice and procedural justice) on 7-point Likert-type scales. Students' evaluations of the instructor were measured by asking them to rate the preparation of the instructor, course organization, subject matter presentation, knowledge of the subject matter, availability of the instructor, his or her attitude toward students, and an overall evaluation of the instructor. These items were based on scales used in empirical research by Chacko (1983) and Carkenord and Stephens (1994).
Distributive justice was measured by asking participants to rate the extent to which they felt that the actual distribution of grades was fair and what they deserved, and procedural justice was measured by asking participants to rate the extent to which they felt that the decision about the grade was made in a fair way and they were treated fairly; these scales were based on those used by Bies, Shapiro, and Cummings (1988) and Shapiro (1991). After completing the ratings, participants were debriefed about the purpose of the study. Results Reliability Analyses and Manipulation Checks Cronbach's reliability coefficients were calculated for the scales and were found to be greater than .75 for each condition; mean ratings were also computed for each scale. Next, I conducted manipulation checks to examine the participants' understanding of the distributive justice and procedural justice manipulations. I conducted separate t tests for the two manipulation checks. The results of the t tests indicated that the manipulations had the intended effects. Grade distributions influenced perceptions of distributive justice; participants who had been assigned expected grades gave higher ratings for distributive justice than those who had been assigned grades lower than expected, Ms = 4.81 and 3.59, respectively, t(95) = 3.84, p [less than] .05. Also, the instructor's grading procedures influenced perceptions of procedural justice; participants gave higher ratings of procedural justice for consistent procedures than for inconsistent procedures, Ms = 5.11 and 3.87, respectively, t(95) = 4.06, p [less than] .05. Tests of Hypotheses Of interest in this study was the relative influence of grade distributions (distributive fairness) and grading procedures (procedural fairness) on students' evaluations of the instructor. I conducted an analysis of variance (ANOVA) with grade distributions and grading procedures as independent variables and students' evaluations of the instructor as the dependent variable. The two main effects and the interaction effect were significant.
In support of Hypothesis 1, evaluations of the instructor were influenced by grade distributions. Participants who were assigned expected grades (fair distributions) gave higher evaluations of the instructor than participants who were assigned grades lower than expected (unfair distributions), Ms = 5.54 and 4.67, respectively, t(95) = 2.42, p [less than] .05. Hypothesis 2 was also supported. Students' evaluations of the instructor were influenced by grading procedures; when consistent (fair) procedures were used, participants gave higher evaluations of the instructor than when inconsistent (unfair) procedures were used, Ms = 5.52 and 4.69, respectively, t(95) = 2.31, p [less than] .05. The interaction of grade distributions and grading procedures was also significant, and simple main effects were calculated using Gabriel's simultaneous test procedure (Kirk, 1982). Among participants who received expected grades (fair distributions), there were no significant differences in evaluations of the instructor between those who were provided with consistent procedures and those who were provided with inconsistent procedures, Ms = 5.61 and 5.47, respectively, t(95) = 0.39, p [greater than] .05. Among the participants who received grades lower than expected (unfair distributions), however, respondents who were provided with consistent (fair) grading procedures gave the instructor higher evaluations than those who were provided with inconsistent (unfair) procedures, Ms = 5.42 and 3.91, respectively, t(95) = 4.19, p [less than] .05. Therefore, grading procedures appeared to influence evaluations of the instructor only when students received grades lower than expected. To test Hypothesis 3, I calculated partial correlation coefficients. The partial correlation between procedural fairness and the students' evaluations of the instructor (controlling for distributive fairness) was compared with the partial correlation between distributive fairness and evaluations of the instructor (controlling for procedural fairness). The results suggest that the relationship between procedural fairness and evaluations of the instructor was no stronger than the relationship between distributive fairness and evaluations of the instructor. Thus, Hypothesis 3 was not supported. Discussion
The purpose of this study was to examine the influence of the fairness of grade distributions and grading procedures on students' evaluations of the instructor. Distributive fairness was manipulated by providing participants with grades that either met expectations or were lower than expected. Procedural fairness was manipulated by providing consistent or inconsistent grading procedures. The results indicate that students' evaluations of an instructor are influenced by distributive fairness because participants who received expected grades gave higher evaluations than those receiving grades lower than expected. Procedural fairness also influenced evaluations of the instructor. Participants provided higher evaluations under consistent procedures than under inconsistent procedures. The fairness of grading procedures, however, influenced evaluations of the instructor only under unfair grade distributions. When students received expected (fair) grade distributions, grading procedures did not significantly influence evaluations of the instructor. This suggests that procedural fairness becomes more salient under conditions of distributive unfairness. The fairness of grading procedures, however, did not influence students' evaluations of the instructor to a greater extent than the fairness of grade distributions; this finding is not consistent with previous research conducted in organizations but may be explained by examining the differences between organizational settings and the classroom context. Employees in organizations are likely to have a long-term perspective of their relationships with management and organizations. In contrast, students are more likely to have a short-term perspective of their relationships with instructors, as they generally interact with instructors for only the length of a semester. These differences in time horizons can be connected to perceptions of fairness. Procedural fairness influences system variables (e.g., trust in management or evaluations of instructors) partly because fair procedures ensure that, over time, outcome distributions (e.g., pay or grades) will be favorable (Lind & Tyler, 1988). Employees who have long-term relationships with management may be influenced to a greater extent by procedural fairness than students who have short-term relationships with instructors and are not concerned about future outcomes distributed by the instructor. Thus, students may not emphasize grading procedures to a greater extent than grade distributions when evaluating instructors.
Before generalizing from the results of this study, certain limitations of the methodology should be kept in mind. Patterns obtained in a scenario-based study may not always be generalizable to other settings. Unfortunately, the sensitive nature of this line of research made it problematic to conduct in a classroom setting. Also, the subtle differences between the independent variables used in this study made it difficult to examine the independent and interaction effects under natural circumstances. The external validity of the study, however, was increased by using students as participants, because they could easily relate to the grading incidents. Future researchers can extend the generalizability of this study by replicating it using other methods in other settings. Another potential limitation of the study is the use of only one manipulation of grade distributions and one of grading procedures. In actuality, students' perceptions of distributive fairness may be influenced not only by comparisons between their grades and expectations, but also by comparisons between their grades and others' grades. Similarly, procedural fairness may be perceived not only through the consistency of grading procedures but also through other factors such as lack of bias and the use of accurate information. Although the manipulation checks indicated that participants' perceptions of distributive and procedural justice were influenced by the manipulations used in the study, future researchers can use other techniques to examine connections between the perceived fairness of grades and students' evaluations of instructors. When the results of this study are viewed along with past studies (Perkins et al., 1990; Snyder & Clair, 1976), grade distributions appear to be a consistent influence on evaluations of teaching. To the extent that students' evaluations of the instructor's performance reflect the instructor's evaluations of the students' performance (grades), teaching evaluations have the potential to be contaminated by factors unrelated to teaching behavior. The influence of grade distributions, however, can be mitigated by the grading procedures used. The results suggest that the fairness of grading procedures has a significant influence on students' evaluations of instructors. As such, this study connects the research on procedural justice in organizational settings (Greenberg, 1990; Lind & Tyler, 1988; Tyler, 1986) to the classroom context; just as managers
perceived as fair by employees are more likely to receive positive evaluations, instructors perceived as fair receive higher ratings. Instructors can ensure the fairness of their grading procedures by being consistent, using accurate information, and maintaining an impartial process. The validity of students' evaluations of instructors is a complex issue. Although factors external to instructor performance (such as grade distributions) can influence evaluations, so can other factors intrinsic to performance such as the fairness of the grading process. The validity of students' evaluations of instructors can be strengthened by using other measures of teaching effectiveness along with student ratings, especially in making decisions about salary increases, promotions, and tenure. A new approach to exploring biases in educational assessment. Ian Dennis, Stephen E. Newstead and David E. Wright. British Journal of Psychology Assessment procedures which lead to a subjective evaluation of the work of students or pupils are extensively used at all levels in British and European education. Although the North American tradition has relied more heavily on objectively scored multiple choice assessments, there, too, in recent years there has been an increasing advocacy of subjectively marked, open-ended or 'authentic' assessment (e.g. Jones, 1988). One of the disadvantages of this form of assessment is that marking may be subject to various forms of bias. Our knowledge of the biases which may be at work and especially of the extent of their impact in live educational settings is quite limited and relies in part on generalization from more general evidence concerning judgmental biases. This paper aims to contribute relevant evidence based on marks obtained from a real rather than simulated assessment situation. However, more importantly it aims to illustrate an approach to studying assessment biases which, if applied more generally, has the potential for adding considerably to our knowledge in this area. One situation in which there is good reason to suppose that marking biases may operate is that in which the student whose work is being assessed is personally known to the marker. Although this situation may be undesirable from the perspective of summative assessment, there are often other considerations, such as the provision
of appropriate feedback, which outweigh this. Examples of marking being undertaken by the same individuals who teach students, and who are therefore personally familiar with them, are thus widespread. In such situations there is good a priori reason for suspecting that marks may be contaminated by individual biases. One form of individual bias which is likely to occur in assessment is based on generalization from previous performance. Theories of impression formation suggest that in general individuals tend to form consistent impressions of others at an early stage in the impression formation process and having done this are prone to discount evidence which is inconsistent with these early views. Although there has been little attempt to examine the consequences of this for educational assessment, it would be strange if markers were exempt from these effects. Related to this is an extensive if rather muddled literature in occupational psychology on halo effects in performance rating. This has primarily focused on the situation in which an individual's performance is rated on a number of different dimensions or attributes where high correlations are often found between the different dimensions. The high observed correlations have been explained partly as a result of the true correlation of the dimensions being rated (true halo) but also partly as a consequence of systematic rater errors (illusory halo). Recent reviews have suggested that the extent of illusory halo may have been exaggerated and that attempts to reduce the total observed halo effect may sometimes be misguided. One argument made in both reviews is that there is not a major problem if the purpose of rating is to make comparisons between individuals according to a criterion which involves pooling the rated dimensions. Such comparisons are not vitiated if the ratings on the dimension under consideration are contaminated by some other dimension which also relates to the decision being made or by some impression of general merit. Whilst this may be true for some occupational settings the argument does not hold good in education and training. In the educational context it is important to be accurate not only in comparisons across individuals but also in comparisons within an individual over time or over different assessments. There are clear reasons in the educational context for seeking to avoid biases which downplay changes in the relative standing of individuals since such biases are clearly unfair to individuals who show more than average improvements in their performance.
Although illusory halo can occur because ratings of various relevant dimensions contaminate one another, it can also occur if they are all contaminated by a common influence which is not relevant to the decision being made. Influences which might plausibly work in this way are not hard to identify, although there is only a limited amount of work demonstrating their impact and even less which enables their magnitude to be evaluated in applied settings. One dimension which has received some attention is the physical attractiveness of the individual being assessed or rated. Landy & Sigall found that the evaluation of essays could be influenced by the physical attractiveness of their author, although a study by Bull & Stevens only partially confirmed this result. In the occupational sphere Morrow, McElroy & Stamper examined the influence of physical attractiveness on judgements of suitability for promotion made by personnel professionals on the basis of simulated assessment centre data. Physical attractiveness was found to have a significant effect on the ratings, although the effect was not large, accounting for only 2 per cent of the variance in the ratings given. Another unwanted influence based on an assessor's or appraiser's knowledge of the individual being assessed is that of interpersonal liking. Cardy & Dobbins (1986) asked student subjects to evaluate vignettes of professors. The inclusion of trait terms that engendered different liking levels but which were irrelevant to performance nevertheless affected the rating given. Tsui & Bruce working with ratings of real occupational performances, also concluded that interpersonal affect may contaminate appraisal ratings. Ratings on a number of different dimensions were higher when raters liked the person being rated. Moreover, the existence of either positive or negative affect towards the ratee increased the intercorrelations between the different dimensions relative to those obtained from raters whose feelings towards the person being rated were neutral. Thus there is good reason to suspect that where markers know the students whose work they are assessing their marks may be biased by overgeneralization from the student's previous performance, by whether or not they like the student, and by irrelevant considerations such as the student's physical attractiveness. However, such influences have had little direct investigation in the educational setting and virtually nothing is known of whether or how seriously marks may be contaminated by them.
As well as resulting in unfair treatment for individual students, it seems likely that if these biases are operating, especially those based on interpersonal affect and physical attractiveness, they will vary in their impact for different groups of students such as males and females or students of different ages. Thus the differential impact of biases based on individual knowledge of students is one way in which different groups of students may come to suffer unequal treatment. However, a second possible way in which this may come about is through the operation of group stereotypes. These are likely to have most impact when the marker does not know the student personally but can identify the group to which the student belongs. The most widely discussed example of this is where markers are able to identify gender from the student's name. The possibility that gender stereotypes may bias marking has received more attention than most aspects of marking bias. However, even here there is no clear agreement on what types of effects are operating and on the extent of their impact. Whilst differences in mark and grade distributions between males and females have often been observed, it has not proved easy to make progress in disentangling whether these reflect genuine differences in performance or whether they are in some part attributable to biases in marking. A study by Bradley is one of the few which has made progress on this issue. Bradley exploited the fact that where two markers mark the same piece of work the discrepancy in their marks will in part be determined by any differences in their biases. She found that second markers (whom she assumed to have less knowledge both of the project topic and of the student) marked the projects of female students closer to the centre of the scale than first markers. For male students the pattern was reversed, with second markers tending to award more extreme marks. Since Bradley's effect relates to the comparison between the marks awarded by first and second markers it cannot be explained solely on the basis of different distributions of performance in male and female students. Bradley attributed the outcome to a centrality bias in the marking of female students' work which derived from gender stereotypes and was consequently stronger in the less specialist second markers. However, the data could equally be explained if there were some influence which inflated the variance of first marker's marks for female students. Bradley's
preference for the explanation in terms of group stereotypes was based partly on the previous literature and partly on the fact that the pattern of results failed to obtain in a department where the second markers were blind to the student's identity and gender. However, there could well be other differences between the departments in which the effect was observed and that in which it was not. This point was reinforced by the failure of Newstead & Dennis (1990) to replicate Bradley's data pattern in a large department where second markers did know the student's gender. A variety of explanations for the discrepant outcomes drawing on different interpretations of Bradley's initial effect have been advanced and discussed. However, this debate has been largely indecisive and it might be concluded that the approach being adopted provides insufficient evidence to adjudicate between alternative interpretations of the effects. Thus, whilst there is good reason to believe that personal biases could operate in marking, there is little direct evidence of their importance, and in the area of gender bias it has been difficult to disentangle effects of bias from true differences in performance. One of the most promising approaches to the latter issue is subject to ambiguities of interpretation. The main aim of the present paper is to propose and illustrate an approach which can contribute considerably to progressing the study of both individual biases and gender bias in marking. The essence of this approach has previously been advanced in relation to occupational ratings by Kenny & Berman (1980). However, it appears to have been little exploited for occupational ratings and not at all in relation to marking bias. The present study goes beyond the proposals of Kenny & Berman in using a multi-sample approach to compare the way in which the work of male and female students is marked and in using a structured means model to locate the source of differences in the average marks of males and females. Overview of the approach Consider a model in which the mark which a marker awards to a piece of work is the sum of three components. The first component is determined by the true merit of the work being assessed. The second component reflects the aggregate influence of the marker's biases concerning the student in question. The third component consists of purely random influences. We can hope to make some progress in distinguishing the variance attributable to these different influences if markers assign nominally
independent marks to a number of pieces of work from the same student and if each piece of work is marked by more than one marker. In this situation each of the marks assigned to a particular piece of work will be influenced by its true worth (along with other influences). All the marks which a marker awards to a particular student will be influenced by, amongst other things, that marker's biases concerning the student. When viewed in this way the problem of separating variance attributable to merit from that attributable to bias is isomorphic with that of separating trait and method variance in personality assessment . Thus marking data of the form discussed above can be very effectively analysed using the same sort of confirmatory factor models which have been applied to multitrait-multimethod matrices. This should be a useful tool since it enables the percentage of variance attributable to individual biases to be estimated for each marker. Moreover, using programs such as LISREL or EQS, this type of model can be fitted simultaneously to data from separate groups such as males and females. The advantage of this is that the consequences for the overall fit of the model of constraining its various parameters to be equal for the two groups can be assessed. Hence it can be determined where differences in the marking of the two groups arise. Thus, for example, if differences in variance are found between male and female marks it can be determined whether these are attributable to the merit component of the mark, the bias component or the error component. Structure of the data The data which were used in this study derive from the marking of final year undergraduate psychology projects in the University of Plymouth between 1991 and 1993. These projects report a piece of empirical work carried out throughout an academic year. This work is supervised by one supervisor who meets with the student regularly during the course of the year. The student's project report is sectioned in the conventional way for reports of empirical work and is independently marked by the supervisor, acting as first marker, and a second marker. Second markers are chosen, subject to constraints imposed by workload, on the basis of their interest or expertise in the topic area of the project. Project reports, which are typed, carry the student's name on the cover. Second markers will vary in their degree of personal knowledge of the student: they may have come to know the student quite well in some role such as that of the student's personal tutor but may know the student hardly at all. Typically
they will know the student considerably less well than does the supervisor by the time the project report is submitted. The supervisor awards four marks to the project whilst the second marker awards three. The supervisor's A mark is intended to be based on the student's performance in designing and conducting the project work over the course of the year. It is an explicit part of the marking policy that the remaining three marks awarded by each marker are to be based solely on evidence which is available to both markers in the form of the written project report. The B mark relates to the introduction section of the report, the C mark to the method and results sections and the D mark to the discussion. The marks are awarded on a 100-point scale commonly used in British higher education where a mark of 70 or above corresponds to a first-class honours degree and a mark below 40 represents a fail, with intermediate classifications all having specified mark ranges. For each of the four marks, markers have available to them a set of marking guidelines which provide a two or three sentence description of the performance appropriate to each degree class. Second markers award the B, C and D marks without knowledge of the corresponding marks given by the supervisor and without receiving any comments from the supervisor. They do, however, have sight of the supervisor's A mark prior to awarding their own marks. Data from one male student and two female students were excluded from the analysis because they were incomplete. This left data from 197 female students and 58 males which were used in the analysis. Twenty-five different markers were involved, of whom eight were female. Structure of the model The path diagram for the model fitted to the data is shown in Fig. 1. This follows the standard conventions for such diagrams. The variables in square boxes are manifest variables - in this case the seven different marks awarded to each project. Al denotes the first marker's A mark, B1 the first marker's B mark, B2 the second marker's B mark and so forth. The variables enclosed by circles are latent variables which the model assumes to combine in determining the manifest variables. In general the model assumes that each mark is determined by the sum of three influences.
The first influence on each mark is a factor which is specific to the section being marked but which influences both markers. This influence is reflected in factors SSB, SSC or SSD for sections B, C and D respectively (the labels are chosen to provide a mnemonic for the fact that the influence of these factors is section specific). These factors will probably reflect primarily the true merit of the section being marked but they could also embody biases which are shared by both markers or any other influence which operates on both of them. The second influence on each mark is one which is marker specific but general across all the marks awarded by that marker to the student. These factors appear on the right of the model in Fig. 1. The factor MS1 influences only the marks of the first marker and the factor MS2 influences only the marks of the second marker (again the labelling is chosen to provide a mnemonic for the fact these are marker specific factors). These two factors affect each of the marks which a marker awards to a particular student. The influences represented by MS1 and MS2 would include any pre-existing biases the marker may have concerning either the student in question or a group to which the marker knows the student belongs. The marker's reaction to aspects of the student's work which transcend the different sections, such as the student's writing style, would also enter into the factors MS1 and MS2. It also needs to be recalled that the markers under consideration are not the same individuals for all projects being marked. This complicates interpretation a little since any differences in stringency between markers, whereby some markers are more severe than others in all the marks which they award, will also appear in MS1 and MS2. However, it also means that rather more general conclusions can be reached than would have been possible if only two markers had been involved. The third influence on each mark is one of the error components E2 to E7. These account for the component of the variance which is not explained by the other factors. They represent the component of each mark which is idiosyncratic and section specific and are analogous to the error component in traditional true score models of marking reliability. Factor SSA differs slightly from factors SSB to SSD. Because the A mark is not independently awarded by a second marker there is no way of identifying the component of this mark which is idiosyncratic to the supervisor. Thus the model does
not contain an error component influencing A. Instead in this case the error component can be thought of as being hidden within SSA. The double headed arcs linking factors SSA to SSD signify that these factors are allowed to be intercorrelated. Clearly if the major determinant of these factors is the real merit of the student's performance then students who perform well on one element of the project are likely also to perform well on other elements so this is necessary to make the model appropriate. It is important to note that influences which are not linked by double headed arcs are assumed to be uncorrelated. In particular the influences on marks reflected in MS1 and MS2 are ones which are uncorrelated with each other and with the complex of SSA to SSD. Thus biases which are shared by the two markers will appear in SSA to SSD rather than MS1 and MS2. Biases which are correlated with ability will also not affect MS1 and MS2. Consider, for example, Bradley's (1984) suggestion that because of gender stereotypes second markers mark the work of female students less extremely. This bias would reduce the mark given to good work presented by a female student but elevate the mark given to poor work. Thus the bias which Bradley proposes to be operating in second markers is one which is correlated, albeit negatively, with student performance. Because it is not orthogonal to the section specific factors it will not contribute to MS2. However, if this bias is operating, the second marker will award a lower mark to a good female project than to an equally good male project and conversely for poor projects from male and female students. Thus variations in project quality will have less impact on the mark for female than for male students. This will lead to smaller path coefficients for females than for males on the paths from SSB to SSD to the corresponding marks awarded by the second marker. Results Descriptive statistics for each of the seven marks are given for male and female students separately in Table 1. The modelling procedures used here assume multivariate normality of the underlying data and are known to be sensitive to violations of this assumption. It can be seen that in general, but especially for the females, the distributions tend to be leptokurtic and to have negative skew. After experimentation with alternative transformations this was dealt with by squaring all
marks prior to calculating the variances and covariances used in modelling. Skew and kurtosis for the transformed marks are also given in Table 1. It is apparent that the transformation alleviates the problems which were present in the raw data and produces data which are reasonably close to normality for both males and females. The correlation matrix for the transformed data along with the relevant standard deviations are presented in Table 2. These are the data to which the model was fitted. Informal inspection of Table 2 reveals a number of features which are helpful in gaining an insight into the conclusions which emerge more formally from the fitting of the model in Fig. 1. The correlations in the table are of three types: (i) correlations between marks awarded by the same marker to different sections, (ii) correlations between markers in their marks for the same section, and (iii) correlations involving both different markers and different sections. In Table 2 correlations of type (i) are in bold type, correlations of type (ii) in italics, and correlations of type (iii) in plain type. In general correlations of type (i) are larger than those of type (ii), which are in turn larger than those of type (iii). This general pattern, which, as might be expected, emerges more clearly in the larger female sample, is consistent with the model in Fig. 1. Correlations of type (iii) are expected to be smallest since the two measures being correlated are not influenced by any shared factor. The fact that the correlations of type (iii) are all positive presumably reflects the existence of positive correlations amongst the factors SSA to SSD. Correlations of type (ii) differ from those of type (iii) in that the measures being correlated share the influence of either MS1 or MS2. The fact that correlations of type (ii) tend to be bigger than those of type (iii) demonstrates the need for the inclusion of these two factors in the model. For correlations of type (i) the two measures being correlated share the common influence of one of the factors SSA to SSD. The observation that correlations of type (i) tend to be bigger than those of type (ii) suggests that, as might be expected, the influence of SSA to SSD is greater than that of MS1 or MS2. A second feature of the data which is evident in Tables 1 and 2 and which contributes to the outcome of model fitting is that the marks awarded by first markers show a larger SD for males than for females. If this difference in variance is tested using the transformed data it reaches significance on the B and D marks though not on
the A and C marks (for mark A, F(57, 196) = 1.28, p = .22; for mark B, F(57, 196) = 1.51, p = .04; for mark C, F(57, 196) = 1.17, p = .42; for mark D, F(57, 196)= 1.65, p = .006). The difference in SD is only marginally and non-significantly evident in the second marker's marks (for mark B, F(57, 196) = 1.12, p = .58; for mark C, F(57, 196) = 1.05, p = .79; for mark D, F(57, 196) = 1.13, p = .53). Some aspect of the first marker's marking must differ between males and females but without formal modelling it is not clear which of several possibilities holds. It could be, for example, that individual biases are larger for males than for females so that the influence of factor MS1 is greater for males than for females. Alternatively, it could be that first markers let the same degree of variation in merit produce larger decrees of variation in marks when the work is that of male rather than female students. Thus the paths leading from the factors SSA to SSD to the marks awarded by the first marker would have larger coefficients for males than for females. Perhaps less plausibly it may be that supervisors' marking of male students is noisier so that the error components of the first marker's marks (E2, E4 and E6) are greater for males than for females. The purpose of the model fitting, carried out here using EQS, is to predict the entire pattern of covariances and variances amongst the seven marks in each of the two samples. EQS and similar programs determine values for the unknown parameters of the model in such a way as to minimize the discrepancy between the observed variances and covariances and those predicted by the model. A number of alternative criteria for assessing this discrepancy are available. In the present case a maximum likelihood criterion was employed. Under the assumption of multivariate normality this criterion leads to a function which is approximately distributed as [[Chi].sup.2] with a number of degrees of freedom which depends on the number of measured variables and on the number of parameters which are estimated in fitting the model. The value of this statistic can be used to assess the overall compatibility of the model with the data. It is also possible when fitting models to constrain parameters to particular values or to equality with one another so that fewer separate parameters need to be estimated and the resulting [[Chi].sup.2] statistic has more degrees of freedom. The change in [[Chi].sup.2] resulting from the imposition of constraints provides a method of assessing whether those constraints are significantly worsening the fit of the model - this is sometimes referred to as the [[Chi].sup.2] difference test. In addition to the [[Chi].sup.2] statistics a variety of other fit indices are also
available. Some of these, such as the normed fit index (Bentler & Bonnett, 1980), provide a measure of fit on a scale from 0 to 1; for these the fit index inevitably increases as constraints are released in a series of nested models. Other measures such as Akaike's Information Criterion also take account of the parsimony of the model and favour models in which a good fit is obtained with a small number of free parameters. Because of the previous evidence that males and females may be treated differently in project marking they were treated as separate samples using the facilities which EQS provides for multi-sample modelling. This approach makes it possible to impose constraints which equate particular parameters across the two samples. The degree of fit of the constrained model then provides a test of whether the assumption that these parameters are the same for the two samples is consistent with the data. By investigating which parameters, if any, need to differ across the two samples it should be possible to be more specific about the origins of different mark distributions for males and females. Table 3 presents a number of measures of fit for a series of variants of the model under discussion. Preliminary explorations of the male and female data separately indicated that in both cases no strain was imposed on the model by making the error component of the C and D marks equal for the two markers and that this constraint helped to avoid some problems with error variances becoming negative in the smaller male sample. Accordingly this constraint was retained in all the models reported in Table 3. There was never any indication that releasing this constraint would have produced a noticeable improvement in fit of any model. Model 1 of Table 3 contains the requirement that all parameters of the model should be equal for males and females. It can be seen that overall this model is quite compatible with the data. However, it was noted previously that the marks awarded by first markers to male and female students had different variances. This provides evidence that some aspect of the model needs to be different for male and female students. Detailed consideration of the results from fitting model 1 indicated that the fit of the model could be improved if MS1 was allowed to have a greater influence for males than for females, particularly on the B and D marks.
Model 2 in Table 3 differs from model 1 in that all four path coefficients of MS1 have been allowed to differ in males and females. The [[Chi].sup.2] difference test confirms that the fit of model 1 is significantly poorer than that of model 2 ([[Chi].sup.2](4) = 10.141, p = .0381). Thus the influences represented by MS1 (the nature of these is taken up later) have greater impact when the work being marked is that of a male student. Inspection of the fitted parameters of model 2 showed that the influence of the first marker factor MS1 was considerably larger than that of the second marker factor MS2 (parameter values from a model which is very close to model 2 are given in Table 4). How strongly do the data dictate a model in which both MS1 and MS2 have some influence but the influence of MS1 is stronger? The remaining models in Table 3 were considered with a view to exploring this question. In model 3 the second marker factor, MS2, is removed from the model entirely. Model 3 is otherwise identical to model 2. The change in [[Chi].sup.2] on moving from model 2 to model 3 is not significant ([[Chi].sup.2](3) = 5.097, p = .165), indicating that the second marker effect, MS2, is not necessary for an adequate fit to the data. Because of its greater parsimony, model 3 appears slightly superior on Akaike's (1987) information criterion but the other fit indices suggest that model 2 yields a marginally better fit. The results from model 3 show that a model from which the second marker effect, MS2, has been removed provides an adequate account of the data. Would a model from which MS1 has been removed also be satisfactory? Model 4 examines this by dropping MS1 whilst retaining MS2. To make the comparison with model 3 a fair one, MS2 was allowed to have different path coefficient for males and females. Model 4 produces a significant overall [[Chi].sup.2] ([[Chi].sup.2]33) = 55.8, p = .008), implying that the data pattern observed would be unlikely if this model were correct. Thus whilst a model with only the first marker effect, MS1, is tenable, there is enough evidence in the data to discount a model in which only the second marker effect, MS2, is present. It is worth noting that models which exclude both of the marker specific factors, MS1 and MS2, unsurprisingly provide an even poorer account of the data than model 4. Complications arise in fitting these models because some estimates of the correlations between section specific factors become constrained at unity. This in
itself suggests that these models are unsatisfactory. Further evidence comes from the fit measures obtained when both MS1 and MS2 are dropped from the model. In model 6 the complications mentioned above have been dealt with by having the same factor load on both section C and section D marks. Additionally, differences between the male model and the female model which lead to a significant improvement in [[Chi].sup.2] have been introduced on an ad hoc basis. Despite these ad hoc changes the model produces a highly significant overall [[Chi].sup.2] and can thus be rejected. The exclusion of factors MS1 and MS2 (and also the merging of SS3 and SS4) make model 6 more parsimonious than the other models in Table 3 and this will in part account for its poorer fit. However, even on the AIC, whose purpose is to allow for differences in parsimony, model 6 comes out markedly worse than the other models in Table 3. This confirms the view that at least one of the marker specific factors is needed to give an adequate account of the data. Model 5 is intermediate between model 3 and model 4 in that the path coefficients of MS1 and MS2 on the two marks awarded to a particular section were constrained to equal one another. These equal path coefficients were allowed to differ between males and females. Model 5 cannot be discounted at conventional significance levels, although on all measures of fit it performs more poorly than models 2 and 3. In summary, the general form of the model presented in Fig. 1 provides a good account of the data provided that the first marker effect, MS1, is included. The effect of MS1 is greater for males than for females. The best account of the data is provided by models in which the impact of MS2 is either substantially smaller than that of MS1 or absent. However, a model in which MS1 and MS2 have an equal impact cannot be rejected. The good fit of the models should not be overemphasized given the relatively large number of free parameters which they contain. However, less sensibly motivated models with as many free parameters do less well in accounting for the data, and the results reported above from model 4 show that a satisfactory fit is not guaranteed for models of the order of complexity considered here. The satisfactory fit does imply that sensible interpretations may be placed on the parameter values of the fitted models. A number of issues arise from these parameter values and these are taken up in the discussion.
Inspection of Table 1 reveals that the mean marks given to male students are higher than those for females, especially in the supervisor's marks. The model presented in Fig. 1 implies that this must occur either because males have higher scores on some or all of the factors SSA to SSD or because they have higher scores on one or both of MS1 or MS2. The fact that the difference is more apparent in the supervisor's marking begins to suggest that it might arise from MS1. The question can be looked at more formally by use of a structured means model. The models whose fits are presented in Table 3 are based solely on the variances and covariances of the seven observed marks. They differ from a structured means model in that the latter also takes account of the means of the manifest variables in estimating parameter values. Multi-sample structured means models are discussed by Bentler . They make use of more evidence from the data in that they seek to account for the means of the manifest variables in the different groups but doing this also involves estimating additional parameters. The additional parameters which are needed in the present example include the means of each of the factors SSA to MS2 in each of the two samples. Since the zero point of the factors is arbitrary it can be set at zero in one of the samples. When this is done the estimated means of the factors for the remaining sample provide information on how the means of the factors differ in the two samples. It is these estimated differences in factor means which are the focus of interest here. The other additional parameters needed are the intercepts of the seven linear expressions given manifest variables as a function of the latent factors. These intercepts are constrained to equality in the two samples. In applying this approach to the present model 13 extra parameters need to be estimated - the intercepts of the seven manifest variables and the means for the six factors in one group. With 14 observed means available as data to be explained by the model and 13 additional parameters to be estimated, the structured means model provides virtually no additional information to assist in discriminating between models. However, the parameter estimates it provides are of considerable interest for the reasons given above. The structured means version of model 2 again has a good fit ([[Chi].sup.2] (29) = 32.89, p = .28, NFI = .973, NNFI = 0.995, CFI = .997). Table 4 presents all the parameter estimates from the fitting of this model along with their respective standard
errors. The ratio of these two quantities is in effect a z score which may be used to test whether the parameter differs significantly from zero. The parameter estimates relating to variance and covariance are all virtually identical to those obtained from model 2. Of immediate concern are the differences in factor means for the two groups. These appear in the bottom row of Table 4. It is important to note that whilst the means of the observed marks are higher for males, the means of the section specific factors (SSA to SSD) are all higher for females, though the differences are slight and fall well short of significance (for SSA z = 0.211, for SSB z = 0.086, for SSC z = 0.125, and for SSD z = 0.104). Thus the higher mean marks supervisors give to males cannot be explained in terms of the greater merit of their projects. Rather the higher marks of males are accounted for by their higher mean on MS1; here the difference between males and females does reach significance (z = 2.177, p = .029). Whatever influences are represented in MS1 are on average raising the marks of male candidates relative to those of female candidates to a small but significant extent. MS2 also operates to raise the marks of males relative to females; the strength of this effect is only a little smaller than for MS1 but because of a larger standard error of estimate it fails to reach significance (z = 1.09, p = .28). Discussion The most importance feature of these analyses lies in the satisfactory fit obtained from the model presented in the Introduction, in the need to include factors MS1 and possibly MS2 in order to obtain that fit, and in the proportion of mark variance accounted for by these factors. Thus, for example, under the model whose parameters are given in Table 4 the proportion of the total variance in the supervisor's mark attributable to the influence of MS1 ranges from 17.1 per cent for the B marks awarded to females up to 52.8 per cent for the D mark awarded to males with an average of 26.5 per cent for females and 31.1 per cent for males. In the case of the second marker the proportion of total variance due to MS2, which model 2 constrains to be the same for males and females, is 2.4 per cent for the B mark, 18.7 per cent for the C mark and 11.6 per cent for the D mark. What then do the factors MS1 and MS2 represent? A number of influences which might contribute to these factors are mentioned in the Introduction. Undoubtedly, the least interesting account of MS1 and MS2 is that
they merely reflect differences in stringency between the different individuals acting as first and second markers. There are, however, good reasons for believing that this is not a major component of MS1 and MS2. First, and most directly, it is possible to estimate a factor score for each project on MS1 and MS2 and to see whether these vary according to the marker. If differences in marker stringency are an important component of MS1 and MS2 then some markers should be associated with consistently high factor scores and some with consistently low factor scores. Although records of marker identity were not available for the first cohort of students whose data were included in the analysis, first markers were known for 190 projects and second markers for 168 projects. Analyses of variance comparing MS1 scores and MS2 scores across different markers yielded no hint of any significant differences between markers (for first markers, F(23, 166) = 1.26, p = .20; for second markers, F(23, 144) = 0.96, p = .52). The second reason for discounting differences in marker stringency as the main explanation, for MS1 especially, is that there is evidence that they are simply not large enough to have the impact which MS1 has. Newstead & Dennis (1994) report a study of the marking of examination answers by much the same group of markers as were involved here using the same marking scale. In terms of raw marks the standard deviation of marker means in that study was 2.32. Relating this to the standard deviations presented in Table 2 suggests that if differences in marker stringency here are of the same magnitude they would on average account for about 11.5 per cent of the mark variance. On this basis differences in marker stringency seem capable of accounting for only about two-fifths of the impact of MS1, although they might be large enough to account entirely for MS2. It is of course possible that differences in marker stringency are much greater in project marking than in exam marking but this seems unlikely given that markers work in a greater variety of partnerships when marking projects and there is more opportunity for transfer of standards. Third, differences in stringency should be equal for first and second markers since essentially the same individuals are acting in both roles. In so far as the data tend to point to a model in which the impact of MS1 is greater than that of MS2, the first marker factor MS1 cannot wholly be attributed to variation in marker stringency.
Finally, differences in marker stringency should apply equally to both male and female students. The finding that the influence of MS1 is moderated by student gender therefore leads to the conclusion that it cannot be entirely attributable to this source. Thus whilst a part of MS1 and perhaps the entirety of MS2 might be attributable to section-transcendent differences in marker severity there are good reasons for believing that some part of MS1, and probably the largest part, is not attributable to this cause. A second possibility raised in the introduction is that factors MS1 and MS2 reflect a particular marker's reaction to section-transcendent features of the student's work such as writing style. If this is the case then markers are giving such features an inappropriately high weighting. The marking guidelines refer to quite separate features of the project for each section. It should be noted that all projects are word processed or typed so quality of handwriting is not a candidate as a sectiontranscendent feature. There are in any case reasons for being sceptical about any explanation of this general type. MS1 loads not only on the marks awarded to the sections of the written project but also on the supervisor's A mark given for the conduct of the project over the year. It is difficult to see what sort of feature could transcend both conduct of the work of an empirical project during the year and the written report on the work. Again, in so far as the data favour a model in which MS1 has more impact than MS2, it is difficult to see why factors such as writing style should affect the marks awarded by supervisors to a greater extent than they affect the marks given by second markers, especially when it is recalled that it is the same group of individuals who act in both roles. Finally, an account based on section transcendent features of the work offers no ready explanation of why the impact of MS1 is different when the work being marked is that of a male rather than a female student, though it is possible that some tortuous account based on stylistic differences in the work of the two genders might be developed. If explanations based on differences in marker stringency or reactions to section-transcendent features of the work are discounted, then what remains is the conclusion that when a mark is awarded to a section of the project, that mark is inappropriately influenced by factors external to that section. These external factors could be a carryover from other sections of the project or they could reflect an
influence of knowledge of the student external to the project report. That knowledge might in turn be an assessment of the student's abilities based on evidence outside of the student's project performance or may even less appropriately represent a reaction to the student's other personal characteristics. The present data do not offer a great deal of evidence with which to disentangle these possibilities. There is no reason to expect that halo effects internal to the project report should be any greater for supervisors than for second markers. Thus if the first marker effect is stronger this would suggest that supervisors' B, C and D marks are being influenced by their contact with the student over the course of the year rather than solely by the project report. Although the data do not provide a particularly strong case against a model in which MS1 and MS2 have equal influence it would be surprising if both markers were susceptible to halo effects within the project report whilst supervisors were totally impervious to effects from their prior knowledge of the student. The conclusion that in marking supervisors are influenced by their prior knowledge of the student is perhaps not a greatly surprising one although there have been few previous attempts to demonstrate or investigate it. One useful feature of the application of covariance modelling to investigating such influences is that it provides an estimate of the magnitude of the effect. Whilst the existence of the influence it reveals is unsurprising, the strength of the effect which emerges may provide grounds for both surprise and concern. It would clearly be useful to discover the extent to which the influences in question are related to the student's personal characteristics and the extent to which they are imported from aspects of academic performance other than that nominally being assessed. One personal characteristic to which the influences under consideration are to some extent related is the student's gender. A model in which the factor MS1 was allowed to have different path coefficients for males and females produced a significant improvement in fit over one in which these coefficients were constrained to equality. Inspection of Table 4 shows that the main difference in path coefficients between males and females relates to the B and D marks and that on these the coefficient of MS1 is greater for males than for females. This finding relates to the results of Bradley (1984) on project marking. She found that second markers who knew the student's gender tended to mark female projects less extremely than male
projects compared to supervisors. Bradley explained this by suggesting that the less expert second markers displayed a bias whereby they were reluctant to give female students extreme marks. As noted previously, if this sort of mechanism were at work in the present data the path coefficient of SSB to SSD on the second marker's marks should be smaller for female students than for males. There is no evidence for this. This may be unsurprising given that a previous study in the department from which the present data come failed to replicate Bradley's effect (Newstead & Dennis, 1990). Newstead & Dennis (1990) suggested that an alternative explanation of Bradley's results might lie in biases shown by supervisors concerning individual students. Such biases would inject variance into the supervisor's marks tending to make them more extreme than the second marker's marks. If the biases concerning individual students were stronger for females this could then explain the pattern of data observed by Bradley. In so far as they are consistent with the existence of quite a strong influence on supervisors' marks coming from reactions to individual students the present results are compatible with the proposal made by Newstead & Dennis. However, the direction of the gender effect in the present data is opposite to that which would be necessary to explain Bradley's data. Possible reasons for the discrepancy between Bradley's findings and those of Newstead & Dennis have been extensively rehearsed and the present results can only make a very limited contribution to resolving that debate. The present findings might perhaps most easily be reconciled with those of Bradley by suggesting that there are biases relating to individual students which may have a differential impact on the two genders but that whether this happens and which gender is most affected varies according to factors such as the type of material being marked, the population of students involved, and the particular set of markers. It is noteworthy that in this study the greater impact on male marks was restricted to the marking of the more discursive introduction and discussion sections. Another possibility is that gender is not the true variable underlying these effects but some other variable which was correlated with it in the samples studied, and which in particular was correlated in opposite directions in the present sample and in Bradley's sample.
The pattern which is evident in Table 2 whereby the marks given to males generally show greater variability than those of females, especially in the first marker's marking, is an example of a more general finding which has provoked considerable debate (Rudd, 1984). Is the fact that females seem to obtain less extreme marks and fewer first-class and third-class degrees a reflection of truly different patterns of performance in the two genders or does it reflect a difference in the way their work is marked? In the present case the difference in variance between male and female marks arises from the influence of MS1. Thus in this case the evidence inclines towards the position that the greater variance of male marks arises primarily from sources which influence all of the marks awarded by the first marker. Obviously this need not imply that all examples of mark variance being greater in males than in females arise in the same way. One advantage of the method used here, and in particular the use of a structured means model, is that it makes it possible to obtain more information about how differences in means between groups arise. In this case the difference in means between male and female marks seems to arise in the factor MS1 (and possibly also MS2) but not in the section specific factors, with the non-significant effect on SSA to SSD taking the form of a female superiority. It is important to recognize that this is quite a small effect relative to the variance of MS1. Thus although the average effect of MS1 is to favour males slightly, there will be some males as well as some females where it acts quite strongly to reduce marks. Having said this it is difficult to escape the conclusion that the sex difference in MS1 reflects an influence of the personal knowledge which the marker gains of the student during the year and that in this sample that influence was acting in a manner which on average favoured male students over female students. If it has any degree of generality this is a finding with significant implications for the marking process. The data and modelling reported here also have implications for the reliability of marking. It is evident from the correlations in Table 2 that inter-marker agreement on each section is only very moderate. This outcome is not unsurprising in relation to other data on the reliability of degree level assessment (Byrne, 1980; Cox, 1967; Laming, 1990; Newstead & Dennis, 1994). However, given the weight attached to the project in many degree schemes it might have been hoped that it would have a higher level of reliability than a single exam answer or even an exam paper.
In the present case it is clear that much of the disagreement between markers derives not from a random influence on their marks but rather from a consistent but marker-specific influence on the marks awarded to a particular student. This has important implications for the extent to which poor reliability on individual elements of assessment is overcome by averaging. Thus, for example, if an overall mark for the projects involved here were calculated by averaging the section marks, the agreement between supervisors and second markers on the project average would not be greatly superior to their correlation on individual sections. Authors who have found only modest reliability in exam marking in higher education have sometimes taken comfort from the benefits of averaging over large numbers of assessments (e.g. Newstead & Dennis, 1994). The present results suggest that it may be unwise to assume that averaging is satisfactorily dealing with the problems of unreliability. Table 4 shows some differences between markers and sections in the error component of mark variance. Interpretations of these differences are necessarily speculative. Supervisors show greater error variance than second markers on the mark for the introduction. Error variance here really means variance that is both marker and section specific. It may be that it is in marking the introduction that the supervisor's specialist knowledge of the literature is most relevant and hence it is here that their mark is most idiosyncratic. The data analysed in this study derive from a single department, and a relatively small group of markers was involved. In view of this it would be unwise to reach overly strong conclusions about marking in general. The situation examined in which a supervisor teaches a student on an individual basis for a year and is then required to assess the work around which all their meetings have centred is perhaps unusually susceptible to the supervisor developing biases towards the student. However, the strength of the effects detected does suggest that the influence of markers' personal knowledge of the individuals whose work they are marking deserves considerably more attention than it has previously received. If the results reported here do have any generality then there is a strong case for avoiding, as far as possible, the assessment of work by those who know the student. Student gender appears to moderate the influence of the marker's personal knowledge of the student and although the effect is not enormous it again gives cause for concern and the generality of the effect warrants further investigation. Student gender was the only personal characteristic of the students which was considered in this study and it may well be that there are other characteristics which have a similar or even a larger
influence; this too deserves further study. Whilst caution is advisable in generalizing the conclusions of this study what may be more usefully generalized are its methods. Whilst there are ambiguities of interpretation, which have been discussed, the use of structural equation modelling provides a valuable new handle on marking bias. An accumulation of studies using the approach illustrated here could substantially improve our knowledge of the nature and magnitude of marking biases.
This action might not be possible to undo. Are you sure you want to continue?
We've moved you to where you read on your other device.
Get the full title to continue reading from where you left off, or restart the preview.