You are on page 1of 25


List of figures.........................................................................................................................2
List of tables...........................................................................................................................2
1 Project Goals....................................................................................................................3
2 Importance and Potential Applications............................................................................5
3 Implementation ...............................................................................................................6
4 Experimental Results and Analysis..................................................................................11
5 Criteria of Success...........................................................................................................16
6. Comparison with other Implementations........................................................................16
7. Challenges and Limitations in the Implementation........................................................17
8. Accomplishments and Future Work................................................................................17
9. Historical Background and Related Work......................................................................18
9.1 Recent Work and Ongoing Research.....................................................................20

P a g e 1 | 25

List of Figures
Figure3, 4 ..............................................................................................................................7
Figure5, 6...............................................................................................................................8
Figure7, 8...............................................................................................................................9
Figure9, 10.............................................................................................................................10
Figure15, 16 ..........................................................................................................................15

List of Tables

P a g e 2 | 25

Human beings have always longed for systems that are capable of self-governing especially after the advancements
in the field of artificial intelligence, this human dream has touched new horizons. One of the field where in recent
decades a lot of work has been done is related to the machine based interpretation of human language and
expressions. One of the most powerful creations to date in this field is IBM WATSON that is a state of the art selfgoverned natural language based question answerer. But before its development ETS had started working on
automatic interpretation of human written responses in 1990. This work has advanced to level of perfection, one
relevant field is short question answering but not much attention has been paid to that field in the past but it has now
got attention and a lot of research is going on for automated checking/grading systems development owing to its
fundamental importance in academia. This field of short question answering is at the intersection of subjective and
objective type extremes, so it is really hard to make a system that is able to understand a very concise response
covering large ranges. My project is a prototype for such an automated system where I have employed most of the
Natural Language Processing techniques learnt during the course to this project through software. The system has
high average weighted accuracy reaching to 80%.
1. Project Goals:
The field of academia has longed for a system that can rid them of the tedious tasks of grading quizzes, assignments
etc., especially when you have the case of short question answers because in that case there may be numerous
possibilities of a particular correct answer and many correct answers may correspond to the same realm. So to define
what is correct and grade accordingly is always hard. On the other hand, the students are most of the times not
satisfied with the grading and always have something at the back of their mind that they could have been marked
better if.So this quest of ease from the faculty side in terms of grading load and unbiased expectation from the
student side is always there. This compelled to the development of many automated grading systems since 90s. The
ultimate goal at sight is and was to produce a system that can grade question answers/essays of any length and of
any field robustly and the grading is the same as that at least a human mind can perceive about it i.e. taking in to
account all the possibilities of a correct response and taking in to account every pros and con of the response in a
systematic manner and grade each accordingly, in other words (no shallow parsing as sometimes we expect from
humans) instead always go for deep parsing and analysis and come out with a grade that is justifiably reflective of
every effort of the student. The grading system is also expected to take in to account all the hues of Natural
Language Processing techniques and carve the system based on them and should be very well resourced based on
possible updates and additions to the English language. This task is more tedious for the case of short question
answers as compared to essays as there is less room for error and more demand for deep analysis whereas in case of
essay grading only particular criteria are set beforehand to check whether the flow of text is in concordance or not
with those set criteria. From now onwards in the report I will explore and consider only the short question answering
systems. In a short question answering response it would be really interesting to assign a confidence indicator from
the grading system algorithm (I have introduced it as first time as an appreciation from the Watson) though that
would be redundant but can depict the internal grading process and satisfaction the algorithm has achieved from the
given response and would help students to better understand where they have lacked. In order to check the accuracy
and precision of such a system and algorithm would be very tricky as no one can for sure can say that this is the final
and the best answer to a particular question though we can argue and identify on the basis of visible errors but still
there will an ambiguity associated with many of the actual responses. But one should get it checked on a huge
corpus of question database and see the persistency of the outcome and compare it with the best human graded
response. An ideal grading system should also be expected to have some sort of feedback mechanism either
embedded inside the algorithm or through a training corpus of errors from which it can learn continuously after
every response and get wiser and seek more incisiveness and better logical interpretation and comes up with implicit
and explicit conclusions every time it is tested and update itself accordingly. Some other functionalities peculiar to
handling of system should be that, it should be easy to handle, easy GUI provision, no tracking and manual
adjustments required, a form of a fire and forget system, easy to install everywhere, easy to upload documents, the
grading criteria can be adjusted by the instructor so flexible in working and definitely should be fast and grade
multiple responses at the same time.
These are some of the desirable characteristics for an ideal grading system explained above. Application specific
additions to those systems may be required e.g. military school based grading school may demand different
nomenclature for grading but the overall system should or try to follow those guidelines. Those can also be termed
as the gold standard and every other implementation would try to follow it or compare its implementation with it.

P a g e 3 | 25

In my project the goals were definitely the same as mentioned at the top and the main source of inspiration was to
build an application specific implementation that imitates the basic operations of the Watson, due to time restrictions
and scope and resources available, I have limited the project goals to a subset of the above. As mentioned in my
literature review, my goal was to produce an implementation that should follow the basic operations of a grading
system but with a new approach and introduce a small scale realization of the grading system. Following were the
salient features that were set as goals for my project and I am able to accomplish them successfully.

An autonomous System
Simple and area specific
Grading based on NLP techniques
Allows flexibility, robust, unbiased and persistent
High accuracy and precision.

The basic structural goal of the project was that a question/s would be asked to the student and the response from the
student/s would be recorded. The actual answer is known before hand, as well as all of its possible answers are
calculated either run time or before and is matched with the student response and is graded according the semantic
and syntax of the match between the two. The first goal was aimed at making the system independent of the
environment around and depend only on the corpus of words or dictionaries and associated texts for finding the
reference. Some counterpart implementation techniques e.g. use of database for errors that is consulted every time
for every incorrect susceptibility in order to find the respective error, this technique is very time consuming and
exhausting. In my project I have only used the resources available from the NLTK software for words referencing.
NLTK supports most of the well-known corpuses making it a very large database for referencing word. Following is
a snap of the corpus that NLTK supports

Figure 1. Corpuses supported by NLTK

As can be seen from above there are many well-known corpuses that are included in to the NLTK, additionally we
can download other required corpuses in to the NLTK if required. The corpuses are included as well as their tagged
versions are also included that are quite helpful in natural language based operations. (Note some of the corpuses we
have studied and referenced in our course e.g. brown, framenet, propbank, wordnet are present here)
In order to make the implementation simpler, I added many restriction to the project including restricting the text
that can be answered to only 4 words at maximum. In my system I reduced the number of synsets for each word and
number of lemmas from each sysnet to only 3, so overall there are at maximum 36 possible lemmas for a particular
response/answer. This sufficiently reduces the operational load from the algorithm but compromises on the number
of other possible answers that may exist. Also I made the implementation more like area specific or domain given a
particular set of questions, this is achieved by loading a corpus that gives higher probability of word counts and their
respective bigram and trigram probabilities that better corresponds to that field e.g. the probability of a sentence
God created earth would be higher if we search in genesis corpus as compared to Shakespeare corpus. As a
general scenario the largest corpus can be used for all types of questions.

P a g e 4 | 25

The goal of grading a particular students response was based on using the NLP techniques, means it will be
checking the syntax and semantics of the response for proper grading, it will not go in to the logic of the answer.
This system is obviously well suited for specific field questions e.g. it cannot provide good results for a chemistry or
physics based question answers.
Another goal of the grading system set was to make it flexible not only in terms of grading criteria that is an easy
job but to provide other implementations acting as a seed. The implemented system has flexibility in the way it can
adopt to new changes very smoothly with very less changes, also the system can be implemented in reverse. The
forward case is meant to find the maximum match possible or in other words to find the highest similarity or
minimum distance between the actual answer and its other possibilities with the answer given by the student.
Greater the match, and lesser the errors higher would be the score, but in the reverse case we can make our system to
work in opposite direction with the same setup with only few changes required, e.g. in forward case we go for
synonyms for every word but in reverse case we will find the antonyms of that word and find the maximum
similarity between the antonyms possible for that word and the given response from student. Then we find the
greatest mismatch between the actual and student answer. Greater the dissimilarity higher will be the probability of
getting the low grade. There can be other interesting possibilities that can be drawn with this algorithm due to its
high flexibility. The system is quite robust due to simple calculations and linear grading scale, also the number of
possibilities and maximum allowed words are restricted, so this system is guaranteed to perform fast with short
question answers. Also the system is unbiased and persistent as long as we have adequate resources available for
words matching in corpus or database. The algorithm performs with high accuracy and precision for bigram
probabilities calculated at the last stage of algorithm but has not appreciable results when using trigram probabilities
as the specific matching using this technique for long strings is not effective.
2. Importance and potential applications:
A brief introduction to the importance of the project in todays world is provided at the start of the project goal
section. The detailed description follows. In the past, the general measure for evaluating the performance of students
in academics was paper based long questions but grading and checking those answers was a tedious job, so multiple
choice questions were adopted mostly as they provide ease in grading and no ambiguity, still multiple choice
questions are the most dominant source of evaluation now a days but there are certain fields where still MCQs are
not effective and requires short answer to better elaborate the claim. Also the assignments in academia are mostly
short question answer based and MCQs cannot replace them. The reason short question answers are better in terms
of grading relaxation as they offer more chance for the students to answer or reflect some similarity of thought
whereas in case of MCQs students have very little chance of going around and have to select the exact match which
in some cases can be guessed by many tricks available for MCQ based testing. The essay based testing on the other
hand offers a very large room for both the students and the grader and small mistakes can be ignored and major
emphasis is given on the coherence of the essay with the assigned/asked tasks. The problem with short question
answering on the other side is their grading that requires quite much effort and concentration all the time from the
human checker/grader but still can have many lapses. Those errors are mostly general human errors and sometimes
limitations of the human vocabulary towards a particular topic especially when the field of testing is very vast and
deep. The students are also often not content on the grades they get and expect a more automated unbiased approach
that can better satisfy their doubts. If we consider the cost and time then they also weigh very high when grading a
large set of response of short questions. Considering all those shortfalls demanded a system that can take in to
account all those and produce a single solution to all of them. Such a system can play a very life changing progress
in academia. Think about an institution where teachers do not have to grade anything, no quizzes, no assignments,
no exams and everything is done in an automated and unbiased and flawless fashion, this can be a win-win situation
for both the faculty and the students and the overall efficiency of academics would be more distinct.
The biggest application of such a system is in the field of academia and online testing systems e.g. ETS and British
council testing System and in recent two decades online testing systems have spent a lot of resources in the field of
automated checking but that was previously mostly focused on the essay grading but now ETS has funded c-rater
project that is a form of an automated short question answering system, as it is expected that the essay writing may
be more concise and reduce in length and then it would require a most specific system to grade just as in case of
short question answering. Essay grading task is basically accomplished using a training based approach where the
computer evaluator is previously trained to track the changes in the text and find it coherence with the defined
criteria and is then weighted and graded accordingly. It requires a very large corpus for training and the essays are
previously selected and mostly repeated in context and sense so once you create such a system it is not difficult to
make changes in to it but in case of short question answering the task can be quite hefty as we encounter questions
from different areas and domains and require a very concise answer, so modifications can be made in question

P a g e 5 | 25

answering grading system to make it usable for the automatic essay grading. Another potential application of
automated short question answering systems can be in the field of real time systems providing answers for FAQs
online. Also it can be a subpart of other applications e.g. text recognition, voice recognition, pattern matching
language translator etc.
3. Implementation:
For the implementation of the scheme, I have used NLTK using Python. The software is very well resourced and
helpful. Especially it is equipped with some of the best built in tools required for numerous operation of NLP based
tasks. The mode of python used is cmd mode and in that mode NLK is imported to get started. Next the relevant
libraries and tools are loaded. All the basic features calculations and processing is done using it. The results obtained
from it are compiled in MATLAB for further calculation and plotting.
The basic problem that I would be focusing will be to encompass syntactic and semantic characteristics of both the
actual answer and the answer from the student and try to find as many possible answers that can be rightly answered
for a particular question and grading accordingly. There are some peculiar terms that are introduced e.g. penalty,
confidence that are calculated as the grading is accessed for a students response. As described in my literature
review (LITERATURE, n.d.)*. The basic implementation module would be:

Figure 2. Basic implementation module(Taken from(LITERATURE, n.d.))

(* (LITERATURE, n.d.) is my previous literature review )

The above figure showed the basic implementation blue prints guiding for the real time design and implementation.
All of those blocks integrity and implementation are kept with some modifications in the final design, e.g. the
maximum number of words taken for a candidate answer is increased to 4 instead of 3. The first two modules are
very simple and are required to simply break the sentence in to words and lemmas and check the word count. Here
the maximum limit of the word count to be checked for grading has been increased to 4,then there is word
identification block, that had the task to assign respective part of speech to each word and do entity recognition and
relation extraction procedure, after proper identification has been established for a valid word, it is sent to the
matching block, that is the central component of the design, the matching block was expected to establish all
possible matches of student response as processed above with the possible correct answers of a particular answer
with different matching criteria in place, the matching process was scheduled to work in sequence first checking the
direct matching and then going other synonyms based matching and assign particular weights to each word with
reference to its actual answer comparison. After that Bigram weighting was expected to take place and at the same
time the ambiguity score was calculated. The results were expected to be put in to a grading module suggested

P a g e 6 | 25

below in figure 3. As can be seen from the figure there are there were two scoring modules one based on linear
calculation and the other based on non-linear scoring, and both scores were expected to get combined in the final
scoring based on the threshold comparison of the scores from the above modules, but the final implementation was
not able to consider the non-linear implementation module due to time constraints. This non-linear scoring block
was expected to take the weights and other factors of processing in to account and produce a machine learning based
best score. The grading criteria was considered to be very simple and flexible as it can be tuned by the instructor or
operator by adjusting the thresholds of scores.

Figure 3. Grading procedure(taken from (LITERATURE, n.d.))

The final implementation was divided in to two stages/phases. The first phase is concerned with the actual answer
that is known before-hand but there can be many possible interpretations for that actual answer so those nearest
interpretations are quested along with its syntax parsing. Possible interpretations of all the words that can lead to a
meaningful answer are stored as well as their respective POS tag positions for further processing. This phase can
either run before the students response is calculated or can be calculated runtime. So this information is a sort of a
priori. Next, the implementation of students response/answer checking module is considered where the first basic
operations are the same as in case of the actual answer calculation block with additional identity and relation
extraction blocks added for better granularity of the sense that student want to convey. That is followed by the
matching section which has different checking constrains with in it along with calculation of grade values runtime
and assigning respective penalty and confidence values. One of the most important section of matching block is the
calculating the bigram and trigram probability scores to better support our evidence. This section is a form of post
priori information calculation block. Let us first consider the first stage of calculation i.e. the actual answer
calculation (a priori).

Figure 4. Actual Answer calculation (first stage)

P a g e 7 | 25

Let us consider each and every detail of implementation using a simple sample question and its possible answers
calculations using these set of modules. E.g. consider we have a couple of questions 1. Who created earth or 2.
Which war in the world has the highest causalities. Those two are the simplest of the questions and their
respective answers can have many dimensions so thats why I have selected them for demonstration to show how
this scheme actually works. The answer to the first question seems very straight forward but it can have many
dimensions possible e.g. one of the possible answers can be that is selected to be the actual answer as God created
earth but another valid answer is Lord created earth and also earth created by Lord/God etc. so there can be
numerous possibilities that may be specific to a particular area or sect. In the second question there can also be many
valid answers to it e.g. the actual answer Second world war or another answer can be world war II/2 or can be war
of 1939 etc. Obviously there can be numerous possibilities of correct answer dependent upon the syntax or
positioning of the words and secondly semantics of the words that would be calculated in this stage, so let us see
how it progresses, first the actual answer is inserted at the top of the word separator module that takes that sentence
and split that sentence in to its respective words and lemmas possible. At this point if the length of the
words/lemmas turn out to be more than four than the response is discarded without any further processing. In case of
shallow parsing tokenization and splitting steps would be identical. The next state will find the possible POS tags of
the response. A screen shot of the sample calculation is shown as below

Figure 5. Sample actual answer tokenization and POS tagging

As can be seen that perfect POS tags have been assigned to the sample response. The next module will find the
possible synsets and their respective names and lemmas for each word separately and stores only the first three of
them i.e. for each word the first three synsets are selected and from each of these synsets the first three names or
lemmas are selected. This truncating is done to make the calculations scalable as it would become very difficult if all
the synonyms and their possible placements are considered for each word. Overall we can select up to 36 possible
words for a given 4 word answer. It can be less than that number of possible words selection but cannot exceed,
many of the words have a very small sysnet and respective synonym names possible but some can have as much as
more than 40 synonym names possible for a single word. Following snap shows the synset calculation and
respective synonyms with in each synset. As can be seen from the figure below the sample answer is first tokenized
and POS tagged, then the wordnet instance is called and is used to find the possible synonyms and lemmas for each
word in the sample answer. E.g. a1 represents all the synsets possible for the word god from wordnet. Each of
those synsets would contain sets of synonym names and lemmas, that are extracted using lemma_names() function
and synonyms for word God in the second synset are deity, divinity, immortal, any of those can be replace with the
current word God, same for others also. The below figure shows how this is accomplished in NLTK

P a g e 8 | 25

Figure 6. Synset calculation and synonym names and lemmas with in those synsets

Next Viterbi is used for finding the maximum likelihood of position of words forming a given sentence for candidate
answer with respect to their respective POS tags with the help of predefined context free grammar with probabilities
for each word and tag following the other in the sentence already defined in the grammar. A snap of the Viterbi
based parsing is shown below

Figure 7. Viterbi based parsing to find maximum likelihood position given their tags defined by grammar

Here the word combinations forming candidate answers with their corresponding tags are selected if their respective
likelihood is greater than the minimum defined, else considered as not a valid sentence for answer. Now we have
valid set of answers formed by determining all right possible synonyms and placements possible. This will serve as
redundant matching data along with the gold standard answer.
The second stage block of modules is shown below

P a g e 9 | 25

Figure 8. Student Answer Checking (2nd Stage)

As can be easily seen that first three modules are exactly the same as we have used in actual answer case. The
students response is fed to the word separator and that follows by tokenization and POS tagging. After that the
response is sent to the entity recognition module. In entity recognition module we will employ chunking in order to
develop a binary entity recognition, i.e. either it recognize the tagged words or not. In order to accomplish this task
NLTK gives us a trained classifier that can recognizes named entities and can be called via the function
nltk.ne_chunk(). E.g. consider the sentence Japan and Nagasaki, use this entity recognizing function on it we get
the following outcomes

Figure 9. Named entity recognition example answer

This entity relation can be drawn as follows

P a g e 10 | 25

Figure 10. Named entity recognition graph for example answer

Where the named entity graph is similar to the parse tree but it has fundamental difference that here the main
emphasis is given on NE (Named entities) only. If the input to the named entity recognizer is not a valid one then the
result will return false. The next module is the entity relation extraction will extract relation between the entities.
This module is optional and can be skipped as it takes additional time and the algorithm can work without it. If the
words fail to establish proper entity recognition and extraction procedure then a very high penalty 0.6 is assigned,
and confidence is lowered to .333. Meaning that there is no way it would get any score higher than C.

Penalty value
<= 0.3
<= 0.4
> 0.7<=1

Table 1 Grading and Penalty assignment

The initial assigned grade score for a particular response is 1 and as the algorithm progresses it is either deducted or
remains the same dependent on the penalty. If there is no penalty or a direct match, in both cases the value of the
grading score will maximum. Ranges of confidence threshold are 1 to .1665, in three steps and confidence is shed in
different proportions at different stages depending on the type of penalty. Just as grade score value the confidence
threshold at the start of processing would be 1 and would decrease on penalties with respect to their types.
The penalties are simply added up as they incur at each stage and grade value and confidence threshold is modified
accordingly. A table of penalty types, penalty values and confidence threshold calculation is shown in the table

Type of penalty

Penalty Value

Wrong entity recognition and

Failed Exact matching


Unable to match with possible



Confidence Threshold at
corresponding penalty (C)

P a g e 11 | 25

Failed Bigram
probability (0.0)




Table 2. Types of penalty their values and Confidence Threshold

Next comes the matching module in figure 8 that will first perform the direct matching between the actual answer
and the students answer if the two match then there is no further processing and highest grade is awarded with
highest confidence. Next, matching that would be calculated will be between all the valid possible answers
calculated in the first stage and comparing them with the actual answer. The matching is based on the and
condition between the actual answer and its valid possibilities and the students response i.e.
(W_A_1 == W_S_1) and (W_A_2 == W_S_2) and (W_A_3 == W_S_3) and (W_A_4 == W_S_4)
W_A_i represent words at respective index positions i of actual answer and its possibilities where i= 0,1,2,3
W_S_i represent words at respective index positions i of students answer where i = 0,1,2,3
Here one important point to mention is that if the actual answer and its resultant possible answers are smaller in size
than the students answer then only the actual answer words would be checked and redundant words will not be
When an answer in the above case has matched we also match its respective POS tags. If the matching fails at this
stage then a penalty score of .3 is awarded resulting in reduction of the confidence accordingly.
At the last module, bigram and trigram probabilities are calculated. If we have not got a valid match for bigram,
trigram module, penalty will be 0.4 and confidence threshold is adjusted accordingly. (Note the penalties are added
up linearly for each module failure as shown in the figure 8). A sample example of calculation will be as follows for
bigram calculation as follows

Figure 11. Bigram Probability Count of sample answer

Here the probability is the highest for the well-known occurring couple/pair, whereas the bigram probability for two
disjoint words will be low.
4. Experimental Results and Analysis:
The scheme was implemented in NLTK as mentioned before and the questions were created by myself keeping in
view of the corpus support, so most were related to history and prominent places, people, books, events, science. A
total of more than 30 question set was prepared and tested but were later pruned to only 30 questions. A few of the
questions from the sample set and their gold standard answers are
Q1- What is the fastest thing in the world?
Ans- Speed of light
Q2- What is the famous law of action and reaction?
Ans- Newtons third law
Q3- Which war has the highest loss of human lives?
Ans- Second World War
Q4- Which British trading company was the first to arrive in subcontinent?
Ans- East India Company
Q5- First ever atomic bombs were dropped on which two cities?

P a g e 12 | 25

Ans- Hiroshima and Nagasaki

Q6- What was the name of the famous novel by Charles Dickens in 1859?
Ans- Tale of two cities
Q7- Where the titanic was sunk on its maiden voyage?
Ans- North Atlantic Ocean
Q8- Which canal connects Mediterranean Sea and the Red Sea
Ans- Suez canal
Q9- What is the name of the painting of da Vinci located in Milan?
Ans- The last supper
Q10- What is the most founding document of American political tradition?
Ans- The declaration of independence
After getting the response from test subjects (in this case me and my friends), they were marked by peers(also some
of my friends) to get grading that would act as actual grading. The same answers from thirty question that
presumably students had given were fed to the system and grades were calculated for bigram and trigram cases. The
graph for grade assignment for students response using bigram and trigram probabilities for matching and the actual
response for 30 questions is shown below.

Figure 12. Grading by human evaluator and automated by system for each answer

One observation from the results is that bigrams have better correlation with the actual grading as compared to when
we use trigram matching. This observation would become more vivid when we use weighted accuracy, average
weighted accuracy and mean square error to see how well the two approaches perform as compared to the actual
answer grading.
In order to calculate weighted accuracy for the bigram case we can assign weights as follows
Weighted accuracy:
Assigned weights for different values of dissimilarity are as follows
Difference between actual/gold standard and students

Assigned Weight

P a g e 13 | 25


Table 3. Weights assignment for calculating weighted accuracy

For Actual answer and bigram based matched result we have

For A grade:
For B grade:
For C grade:
For Fail:


A plot of weighted accuracy for bigram based matching is shown below

Figure 13. Weighted accuracy for automated case using Bigram

Most of the time we achieve high weigted accuracy for different grades but is relatively higher for the C case, one
possible reason is that bigram matching at the end stage ensures that grading does not deteriorate even if we do not
have a match and have enough bigram probability (calculate from the corpus) to give some credit to the student
instead of just failing him. The average accuracy in this case is 80%.
The weighted accuracy in case of trigram vs the actual/gold standard is calculated as follows
For grade A
Abs(9-2) = 7
For grade B


P a g e 14 | 25

Abs(8-6) = 2
For grade C
Abs(7-13) = 6
For Fail
Abs(6-9) = 3


A corresponding plot is shown below

Figure 14. Weighted Accuracy for automated case employing Trigram

The average weighted accuracy in this case is 55%. The weighted accuracy in trigram is not very appreciable owing
to the fact for specific question there are not much trigram matching available in corpuses under use.
The average weighted accuracy in case of bigram The root mean square error calculate between the bigram and
trigram based approaches vs the actual answer is as follows for the 30 questions considered
RMSE for Bigram = 2.1213
RMSE of Trigram = 4.9497
As can be seen that Trigram has much root mean square error as compared to the bigram, most than half as in
question answering approach matching to three set of words is not that effective in my scheme.
Another measure calculated during the processing was confidence threshold. This is a redundant measure included
just to show how confident was the system when it graded the particular answer. In other words it is the internal selfassessment of the system for a particular grade. E.g. if the confidence threshold is 1 then the system is 100% sure
that the grade he has awarded are really the correct or deserved one whereas a confidence threshold of .1665 will
show that though the system has assigned a grade to a particular grade but there is only 16% sure that the grade it
has assigned is the correct one. The confidence measurement in my system case decreases with the penalties so most
of the low grades are accompanied with lower confidence levels. There can be numerous changes that may be
introduced in confidence threshold calculation. Confidence thresholds for bigram and trigram matching cases are
plotted as follows for the 30 question

P a g e 15 | 25

Figure 15. Confidence Threshold for Bigram case

Figure 16. Confidence Threshold for Trigram case

It can be very easily analyzed that bigram case has better confidence thresholds due better matching at the bottom,
whereas in the trigram case we have much lower confidence thresholds except for the exact matching cases that
have maximum confidence. The very low probability from trigrams result in clustering at lower grades.
5. Criteria of Success:
In order to determine the criteria for success for the success of any question answering system, real time feedback is
needed along with responses from different cadre of students to undergo the testing to check the pitfalls of the
project that would check the extremes of the project and define its limitations. The general success criteria is to
match any response with the gold standard or real answer but there can be answers that may be better than the real
answer so it is difficult to define hard and fast bounding to the actual answer. Another approach is to do survey of
the system among students and faculty to argue about its success and weaknesses. Another success criteria is to
access the success of each individual module and then its contribution to the whole system. Some criteria for
evaluating the success are discussed in (Kakkonen & Sutinen, 2008) where the basic focus is laid on how the
system of grading first defines the goals to be achieved and how much it achieves those goals and what other tradeoffs it make.
In my system the general criteria of success of the overall system was how closely the grading is followed on the
basis of grading by the human evaluator and those calculated by the system for any given answer by the student. So
the question comes what is the best criteria that a human evaluator can come up with for a particular response. There
can be more than one responses from different human evaluators that may not correlate among themselves. In order
to solve this and make my system close to real time analysis to determine the actual success, I have made/selected

P a g e 16 | 25

questions by myself and then answered them with couple of friends (acting as test subjects) to come up with
different possible answers. Those response would be acting as answers from the students. Then with the help of
other friends graded those responses, and out of every three per answer grading, selected the grade with majority
vote. Then the same answer from the student/test subject was given to the program for evaluation and were recorded
and later counter checked with the grades from human evaluators. The closer the proximity between the two, the
better the success. The success rate for the grades using bigram and trigram are directly proportional to the weighted
accuracies calculated above. Higher the accuracies better the success for that particular grade and ultimately define
more success to the overall system. Confidence threshold is a type of self-success assessment for the system. Greater
the value, greater the system will assume that it has come up with a result that is expected to be successful
(assigning the right grades) and vice versa.
To determine the actual success criteria of my implementation. It has to be seen in a broader context as defining a
hard bound criteria for grading systems is always an arduous task. The overall performance of the system was
satisfactory, and mostly achieve the results and expectations. In order to be more specific about success, it has to be
tested and operated in a neutral environment as the questions making and checking was in some sense originated in a
close circle so in order to define overall success it has to go through more rigorous and independent testing. The
system achieved good weighted accuracies for bigram case resulting in a very small RMSE error. The average
weighted accuracy in case of bigram was 80% that is quite healthy.
6. Comparison with other implementations:
There are other implementations that perform the same task of grading but with unique characteristics depending on
the type of tasks assigned. In (Leacock, 2004) the system uses paraphrases as the smallest measure from which the
analysis is carried out whereas in my approach I am working on individual words instead of paraphrases and after
the word based operation, the possible correlation between the different words is carried out. The algorithm in
(Leacock, 2004) requires a large training set of 100 models for cross validation of response initially, though it can
help in the performance but it is very cumbersome to find such large models for different cases. In my approach
there is no need for explicit training and works on the smart resources of the software (NLTK) that makes it more
robust but a little less accurate than those with training. In (Mohler & Mihalcea, 2009) the scheme proposed and
tested is based on unsupervised learning based and tested on the basis of different corpuses used. The information
retrieval step was based on a form of pseudo relevance feedback and also the similarity was cosine based whereas in
our case we calculated the similarity based on exact match and also the information was retrieved directly from the
student without any feedback. In (Mohler, Bunescu, & Mihalcea, 2011) the scheme proposed is based on
dependency graphs in which the graph is syntax aware and this feature is used for three stage answering pipeline.
The system takes in the student and instructor answers and for each node in the instructor dependency graph a
corresponding score is calculated in the students dependency graph, also the scoring function is trained using a
small corpus of manually aligned set of graphs and uses perceptron for it. The fundamental difference between this
approach and mine is that I have a previously checked answer from the instructor and I only see what the system do
with the answer fed to it whereas in the previous paper the two responses are fed simultaneously and similarity is
calculated between them and graded using a dependency tree that takes in to account the syntactic measures and do
not go for the semantic richness and just compares the two. So basically in that he focusses on developing a kind of
similarity between the two responses based on the syntax and then verifying it with different machine learning
algorithms to see how much closely he has matched e.g. SVM and the answer that he get from there, are used to
assign grades. There is a fundamental discrepancy in this approach apart from its shortcoming of not exploring the
semantic richness of the answer, that it is required to be trained manually on a set of possible dependency graphs
that is not enough and the results may vary with the type and quantity of the training set. There is no such
requirement in my approach. In (Sukkarieh, Pulman, & Raikes, 2003) the paper describe an approach for short text
grading based on information extraction techniques and nave text classification approach has been used for baseline
grading. This approach is quite specific to a particular field of study and may not perform well in others as the
classification was based on keywords of a particular topic obtained from particular training set. It also requires a
very huge training set to make selections of correct answer as it is not actually directly calculating the possibilities
based on NLP techniques for semantic and syntax and is a form of classification of answer based on its origin and
context that is reflective of its training. Whereas in our case we have an approach that is quite much field
independent if we have huge ubiquitous corpus available. Also we directly employ the NLP techniques to find the
possibilities and the grading is fully dependent upon it

P a g e 17 | 25

7. Challenges and limitations in the implementation:

The biggest challenge of the implementation was to make it as independent as possible. This means that it can be
applied anywhere without the requirement of prerequisites. Also I decided to implement it in such a way that no
previous feedback in terms of errors should be inserted in order to reduce the computational time and complexity
and to allow the system to make decisions on runtime and not looking in to any other database for checking apart
from the software database for error feedback. This was a challenging task and tradeoff between accuracy and
calculation and innovation and this challenge was cop up with different measures and the built in support from the
NLTK software. E.g. instead of looking for a matching of particular set of words in the whole corpus of more than
million words we can use a simple function of nbest() that would give you much faster and nearly accurate results.
The other big challenge was the selection of the corpus or texts to be included, the simplest solution was to select the
largest corpus with most relevant data. Another challenge was to determine the possible candidate answers using the
Viterbi based parsing, that was actually based on the context free grammar, this grammar cannot be assumed to be
final and there can be many modification to it possible. The likelihoods calculated in such a corpus were also not
universally applicable and impact the selection.
One of the limitation of the project is that it is heavily dependent on the type (domain) and size of the corpus used
that makes it to answer questions with more accuracy when a more appropriate corpus is selected that can enhance
the respective likelihood. The number of test questions preparation and test subjects (students) was premature and
requires a very thorough survey and include students of different cadre to best diversify the results and determine the
bounds of the approach used. The same is the case with evaluations by humans that is called the actual answer.
Different people can grade the same response differently so it requires more professional evaluator from
corresponding fields to better grade and that would ultimately test the system limitations. Also the answers that are
checked and graded using this approach are based on the language text, that may be related to general language
processing test sets e.g. history, business etc. but it is not expected to perform well in case of answers to other fields
e.g. from chemistry, physics etc. The system accuracy is also dependent on whether we employ bigram or trigram
probability measure for matching for evidence retrieval and word sense disambiguation. Also the size of the answer
(number of words) was limited to build a simple one. The synsets and synonyms with in those for a particular word
from wordnet were also pruned to small size in order to avoid complexity and avoid choking in the system.
8. Accomplishments and Future Work:
The need for improvement as far as short question answer grading is concerned is immense, as there is a lot of effort
now a days in this regard to find effective ways to bridge the area of objective and subjective extremes with the short
answer solution but is quite challenging. I have tried in my work to show a diverse approach in this regard. I have
tried to imitate the basic functionalities of the WATSON question answering system in a more specific way e.g.
segmentation, tokenization, entity recognition and relation extraction, word sense disambiguation and confidence
calculation for answering a particular response. The system implemented is independent of the error database for
comparison and works on runtime data using mostly the resources of the underlying software (NLTK in this case).
So in this way it is quite independent of the external resources and can work in a standalone environment. The
implementation is flexible in that sense that it can be applied in the reverse fashion also where instead of finding the
synonyms and maximum matching we go for the inverse and find the maximum likelihood of lowest possible score.
As it is shown above that it has higher average weighted accuracy 80% for the bigram case. Also the grading criteria
can be adjusted accordingly at any time as well as the other parameters and due to the linear mode of calculation.
The future work in this field is not only limited to academia but other fields can also benefit from the advance in it.
This current work is just a prototype of the actual large scale implementation and there can be so many changes that
can made in order for the improvement of performance in this field. Some of the future works that I would suggest
would be to increase the size of the number of words under consideration from an answer currently it was limited to
only four. The answer was checked purely on the basis of NLP based approaches, future work should include other
information extraction processes in it to cover broader realm. The implementation should be domain free or should
cover multi-domain. Also to include a united large corpus bridged with online resources to get better information
extraction and free from certain domain related shell. In that way it is expected to grade answers from much more
wide variety of fields instead of just language based. The grading used as well as penalty calculation and confidence
threshold used here were calculated linearly and in a very simple way due to shortage of time but in future they can
be calculated using some machine learning methods e.g. using perceptron or SVM that efficiently assign values
using discriminative analysis. The future work may include its application not only to LMS but also to other online
services. Those online applications include e.g. turnitin software, FAQ grading etc. The grading system currently
works for only computer based answer grading. An ambitious future work is to use it to check paper based question
answers by using imaging extraction and resolution and then using it as a text input in to this type of system.

P a g e 18 | 25

Another future direction can be by employing error detection and correction mechanism using machine learning
algorithms so that every time it makes a wrong grading it tries to improve it next time and avoid doing the same
error again. A general future direction can be usage of concise content structure building by allowing common sense
handling. Another future application for such systems is related to the speaking and listening though it is very far
from being realized due to the reason that voice recognition is itself in its state of evolution and if progresses to a
credible stature then all the tasks required for language qualification including reading, writing, listening and
speaking would be marked automatically with software.
The validity of the answers by those systems will always pose a question as to which level they can achieve success
as in case of WATSON it has surpassed human intellect so in that case would those systems try to expect more from
average human mind and how to develop and maintain those systems such that they correspond only to a certain
bounds of human intellect not too expecting and not too generous. This thinking is not much thought about and
probably would come in to play when those types of systems would become more accessible to masses, then how to
maintain the level of expectation from a particular set of people e.g. it would be difficult except the same criteria
from a student at MIT and one at a low ranked university because we can only change the grading criteria but not
the internal processing of the system.
9. Historical background and related work (taken from (LITERATURE, n.d.)):
Let us begin with brief overview of the historical work and progress to date relevant to the implementation of the
project. The use of language processing capabilities in to the field of the academics dates back to 1960s when
linguistic experts started to form concepts of employing those techniques to the field of the education scoring and
checking. The first actual application of its kind was introduced in 1982 with the name Writers Workbench that was
form of text editor for helping teachers(Freedman, 1984). In late 90s ETS introduced a system for automated rating
of essays trained for different tasks that system was called e-rater (Attali & Burstein, 2006). This tool was used for
rating GMAT essays and other TWE (Test of Written Essays) essays. The underlying technique is shallow parsing
and to identify any syntactic or different features from the trained ones. Weighted content words are used for
checking. The essay that gets high marks are those that have less variation from the actual topic, employs powerful
vocabulary with diverse syntactic structures. The implementation tools are based on NLP and other machine
learning and statistical tools. The e-rater achieves a very high performance efficiency of more than 80% for unseen
data, but this approach was mostly based multi class classification. The criteria is that the error in the form of
feedback is grouped in to four categories namely mechanics, grammar, comment about style and usage. Many of the
techniques are indicated by the natural language processing. Then the counting of the errors from the above
mentioned four classes give rise to four features. The rates are reflective of the number of errors count by the total
number of words in the essay. The cumulative score development is evaluated by adding up the number of thesis,
supporting ideas, main points and the elements of result or conclusion in a given essay.
Conceptual rater (c-rater) works on the foundations of the NLP and is designed for short question answers
assessment that are based on content type. The techniques adopted by c-rater are mostly inherited from NLP and
those that were made for e-rater, though the two systems, c and e rater differ in functionality. E-rate score
assignment is based on skills of writing as compared to the specific content, whereas the c-rater score assignment is
based mostly on binary classification of either correct or incorrect. The main concept lies with the close alignment of
the response related to a particular domain, if the sample response shows those concepts, then it is considered as
right otherwise incorrect and there is no care taken of the writing efficacy. Also in case of e-rater the grades are
assigned on rhetorical form of the essay whereas the c-rater requires to find more particular content. C-rater also do
not need a large corpus of training essays, instead it relies on a particular right answer that is the key. This is
basically based on the assumption that it is non-practical to have large data set for marking of short questions. The
score agreement with the human reader for c-rater is around 80%.
Intelligent Essay Accessor (IEA) was another approach introduced in late 90s and is based on Latent Semantic
Analysis (LSA), a technique in which the words and their respective documents are represented in the form of a
semantic space by virtue of a two dimensional matrix (Assessor et al., n.d.). Employing an algebra formulation of
single value decomposition (SVD), numerous new correlations between documents and their contents are observed.
Each cell from a matrix denotes the frequency of each word in a particular context . This first matrix is converted to
inverse weighting of frequency for a particular approach. Three component matrices are obtained from subdivision
of the SVD matrix. Word associations can be represented by the reduced representation of the three actual matrices,
formation of new concurrencies between the words and the context are done by restructuring approximations to the
actual matrix. Then these relations are reflected. The grading procedure is performed by building a matrix
corresponding to document and then converted employing SVD to reform the matrix with reduced dimensions that
is used for the space of semantic analysis of essay.

P a g e 19 | 25

Schema Extract Analyse and Report (SEARS) is a software system developed by Christie (Christie, 1999).
According to the schema either or both style and content for automated grading of essays may be used dependent on
the situation. Thus both the content and the style of the essay can be marked accordingly. For the assessment of style
the technique adopted is to use common metrics along with calibration initially. The candidate metric is decided by
using a subset of essays acting as set of training and manually identify them. The calibration process is initiated by
calibrating the weight of every metric until an appropriate agreement between the human and marking by computer
based system is attained on the complete set of essays. For content based assessment essays that are technical are the
considered to be candidates for content based marking. The content based approach is created once and is updated
quite easily and rapidly. The content based schema in SEARS do not require calibration or training. This type of
schema is taken as a simple data structure. Two approaches have been made to help in the automated process of
marking. The one is usage and the other is coverage. The usage based is to determine how much of the each essay is
employed whereas in the other case the determination of the quantity of schema in the essay under examination. The
goal is to develop a relationship among essay and schema
Another implementation was project essay grade developed and implemented in 2000 (Miel, 2014). It evaluates the
essay on the basis of quality of the writing and not take in to account the content. It is based on a technique called
proxes in which different parameters are included for evaluation e.g. the length of the essay, count of different parts
of speech e.g. frequency of pronouns, nouns etc. They are taken as a measure of complexity of the structure of the
sentence from a sample. Proxes creation is based on set of training essays and then a transformation to a multiple
regression that is standardized. The coefficients of the regression are approximated to those of human grades when
applied to proxes. Unfortunately in this technique no NLP technique is employed and there is no care of the lexical
content taken in to account. The approach is also mostly dependent on the number and types of essays used for
Bayesian Essay Test Scoring System (BETSY) is another text based classification program developed by Lawrence
M. Rudner (University of Maryland) (Dikli, 2006). The objective of the system was to classify it according to the
four nominal scale point described as e.g. unsatisfactory, extensive, partial and essential based on both the style
based as well as content based tasks. The models that are employed for the classifying text are Multivariate
Bernoulli Model as well as the Bernoulli Model. In the first model every essay is taken as specific case of all other
features that are calibrated and the overall probability is calculated as the product of probabilities of the individual
features in sample essay. This model needs a lot of time since all the terms in a given vocabulary need to be parsed.
Both of these models are assumed nave Bayes models due to underlying consideration of conditional independence
between the two. BESTY is based on approaches that incorporate the best aspects of e-rater, LSA and PEG. Other
advantages include its application to short essays, it can be applied to get diagnostic analysis as well as multiple skill
tasks can be classified with its help. It is a windows based program. The accuracy achieved by BESTY on a
particular set of data is 80%.
Intelligent Essay Marking System IEMS (Valenti, Neri, & Cucchiarelli, 2003) relies on the neural network based
pattern indexes. It can be employed not only as an assessment tool but also as diagnostic tool for cases based on
content. The feedback from the system is simultaneous and the students can very easily learn about their actual
performance. The grading of the essay relies upon qualitative question types in contrast to numerical type. Indextron
is a specific algorithm for cauterization and can be implemented as a neural network. This type of neural network
tries to overcome non-incremental and slow data from training, which is the case for the ordinary artificial neural
network. The results from this approach obtained a correlation of 0.8 by experiment performed on numerous
evaluations of the students essays.
Another proposed work (Mitchell, Russell, Broomhead, & Aldridge, 2002) for computerized answering of text
questions and the tool was named Automark. Its based on the information extraction approach. This implementation
has various separate modules that check for errors in syntax, vocabulary, and semantics. There are number of
marked templates already present and the student/candidate response is matched against all the possible templates
and then marked accordingly. The underlying scheme is based on the natural language processing basics. Two main
sub approaches for evaluation can be employed either based on the keyword analysis or the Natural Language
processing. Keyword analysis looks for presence or absence of the defined text in the answer and marks accordingly,
whereas in case of employing natural language processing based systems, perform in depth analysis of the text but
this approach is quite complex and expensive though provide a more granular and realistic measure of the answer.
This scheme gives high performance accuracy close to humans (95% accuracy).
Paperless School free-text Marking Engine (PS-ME) (Valenti et al., 2003)was designed as an integral part of the web
based learning management system but due to high requirements for processing, this system did not grade essays in
real time. This system uses the NLP based techniques to evaluate students essays to according to knowledge,
semantic. The individual essay response is uploaded to the server and along with the task to identify the right text

P a g e 20 | 25

based on comparison. Alongside a master text based on negative response or errors is created. The parameters are
calculated based on the analysis of linguistic. The weighting obtained as a result of comparisons are combined in the
form of a numeric value that is depiction of the assignment grade. The procedure to set up auto marker for a given
task is easy and is dependent only on the selection of a master text, and the number of different sources. And then
applying the regression analysis find the best fit among the grades provided by the marker and those that result from
the different parameter combinations that is then uploaded to the server.
In (Pulman & Sukkarieh, 2005) the author has propose an approach consisting of hidden markov model part of
speech tagger trained on particular corpuses for data extraction and inductive logic programming is used along with
decision tree learning and Bayesian learning. An annotated data set is considered for analysis and grading system. In
the end the author compares various approaches of machine learning from which different grading results were
obtained in a particular scenario. The Nave Bayes approach performs better as compared to the rest. Also it is
shown that the same algorithm performs differently in assigning marks for either using annotated corpus or without
annotation. The annotated corpus produced better results as expected.
9.1 Recent work and ongoing research (taken from (LITERATURE, n.d.)):
The field of question answering in recent years has been divided majorly in to two constituent domains. The open
and the closed (Walke & Karale, 2013). The other two further subtypes are the web based QAS (Question
Answering System) and rule based QAS. The open domain as the name depicts is quite wide and handles almost
everything owing to very large and strong knowledge of world and ontology. They employ very large set of data to
get the closest answer. The answering in the closed domain tackles only with particular field questions e.g. medical,
biology etc. and the task in this case can be perceived to be easy as compared to its counterpart. The QAS is based
on three modules that serve as the backbone of the system namely Classification, information retrieval and answer
extraction module. In (Sneiders, 2009) an interesting web service answering FAQs in an automated fashion is
proposed. Various modules of knowledge are identified and recurrence of the user queries are maintained in a
database that serves as seed for answering of the future based questions on the web. Due to similarity of the content
on a particular site, size of the database remains almost the constant. Various templates of questions exhibit various
types of knowledge e.g. based on lexical, morphological and syntactic based. (Ou, Mekhaldi, & Orasan, 2009)
employs an ontology based QA employing the textual entailment also. In this approach hypothesis questions are
produced a particular ontology and a corresponding query template is generated, an entailment engine is employed
to find the entailed hypothesis. A later evaluation is carried out find the accuracy of the QA method. A study on the
similarity of questions is carried out for answering questions in (Dong, Shi, Wang, & Lv, 2009). The basic
methodology is based on first getting the sets of questions that have dependency and from them keyword are
extracted from the major components of the sentence. The related libraries are created from the major parts of the
sentence and the target question. The candidate set of questions is ascertained via comparison of the field words and
the keywords. The sentences with the largest similarity are returned. In (Varathan, Sembok, & Kadir, 2010) an
automated lexicon generation process is proposed . The proposed generated lexicon is integrated with a system of
question answering and that employs a logical inference model. The major benefit of this approach is the generation
of the automated lexicon would significantly lower the manpower and time required during the QA procedure based
on NLP thus making the system fast. A proposed work with the system named Auto Accessor is proposed and
implemented (Cutrone, Chang, & Kinshuk, 2011) using natural language based short question answering mechanism
taking in to account the semantic meaning of the sentence answered. The two main phases of implementation are the
text preprocessing phase and the word/synonym matching process. An enhanced automated creator of question is
proposed in (Gtl, Lankmayr, Weinhofer, & Hfler, 2011) in which an automated question creation technique is
proposed for further test and analysis based on different natural language processing techniques. The results reveal
that the questions created are relevant to the human generated questions. In (Hao & Wenyin, 2011) an automated
scheme for answering questions that are repeated relying on the semantic question patterns. There are semantic
constraints that are introduced for non-constant terms in the pattern and in this way optimizes the semantic
representation and greatly lessens the ambiguity. The three major steps in the implementation are structure
processing, pattern matching based on similarity and filtering and evaluation of similarity of question and finally the
retrieval of answer. In (Fikri & Purwarianti, 2012) a case based analysis of the Indonesian is made in closed domain.
The analysis has been carried out keeping in view that different information domains accompany different QA stats.
So based on it a case based approach is proposed where each QA set would be dependent on the closed domain
frequency of the occurrence of a particular topic. In (Malik, Sharan, & Biswas, 2013) a domain restricted QA based
system is proposed. The domain restriction leads to a focus on the domain of a particular field knowledge. The
scheme amalgamate domain restricted knowledge field with the various NLP techniques at different stages.

P a g e 21 | 25

Currently the work is still going for a more generalized question answering system that has good speed and high
accuracy. The work is also dependent on the fact that the implementation requires the testing to be online and not
paper based and presents a big difficulty in actually implementing it in the class room. The other challenges in the
current era are handling of multiple corpuses that may be different rules, effective and fast parsing and domain free
analysis, as the web is populated with more and more data, there can be numerous ways of answering a question in
multiple ways and even in new lexical patterns, so the real challenges are in the open domain and free text analysis
and implementation of questions. Additionally there is no baseline and generalized pattern for a general QA.
Keeping in to account all the progress through the years has been made, one can easily see that much of the work
was devoted to the essay rating as compared question answering. All the work that has been done to date still needs
optimization and is either focused for a particular feature either speed or deep analysis for achieving a minimum
threshold of accuracy. The other limiting factor is the huge corpus of already evaluated trained knowledge set.
Another important factor used in most of the approaches is the dependence on the keywords in the response by an
individual. It is possible that the student may get a higher score based on providing large number of keywords or
may get a poor score if not enough keywords are used, so thats why in preparing of those computer graded exams
specific instructions are given to the individual to score high e.g. avoid using repeated words in a sentence, use
diverse vocabulary etc. The reason is obvious that most of the online testing systems to date employ two more
popular systems of testing. One is the Essay based and the other is MCQ based. The two techniques of evaluation
for online testing ensure that there is very less ambiguity in marking a grade. The MCQ are the most direct and can
be seen as pin point answers with no room for any Natural language based application or statistics. But the other one
based on Essay can provide a good room for uncertainty especially based on natural language based interpretation
that may vary from person to person and in order to get a uniform automated marking and grading system one has to
train it on a huge training data, validate and test errors on those training data and various statistical analysis
techniques should be applied on the outcomes of the multiple modules based on natural language processing e.g.
how many pronouns are used given a particular topics, what is the frequency of a using verb in a sentence, what is
the overall frequency, the bigram probabilities for a line to line as compared to the key featured essay. The variance
among the type of parts of speech used etc. All those and many other factors need to be addressed that would be
discussed later in the review. From the surface we can see the depth of analysis that is required in order to achieve
the task of consistent marking given a particular topic along with certain set of instructions, but still the essay based
marking has a lot of room and margin for the biasing and weight adjustments in order to produce a well- built grade
because the marking would be based in a sense on the flow and sequence of the information from the individual and
that flow is the key in guiding through to the actual grades whereas when we compare it to the case of short question
answering the room of error is very narrow and there is no sense of flow of information or directed graph for
expectation to next state. In other words we have to take care of the accuracy just like the MCQ based analysis and
also apply the techniques of essay grading for a candidate answer.

Assessor, I. E., Technologies, K. A., Kat, T., Analysis, L. S., Intelligence, A., & Lsa, U. (n.d.). Research-Based Scoring Across
Subject Areas.
Attali, Y., & Burstein, J. (2006). Automated essay scoring with e-rater V2. The Journal of Technology, Learning. and Assessment,
4(November), 130. Retrieved from
Christie, J. R. (1999). Automated Essay Marking for both Style and Content. Proceedings of the Third Annual Computer
Assisted Assessment Conference.
Cutrone, L., Chang, M., & Kinshuk. (2011). Auto-assessor: Computerized assessment system for marking students short-answers
automatically. Proceedings - IEEE International Conference on Technology for Education, T4E 2011, 8188.

P a g e 22 | 25

Dikli, S. (2006). An Overview of Automated Scoring of Essays. Journal Of Technology Learning And Assessment, 5(1), 200612.
Retrieved from
Dong, Q., Shi, S., Wang, H., & Lv, X. (2009). Study on similarity of simple questions based on the catering field. 2009
International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2009, (3).
Fikri, A., & Purwarianti, A. (2012). Case based Indonesian closed domain question answering system with real world questions.
2012 7th International Conference on Telecommunication Systems, Services, and Applications, TSSA 2012, 181186.
Freedman, S. S. W. (1984). The Evaluation of, and Response to Student Writing: A Review. American Educational Research
Association (AERA). Retrieved from
Gtl, C., Lankmayr, K., Weinhofer, J., & Hfler, M. (2011). Enhanced automatic question creator - EAQC: Concept,
development and evaluation of an automatic test item creation tool to Foster modern e-Education. Electronic Journal of ELearning, 9(1), 2338.
Hao, T., & Wenyin, L. (2011). Automatically answering repeated questions based on semantic question patterns. Proceedings of
the 10th IEEE International Conference on Cognitive Informatics and Cognitive Computing, ICCI*CC 2011, 272277.
Kakkonen, T., & Sutinen, E. (2008). Evaluation criteria for automatic essay assessment systemsthere is much more to it than just
the correlation. Icce 2008 Proceedings, 111115. Retrieved from
Leacock, C. (2004). Scoring free-responses automatically: A case study of a large-scale assessment. Examens, 1, 19. Retrieved
Malik, N., Sharan, A., & Biswas, P. (2013). Domain Knowledge enriched framework for restricted domain question answering
Miel, S. (2014). Project Essay Grade ( PEG ) Summative Assessments.
Mitchell, T., Russell, T., Broomhead, P., & Aldridge, N. (2002). Towards robust computerised marking of free-text responses.
Proceedings of the Sixth International Computer Assisted Assessment Conference, 233 249.
Mohler, M., Bunescu, R., & Mihalcea, R. (2011). Learning to Grade Short Answer Questions using Semantic Similarity Measures
and Dependency Graph Alignments. Proceedings of the 49th Annual Meeting of the Association for Computational
Linguistics: Human Language Technologies, 752762.
Mohler, M., & Mihalcea, R. (2009). Text-to-text Semantic Similarity for Automatic Short Answer Grading. Proceedings of the
12th Conference of the European Chapter of the Association for Computational Linguistics (EACL 09), (April), 567575.
Ou, S., Mekhaldi, D., & Orasan, C. (2009). An ontology-based question answering method with the use of textual entailment.
2009 International Conference on Natural Language Processing and Knowledge Engineering, NLP-KE 2009.

P a g e 23 | 25

Pulman, S. G., & Sukkarieh, J. Z. (2005). Automatic short answer marking. EdAppsNLP 05 Proceedings of the Second Workshop
on Building Educational Applications Using NLP, (June), 916.
Sneiders, E. (2009). Automated FAQ answering with question-specific knowledge representation for web self-service.
Proceedings - 2009 2nd Conference on Human System Interactions, HSI 09, 298305.
Sukkarieh, J. Z., Pulman, S. G., & Raikes, N. (2003). Auto-marking: using computational linguistics to score short, free text
responses. The Annual Conference of the International Association for Educational Assessment (IAEA), Manchester, UK,
115. Retrieved from
Valenti, S., Neri, F., & Cucchiarelli, A. (2003). An Overview of Current Research on Automated Essay Grading. Journal of
Information Technology Education, 2, 3118. Retrieved from
Varathan, K. D., Sembok, T. M. T., & Kadir, R. a. (2010). Automatic Lexicon Generator for Logic Based Question Answering
System. Computer Engineering and Applications (ICCEA), 2010 Second International Conference on, 2, 48.
Walke, M. P. P., & Karale, S. (2013). Implementation Approaches for Various Categories of Question Answering System, (Ict),

P a g e 24 | 25