You are on page 1of 71

Resources: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al.

Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff
The CoNLL-2013 Shared Task on Grammatical Error Correction, Ng el al.
Better Evaluation for Grammatical Error Correction, Dahlmeier and Ng
Resource: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al.

System output
System output

Annotator tag Annotator tag

Standard NLP evaluation

Error detection evaluation


 Comma restoration task
 Commas are removed from well edited text (gold
standard)
 System tries to restore commas by predicting
their locations
 Comparison:
▪ Binary distinction (presence or absence of comma)
 Comma restoration task
 Comparison can be represented in
contingency table

Accuracy (A)
Precision (P)
Recall (R)
.
F-measure (F1)
 Comma error detection task
 System seeks to find and correct errors in the
write’s usage of commas
 Intricacies:
▪ Positive class: Error of the writer that involves comma
(not presence of comma)  Mismatch between writer’s
sentence and the annotator’s judgement
▪ Negative class: writer and annotator agree
▪ System’s judgement has not been considered yet
▪ Writer-Annotator-System (WAS)
Contingency scheme for WAS

Considering System prediction and Writer’s form together


Contingency scheme for WAS

Considering System prediction and Gold standard together


S
W-A-S
(NC, NC, C) (C, NC, C)

(C, C, C)
(NC, C, C)

(NC, NC, NC) (C, NC, NC)


W

(NC, C, NC) (C, C, NC)

A
S
W-S
(NC , NC, C) (C , NC, C)

P N

(NC , C, C) (C , C, C)
P N

N P
(NC , NC, NC) (C , NC, NC)
W

N P
(NC , C, NC) (C , C, NC)

A
S
A-S
(NC , NC, C) (C , NC, C)

FP FN

(NC , C, C) (C , C, C)
TP TN

TN TP
(NC , NC, NC) (C , NC, NC)
W

FN FP
(NC , C, NC) (C , C, NC)

A
Simplified contingency scheme
 The case of * row in simplified WAS contingency
table
 Concerns different categories (TP,TN,FP,FN) depending on
whether the evaluation is for detection or correction
▪ TP for detection ( )
▪ For correction: X Y Z is both FP (system writer) and FN(system
annotator)
 Distribution of positive and negative error
classes are highly skewed
 13% errors in preposition usages by L2 writers
(Han el al., 2006)
 Baseline system  always predicts “no errors”
▪ 87% accuracy
 All the measures are affected by the proportion of
errors in gold standard
▪ Prevalence
 All the measures are affected by the proportion of
cases that system reports as error
 Bias
 Effect on a system that performs no better than
chance

 Increase in R when prevalence increases

 Increase in P when bias increases
Expected proportion of TP
match

• Expected match between Annotator and System  product of their probabilities for
respective categories (in this case Error/No-Error)

• the expected proportion of TP matches is equal to the product of the proportion of


cases assigned the Error label by the Annotator (i.e., the prevalence) and the
proportion of cases assigned the Error label by the System (i.e., the bias)


System1: Predications are correct at chance level

• Cohen’s kappa

• Accuracy = 0.68, Precision = 0.04/(0.04+0.16) = 0.2, Recall = 0.2


and F1= 0.2
System2: Prevalence and bias remain the same

• Cohen’s kappa

• Accuracy = 0.80

• Removing the cases expected to show agreement by chance, the


System is correct in 38% remaining cases
System3: Increase bias and prevalence + Predications are correct at chance
level

• Cohen’s kappa

• Accuracy = 0.54, Precision = 0.40, Recall = 0.30, F1 = 0.34


 Variability in prevalence or error rates
( )
 Prevalence changes with population of learners
with different native languages
 Different levels of proficiency in second language
 Variability in bias ( )
 Detection system dependent
 Threshold for marking Error/Non-Error
▪ Higher threshold  lower bias
▪ Lower threshold  higher bias
 Dealing with sensitivity to bias
 Vary threshold and generate precision-recall curve
 Dealing with sensitivity to bias
 Area under Receiver Operating Characteristic (AUC)
curve
 p(true|true)

Effect of random
)

prediction is not
(

nullified
True Positive Rate =

Area under random


prediction
45

False Positive Rate = ( )  p(true|false)


 Dealing with sensitivity to bias
 Area under curve (AUK)
True Positive Rate

Cohen’s 𝜅

False Positive Rate False Positive Rate = ( )  p(true|false)


 Positive class consists of an error in writer’s
text
 No 1:1:1 correspondence between writer’s
sentence, annotator’s correction and type of error

Book of my class inpired me


Article error A Book in my class inspired me

Number error Books for my class inspired me


Article+Number
error
The books of my class were inspiring to me
 Assuming no ambiguity in error type
 What would be the size of unit over which error is
defined?

The book in my class inspire me


a) The book in my class inspires me
b) The books in my class inspire me
• Unit size: Morpheme level? Word level? Phrase level?
String level?
• Token-based approach vs String-based approach
 Variability of size can be handled with Edit
Distance Measures (EDM)
 inspire  inspires is same as book…inspire 
book… inspires
 EDM can handle multiple overlapping errors
 Sequence: “…development set is similar with test
set…..”
 Correction1: with  to and  the EDM can
 Correction2: with  to the handle both
 EDMs are good for comparison not for
providing feedback to the writer
 If book and inspire are not linked feedback like
violation in subject-verb agreement cannot be
provided
 Negatives consist of non-errors in writer’s
text
 set complement of positive class?
 Appropriate set of non-errors cannot be easily
specified
▪ Book of my class inspire to me
▪ Negatives: a of, a my, a class, a inspire, a to, a me, a .?
▪ Should only the noun phrases be counted?
▪ He is fond beer . ( negatives or negatives)
 Error data is biased towards negative class
 Negative counting strategy have greater
consequences in performance reporting
 Identifying negatives through trivial means results
inflation in true negatives (TN), keeping other
counts in contingency table constant
▪ Increases A,
▪ R and P not affected
Accuracy: 0.54, Kappa = 0.00

Inject 100 more TNs

Accuracy: 0.77, Kappa = 0.21


 Given: Gold standard edits (G), System Edits (E)


 Example
 Learner sentence S
▪ There is no a doubt, tracking system has brought many benefits in this
information age
▪ 𝐺 = {𝑎 𝑑𝑜𝑢𝑏𝑡 → 𝑑𝑜𝑢𝑏𝑡, 𝑠𝑦𝑠𝑡𝑒𝑚 → 𝑠𝑦𝑠𝑡𝑒𝑚𝑠, ℎ𝑎𝑠 → ℎ𝑎𝑣𝑒}
 System correction H
▪ There is no doubt, tracking system has brought many benefits in this
information age .
▪ 𝐸 = {𝑎 𝑑𝑜𝑢𝑏𝑡 → 𝑑𝑜𝑢𝑏𝑡}
 Performance
▪ P=1/1, R=1/3 F=1/2
 Extraction of system edit from writer’s text
(source) and system output (hypothesis)
 done with GNU wdiff utility
Source: Our baseline system feeds word into PB-SMT pipeline
Hypothesis: Our baseline system feeds a word into PB-SMT pipeline

System edit: ( )  inserting article a


Gold standard edit: ( )

Hypothesis matches with first gold standard edit but flagged as invalid
 Key idea
 There may be multiple ways to arrive at the same
correction
 Extraction of the set of edits that matches the
gold standard maximally
 Notations
 : set of writer sentences
 : set of hypothesis or system
outputs
 : set of gold standard
annotations
▪ : set of edits
 Notations
 An edit is a triple <a,b,C>
▪ Start and end token offsets a and b with respect to a
source sentence.
▪ A correction C.
▪ For gold standard edit C is set of corrections
▪ For system edit C is a single correction
 Evaluation of system output
 Extracting a set of system edits ( ) for each
source-hypothesis pair ( - )
▪ Construction of edit lattice
▪ Searching through the lattice for extracting optimal set
of edits
 Evaluating system edits with respect to gold
standard
 Edit metric: Levenshtein distance
 Minimum number of insertions, deletions and
substitutions needed to transform one string to
another
 How to compute levenshtein distance?
▪ Use a 2-D matrix (Levenshtein matrix) to store edit costs
of substrings of string pairs
▪ Compute individual cell entries (edit costs) with dynamic
programming
▪ Rightmost corner cell stores optimal edit cost
 Slides from Jurafsky course page
 Spell correction • Computational Biology
• Align two sequences of nucleotides
 The user typed
AGGCTATCACCTGACCTCCAGGCCGATGCCC
“graffe” TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• Resulting alignment:
Which is closest?
▪ graf
▪ graft -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
▪ grail
▪ giraffe

• Also for Machine Translation, Information Extraction, Speech Recognition


 The minimum edit distance between two
strings
 Is the minimum number of editing operations
 Insertion
 Deletion
 Substitution
 Needed to transform one into the other
 Two strings and their alignment:
 If each operation has cost of 1
 Distance between these is 5
 If substitutions cost 2 (Levenshtein)
 Distance between them is 8
 Searching for a path (sequence of edits) from
the start string to the final string:
 Initial state: the word we’re transforming
 Operators: insert, delete, substitute
 Goal state: the word we’re trying to get to
 Path cost: what we want to minimize: the
number of edits
 But the space of all edit sequences is huge!
 We can’t afford to navigate naïvely
 Lots of distinct paths wind up at the same state.
▪ We don’t have to keep track of all of them
▪ Just the shortest path to each of those revisited states.
 For two strings
 X of length n
 Y of length m
 We define D(i,j)
 the edit distance between X[1..i] and Y[1..j]
▪ i.e., the first i characters of X and the first j characters of Y
 The edit distance between X and Y is thus D(n,m)
 Dynamic programming:
 Solving problems by combining solutions to
subproblems.
 A tabular computation of D(n,m)

 Bottom-up
 We compute D(i,j) for small i,j
 And compute larger D(i,j) based on previously
computed smaller values
 i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
DELETION

INSERT

SUBSTITUTION
N
O
I
T
N INSERT
E
T
N SUBSTITUTE DELETE
I
#
# E X E C U T I O N
 Initialization
D(i,0) = i
D(0,j) = j

 Recurrence Relation
For each i = 1 … n
For each j = 1 … m

D(i-1,j) + 1 deletion

D(i,j)= min D(i,j-1) + 1 insertion


D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
 Termination substitution
D(n,m) is distance
N 9
O 8
I 7

T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7 D(0,1) + 1
D(1,1)= min D(1,0) + 1
T 6 D(0,0) + 2; if X(1) ≠ Y(1)
0; if X(1) = Y(1)
N 5
E 4
T 3
N 2
I 1 2
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table

N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
 Edit distance isn’t sufficient
 We often need to align each character of the two
strings to each other
 We do this by keeping a “backtrace”
 Every time we enter a cell, remember where
we came from
 When we reach the end,
 Trace back the path from the upper right corner to
read off the alignment
The Edit Distance Table

N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
 Base conditions: Termination:
D(i,0) = i D(0,j) = j D(n,m) is distance

 Recurrence Relation:
For each i = 1 … n
For each j = 1 … m
D(i-1,j) + 1 deletion

D(i,j)= min D(i,j-1) + 1 insertion

D(i-1,j-1) + 2; if X(i) ≠ Y(j) substitution


0; if X(i) = Y(j)

LEFT insertion

ptr(i,j)= DOWN deletion

DIAG substitution
Source: Our baseline system feeds word into PB-SMT pipeline
Hypothesis: Our baseline system feeds a word into PB-SMT pipeline

System edit: ( )  inserting article a


Gold standard edit: ( )
0 1 2 3 4 5 6 7 8 9 10
# our baseline system feeds a word into PB-SMT pipeline .
0 #
1 Our
2 baseline
3 system
4 feeds
5 word
6 into
7 PB-SMT
8 pipeline
9 .
0 1 2 3 4 5 6 7 8 9 10
# our baseline system feeds a word into PB-SMT pipeline .
0 # 0 1 2 3 4 5 6 7 8 9 10
1 Our 1 0 1 2 3 4 5 6 7 8 9
2 baseline 2 1 0 1 2 3 4 5 6 7 8
3 system 3 2 1 0 1 2 3 4 5 6 7
4 feeds 4 3 2 1 0 1 2 3 4 5 6
5 word 5 4 3 2 1 1 1 2 3 4 5
6 into 6 5 4 3 2 2 2 1 2 3 4
7 PB-SMT 7 6 5 4 3 3 3 2 1 2 3
8 pipeline 8 7 6 5 4 4 4 3 2 1 2
9 . 9 8 7 6 5 5 5 4 3 2 1
0 1 2 3 4 5 6 7 8 9 10
# our baseline system feeds a word into PB-SMT pipeline .
0 # 0 1 2 3 4 5 6 7 8 9 10
1 Our 1 0 1 2 3 4 5 6 7 8 9
2 baseline 2 1 0 1 2 3 4 5 6 7 8
3 system 3 2 1 0 1 2 3 4 5 6 7
4 feeds 4 3 2 1 0 1 2 3 4 5 6
5 word 5 4 3 2 1 1 1 2 3 4 5
6 into 6 5 4 3 2 2 2 1 2 3 4
7 PB-SMT 7 6 5 4 3 3 3 2 1 2 3
8 pipeline 8 7 6 5 4 4 4 3 2 1 2
9 . 9 8 7 6 5 5 5 4 3 2 1
 A lattice of all the shortest paths from top-
left corner to bottom-right corner
 Each vertex corresponds to a cell in
Levenshtein matrix
 Each edge corresponds to an atomic edit
operation
 Insert, delete, substitute, match
 Each path corresponds to a shortest
sequence edits that transforms into
Edit lattice for “Our baseline system feeds ( ) word into PB-SMT pipeline .”

6,7 PB-SMT(1) 7,8

pipeline(1)
4,4
5,6

8,9

3,3
4,5

.(1)
9,10

Our(1) baseline(1)
0,0 1,1 2,2
 Annotators can use longer phrases and can
use unchanged words from context
 word  {a word, words}
 Should we allow arbitrary number of unchanged
words in an edit?
▪ Avoid very large edits with many unchanged words
▪ Put limit ( ) on number of unchanged words in an edit
 Allow phrase level edits
 Add transitive edges with limit and edit changes
at least one word
 Let and be two
adjacent edges ( )
▪ Transitive edge:
Edit lattice for “Our baseline system feeds ( ) word into PB-SMT pipeline .”

6,7 PB-SMT(1) 7,8

pipeline(1)
4,4
5,6

8,9

3,3
4,5

.(1)
9,10

Our(1) baseline(1)
0,0 1,1 2,2
Edit lattice for “Our baseline system feeds ( ) word into PB-SMT pipeline .”

Change the weight of 6,7 PB-SMT(1) 7,8

matching edge weight to

pipeline(1)
4,4
5,6

8,9

3,3
4,5

.(1)
9,10

Our(1) baseline(1)
0,0 1,1 2,2
 Perform a single-source shortest path with
negative weights from start to end vertex
 Bellman-Ford algorithm
 Theorem
 The set of edits corresponding to the shortest path
has the maximum overlap with the gold standard
annotation.
 Proof
 Let be the edit sequence in shortest path and
be the number of matching edits
 be another edit sequence with higher path cost but


 Bound on right hand side

 Bound on left hand side

▪ LHS is
 Contradiction

You might also like