Professional Documents
Culture Documents
Helping Our Own: The HOO 2011 Pilot Shared Task, Dale and Kilgarriff
The CoNLL-2013 Shared Task on Grammatical Error Correction, Ng el al.
Better Evaluation for Grammatical Error Correction, Dahlmeier and Ng
Resource: Problems in Evaluating Grammatical Error Detection Systems, Chodorow et al.
System output
System output
Accuracy (A)
Precision (P)
Recall (R)
.
F-measure (F1)
Comma error detection task
System seeks to find and correct errors in the
write’s usage of commas
Intricacies:
▪ Positive class: Error of the writer that involves comma
(not presence of comma) Mismatch between writer’s
sentence and the annotator’s judgement
▪ Negative class: writer and annotator agree
▪ System’s judgement has not been considered yet
▪ Writer-Annotator-System (WAS)
Contingency scheme for WAS
(C, C, C)
(NC, C, C)
A
S
W-S
(NC , NC, C) (C , NC, C)
P N
(NC , C, C) (C , C, C)
P N
N P
(NC , NC, NC) (C , NC, NC)
W
N P
(NC , C, NC) (C , C, NC)
A
S
A-S
(NC , NC, C) (C , NC, C)
FP FN
(NC , C, C) (C , C, C)
TP TN
TN TP
(NC , NC, NC) (C , NC, NC)
W
FN FP
(NC , C, NC) (C , C, NC)
A
Simplified contingency scheme
The case of * row in simplified WAS contingency
table
Concerns different categories (TP,TN,FP,FN) depending on
whether the evaluation is for detection or correction
▪ TP for detection ( )
▪ For correction: X Y Z is both FP (system writer) and FN(system
annotator)
Distribution of positive and negative error
classes are highly skewed
13% errors in preposition usages by L2 writers
(Han el al., 2006)
Baseline system always predicts “no errors”
▪ 87% accuracy
All the measures are affected by the proportion of
errors in gold standard
▪ Prevalence
All the measures are affected by the proportion of
cases that system reports as error
Bias
Effect on a system that performs no better than
chance
Increase in R when prevalence increases
Increase in P when bias increases
Expected proportion of TP
match
• Expected match between Annotator and System product of their probabilities for
respective categories (in this case Error/No-Error)
•
System1: Predications are correct at chance level
• Cohen’s kappa
• Cohen’s kappa
• Accuracy = 0.80
• Cohen’s kappa
Effect of random
)
prediction is not
(
nullified
True Positive Rate =
Cohen’s 𝜅
Hypothesis matches with first gold standard edit but flagged as invalid
Key idea
There may be multiple ways to arrive at the same
correction
Extraction of the set of edits that matches the
gold standard maximally
Notations
: set of writer sentences
: set of hypothesis or system
outputs
: set of gold standard
annotations
▪ : set of edits
Notations
An edit is a triple <a,b,C>
▪ Start and end token offsets a and b with respect to a
source sentence.
▪ A correction C.
▪ For gold standard edit C is set of corrections
▪ For system edit C is a single correction
Evaluation of system output
Extracting a set of system edits ( ) for each
source-hypothesis pair ( - )
▪ Construction of edit lattice
▪ Searching through the lattice for extracting optimal set
of edits
Evaluating system edits with respect to gold
standard
Edit metric: Levenshtein distance
Minimum number of insertions, deletions and
substitutions needed to transform one string to
another
How to compute levenshtein distance?
▪ Use a 2-D matrix (Levenshtein matrix) to store edit costs
of substrings of string pairs
▪ Compute individual cell entries (edit costs) with dynamic
programming
▪ Rightmost corner cell stores optimal edit cost
Slides from Jurafsky course page
Spell correction • Computational Biology
• Align two sequences of nucleotides
The user typed
AGGCTATCACCTGACCTCCAGGCCGATGCCC
“graffe” TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• Resulting alignment:
Which is closest?
▪ graf
▪ graft -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
▪ grail
▪ giraffe
Bottom-up
We compute D(i,j) for small i,j
And compute larger D(i,j) based on previously
computed smaller values
i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
DELETION
INSERT
SUBSTITUTION
N
O
I
T
N INSERT
E
T
N SUBSTITUTE DELETE
I
#
# E X E C U T I O N
Initialization
D(i,0) = i
D(0,j) = j
Recurrence Relation
For each i = 1 … n
For each j = 1 … m
D(i-1,j) + 1 deletion
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7 D(0,1) + 1
D(1,1)= min D(1,0) + 1
T 6 D(0,0) + 2; if X(1) ≠ Y(1)
0; if X(1) = Y(1)
N 5
E 4
T 3
N 2
I 1 2
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit distance isn’t sufficient
We often need to align each character of the two
strings to each other
We do this by keeping a “backtrace”
Every time we enter a cell, remember where
we came from
When we reach the end,
Trace back the path from the upper right corner to
read off the alignment
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Base conditions: Termination:
D(i,0) = i D(0,j) = j D(n,m) is distance
Recurrence Relation:
For each i = 1 … n
For each j = 1 … m
D(i-1,j) + 1 deletion
LEFT insertion
DIAG substitution
Source: Our baseline system feeds word into PB-SMT pipeline
Hypothesis: Our baseline system feeds a word into PB-SMT pipeline
pipeline(1)
4,4
5,6
8,9
3,3
4,5
.(1)
9,10
Our(1) baseline(1)
0,0 1,1 2,2
Annotators can use longer phrases and can
use unchanged words from context
word {a word, words}
Should we allow arbitrary number of unchanged
words in an edit?
▪ Avoid very large edits with many unchanged words
▪ Put limit ( ) on number of unchanged words in an edit
Allow phrase level edits
Add transitive edges with limit and edit changes
at least one word
Let and be two
adjacent edges ( )
▪ Transitive edge:
Edit lattice for “Our baseline system feeds ( ) word into PB-SMT pipeline .”
pipeline(1)
4,4
5,6
8,9
3,3
4,5
.(1)
9,10
Our(1) baseline(1)
0,0 1,1 2,2
Edit lattice for “Our baseline system feeds ( ) word into PB-SMT pipeline .”
pipeline(1)
4,4
5,6
8,9
3,3
4,5
.(1)
9,10
Our(1) baseline(1)
0,0 1,1 2,2
Perform a single-source shortest path with
negative weights from start to end vertex
Bellman-Ford algorithm
Theorem
The set of edits corresponding to the shortest path
has the maximum overlap with the gold standard
annotation.
Proof
Let be the edit sequence in shortest path and
be the number of matching edits
be another edit sequence with higher path cost but
Bound on right hand side
▪
Bound on left hand side
▪
▪ LHS is
Contradiction
▪