Professional Documents
Culture Documents
CS6113 Semantic Computing: Tagging Data With XML
CS6113 Semantic Computing: Tagging Data With XML
Semantic Computing
aqadir@cust.edu.pk
Tagging with XML
Information Extraction from unstructured
documents and then tagging the certain
information
Find and understand limited relevant parts of texts
Gather information from many pieces of text
Produce a semi-structured representation in XML
3
Named Entity Recognition (NER)
A very important sub-task: find and classify
names in text:
For example
names of persons,
Names of organizations,
Names of geographical locations (countries, cities),
Dates,
Products,
4
NER Example
Salma lives in Rawalpindi and is studying Computer
Science at Capital University of Science & Technology.
She is a part time worker at a call center in Islamabad.
<person> Salma </person> lives in <location>
Rawalpindi </location> and is studying Computer
Science at <organization> Capital University of
Science & Technology </organization> in
<date>2019</date>. She is a part time worker at a
call center in <location> Islamabad </location>.
5
NER
6
7
Evaluation of NER
Precision, Recall, and the F measure
2x2 Evaluation Table
8
9
A combined measure: F
A combined measure that assesses the P/R tradeoff is F measure
(weighted harmonic mean):
10
A combined measure: F
P = 40% R = 40% F =?
P = 75% R = 25% F =?
11
Accuracy
12
OKE Challenge
13
Message Understanding Conference (MUC) was an
annual event/competition where results were
presented
Focused on extracting information from news
articles:
Terrorist events
Industrial joint ventures
Company management changes
14
NER
Typically, NER demands optimally combining a
variety of clues including,
orthographic features,
parts of speech,
similarity with existing database of entities,
presence of specific signature words and so on.
15
Methods for NER
Hand-written regular expressions
Finding (US) phone numbers
(?:\(?[0-9]{3}\)?[ -.])?[0-9]{3}[ -.]?[0-9]{4}
Develop rules
Using classifiers
Sequence models
16
CustNER
Illinois Annotated
NER Text
CustNER
DBpedia Annotated
Spotlight Text DBpedia
White House loc: White org: White House thing: National org: White House
National House National Trade Trade Council National Trade
Trade Council Council Council
Mr Trump per: Trump per: Trump surname: Mr per: Trump
Trump
Dublin City loc: Dublin org: Dublin City org: Dublin City org: Dublin City
Council City Council Council Council
The Coming misc: The loc: China book: The loc: China
China Wars Coming China Coming China
Wars Wars
UK loc: UK loc: UK org: UK org: UK
government government government
US President- loc: US title: President- per: US per: US President-
18
elect elect President-elect elect
Rule 2 - Addition of Entities Recognized by
Stanford or Illinois NER
Rule 3 - Checking around Title Entity
Rule 4 – Expanding Nationality Entities
Rule 5 – Addition of Mentions Having Corresponding
DBpedia Resources
Rule 6 – For Recognizing Acronyms
Rule 7 – For Adding Re-Occurrences of Added
Entities
19
Example incorrect annotations in
OKE dataset and the corrections
made
20
Named Entity Previous Corrected Comments
Annotation annotation
Irish 1 0 "Irish" is not a person, organization or location. It is a
nationality. Therefore, is removed from the dataset.
Korean 1 org: Korean "Korean" is nationality. But the text actually has "Korean Air",
Air which is an organization.
Yonhap news 1 org: Yonhap "Yonhap" is name of organization, not "Yonhap news agency".
agency
Ministry of Defence 0 1: org "Ministry of Defence" is an organization.
Russia 0 1: loc "Russia" is a location.
Paul Pogba's 1 per: Paul "'s" is not part of the person name.
Pogba
King Koopa 1 0 "King Koopa" is a turtle-like fictional character and not a
person, location or organization.
legendary 1 per: Alan "legendary cryptanalyst" is not part of the person name.
cryptanalyst Alan Turing
Turing
Santa 0 per: Santa "Santa" or "Santa Claus" is a human fictional character.
U.S. 0 1: loc "U.S." is a location named entity.
Joker 0 1: per "Joker" is a person fictional character.
Persian army 0 1: org "Persian army" is name of an organization.
Greenwich Village, 1 loc: This entity has been broken down into three location entities,
Manhattan, New Greenwich "Greenwich Village", "Manhattan" and "New York City".
York City Village, loc:
Manhattan,
loc: New York
City
21
FIFA 0 1: org "FIFA" is acronym of an organization.
Results comparison of NE recognition
task on OKE evaluation dataset
Annotator Weak Annotation Strong Annotation Match
Match
Precision Recall F1 Precision Recall F1
Stanford NER 74.94 85.22 79.75 68.75 72.04 70.36
Illinois NER 94.66 84.17 89.11 86.14 77.45 81.56
CustNER 92.13 92.37 92.25 85.64 83.42 84.51
22
Results comparison of NE recognition
and classification task on OKE
evaluation dataset
23
Results comparison of strong
annotation match for each type on
OKE evaluation dataset
Annota person location organization
tor Precisi Recall MicroF Precisi Recall MicroF Precisi Recal Micro
on on on l F
Stanfo
rd 62.56 72.11 66.99 64.71 64.23 64.47 68.85 60.00 64.12
NER
Illinois
87.01 74.86 80.48 80.99 77.78 79.35 60.94 54.17 57.35
NER
CustN
85.88 83.52 84.68 80.49 75.57 77.95 66.67 68.49 67.57
ER
24
Results comparison of NE
recognition task on CoNLL03
evaluation dataset
25
Assignment: NER
1. Gather small paragraphs from the web with entities of your
interest (atleast ten)
2. Mark the entities in these paragraphs with relevant domain
specific tags
3. Use the publicly available NER systems to tag these
paragraph
4. Tabulate the results
5. Compute P, R, F1 for each paragraph and each NER system
6. Compute average P, R, F1 and then give your opinion in the
discussion form
7. Submit your report to the Assignment Folder before next
class
26