You are on page 1of 26

CS6113

Semantic Computing

Tagging Data with XML

Dr. Mohammad Abdul Qadir


aqadir@cust.edu.pk
1
The Tree Model of XML Documents:
An Example
<email>
<head>
<from name="Michael Maher"
address="michaelmaher@cs.gu.edu.au"/>
<to name="Grigoris Antoniou"
address="grigoris@cs.unibremen.de"/>
<subject>Where is your draft?</subject>
</head>
<body>
Grigoris, where is the draft of the paper you promised me
last week?
</body>
</email>

aqadir@cust.edu.pk
Tagging with XML
 Information Extraction from unstructured
documents and then tagging the certain
information
 Find and understand limited relevant parts of texts
 Gather information from many pieces of text
 Produce a semi-structured representation in XML

3
Named Entity Recognition (NER)
 A very important sub-task: find and classify
names in text:
 For example
 names of persons,
 Names of organizations,
 Names of geographical locations (countries, cities),
 Dates,
 Products,

4
NER Example
 Salma lives in Rawalpindi and is studying Computer
Science at Capital University of Science & Technology.
She is a part time worker at a call center in Islamabad.
 <person> Salma </person> lives in <location>
Rawalpindi </location> and is studying Computer
Science at <organization> Capital University of
Science & Technology </organization> in
<date>2019</date>. She is a part time worker at a
call center in <location> Islamabad </location>.

5
NER

6
7
Evaluation of NER
 Precision, Recall, and the F measure
 2x2 Evaluation Table

 Precision: % of selected items that are correct


 Recall: % of correct items that are selected
correct

8
9
A combined measure: F
 A combined measure that assesses the P/R tradeoff is F measure
(weighted harmonic mean):

 The harmonic mean is a very conservative average


 People use F1 with with β = 1 (that is, α = ½)
 F = 2PR/(P+R)

10
A combined measure: F
 P = 40% R = 40% F =?
 P = 75% R = 25% F =?

11
Accuracy

12
OKE Challenge

13
 Message Understanding Conference (MUC) was an
annual event/competition where results were
presented
 Focused on extracting information from news
articles:
 Terrorist events
 Industrial joint ventures
 Company management changes

14
NER
 Typically, NER demands optimally combining a
variety of clues including,
 orthographic features,
 parts of speech,
 similarity with existing database of entities,
 presence of specific signature words and so on.

15
Methods for NER
 Hand-written regular expressions
 Finding (US) phone numbers
 (?:\(?[0-9]{3}\)?[ -.])?[0-9]{3}[ -.]?[0-9]{4}
 Develop rules
 Using classifiers
 Sequence models

16
CustNER
Illinois Annotated
NER Text
CustNER

Standard Annotated Pre Rule Named


Input Text
NER Text Processor Engine Entities

DBpedia Annotated
Spotlight Text DBpedia

Pre-Processor: The lists of entities annotated by the


annotators contain some apparent false positives like he, his,
goes, the etc., which need to be removed.
17
Rule 1 – Deciding the Boundary and Type of e
from three Annotations
Text Annotation by Annotation
Stanford NER Illinois NER DBpedia selected by
Spotlight CustNER

White House loc: White org: White House thing: National org: White House
National House National Trade Trade Council National Trade
Trade Council Council Council
Mr Trump per: Trump per: Trump surname: Mr per: Trump
Trump
Dublin City loc: Dublin org: Dublin City org: Dublin City org: Dublin City
Council City Council Council Council
The Coming misc: The loc: China book: The loc: China
China Wars Coming China Coming China
Wars Wars
UK loc: UK loc: UK org: UK org: UK
government government government
US President- loc: US title: President- per: US per: US President-
18
elect elect President-elect elect
 Rule 2 - Addition of Entities Recognized by
Stanford or Illinois NER
 Rule 3 - Checking around Title Entity
 Rule 4 – Expanding Nationality Entities
 Rule 5 – Addition of Mentions Having Corresponding
DBpedia Resources
 Rule 6 – For Recognizing Acronyms
 Rule 7 – For Adding Re-Occurrences of Added
Entities
19
Example incorrect annotations in
OKE dataset and the corrections
made

20
Named Entity Previous Corrected Comments
Annotation annotation
Irish 1 0 "Irish" is not a person, organization or location. It is a
nationality. Therefore, is removed from the dataset.
Korean 1 org: Korean "Korean" is nationality. But the text actually has "Korean Air",
Air which is an organization.
Yonhap news 1 org: Yonhap "Yonhap" is name of organization, not "Yonhap news agency".
agency
Ministry of Defence 0 1: org "Ministry of Defence" is an organization.
Russia 0 1: loc "Russia" is a location.
Paul Pogba's 1 per: Paul "'s" is not part of the person name.
Pogba
King Koopa 1 0 "King Koopa" is a turtle-like fictional character and not a
person, location or organization.
legendary 1 per: Alan "legendary cryptanalyst" is not part of the person name.
cryptanalyst Alan Turing
Turing
Santa 0 per: Santa "Santa" or "Santa Claus" is a human fictional character.
U.S. 0 1: loc "U.S." is a location named entity.
Joker 0 1: per "Joker" is a person fictional character.
Persian army 0 1: org "Persian army" is name of an organization.
Greenwich Village, 1 loc: This entity has been broken down into three location entities,
Manhattan, New Greenwich "Greenwich Village", "Manhattan" and "New York City".
York City Village, loc:
Manhattan,
loc: New York
City
21
FIFA 0 1: org "FIFA" is acronym of an organization.
Results comparison of NE recognition
task on OKE evaluation dataset
Annotator Weak Annotation Strong Annotation Match
Match
Precision Recall F1 Precision Recall F1
Stanford NER 74.94 85.22 79.75 68.75 72.04 70.36
Illinois NER 94.66 84.17 89.11 86.14 77.45 81.56
CustNER 92.13 92.37 92.25 85.64 83.42 84.51

22
Results comparison of NE recognition
and classification task on OKE
evaluation dataset

Annotator Weak Annotation Match Strong Annotation Match


  Precision Recall F1 (micro) Precision Recall F1
(micro)
Stanford
68.91 78.36 73.33 64.18 67.25 65.68
NER
Illinois
85.76 76.25 80.73 79.94 71.88 75.70
NER
CustNER 83.73 83.95 83.84 80.27 77.98 79.11

23
Results comparison of strong
annotation match for each type on
OKE evaluation dataset
Annota person location organization
tor Precisi Recall MicroF Precisi Recall MicroF Precisi Recal Micro
 
on on on l F
Stanfo
rd 62.56 72.11 66.99 64.71 64.23 64.47 68.85 60.00 64.12  
NER
Illinois
87.01 74.86 80.48 80.99 77.78 79.35 60.94 54.17 57.35  
NER
CustN
85.88 83.52 84.68 80.49 75.57 77.95 66.67 68.49 67.57  
ER

24
Results comparison of NE
recognition task on CoNLL03
evaluation dataset

Annotator Weak Annotation Match Strong Annotation Match


Precision Recall F1 Precision Recall F1
Stanford
86.33 94.66 90.30 86.28 87.72 86.99
NER
Illinois
95.70 95.29 95.49 98.05 91.20 94.50
NER
CustNER 90.98 97.70 94.22 91.80 91.31 91.55

25
Assignment: NER
1. Gather small paragraphs from the web with entities of your
interest (atleast ten)
2. Mark the entities in these paragraphs with relevant domain
specific tags
3. Use the publicly available NER systems to tag these
paragraph
4. Tabulate the results
5. Compute P, R, F1 for each paragraph and each NER system
6. Compute average P, R, F1 and then give your opinion in the
discussion form
7. Submit your report to the Assignment Folder before next
class
26

You might also like