Professional Documents
Culture Documents
2 4知识图谱的构建与互联网场景下的应用
2 4知识图谱的构建与互联网场景下的应用
J.K. Rowling!
J.K.Rawling
KB author Harry Potter
all.filter(/material/leather)
.filter(/type/bag)
series series
Faceted Search .filter(/price < $300)
Harry Potter I Harry Potter
II .rerank()
Question Answering
likes recommend
/person/Taylor_Swift
1989 World
The Red Tour
Tour
performs performs
Love Story
/event/The_1989_World_Tour
artist
instance_of instance_of
/song/Love_Story_(Taylor_Swift)
American female
country singers
Video Recommendation
Entity Linking is critical in KB systems
• The crucial step: mapping text to KB entities!
linking(“harry potter”).follow(/r/Author)
Author of Harry Author U.K.
Potter?
has_occupation nationality
J.K. Rowling!
J.K.Rawling
KB author Harry Potter
series series
/song/Love_Story_(Taylor_Swift)
/event/The_1989_World_Tour
Text Understanding
“Ball Animal at Target” Video Recommendation
Agenda
• /wiki/⼤众⾼尔夫, and
• /wiki/⼤众新甲壳⾍.
• Candidate
Generation:
0.96 0.85 0.14
• Scoring:
0.01
0.04
Entity Linking: Traditional Approach
• Sentence: “⾼尔夫甲壳⾍哪个贵”
⾼尔夫
ENT/B ENT/I ENT/I ENT/B ENT/I
Output
Input
⾼ 尔 夫 甲 壳
Entity Linking: Traditional Approach
• Mentions: ⾼尔夫 甲壳⾍
• Candidate
Generation:
Using entity alias tables…
• Candidate
Generation:
• Scoring: prior=0.32
0.96
full_match_alias
same_type_in_sentence
prior=0.05
0.04
full_name_match
• Candidate
Generation:
• Scoring:P(E|M) = X1
0.96
TypeCoherence(E,Context) = Y1
P(E|M) = X2
0.04
TypeCoherence(E,Context) = Y2
[Raiman & Raiman] DeepType: Multilingual Entity Linking by Neural Type System Evolution
Challenges & Limits
alias entity
• Absence of good Alias Tables for CandGen san jose San_Jose,_Calif
No alias
san jose ornia
San_José,_Cost
a_Rica
san francisco San_Francisco
❖ Name Mining: challenging, low recall, and table
san francisco San_Francisco_I
error-prune. nternational_Air
san francisco San_Francisco,_
1 Córdoba
• Zero-shot for tail and fresh entities 0.75
0.5
❖ e.g. new people, products, albums and events show up.
0.25
Tail entities
0
• Massive KBs 1e3 1e5 1e7
• CandGen:
Entity retrieval with
dual encoders; San Francisco
Featurized entity 0.96
San Jose, California International Airport
0.14
representations
0.85
San Francisco, California
• Scoring:
Entity scoring with
reading comprehension
Ricardo_Costa (0.588)
Candidates
Jorge_Costa (0.572) retrieved Illustration of nearest neighbor search for
entity retrieval
Fernando_Torres (0.508)
…
Entity retrieval with dual encoders (DEER)
• Train with two losses:
Cosine
❖ Batch Softmax for positives
Mention Entity
encoder encoder
❖ Cross entropy for hard negatives
(hard negative mining: +0.37 R@1)
Mention Mention
span context
Title Paragraph Categories
• Inference:
costa has, not, jorge, costa jorge, paulo, basketball_player,
played… born, 1971… person… ❖ Nearest neighbor search for entity
Costa has not played
embeddings
since being struck by
the AC Milan forward. [Gillick 2019] Learning Dense Representations for Entity Retrieval
Limits to DEER
❖ Transformers
FFNN
[Ling, FitzGerald, Shan 2019] Learning Cross-Context Entity Representations from Text
RELIC Evaluations
R@1 CONLL-
• R@1 Matches SOTA on (Accuracy) AIDA
KBP 2010
EL (CoNLL-AIDA)
Raiman
• Industry-friendly model 2018 94.9 90.9
(SOTA)
(no entity features; no alias table);
can be used for retrieval DEER - 87.0
• Outperforms DEER
RELIC
without doing negative mining (ours)
94.9 89.8
Insight: question answering
• RELIC is also an approach to entity-centric QA:
❖ encode the question, retrieve nearest entities in the embedding space as
answer candidates.
❖ “closed-book” vs “open-book” QA
In which Lake District town would
Question you find the Cumberland Pencil
Museum? [MASK]
Keswick (0.638) (correct)
Candidates
encode Encoding [0.372, 0.187, -0.408, … ] NNS Hawkshead (0.602)
Grasmere (0.517)
…
Recovers 80% R@1 of more complex SOTA models for QA, in a closed-book setting
[Ling, FitzGerald, Shan 2019] Learning Cross-Context Entity Representations from Text
Limits to RELIC
• Massive KBs
❖ e.g. Google KG: billions of entities
• Zero-shot for tail and fresh entities —> featurized entity representations
❖ e.g. new people, products, albums and events show up
[CLS] [E1] Costa [/E1] has not … [SEP] [CLS] Jorge Paulo Costa Almeida.. [SEP]
[CLS] [E1] Costa [/E1] has not [CLS] Jorge Paulo Costa Almeida,
• Train with Batch Softmax + Cross
played since being struck by known as Costa, is a Portuguese Entropy for hard negatives.
the AC Milan forward . [SEP] retired footballer … [SEP]
R@1 Model F
Tsai+ Uadhyay+
• R@1 Outperforms (Accuracy) (ours)
previous SOTA on
Cross-Lingual EL
13 langs 13 langs 104 langs
Stats
5m entities 5m entities 20m entities
• With more entities
and languages
R@1 on
(harder)
TR2016- 0.51 0.54 0.57
hard (avg)
Releasing New Multilingual EL dataset
Languages
Uniform improvements over Alias Table on all 104 langauges, despite training size!
in training:
Micro-avg +0.02 +0.01
Scales to larger KBs because
the entity encoder can generalize!
[Botha, Shan, Gillick 2020] Entity Linking in 100 Languages
Example Architecture
/ent/Mojito_(song_by_Jay_Chou) (0.64) 4. Embedding Search
/ent/Mojito_(song_by_Abigail) (0.43) return
… /ent/Mojito_(song)
5. downstream scoring
ANN /ent/Mojito_(cocktail)
search
Index /ent/Glenrothes1990
3. Mention Encoding
FFNN FFNN
[CLS] [E1] mention [/E1] context [SEP] [CLS] name [TYPE] type [REL] rels [SEP]
encode
• CandGen:
San Francisco
0.96
San Jose, California International Airport
0.14
0.85
San Francisco, California
• Scoring:
Entity scoring with
reading comprehension
Dual-encoder
FFNN
[CLS] [E1] Costa [/E1] has not … [SEP] Jorge Paulo Costa Almeida, known [SEP] (Model-F like)
0.58
Mention+Entity
Cross-Attention encoder
(BERT)
[CLS] [E1] Costa [/E1] has not played since being struck by
Cross Attention 0.76
the AC Milan forward . [SEP] Jorge Paulo Costa Almeida,
known as Costa, is a Portuguese retired footballer … [SEP] Also works well for the Multilingual EL
paper setting & Google EL scoring.
[Logeswaran 2019] Zero-Shot Entity Linking by Reading Entity Descriptions
Agenda
? Knowledge
Base
• Human Editing?
• Hard Rules?
has_spouse
Mentions "Barack" "Michelle"
(mention-level relation)
Semi-structured
Unstructured Text Tables & Lists
(e.g. Wikipedia text)
Unstructured (e.g. Wikipedia
Unstructured
Unstructured infobox and lists)
Unstructured
Table 1
Semi-structured
Unstructured Text Tables & Lists
(e.g. Wikipedia text)
Unstructured (e.g. Wikipedia
Unstructured
Unstructured infobox and lists)
Unstructured
NLP tagging
Table 1
Semi-structured
Unstructured Text Tables & Lists
(e.g. Wikipedia text)
Unstructured (e.g. Wikipedia
Unstructured
Unstructured infobox and lists)
Unstructured
Mention
NLP tagging
Extraction
Table 1
Semi-structured
Unstructured Text Tables & Lists
(e.g. Wikipedia text)
Unstructured (e.g. Wikipedia
Unstructured
Unstructured infobox and lists)
Unstructured
Mention Relation
NLP tagging
Extraction Extraction
Table 1
Semi-structured
Unstructured Text Tables & Lists
(e.g. Wikipedia text)
Unstructured (e.g. Wikipedia
Unstructured Structured KB
Unstructured infobox and lists)
Unstructured (e.g. Wikidata)
Coref
Mention Relation
NLP tagging resolution and
Extraction Extraction
Entity Linking
Swift said in a video posted on her Instagram story, which shows her with her
Semi-structured
Unstructured Text Tables & Lists
(e.g. Wikipedia text)
Unstructured (e.g. Wikipedia
Unstructured Structured KB
Unstructured infobox and lists)
Unstructured (e.g. Wikidata)
Coref
Mention Relation
NLP tagging resolution and
Extraction Extraction
Entity Linking
Swift said in a video posted on her Instagram story, which shows her with her
pm2; coref(m1) pm3; coref(m1)
Coref
Mention Relation
Extraction Pipeline
Knowledge
NLP tagging resolution and Data Fusion
Extraction Extraction Base
Entity Linking
Coref
Mention Relation
Extraction Pipeline
Knowledge
NLP tagging resolution and Data Fusion
Extraction Extraction Base
Entity Linking
Error Analysis
http://deepdive.stanford.edu/labeling
A more modern approach with BERT
• Matching-The-Blanks: use BERT to read two mentions in
context and classify the relation.
2015 San
Mazda-6 Shenzhen Birds Baseball
Francisco
User Portrait vs PKG
• Traditional user portraits are collections of (unstructured) labels and
attributes:
Electric
Rock
Acoustic Guitar Imagine
Guitar Dragons
James
Basketball Ragdoll
Baseball Bird
User Portrait vs PKG
• PKGs: entities, relations, temporal, expandable, reasoning
Folk
Taylor Music
Music
(guitar brand)
Guitar Subclass of Muse (band)
Genre
Electric Rock Genre
Miami Heat Guitar Music
Acoustic Imagine
Former employer Guitar Dragons
LeBron James
Employer
Ragdoll
Basketball
Cat
LA Lakers League Baseball Birds
NBA Subclass of
Instance of Subclass of
Pets
Sports Animals
Instance of Instance of Skiing
Soccer
Stacks for PKG Construction
Content
Personalized Interest Personal Ads &
Recomm-
Search Expansion Assistants Shopping
endations
The acoustic or
electric one?
Video
Recommender
You may be interested
in these recent SF
baseball videos!
PKG: Challenges and Open Questions
• Engineering PKGs with Privacy