You are on page 1of 80

Multimodal Knowledge Graphs:

Construction, Inference, and Challenges

Meng Wang
Southeast University
IMGpedia
2015,2017

ImageNet,
• Automatically extracting Still many unsolved
Visipedia • Semi-supervised learning
2009,2010 problems
• Discovers common sense
relationships
Semantic Web
• Interlinking Multimedia formats
• Apply Linked Data Multimodal
Principles to Multimedia Knowledge
Fragments built upon the backbone NEIL: Image Graph
of the WordNet Knowledge 2019
Miner
ESG 2013
2004
iM
2008,2009

• Labeling Images with a Computer Game


n Why Multimodal?
n Construction
n Inference
n Challenges
Yoshua Bengio Marvin Minsky
NeurIPS Keynote, 2019 The Society of Mind, 1986

From system 1 DL to system 2 DL Framework for representing knowledge


Neural (System1) Symbolic (System2)
神经网络 符号主义

Input
(Question)

Goal
(Answer)
Visual generalisation vs. Symbolic VQA, Commonsense QA,
generalisation KBQA,and Machine Reading Comprehension
Cognitive Theory Knowledge Graph Perspective
两大学派:神经系统+符号主义
Neural (system1) are
• powerful for some problems
• robust to data noise
• hard to understand or explain
• poor at symbol manipulation
• unclear how to effectively use
background knowledge

Symbolic (system2) are


• Usually poor regarding machine learning
problems
• Intolerant to data noise
• Easy to understand and assess by a
human
• Good at symbol manipulation
• Designed to work with background
knowledge

“神经+符号”:从知识图谱角度看认知推理的发展, 《中国计算机学会通讯》,2020年第16卷第8期
Neural+Symbolic:
• powerful machine learning paradigm
• robust to data noise
• easy to understand and assess by humans
• good at symbol manipulation
• work seamlessly with background knowledge

Multimodal Knowledge Graph ?

representation database
meta KR
learning
learning knowledge DL
deep learning graph reasoning
semantic
reinforcement GNN QA search
learning knowledge
recommendation

Neural Symbolic
1. Multimodal is complementary

2. More cross-modal relations, more details and


more answers
3. Cross-modal Disambiguation:
Heterogeneous in modal, but correlated in semantic
多模态知识进行歧义消解
1. Multimodal is complementary

多模态知识:不同模态知识描述同一个事情!!!
抽取的是知识,而不是特征!!!

Image

Text

KG
2. More cross-modal relations, more details
and more answers
Cross-modal entity grounding VQA, Complicated scene understanding
Yao Ming and Tracy McGrady of the Houston Rockets
visit Beijing in 2004
3. Cross-modal Disambiguation:
Heterogeneous in modal, but correlated in semantic
Result:
姚明(100)
威尔史密斯(39)
梅兰芳(36)
Visual Entity Disambiguation
刘欢在美国超市被偶遇,买8美元面包,
爽快接过纸笔给网友签名

Textual Entity Disambiguation


Enhance KG to Multimodal KG

hold Ball
(Spalding) Enhance KG to Multimodal KG
Person Man1
(Yao Ming)

stand next
to
Rostrum
beside at (Tian’anm
en)

detection
Man2
(McGrady)

Jacket
wear (Houston
Rockets)

Visual relation detection Cross-modal Entity Linking

Application
1. Enhance KG to Multimodal KG

Image in KGs are very sparse

Missing visual relations


1. Enhance KG to Multimodal KG

wiki Bing
city Clustering
sameAs
sight person
Diversity
Google Yahoo
Image relation
KG sources image sources Feature

Data Collection Image Processing Relation Discovery

rpo:sameAs rpo:contain

rpo:Image
rpo:Descriptor
rpo:Imageof rpo:pixel (xsd:string) rpo:Describes
rpo:KGentity
rpo:height (xsd:float) rpo:value (xsd:string)
rpo:width (xsd:float)

rdfs:subClassOf
rpo:sourceImage
rpo:targetImage

rpo:GHD rpo:CLD

rpo:ImageSimilarity rpo:DescriptorType
rpo:CM rpo:GLCM
rpo:Similarity (xsd:float)

rpo:HOGS rpo:HOGL

Diversity detection Visual relation ontology


1. Enhance KG to Multimodal KG
Richpedia is G, G=(E, R), E: entities R: relations

Motivation: few images


l Issue 1: Noise-containing image
l Method: K-means cluster
l Implement: Noise is an outlier in the clusters,
and outliers are removed

l Issue 2: High image similarity


l Method: Diversity detection algorithm
l Implement: 1

𝑠𝑖𝑚 𝑒% , 𝑒' = ) 𝑚𝑖𝑛(𝐻. 𝑒% − 𝐻. (𝑒' ))


Wikipedia
.23
First choose the center of mass, then choose the point
Bing Yahoo Google farthest from the center of mass, and select the farthest
Method: collect image resources point of those selected points in turn.
1. Enhance KG to Multimodal KG
Richpedia is G, G=(E, R), E: entities R: relations

For each image, we generate five descriptors:

l rpo:GHD(Gradation Histogram Descriptor)

l rpo:CLD(Color Layout Descriptor)


Source Image 1-NN 2-NN 3-NN
l rpo:CMD(Color Moment Descriptor)

l rpo:GLCM(Gray-level co-occurrence matrix)


4-NN 5-NN 6-NN 7-NN
l rpo:HOG(Histogram of Oriented Gradient)

8-NN 9-NN 10-NN


1. Enhance KG to Multimodal KG

Richpedia is G, G=(E, R), E: entities R: relations

City_id:wd:Q84
Name:London

rpo:imageof rpo:imageof

rpo:sameAs

<wd:Q84,wdt:P31,wd:Q515>
<rp:0000001,rpo:imageof,wd:Q84>
Visual relation ontology
<rp:0000001,rpo:sameAs,rp:0000002>
1. Enhance KG to Multimodal KG
Richpedia is G, G=(E, R), E: entities R: relations

• Rule based: simple but effective and efficient

• Rule1: If there is a hyperlink in the description, the relationship is discovered by a string


mapping algorithm between the keyword and the predefined relational ontology.
• Rule2: If there are multiple hyperlinks in the description, cyclically input the KG entity
corresponding to the hyperlink, simplifying the situation to Rule1.
• Rule3: If there are no hyperlinks in the description, use the named entity recognition tool to
find the KG entities and simplify the situation to Rule1 and Rule2.
1. Enhance KG to Multimodal KG

Richpedia.cn

We will introduce how


to use this dataset later

Meng Wang, Guilin Qi, Haofen Wang and


Qiushuo Zheng. Richpedia: A Large-Scale,
Comprehensive Multi-Modal Knowledge Graph.
Big Data Research, 2020
Enhance KG to Multimodal KG

Visual Relation Detection


hold Ball
(Spalding)

Person Man1
(Yao Ming)

stand next
to (Unbalanced, Long-tail)
Rostrum
beside at (Tian’anm
en)

detection
Man2
(McGrady)

Jacket
wear (Houston
Rockets)

Visual relation detection Cross-modal Entity Linking

Application
2. Visual Relation Detection
The aim of visual relation detection is to provide a comprehensive understanding of an image
by describing all the objects within the scene, and how they relate to each other
Input Output

person-on-motorcycle person-wear-helmet motorcycle-has-wheel

Statistical Information
Feature Space Relation Space
Spatial and Semantic information
VTransE, CVPR 2017 Neural motifs: Scene graph parsing with global context
Language Priors, ECCV 2016 (CVPR2018)
2. Long-tail Visual Relation Detection

Our method

Learning to compose dynamic tree structures for visual


contexts, CVPR 2019
初步尝试: 2. Long-tail Visual Relation Detection

feature sparsity can be alleviated by using the achieve massages passing from
feature-level attention mechanism predicates or objects in different images
2. Long-tail Visual Relation Detection

Weitao Wang, Meng Wang, Sen Wang, Guodong Long, Lina Yao, and Guilin
Qi. One-Shot Learning for Long-Tail Visual Relation Detection. AAAI 2020.
2. Unbalanced Visual Relation Detection

few
num.
pizza on a plate man on an elephant boy carrying surfboard

lager
num.
pizza on plate man on elephant boy on surfboard
(a) Nonstandard label (b) Features overlap

Due to the existence of nonstandard labels, excessive attention to


low-frequency visual relation will affect the performance of the
scene graph generation model.
今年研究: 2. Unbalanced Visual Relation Detection

We use the method of memory features to realize the transfer of high-frequency relation
features to low-frequency relation features.
2. Unbalanced Visual Relation Detection

• The calculation of visual relation memory is based


on the prototype of each class, which is the mean
of each category of features in the training set.

• The direct observation features and memory


features are fused to realize the information
exchange between the current relation and other
relations

• We also utilize the statistical information


(distribution) from the training set to influence the
results of the model.
2. Unbalanced Visual Relation Detection

Our model achieves evident improvement in almost all relations (47/50)

Weitao Wang, Ruyang Liu, Meng Wang, Sen Wang, Xiaojun Chang, and Yang Chen.
Memory-Based Network for Scene Graph with Unbalanced Relations. 28th ACM MM 2020.
Enhance KG to Multimodal KG

hold
Cross-modal Entity Linking
Ball
(Spalding)

Person Man1
(Yao Ming)

stand next
to beside at
Rostrum
(Tian’anm
en)
(from classes to entities)
detection
Man2
(McGrady)

Jacket
wear (Houston
Rockets)

Visual relation detection Cross-modal Entity Linking

Application
实体链接 3. Cross-modal Entity Linking
3. Cross-modal Entity Linking

Learning to Rank
With
Modal Attention
3. Cross-modal Entity Linking

hold Ball

Man1

beside at Rostrum

Man2

wear Jacket
3. Cross-modal Entity Linking

Qiushuo Zheng, Meng Wang, Guilin Qi, Chaoyu Bai. Semantic Visual Entity Linking
based on Multimodal Learning. AAAI. 2021 (submitted)
n Why Multimodal?
n Construction
n Inference
n Challenges
Multi-modal KG Completion

Logan IV R L, Humeau S,
Singh S. Multimodal attribute
extraction. NIPS, 2017.
最新尝试:多模态图谱的推理 Multi-modal KG Completion

Input

Item Details

Output

[三人、白色/浅色、不带]
多模态机器翻译比目前的多模态图谱发展更早
Is the multi-modal KG really helpful?
——需要系统性地了解哪些场景下需要多模态实体链接和实体对齐,那些场景加上反而是噪声!!!
Adversarial Evaluation of Multimodal Machine Translation. EMNLP 2018. (CCF B)
—— 质疑多模态信息是否有用?系统性地概括了哪些场景需要多模态信息

Only needed for incorrect, ambiguous, and gender-neutral words


Is the multi-modal KG really helpful?
Probing the Need for Visual Context in Multimodal Machine Translation.
NAACL 2019. (CCF B)

This dominance effect corroborates the seminal work of Colavita (1974) in Psychophysics
where it has been demonstrated that visual stimuli dominate over the auditory stimuli when
humans are asked to perform a simple audiovisual discrimination task.

——更细粒度认证了哪些场景下多模态信息是有用的,其实90%的场景依赖图片信
息,只有10%的场景依赖于文本信息,思考现在的基于文本的多模态知识图谱为主,
图片信息为辅是否正确?
Is the multi-modal KG really helpful?
Realistic Re-evaluation of Knowledge Graph Completion Methods: An
Experimental Study. SIGMOD 2020. (CCF A)

A Benchmarking Study of Embedding-based Entity Alignment for


Knowledge Graphs. PVLDB 2020. (CCF A)
——系统地探讨了知识图谱实体对齐在哪些场景下是比较好的,哪些反而是噪声?
Is the multi-modal KG really helpful?

Cross-modal Entity Linking???

Is the multi-modal information only needed in very specific cases for KG?
n Why Multimodal?
n Construction
n Inference
n Challenges
Challenges

Parsing text to structured semantic graph

Parsing images/videos to structures

Grounding event/entities across modalities

Annotation Cost or Limited training data (domain specific)


Computational complexity
Limited fixed vocabulary
Abstract concept not groundable
三步:
1.不同模态的图结构抽取(需要考虑精细的实体构建、计算资源)
2.不同模态的图grounding(需要考虑不同概念图谱的grounding(实体对齐?))
3.构建多模态图谱 Shih-Fu Chang Alireza Zareian, Hassan Akbari, Brian Chen, Columbia University
Heng Ji, Spencer Whitehead, Manling Li, UIUC
Real Challenges
• Multimodal Data: 1.思考场景下是否支持这些数据?
• KG
• Text
• Image or video

• Multimodal Knowledge Representation:


• Multimodal
• Spatial-Temporal 1.思考更为通用的多模态知识图谱表示的框架图(考虑异构数据、时空数据、事件、规则)
• Event ——什么样的关联! 什么关联!

• Rules

• Multimodal Representation Learning:


• Pre-trained model for multimodal KG
• Cross-modal alignment
• Computing and storage capacity
1.设计预训练模型对异构知识进行整理
Richpedia Demo Tutorial
RDF
• The RDF data model is similar to classic concept modeling methods
such as entity relationships or class diagrams, because it is based on
the sentence representation of resources in the form of subject-
predicate objects.

• These expressions are called triples in RDF terms, such as <grass,


color, green>, <Yao Ming, has_wife, Ye Li>.
Architecture
Interaction layer:
Nodejs, JavaScript

Multimodal knowledge
graph display

Business Layer:
Apache Jena
Select the Select
multimodal
corresponding
knowledge
entity information

Data layer:
AllegroGraph
Text Image
knowledge knowledge
graph graph
AllegroGraph
• AllegroGraph is a modern, high-performance, and
durable graphical database.

• AllegroGraph uses disk-based effective memory


utilization.

• AllegroGraph supports SPARQL, RDFS++ and Prolog.

• AllegroGraph turns mathematical logic into a


programming language, which can be applied to many
client programs.
Using AllegroGraph

• The query function of the multi-modal knowledge graph


online platform mainly uses the interface of the AllegroGraph
server deployed on the server port, and an anonymous and
only readable warehouse is set up to perform data security
without exposing server administrator information access.
Using AllegroGraph

Backend AllegroGraph query results


Architecture
Interaction layer:
Nodejs, JavaScript

Multimodal knowledge
graph display

Business Layer:
Apache Jena
Select the Select
multimodal
corresponding
knowledge
entity information

Data layer:
AllegroGraph
Text Image
knowledge knowledge
graph graph
Jena
• Apache Jena is an open source Java framework
dedicated to semantic web ontology operations. It
provides RDF and SPARQL APIs to query and
modify ontology and perform ontology reasoning. It
also provides TDB and Fuseki to store and manage
triples. There are two versions Jena1 and Jena2.

• The tools provided by Jena1 include:


1. RDF/XML parser
2. A query language
3. I/O model for N3, N-triple, RDF/XML output
Jena Architecture

Model Layer

EnhGraph Layer

Graph Layer
Using Jena
• First, you need to configure the environment on the machine,
configure the environment variables, and then the main
functions that Jena can achieve are:

1. Write RDF
2. Store RDF as XML file
3. Read RDF from XML file
4. Given URI query, model.getResource(URI).getProperty()...
5. Given any SPO query, model.listSubjectwithProperty(),
model.listStatement()...
6. Process the RDF model as a whole, including merging
RDF and disassembling RDF
Architecture
Interaction layer:
Nodejs, JavaScript

Multimodal knowledge
graph display

Business Layer:
Apache Jena
Select the Select
multimodal
corresponding
knowledge
entity information

Data layer:
AllegroGraph
Text Image
knowledge knowledge
graph graph
Online Website
l Website : www.Richpedia.cn

l Homepage : mainly summarizes the multi-modal


knowledge graph, Richpedia.

l Download : provides download links for image files and


NT files.

l SPARQL : uses the access interface of AllegroGraph,


SPARQL query of data is realized by the network
request.

l Query : provides the function of directly inquiring


related images of sight and people.
Online Website
l Tutorial : provides a tutorial on how to use the website.

l Ontology : provides the related information of ontology.

l Resource : provides access addresses for all image


resources.

l Bottom links : provides friendly link, contact email


address, GitHub homepage and sharing agreement.
Homepage
Download

Download Link
SPARQL

Query Example

Query Box
SPARQL

Query Result
Query

Category
Selection
Query

Textual KG

Visual KG
Tutorial
Ontology

OWL linking
Resource

resource linking
Bottom links

Friendly Link Contact Email GitHub Link License


About us
Thank you!

You might also like