mutilmodal KG - 王萌

Multimodal Knowledge Graphs:
Construction, Inference, and Challenges
Meng Wang
Southeast University
IMGpedia
2015,2017
ImageNet,
• Automatically extracting Still many unsolved
Visipedia • Semi-supervised learning
2009,2010 problems
• Discovers common sense
relationships
Semantic Web
• Interlinking Multimedia formats
• Apply Linked Data Multimodal
Principles to Multimedia Knowledge
Fragments built upon the backbone NEIL: Image Graph
of the WordNet Knowledge 2019
Miner
ESG 2013
2004
iM
2008，2009
• Labeling Images with a Computer Game

n Why Multimodal?
n Construction
n Inference
n Challenges
Yoshua Bengio Marvin Minsky
NeurIPS Keynote, 2019 The Society of Mind, 1986
From system 1 DL to system 2 DL Framework for representing knowledge

Neural (System1) Symbolic (System2)
神经网络符号主义
Input
(Question)
Goal
(Answer)
Visual generalisation vs. Symbolic VQA, Commonsense QA,
generalisation KBQA，and Machine Reading Comprehension
Cognitive Theory Knowledge Graph Perspective
两大学派：神经系统+符号主义
Neural (system1) are
• powerful for some problems
• robust to data noise
• hard to understand or explain
• poor at symbol manipulation
• unclear how to effectively use
background knowledge
Symbolic (system2) are

• Usually poor regarding machine learning
problems
• Intolerant to data noise
• Easy to understand and assess by a
human
• Good at symbol manipulation
• Designed to work with background
knowledge
“神经+符号”：从知识图谱角度看认知推理的发展, 《中国计算机学会通讯》，2020年第16卷第8期
Neural+Symbolic:
• powerful machine learning paradigm
• robust to data noise
• easy to understand and assess by humans
• good at symbol manipulation
• work seamlessly with background knowledge
Multimodal Knowledge Graph ？
representation database
meta KR
learning
learning knowledge DL
deep learning graph reasoning
semantic
reinforcement GNN QA search
learning knowledge
recommendation
Neural Symbolic
1. Multimodal is complementary
2. More cross-modal relations, more details and

more answers
3. Cross-modal Disambiguation:
Heterogeneous in modal, but correlated in semantic
多模态知识进行歧义消解
1. Multimodal is complementary
多模态知识：不同模态知识描述同一个事情！！！
抽取的是知识，而不是特征！！！
Image
Text
KG
2. More cross-modal relations, more details
and more answers
Cross-modal entity grounding VQA, Complicated scene understanding
Yao Ming and Tracy McGrady of the Houston Rockets
visit Beijing in 2004
3. Cross-modal Disambiguation:
Heterogeneous in modal, but correlated in semantic
Result：
姚明（100）
威尔史密斯（39）
梅兰芳（36）
Visual Entity Disambiguation
刘欢在美国超市被偶遇，买8美元面包，
爽快接过纸笔给网友签名
Textual Entity Disambiguation

Enhance KG to Multimodal KG
hold Ball
(Spalding) Enhance KG to Multimodal KG
Person Man1
(Yao Ming)
stand next
to
Rostrum
beside at (Tian’anm
en)
detection
Man2
(McGrady)
Jacket
wear (Houston
Rockets)
Visual relation detection Cross-modal Entity Linking
Application
1. Enhance KG to Multimodal KG
Image in KGs are very sparse
Missing visual relations

wiki Bing
city Clustering
sameAs
sight person
Diversity
Google Yahoo
Image relation
KG sources image sources Feature
Data Collection Image Processing Relation Discovery
rpo:sameAs rpo:contain
rpo:Image
rpo:Descriptor
rpo:Imageof rpo:pixel (xsd:string) rpo:Describes
rpo:KGentity
rpo:height (xsd:float) rpo:value (xsd:string)
rpo:width (xsd:float)
rdfs:subClassOf
rpo:sourceImage
rpo:targetImage
rpo:GHD rpo:CLD
rpo:ImageSimilarity rpo:DescriptorType
rpo:CM rpo:GLCM
rpo:Similarity (xsd:float)
rpo:HOGS rpo:HOGL
Diversity detection Visual relation ontology

Richpedia is G, G=(E, R), E: entities R: relations
Motivation: few images

l Issue 1: Noise-containing image
l Method: K-means cluster
l Implement: Noise is an outlier in the clusters,
and outliers are removed
l Issue 2: High image similarity

l Method: Diversity detection algorithm
l Implement: 1
𝑠𝑖𝑚 𝑒% , 𝑒' = ) 𝑚𝑖𝑛(𝐻. 𝑒% − 𝐻. (𝑒' ))

Wikipedia
.23
First choose the center of mass, then choose the point
Bing Yahoo Google farthest from the center of mass, and select the farthest
Method: collect image resources point of those selected points in turn.
For each image, we generate five descriptors:
l rpo:GHD(Gradation Histogram Descriptor)
l rpo:CLD(Color Layout Descriptor)

Source Image 1-NN 2-NN 3-NN
l rpo:CMD(Color Moment Descriptor)
l rpo:GLCM(Gray-level co-occurrence matrix)

4-NN 5-NN 6-NN 7-NN
l rpo:HOG(Histogram of Oriented Gradient)
8-NN 9-NN 10-NN

City_id:wd:Q84
Name:London
rpo:imageof rpo:imageof
rpo:sameAs
<wd:Q84,wdt:P31,wd:Q515>
<rp:0000001,rpo:imageof,wd:Q84>
Visual relation ontology
<rp:0000001,rpo:sameAs,rp:0000002>
• Rule based: simple but effective and efficient
• Rule1: If there is a hyperlink in the description, the relationship is discovered by a string

mapping algorithm between the keyword and the predefined relational ontology.
• Rule2: If there are multiple hyperlinks in the description, cyclically input the KG entity
corresponding to the hyperlink, simplifying the situation to Rule1.
• Rule3: If there are no hyperlinks in the description, use the named entity recognition tool to
find the KG entities and simplify the situation to Rule1 and Rule2.
Richpedia.cn
We will introduce how

to use this dataset later
Meng Wang, Guilin Qi, Haofen Wang and

Qiushuo Zheng. Richpedia: A Large-Scale,
Comprehensive Multi-Modal Knowledge Graph.
Big Data Research, 2020
Visual Relation Detection

hold Ball
(Spalding)
Person Man1
(Yao Ming)
stand next
to (Unbalanced, Long-tail)
Rostrum
beside at (Tian’anm
en)
detection
Man2
(McGrady)
Jacket
wear (Houston
Rockets)
Application
2. Visual Relation Detection
The aim of visual relation detection is to provide a comprehensive understanding of an image
by describing all the objects within the scene, and how they relate to each other
Input Output
person-on-motorcycle person-wear-helmet motorcycle-has-wheel
Statistical Information
Feature Space Relation Space
Spatial and Semantic information
VTransE, CVPR 2017 Neural motifs: Scene graph parsing with global context
Language Priors, ECCV 2016 (CVPR2018)
2. Long-tail Visual Relation Detection
Our method
Learning to compose dynamic tree structures for visual

contexts, CVPR 2019
初步尝试： 2. Long-tail Visual Relation Detection
feature sparsity can be alleviated by using the achieve massages passing from
feature-level attention mechanism predicates or objects in different images
2. Long-tail Visual Relation Detection
Weitao Wang, Meng Wang, Sen Wang, Guodong Long, Lina Yao, and Guilin
Qi. One-Shot Learning for Long-Tail Visual Relation Detection. AAAI 2020.
2. Unbalanced Visual Relation Detection
few
num.
pizza on a plate man on an elephant boy carrying surfboard
lager
num.
pizza on plate man on elephant boy on surfboard
(a) Nonstandard label (b) Features overlap
Due to the existence of nonstandard labels, excessive attention to

low-frequency visual relation will affect the performance of the
scene graph generation model.
今年研究： 2. Unbalanced Visual Relation Detection
We use the method of memory features to realize the transfer of high-frequency relation
features to low-frequency relation features.
• The calculation of visual relation memory is based

on the prototype of each class, which is the mean
of each category of features in the training set.
• The direct observation features and memory

features are fused to realize the information
exchange between the current relation and other
relations
• We also utilize the statistical information

(distribution) from the training set to influence the
results of the model.
Our model achieves evident improvement in almost all relations (47/50)
Weitao Wang, Ruyang Liu, Meng Wang, Sen Wang, Xiaojun Chang, and Yang Chen.
Memory-Based Network for Scene Graph with Unbalanced Relations. 28th ACM MM 2020.
hold
Cross-modal Entity Linking
Ball
(Spalding)
Person Man1
(Yao Ming)
stand next
to beside at
Rostrum
(Tian’anm
en)
(from classes to entities)
detection
Man2
(McGrady)
Jacket
wear (Houston
Rockets)
Application
实体链接 3. Cross-modal Entity Linking
3. Cross-modal Entity Linking
Learning to Rank
With
Modal Attention
hold Ball
Man1
beside at Rostrum
Man2
wear Jacket
Qiushuo Zheng, Meng Wang, Guilin Qi, Chaoyu Bai. Semantic Visual Entity Linking
based on Multimodal Learning. AAAI. 2021 (submitted)
n Why Multimodal?
n Construction
n Inference
n Challenges
Multi-modal KG Completion
Logan IV R L, Humeau S,
Singh S. Multimodal attribute
extraction. NIPS, 2017.
最新尝试：多模态图谱的推理 Multi-modal KG Completion
Input
Item Details
Output
[三人、白色/浅色、不带]
多模态机器翻译比目前的多模态图谱发展更早
Is the multi-modal KG really helpful?
——需要系统性地了解哪些场景下需要多模态实体链接和实体对齐，那些场景加上反而是噪声！！！
Adversarial Evaluation of Multimodal Machine Translation. EMNLP 2018. (CCF B)
—— 质疑多模态信息是否有用？系统性地概括了哪些场景需要多模态信息
Only needed for incorrect, ambiguous, and gender-neutral words

Probing the Need for Visual Context in Multimodal Machine Translation.
NAACL 2019. (CCF B)
This dominance effect corroborates the seminal work of Colavita (1974) in Psychophysics
where it has been demonstrated that visual stimuli dominate over the auditory stimuli when
humans are asked to perform a simple audiovisual discrimination task.
——更细粒度认证了哪些场景下多模态信息是有用的，其实90%的场景依赖图片信
息，只有10%的场景依赖于文本信息，思考现在的基于文本的多模态知识图谱为主，
图片信息为辅是否正确？
Realistic Re-evaluation of Knowledge Graph Completion Methods: An
Experimental Study. SIGMOD 2020. (CCF A)
A Benchmarking Study of Embedding-based Entity Alignment for

Knowledge Graphs. PVLDB 2020. (CCF A)
——系统地探讨了知识图谱实体对齐在哪些场景下是比较好的，哪些反而是噪声？
Cross-modal Entity Linking？？？
Is the multi-modal information only needed in very specific cases for KG?
n Why Multimodal?
n Construction
n Inference
n Challenges
Challenges
Parsing text to structured semantic graph
Parsing images/videos to structures
Grounding event/entities across modalities
Annotation Cost or Limited training data (domain specific)

Computational complexity
Limited fixed vocabulary
Abstract concept not groundable
三步：
1.不同模态的图结构抽取（需要考虑精细的实体构建、计算资源）
2.不同模态的图grounding（需要考虑不同概念图谱的grounding（实体对齐？））
3.构建多模态图谱 Shih-Fu Chang Alireza Zareian, Hassan Akbari, Brian Chen, Columbia University
Heng Ji, Spencer Whitehead, Manling Li, UIUC
Real Challenges
• Multimodal Data: 1.思考场景下是否支持这些数据？
• KG
• Text
• Image or video
• Multimodal Knowledge Representation:

• Multimodal
• Spatial-Temporal 1.思考更为通用的多模态知识图谱表示的框架图（考虑异构数据、时空数据、事件、规则）
• Event ——什么样的关联！什么关联！
• Rules
• Multimodal Representation Learning:

• Pre-trained model for multimodal KG
• Cross-modal alignment
• Computing and storage capacity
1.设计预训练模型对异构知识进行整理
Richpedia Demo Tutorial
RDF
• The RDF data model is similar to classic concept modeling methods
such as entity relationships or class diagrams, because it is based on
the sentence representation of resources in the form of subject-
predicate objects.
• These expressions are called triples in RDF terms, such as <grass,

color, green>, <Yao Ming, has_wife, Ye Li>.
Architecture
Interaction layer：
Nodejs, JavaScript
Multimodal knowledge
graph display
Business Layer：
Apache Jena
Select the Select
multimodal
corresponding
knowledge
entity information
Data layer：
AllegroGraph
Text Image
knowledge knowledge
graph graph
AllegroGraph
• AllegroGraph is a modern, high-performance, and
durable graphical database.
• AllegroGraph uses disk-based effective memory

utilization.
• AllegroGraph supports SPARQL, RDFS++ and Prolog.
• AllegroGraph turns mathematical logic into a

programming language, which can be applied to many
client programs.
Using AllegroGraph
• The query function of the multi-modal knowledge graph

online platform mainly uses the interface of the AllegroGraph
server deployed on the server port, and an anonymous and
only readable warehouse is set up to perform data security
without exposing server administrator information access.
Using AllegroGraph
Backend AllegroGraph query results

Architecture
Nodejs, JavaScript
graph display
Business Layer：
Apache Jena
Select the Select
multimodal
corresponding
knowledge
entity information
Data layer：
AllegroGraph
Text Image
knowledge knowledge
graph graph
Jena
• Apache Jena is an open source Java framework
dedicated to semantic web ontology operations. It
provides RDF and SPARQL APIs to query and
modify ontology and perform ontology reasoning. It
also provides TDB and Fuseki to store and manage
triples. There are two versions Jena1 and Jena2.
• The tools provided by Jena1 include:

1. RDF/XML parser
2. A query language
3. I/O model for N3, N-triple, RDF/XML output
Jena Architecture
Model Layer
EnhGraph Layer
Graph Layer
Using Jena
• First, you need to configure the environment on the machine,
configure the environment variables, and then the main
functions that Jena can achieve are:
1. Write RDF
2. Store RDF as XML file
3. Read RDF from XML file
4. Given URI query, model.getResource(URI).getProperty()...
5. Given any SPO query, model.listSubjectwithProperty(),
model.listStatement()...
6. Process the RDF model as a whole, including merging
RDF and disassembling RDF
Architecture
Nodejs, JavaScript
graph display
Business Layer：
Apache Jena
Select the Select
multimodal
corresponding
knowledge
entity information
Data layer：
AllegroGraph
Text Image
knowledge knowledge
graph graph
Online Website
l Website : www.Richpedia.cn
l Homepage : mainly summarizes the multi-modal

knowledge graph, Richpedia.
l Download : provides download links for image files and

NT files.
l SPARQL : uses the access interface of AllegroGraph,

SPARQL query of data is realized by the network
request.
l Query : provides the function of directly inquiring

related images of sight and people.
Online Website
l Tutorial : provides a tutorial on how to use the website.
l Ontology : provides the related information of ontology.
l Resource : provides access addresses for all image

resources.
l Bottom links : provides friendly link, contact email

address, GitHub homepage and sharing agreement.
Homepage
Download
Download Link
SPARQL
Query Example
Query Box
SPARQL
Query Result
Query
Category
Selection
Query
Textual KG
Visual KG
Tutorial
Ontology
OWL linking
Resource
resource linking
Bottom links
Friendly Link Contact Email GitHub Link License

About us
Thank you!

mutilmodal KG - 王萌

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

mutilmodal KG - 王萌

Uploaded by

Copyright:

Available Formats

Multimodal Knowledge Graphs:

Construction, Inference, and Challenges

• Labeling Images with a Computer Game

From system 1 DL to system 2 DL Framework for representing knowledge

Symbolic (system2) are

Multimodal Knowledge Graph ？

2. More cross-modal relations, more details and

Textual Entity Disambiguation

Visual relation detection Cross-modal Entity Linking

Image in KGs are very sparse

Missing visual relations

Data Collection Image Processing Relation Discovery

Diversity detection Visual relation ontology

Motivation: few images

l Issue 2: High image similarity

𝑠𝑖𝑚 𝑒% , 𝑒' = ) 𝑚𝑖𝑛(𝐻. 𝑒% − 𝐻. (𝑒' ))

For each image, we generate five descriptors:

l rpo:GHD(Gradation Histogram Descriptor)

l rpo:CLD(Color Layout Descriptor)

l rpo:GLCM(Gray-level co-occurrence matrix)

8-NN 9-NN 10-NN

Richpedia is G, G=(E, R), E: entities R: relations

• Rule based: simple but effective and efficient

• Rule1: If there is a hyperlink in the description, the relationship is discovered by a string

We will introduce how

Meng Wang, Guilin Qi, Haofen Wang and

Visual Relation Detection

Visual relation detection Cross-modal Entity Linking

person-on-motorcycle person-wear-helmet motorcycle-has-wheel

Learning to compose dynamic tree structures for visual

Due to the existence of nonstandard labels, excessive attention to

• The calculation of visual relation memory is based

• The direct observation features and memory

• We also utilize the statistical information

Our model achieves evident improvement in almost all relations (47/50)

Visual relation detection Cross-modal Entity Linking

Only needed for incorrect, ambiguous, and gender-neutral words

A Benchmarking Study of Embedding-based Entity Alignment for

Cross-modal Entity Linking？？？

Parsing text to structured semantic graph

Parsing images/videos to structures

Grounding event/entities across modalities

Annotation Cost or Limited training data (domain specific)

• Multimodal Knowledge Representation:

• Multimodal Representation Learning:

• These expressions are called triples in RDF terms, such as <grass,

• AllegroGraph uses disk-based effective memory

• AllegroGraph supports SPARQL, RDFS++ and Prolog.

• AllegroGraph turns mathematical logic into a

• The query function of the multi-modal knowledge graph

Backend AllegroGraph query results

• The tools provided by Jena1 include:

l Homepage : mainly summarizes the multi-modal

l Download : provides download links for image files and

l SPARQL : uses the access interface of AllegroGraph,

l Query : provides the function of directly inquiring

l Ontology : provides the related information of ontology.

l Resource : provides access addresses for all image

l Bottom links : provides friendly link, contact email

Friendly Link Contact Email GitHub Link License

You might also like