StatSnowball
: a Statistical Approach to Extracting EntityRelationships
∗
Jun Zhu
†
⋆
Zaiqing Nie
‡
Xiaojing Liu
§
Bo Zhang
†
⋆
Ji-Rong Wen
‡†
Dept. of Comp. Sci. & Tech.Tsinghua UniversityBeijing, 100084 P.R China
{jun-zhu@mails,dcszb@mail}.thu.edu.cn
‡
Microsoft Research AsiaNo. 49 Zhichun RoadBeijing, 100080 P.R China
{znie,jrwen}@microsoft.com
§
Dept. of EEISUniversity of Sci. & Tech. ofChinaHefei, 230027 P.R China
xiaojiangliu84@hotmail.com
ABSTRACT
Traditional relation extraction methods require pre-specifiedrelations and relation-specific human-tagged examples. Boot-strapping systems significantly reduce the number of train-ing examples, but they usually apply heuristic-based meth-ods to combine a set of strict hard rules, which limit theability to generalize and thus generate a low recall. Further-more, existing bootstrapping methods do not perform openinformation extraction (Open IE), which can identify var-ious types of relations without requiring pre-specifications.In this paper, we propose a statistical extraction frameworkcalled
Statistical Snowball
(StatSnowball), which is a boot-strapping system and can perform both traditional relationextraction and Open IE.StatSnowball uses the discriminative Markov logic net-works (MLNs) and softens hard rules by learning their weightsin a maximum likelihood estimate sense. MLN is a generalmodel, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection isperformed by solving an
ℓ
1
-norm penalized maximum like-lihood estimation, which enjoys well-founded theories andefficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small butfully labeled data set and large-scale Web data. Empiricalresults show that StatSnowball can achieve a significantlyhigher recall without sacrificing the high precision during it-erations with a small numberof seeds, and the joint inferenceof MLN can improve the performance. Finally, StatSnowballis efficient and we have developed a working entity relationsearch engine called
Renlifang
based on it.
Categories and Subject Descriptors
I.5.1 [
Pattern Recognition
]: Models - Statistical
General Terms
Algorithms, Experimentation
∗
This work was done when Jun Zhu and Xiaojiang Liu werevisiting Microsoft Research Asia.
⋆
Jun and Bo are also with the State Key Laboratory of Intel-ligent Technology and Systems; and the Tsinghua NationalLaboratory for Information Science and Technology.
Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.
WWW 2009,
April 20–24, 2009, Madrid, Spain.ACM 978-1-60558-487-4/09/04.
Keywords
Relation extraction, statistical model, Markov logic networks
1. INTRODUCTION
The World Wide Web has been growing rapidly as a hugeinformation repository, which contains various kinds of valu-able semantic information about real-world entities, such aspeople, organizations, and locations. We have been work-ing on object-level search engines, which automatically ex-tract and integrate the semantic information about entitiesand return a list of ranked entities instead of webpages toanswer user queries [18]. In object-level search engines, itis particularly important to mine entity relations from theWeb to automatically build an entity relationship graph tolink all the extracted information together. With the en-tity relationship graph, users will be able to easily navigatethrough their interested entities just like the way they nav-igate through hyperlinked webpages. This paper focuses onentity relation mining from the Web.
1.1 Motivating Example
We have deployed an entity relationship search engine inChinese search market called
Renlifang
1
. Renlifang is a dif-ferent kind of search engine we have been building, one thatexplores relationships between entities. In Renlifang, userscan query the system about people, locations, and organi-zations and explore their relationships. Currently Renlifangonly serves in the Chinese language domain. These entitiesand their relationships are automatically mined from billionsof crawled Chinese webpages. For each crawled webpage inRenlifang, the system extracts entity information and de-tects relationships, covering a spectrum of everyday indi-viduals and well-known people, locations, or organizations.Below we list the key features of Renlifang:
Entity Relationship Mining and Navigation
. Renli-fang enables users to explore highly relevant informationduring searches to discover interesting relationships aboutentities associated with their queries.
Finding Expertise
. Renlifang can return a ranked list of people known for dancing or any other topic.
Web-Prominence Ranking
. Renlifang detects the pop-ularity of an entity and enables users to browse entities indifferent categories ranked by their prominence on the Webduring a given time period.
1
http://renlifang.msra.cn
WWW 2009 MADRID!Track: Data Mining / Session: Statistical Methods
101