Read without ads and support Scribd by becoming a Scribd Premium Reader.
 
StatSnowball 
: a Statistical Approach to Extracting EntityRelationships
Jun Zhu
Zaiqing Nie
Xiaojing Liu
§
Bo Zhang
Ji-Rong Wen
Dept. of Comp. Sci. & Tech.Tsinghua UniversityBeijing, 100084 P.R China
{jun-zhu@mails,dcszb@mail}.thu.edu.cn
Microsoft Research AsiaNo. 49 Zhichun RoadBeijing, 100080 P.R China
{znie,jrwen}@microsoft.com
§
Dept. of EEISUniversity of Sci. & Tech. ofChinaHefei, 230027 P.R China
xiaojiangliu84@hotmail.com
ABSTRACT
Traditional relation extraction methods require pre-specifiedrelations and relation-specific human-tagged examples. Boot-strapping systems significantly reduce the number of train-ing examples, but they usually apply heuristic-based meth-ods to combine a set of strict hard rules, which limit theability to generalize and thus generate a low recall. Further-more, existing bootstrapping methods do not perform openinformation extraction (Open IE), which can identify var-ious types of relations without requiring pre-specifications.In this paper, we propose a statistical extraction frameworkcalled
Statistical Snowball 
(StatSnowball), which is a boot-strapping system and can perform both traditional relationextraction and Open IE.StatSnowball uses the discriminative Markov logic net-works (MLNs) and softens hard rules by learning their weightsin a maximum likelihood estimate sense. MLN is a generalmodel, and can be configured to perform different levels of relation extraction. In StatSnwoball, pattern selection isperformed by solving an
1
-norm penalized maximum like-lihood estimation, which enjoys well-founded theories andefficient solvers. We extensively evaluate the performance of StatSnowball in different configurations on both a small butfully labeled data set and large-scale Web data. Empiricalresults show that StatSnowball can achieve a significantlyhigher recall without sacrificing the high precision during it-erations with a small numberof seeds, and the joint inferenceof MLN can improve the performance. Finally, StatSnowballis efficient and we have developed a working entity relationsearch engine called
Renlifang 
based on it.
Categories and Subject Descriptors
I.5.1 [
Pattern Recognition
]: Models - Statistical
General Terms
Algorithms, Experimentation
This work was done when Jun Zhu and Xiaojiang Liu werevisiting Microsoft Research Asia.
Jun and Bo are also with the State Key Laboratory of Intel-ligent Technology and Systems; and the Tsinghua NationalLaboratory for Information Science and Technology.
Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution of these papers is limited to classroom use,and personal use by others.
WWW 2009,
April 20–24, 2009, Madrid, Spain.ACM 978-1-60558-487-4/09/04.
Keywords
Relation extraction, statistical model, Markov logic networks
1. INTRODUCTION
The World Wide Web has been growing rapidly as a hugeinformation repository, which contains various kinds of valu-able semantic information about real-world entities, such aspeople, organizations, and locations. We have been work-ing on object-level search engines, which automatically ex-tract and integrate the semantic information about entitiesand return a list of ranked entities instead of webpages toanswer user queries [18]. In object-level search engines, itis particularly important to mine entity relations from theWeb to automatically build an entity relationship graph tolink all the extracted information together. With the en-tity relationship graph, users will be able to easily navigatethrough their interested entities just like the way they nav-igate through hyperlinked webpages. This paper focuses onentity relation mining from the Web.
1.1 Motivating Example
We have deployed an entity relationship search engine inChinese search market called
Renlifang 
1
. Renlifang is a dif-ferent kind of search engine we have been building, one thatexplores relationships between entities. In Renlifang, userscan query the system about people, locations, and organi-zations and explore their relationships. Currently Renlifangonly serves in the Chinese language domain. These entitiesand their relationships are automatically mined from billionsof crawled Chinese webpages. For each crawled webpage inRenlifang, the system extracts entity information and de-tects relationships, covering a spectrum of everyday indi-viduals and well-known people, locations, or organizations.Below we list the key features of Renlifang:
Entity Relationship Mining and Navigation
. Renli-fang enables users to explore highly relevant informationduring searches to discover interesting relationships aboutentities associated with their queries.
Finding Expertise
. Renlifang can return a ranked list of people known for dancing or any other topic.
Web-Prominence Ranking
. Renlifang detects the pop-ularity of an entity and enables users to browse entities indifferent categories ranked by their prominence on the Webduring a given time period.
1
http://renlifang.msra.cn
WWW 2009 MADRID!Track: Data Mining / Session: Statistical Methods
101
 
Figure 1: An entity relationship graph for the query“Bill Gates” generated by EntityCube (i.e. the En-glish version of Renlifang).People Bio Ranking
. Renlifang ranks text blocks fromwebpages by the likelihood of their being biography blocks.Renlifang has been well received by Chinese Internet usersand mainstream media in China (including CCTV and PhoenixTV) with positive comments and millions of daily page-viewsduring the peak days. The English version of Renlifang iscalled EntityCube, and it is currently under development
2
.In Figure 1, we show an automatically generated entity re-lationship graph using our English Renlifang prototype.Based on the overwhelming response from Chinese In-ternet users of the Renlifang entity relationship search, wefound that automatically extracting a large number of highlyaccurate entity relations is important to improve perfor-mance. However, existing work on entity relation extrac-tion in the literature could not meet the requirements of aWeb-scale entity relationship search engine.Relation extraction has been promoted by the MessageUnderstanding Conference (MUCs) and Automatic ContentExtraction (ACE) program. The task has been tradition-ally studied so as to extract
predefined 
semantic relationsbetween pairs of entities in text, e.g., in supervised learn-ing methods [27, 8, 9, 26] and bootstrapping systems [5, 1].Supervised methods require a set of relation-specific humantagged examples to learn an extractor. Labeling trainingexamples is tedious and labor intensive, thus, it makes su-pervised methods difficult to apply to Web-scale applicationslike Renlifang. Bootstrapping methods [5, 1, 7] significantlyreduce the number of training examples by iteratively dis-covering extraction patterns and identifying entity relationswith a small number of seeds, either target relation tuples [1]or general extraction templates [7]. Take the well-knownSnowball [1], which serves as the basis of our proposed ap-proach, as an example. Snowball takes a small set of seedtuples as inputs, and employs the pattern-entity duality [5]to iteratively generate extraction patterns and identify newrelation tuples. From the generated patterns and identifiedtuples, some confidence measures are carefully crafted to se-lect good ones and add them to Snowball as new knowledge.Evaluating patterns and tuples is one key component, since
2
www.entitycube.comit is crucial to select good patterns and good new seed tuplesto make sure the system will not be drifted by errors.Although the bootstrapping architecture is promising, Snow-ball has at least two obvious limitations, which make itunsuitable for the Web-scale relation extraction as moti-vated by Renlifang. First, since the target of Snowball isto extract a specific type of relation (e.g., companies andtheir headquarters) the extraction patterns in Snowball aremainly based on strict keyword-matching. Although thesepatterns can identify highly accurate results, the recall willbe limited. Second, Snowball does not have an elegant evalu-ation measure, such as the probability/likelihood of a prob-abilistic model, to evaluate generated patterns. The care-fully crafted measures and pattern selection criteria are notdirectly adaptable to general patterns (e.g., POS tag se-quences), which can significantly improve the recall as shownin our empirical studies. This is because many tuples ex-tracted by a general pattern are more likely not to be thetarget relations of Snowball, although they can be othertypes of relations. In this case, the confidence scores willbe very small, and it is inappropriate to use the criteria asused in Snowball to select these patterns.In this paper, we address these issues as suffered by Snow-ball to improve the recall while keeping a high precision. Wepresent a system called
Statistical Snowball 
(StatSnowball).StatSnowball adopts the bootstrapping architecture and ap-plies the recently developed feature selection method using
1
-norm [25, 11] to select extraction patterns—both keywordmatching and general patterns. Starting with a handful setof initial seeds, it iteratively generates new extraction pat-terns; performs an
1
-norm regularized maximum likelihoodestimation (MLE) to select good patterns; and extracts newrelation tuples. StatSnowball is a general framework and thestatistical model can be any probabilistic model. StatSnow-ball uses the general discriminative Markov logic networks(MLN) [21], which subsume logistic regression (LR) and con-ditional random fields (CRF) [15]. Discriminative modelscan incorporate arbitrary useful features without strong in-dependence assumptions as made in generative models, likena¨ıve Bayes (NB) and Hidden Markov Models (HMM).By incorporating general patterns, StatSnowball can per-form both traditional relation extraction like Snowball to ex-tract pre-specified relations and open information extraction(Open IE) [3] to identify general types of relations. Open IEis a novel domain-independent extraction paradigm, whichhas been studied in both the natural language documentcorpus [22] and the Web environment [3]. Although the ex-isting Open IE systems are self-supervised, they require aset of human-selected features in order to learn a good ex-tractor. In contrast, StatSnowball automatically generatesand selects the extraction patterns. Moreover, the Open IEsystems require expensive deep linguistic parsing techniquesto correctly label training samples, while StatSnwoball onlyuses cheaper and more robust shallow parsing techniquesto generate its patterns. Finally, by using the MLN model,StatSnowball can perform joint inference, while the O-CRFs[4] treat sentences independently. Our empirical studiesdemonstrate the promise of using joint inference.To the best of our knowledge, StatSnwoball is the firstworking system that takes a bootstrapping architecture andapplies the well-developed
1
-norm regularized MLE to in-crementally identify entity relations. Specifically, we makethe following contributions:
WWW 2009 MADRID!Track: Data Mining / Session: Statistical Methods
102
 
(a) We present a novel framework called StatSnowball byusing
1
-regularized feature selection to incrementallydiscover extraction patterns and identify relation tuples.Compared to the closely related Snowball, StatSnowballcontains the following advantages:i StatSnowball can perform both traditional relationextraction as in Snowball and Open IE.ii The probabilistic foundation provides StatSnowballa principled approach to evaluating and selectingpatterns. In Snowball, however, the confidence mea-sures lack an elegant interpretation and they are dif-ficult to be applied to general patterns.iii StatSnowball automatically learns theweights of gen-erated patterns, which are represented as formulaein MLN, while Snowball applies some heuristic rulesto assign the weights.iv StatSnowball can be easily extended. In current im-plementation, StatSnowball takes the same strategyas Snowball to separately identify named entities andentity relations. StatSnowball can easily integratethese two tasks into one probabilistic model and thuscan achieve a higher performance by exploring theirmutual enhancements. The promise of integratedextraction has been shown in different applications[20, 28]. But the heuristic-based Snowball is hardto be coherently integrated with the state-of-the-artstatistical named entity extraction systems.(b) We extensively evaluate StatSnowball and empiricallyshow that StatSnowball can achieve a significantly higherprecision and recall on large scale Web data.(c) StatSnowball is efficient and we have developed a work-ing Entity Relation Search Engine based on it.The rest of the paper is structured as follows. Section 2briefly overviews the StatSnowball system. Section 3 presentsthe statistical model (i.e., Markov logic networks), includ-ing some speeding up techniques. Section 4 presents how togenerate and select good patterns in StatSnowball. Section5 presents our empirical results. Section 6 discusses somerelated work, and Section 7 concludes this paper, with somefuture work discussed.
2. OVERVIEW OF STATSNOWBALL
In this section, we briefly overview the framework of Stat-Snowball. Different components will be explained in incom-ing sections.The task of StatSnowball is to identify relation tuples.Each relation tuple can be among several entities. Here, weonly focus on the binary relationship, and an extraction is atuple (
e
i
,e
j
,key
)
i
=
j
, where
e
i
and
e
j
are two entities and
key
is a set of keywords that indicate the relationship. Likemany other relation extraction systems [1, 8, 9], we assumethat the entities are given and focus on how to detect (i.e.,decide whether a relationship exists between two entities)and categorize (i.e., assign relation keywords to a detectedrelationship) the relationships.Figure 2 shows the architecture of StatSnowball. Gener-ally, StatSnowball has three parts. The first part P
1
is theinput, which contains a set of seeds and an initial model.The seeds are not required to contain relation keywords that
(e
5
, e
6
, key
1
)(e
3
, e
4
, key
2
)...(on-line)learn model(on-line)inference/extractiongenerate patternsselect patternsrelation clustering(e
1
, e
2
, key)(e
3
, e
4
, ? )...Initial Model
augment seeds
P
1
P
2
P
3
Figure 2: The framework of StatSnowball, withthree parts – P
1
(input), P
2
(statistical extractionmodel), and P
3
(output).
indicate the relationship. Thus, we have two types of seeds,i.e., seeds with relation keywords like (
e
1
,e
2
,key
) or seedswithout relation keywords like (
e
3
,e
4
,
?). If the initial modelis empty, we will first use the seeds to generate extractionpatterns in order to start the process.The second part P
2
is the statistical extraction model.To start the iterative extraction process, StatSnowball takesthe input seeds and the initial model (can be empty) in P
1
to learn an extractor. We apply the
2
-norm regularizedmaximum likelihood estimation (MLE) at this step. On-line learning is an alternative if batch learning is expensive.Then, StatSnowball uses the learned model to extract newrelation tuples on the data corpus. The third step in P
2
is to generate extraction patterns with the newly identifiedrelation tuples. These patterns are used to compose formu-lae of MLN. Finally, it selects good formulae to add to theprobabilistic model and re-train the model. In this step,we first do
1
-norm regularized MLE, which will set someformulae’s weights to zeros. Then, we remove these zero-weighted formulae and send the resultant model to the nextstep for re-training. StatSnowball iteratively performs thesefour steps until no new extraction tuples are identified orno new patterns are generated. In this part, an optionalcomponent is the augmenting seeds, which can be used tofind more seeds to start the process. In order to get highquality training seeds, this component applies strict keywordmatching rules. We do not use it in the current system.The third part P
3
is the output, which is necessary onlywhen StatSnowball is configured to do Open IE [3]. WhenStatSnowball performs Open IE, the extraction results in P
2
are general relation tuples. To make the results more read-able, we can apply clustering methods to group the relationtuples and assign relation keywords to them. The missingkeywords of the seeds can be filled in this part. Currently,we treat it as a post-processing step. Recent work on re-lational clustering with MLN in [14] suggests that we can
WWW 2009 MADRID!Track: Data Mining / Session: Statistical Methods
103
Search History:
Searching...
Result 00 of 00
00 results for result for
  • p.
  • More From This User

    Notes
    Load more