You are on page 1of 11

Knowledge Graph and Corpus Driven Segmentation and

Answer Inference for Telegraphic Entity-seeking Queries

Mandar Joshi ∗ Uma Sawant Soumen Chakrabarti


IBM Research IIT Bombay, Yahoo Labs IIT Bombay
mandarj90@in.ibm.com uma@cse.iitb.ac.in soumen@cse.iitb.ac.in

Abstract 1 Introduction
A majority of Web queries mention an entity or
Much recent work focuses on formal in- type (Lin et al., 2012), as users increasingly ex-
terpretation of natural question utterances, plore the Web of objects using Web search. To
with the goal of executing the resulting better support entity-oriented queries, commercial
structured queries on knowledge graphs Web search engines are rapidly building up large
(KGs) such as Freebase. Here we address catalogs of types, entities and relations, popu-
two limitations of this approach when ap- larly called a “knowledge graph” (KG) (Gallagher,
plied to open-domain, entity-oriented Web 2012). Despite these advances, robust, Web-scale,
queries. First, Web queries are rarely well- open-domain, entity-oriented search faces many
formed questions. They are “telegraphic”, challenges. Here, we focus on two.
with missing verbs, prepositions, clauses,
case and phrase clues. Second, the KG is 1.1 “Telegraphic” queries
always incomplete, unable to directly an- First, the surface utterances of entity-oriented Web
swer many queries. We propose a novel queries are dramatically different from TREC-
technique to segment a telegraphic query or Watson-style factoid question answering (QA),
and assign a coarse-grained purpose to where questions are grammatically well-formed.
each segment: a base entity e1 , a rela- Web queries are usually “telegraphic”: they are
tion type r, a target entity type t2 , and short, rarely use function words, punctuations
contextual words s. The query seeks en- or clausal structure, and use relatively flexible
tity e2 ∈ t2 where r(e1 , e2 ) holds, fur- word orders. E.g., the natural utterance “on the
ther evidenced by schema-agnostic words bank of which river is the Hermitage Museum lo-
s. Query segmentation is integrated with cated” may be translated to the telegraphic Web
the KG and an unstructured corpus where query hermitage museum river bank. Even
mentions of entities have been linked to on well-formed question utterances, 50% of in-
the KG. We do not trust the best or any terpretation failures are contributed by parsing or
specific query segmentation. Instead, evi- structural matching failures (Kwiatkowski et al.,
dence in favor of candidate e2 s are aggre- 2013). Telegraphic utterances will generally be
gated across several segmentations. Ex- even more challenging.
tensive experiments on the ClueWeb cor- Consequently, whereas TREC-QA/NLP-style
pus and parts of Freebase as our KG, us- research has focused on parsing and precise in-
ing over a thousand telegraphic queries terpretation of a well-formed query sentence to
adapted from TREC, INEX, and Web- a strongly structured (typically graph-oriented)
Questions, show the efficacy of our ap- query language (Kasneci et al., 2008; Pound et
proach. For one benchmark, MAP im- al., 2012; Yahya et al., 2012; Berant et al., 2013;
proves from 0.2–0.29 (competitive base- Kwiatkowski et al., 2013), the Web search and in-
lines) to 0.42 (our system). NDCG@10 formation retrieval (IR) community has focused
improves from 0.29–0.36 to 0.54. on telegraphic queries (Guo et al., 2009; Sarkas
et al., 2010; Li et al., 2011; Pantel et al., 2012; Lin
et al., 2012; Sawant and Chakrabarti, 2013). In

Work done as Masters student at IIT Bombay terms of target schema richness, these efforts may

1104
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1104–1114,
c
October 25-29, 2014, Doha, Qatar. 2014 Association for Computational Linguistics
appear more modest. The act of query ‘interpre- • Other contextual matching words s (some-
tation’ is mainly a segmentation of query tokens times called selectors),
by purpose. In the example above, one may re- with the simultaneous intent of finding and rank-
port segments “Hermitage Museum” (a located ar- ing entities e2 ∈ t2 , such that r(e1 , e2 ) is likely
tifact or named entity), and “river bank” (the target to hold, evidenced near the matching words in un-
type). This is reminiscent of record segmentation structured text.
in information extraction (IE). Over well-formed Given the short, telegraphic query utterances,
utterances, IE baselines are quite competitive (Yao we limit our scope to at most one relation mention,
and Van Durme, 2014). But here, we are interested unlike the complex mapping of clauses in well-
exclusively in telegraphic queries. formed questions to twig and join style queries
(e.g., “find an actor whose spouse was an Italian
1.2 Incomplete knowledge graph
bookwriter”). On the other hand, we need to deal
The second problem is that the KG is always with the unhelpful input, as well as consolidate
work in progress (Pereira, 2013), and connec- the KG with the corpus for ranking candidate e2 s.
tions found within nodes of the KG, between the Despite the modest specification, our query tem-
KG and the query, or the KG and unstructured plate is quite expressive, covering a wide range of
text, are often incomplete or erroneous. E.g., entity-oriented queries (Yih et al., 2014).
Wikipedia is considered tiny, and Freebase rather We present a novel discriminative graphical
small, compared to what is needed to answer all model to capture the entity ranking inference task,
but the “head” queries. Google’s Freebase an- with query segmentation as a by-product. Ex-
notations (Gabrilovich et al., 2013) on ClueWeb tensive experiments with over a thousand entity-
(ClueWeb09, 2009) number fewer than 15 per seeking telegraphic queries using the ClueWeb09
page to ensure precision. Fewer than 2% are to corpus and a subset of Freebase show that we can
entities in Freebase but not in Wikipedia. accurately predict the segmentation and intent of
It may also be difficult to harness the KG for telegraphic relational queries, and simultaneously
answering certain queries. E.g., answering the rank candidate responses with high accuracy. We
query fastest odi century batsman, the intent of also present evidence that the KG and corpus have
which is to find the batsman holding the record for synergistic salutary effects on accuracy.
the fastest century in One Day International (ODI) §2 explores related work in more detail. §3
cricket, may be too difficult for most KG-only sys- gives some examples fitting our query template,
tems, but may be answered quite effectively by a explains why interpreting some of them is nontriv-
system that also utilizes evidence from unstruc- ial, and sets up notation. §4 presents our core tech-
tured text. nical contributions. §5 presents experiments. Data
There is a clear need for a “pay-as-you-go” ar- can be accessed at http://bit.ly/Spva49
chitecture that involves both the corpus and KG. A and http://bit.ly/WSpxvr.
query easily served by a curated KG should give
accurate results, but it is desirable to have a grace- 2 Related work
ful interpolation supported by the corpus: e.g., if
the relation r(e1 , e2 ) is not directly evidenced in The NLP/QA community has traditionally as-
the KG, but strongly hinted in the corpus, we still sumed that question utterances are grammatically
want to use this for ranking. well-formed, from which precise clause structure,
ground constants, variables, and connective rela-
1.3 Our contributions tions can be inferred via semantic parsing (Kas-
neci et al., 2008; Pound et al., 2012; Yahya et
Here, we make progress beyond the above frontier
al., 2012; Berant et al., 2013; Kwiatkowski et
of prior work in the following significant ways.
al., 2013) and translated to lambda expressions
We present a new architecture for structural in-
(Liang, 2013) or SPARQL style queries (Kasneci
terpretation of a telegraphic query into these seg-
et al., 2008), with elaborate schema knowledge.
ments (some may be empty):
Such approaches are often correlated with the as-
• Mention/s eb1 of an entity e1 , sumption that all usable knowledge has been cu-
• Mention rb of a relation type r, rated into a KG. The query is first translated to a
• Mention tb2 of a target type t2 , and structured form and then “executed” on the KG. A

1105
Telegraphic query eb1 rb tb2 s
nobel prize winner african american first
first african american nobel prize winner nobel prize - winner first african american
- - winner first african american nobel prize
dave navarro band - first
dave navarro first band dave navarro band band first
merril lynch headquarters - -
merril lynch headquarters merril lynch - headquarters -
spanish died in poet civil war
spanish poet died in civil war civil war died - spanish poet
spanish in poet died civil war
- - - first american in space
first american in space
- - american first, in space

Figure 1: Example queries and some potential segmentations.

large corpus may be used to build relation expres- and eb1 , rb, tb2 to represent their textual mentions or
sion models (Yao and Van Durme, 2014), but not hints, if any, in the query. s is a set of uninterpreted
as supporting evidence for target entities. textual tokens in the query that are used to match
In contrast, the Web and IR community gener- and collect corpus contexts that lend evidence to
ally assumes a free-form query that is often tele- candidate entities.
graphic (Guo et al., 2009; Sarkas et al., 2010; Li Figure 1 shows some telegraphic queries
et al., 2011). Queries being far more noisy, the with possible segmentation into the above
goal of structure discovery is more modest, and of- parts. Consider another example: dave
ten takes the form of a segmentation of the query navarro first band. ‘Band’ is a hint for
regarded as a token sequence, assigning a broad type /music/musical group, so it com-
purpose (Pantel et al., 2012; Lin et al., 2012) to prises tb2 . Dave Navarro is an entity, with men-
each segment, mapping them probabilistically to tion words ‘dave navarro’ comprising eb1 . rb is
a relatively loose schema, and ranking responses made up of ‘band’, and represents the relation
in conjunction with segmentations (Sawant and /music/group member/membership. Fi-
Chakrabarti, 2013). To maintain quality in the face nally, the word first cannot be mapped to any sim-
of noisy input, these approaches often additionally ple KG artifact, so are relegated to s (which makes
exploit clicks (Li et al., 2011) or a corpus that has the corpus a critical part of answer inference). We
been annotated with entity mentions (Cheng and use s and sb interchangeably.
Chang, 2010; Li et al., 2010). The corpus provides Generally, there will be enough noise and uncer-
contextual snippets for queries where the KG fails, tainty that the search system should try out several
preventing the systems from falling off the “struc- of the most promising segmentations as shown in
ture cliff” (Pereira, 2013). Figure 1. The accuracy of any specific segmenta-
Our work advances the capabilities of the lat- tion is expected to be low in such adversarial set-
ter class of approaches, bringing them closer to tings. Therefore, support for an answer entity is
the depth of the former, while handling telegraphic aggregated over several segmentations. The ex-
queries and retaining the advantage of corpus evi- pectation is that by considering multiple interpre-
dence over and above the KG. Very recently, (Yao tations, the system will choose the entity with best
et al., 2014) have concluded that for current bench- supporting evidence from corpus and knowledge
marks, deep parsing and shallow information ex- base.
traction give comparable interpretation accuracy.
The very recent work of (Yih et al., 2014) is simi- 4 Our Approach
lar in spirit to ours, but they do not unify segmen-
tation and answer inference, along with corpus ev- Telegraphic queries are usually short, so we enu-
idence, like we do. merate query token spans (with some restrictions,
similar to beam search) to propose segmentations
3 Notation and examples (§4.1). Candidate response entities are lined up
for each interpretation, and then scored in a global
We use e1 , r, t2 , e2 to represent abstract nodes and model along with query segmentations (§4.2).
edges (MIDs in case of Freebase) from the KG, §4.3 describes how model parameters are trained.

1106
1: input: query token sequence q graphical model. In subsequent subsections, we
2: initialize segmentations I = ∅ will present the design of specific potentials.
3: E1 = (entity, mention) pairs from linker • ΨR (q, z, r) denotes the compatibility be-
4: for all (e1 , eb1 ) ∈ E1 do tween the relation hint segment rb(q, z) and
5: assign label E1 to mention tokens eb1 a proposed relation type r in the KG (§4.2.1).
6: for all contiguous span v ⊂ q \ eb1 do • ΨT2 (q, z, t2 ) denotes the compatibility be-
7: label each word w ∈ v as T2 R tween the type hint segment tb2 (q, z) and a
8: label other words w ∈ q \ eb1 \ v as S proposed target entity type t2 in the KG
9: add segments (E1 , T2 R, S) to I (§4.2.2).
10: end for
• ΨE1 ,R,E2 ,S (q, z, e1 , r, e2 ) is a novel corpus-
11: end for
based evidence potential that measures how
12: return candidate segmentations I
strongly e1 and e2 appear in corpus snippets
Figure 2: Generating candidate query segmenta- in the proximity of words in sb(q, z), and ap-
tions. parently related by relation type r (§4.2.3).
• ΨE1 (q, z, e1 ) denotes the compatibility be-
4.1 Generating candidate query tween the query segment eb1 (q, z) and entity
segmentations e1 that it purportedly mentions (§4.2.4).
• ΨS (q, z) denotes selector compatibility. Se-
Each query token can have four labels, lectors are a fallback label, so this is pinned
E1 , T2 , R, S, corresponding to the mentions arbitrarily to 1; other potentials are balanced
of the base entity, target type, connecting relation, against this base value.
and context words. We found that segments • ΨE1 ,R,E2 (e1 , r, e2 ) is A if the relation
hinting at T2 and R frequently overlapped r(e1 , e2 ) exists in the KG, and is B > 0 oth-
(e.g., ‘author’ in the query zhivago author). erwise, for tuned/learnt constants A > B >
In our implementation, we simplified to three 0. Note that this is a soft constraint (B > 0);
labels, E1 , T2 R, S, where tokens labeled T2 R if the KG is incomplete, the corpus may be
are involved with both t2 and r, the proposed able to supplement the required information.
structured target type and connecting relation.
• ΨE2 ,T2 (e2 , t2 ) is 1 if e2 belongs to t2 and
Another reasonable assumption was that the base
zero otherwise. In other words, candidate e2 s
entity mention and type/relation mentions are
must be proposed to be instances of the pro-
contiguous token spans, whereas context words
posed t2 — this is a hard constraint, but can
can be scattered in multiple segments.
be softened if desired, like ΨE1 ,R,E2 .
Figure 2 shows how candidate segmentations
are generated. For step 3, we use TagMe (Ferrag- Figure 3 shows the relevant variable states as
ina and Scaiella, 2010), an entity linker backed by circled nodes, and the potentials as square factor
an entity gazette derived from our KG. nodes. To rank candidate entities e2 , we pin the
node E2 to each entity in turn. With E2 pinned,
4.2 Graphical model we perform a MAP inference over all other hidden
variables and note the score of e2 as the product of
Based on the previous discussion, we assume that the above potentials maximized over choices of all
an entity-seeking query q is a sequence of tokens other variables: score(e2 ) =
q1 , q2 , . . ., and this can be partitioned into different
kinds of subsequences, corresponding to e1 , r, t2 maxz,t2 ,r,e1 ΨT2 (q, z, t2 )ΨR (q, z, r)
and s, and denoted by a structured (vector) label- ΨE1 (q, z, e1 )ΨS (q, z)
ing z = z1 , z2 , . . .. Given sequences q and z, we ΨE2 ,T2 (e2 , t2 )ΨE1 ,R,E2 (e1 , r, e2 )
can separate out (possibly empty) token segments ΨE1 ,R,E2 ,S (q, z, e1 , r, e2 ). (1)
eb1 (q, z), tb2 (q, z), rb(q, z), and sb(q, z).
A query segmentation z becomes plausible in We rank candidate e2 s by decreasing score, which
conjunction with proposals for e1 , r, t2 and e2 is estimated by max-product message-passing
from the KG. The probability Pr(z, e1 , r, t2 , e2 |q) (Koller and Friedman, 2009).
is modeled as proportional to the product of sev- As noted earlier, any of the relation/type, or
eral potentials (Koller and Friedman, 2009) in a query entity partitions may be empty. To handle

1107
Type Segmentation
language
model
Entity
language
Relation model
language
model
Target Connecting Query
Selectors
type relation entity

Corpus-assisted
entity-relation
evidence potential
Candidate
entity

Figure 3: Graphical model for query segmentation and entity scoring. Factors/potentials are shown as
squares. A candidate e2 is observed and scored using equation (1). Query q is also observed but not
shown to reduce clutter; most potentials depend on it.

this case, we allow each of the entity, relation or from a reference corpus, then mark these patterns
target type nodes in the graphical to take the value into much larger payload corpus.
⊥ or ‘null’. To support this, the value of the fac- We started with the 2000 (out of approximately
tor between the query segmentation node Z and 14000) most frequent relation types in Freebase,
ΨE1 (q, z, e1 ), ΨT 2 (q, z, t2 ), and ΨR (q, z, r)) are and the ClueWeb09 corpus annotated with Free-
set to suitable low values. base entities (Gabrilovich et al., 2013). For each
Next, we will describe the detailed design of triple instance of each relation type, we located all
some of the key potentials introduced above. corpus sentences that mentioned both participat-
ing entities. We made the crude assumption that
4.2.1 Relation language model for ΨR if r(e1 , e2 ) holds and e1 , e2 co-occur in a sen-
tence then this sentence is evidence of the rela-
Potential ΨR (q, z, r) captures the compatibility
tionship. Each such sentence is parsed to obtain
between rb(q, z) and the proposed relation r.
a dependency graph using the Malt Parser (Hall
E.g., if the query is steve jobs death rea-
et al., 2014). Words in the path connecting the
son, and rb is (correctly chosen as) death rea-
entities are joined together and added to a candi-
son, then the correct candidate r is /people/
date phrase dictionary, provided the path is at most
deceased_person/cause_of_death. An
three hops. (Inspection suggested that longer de-
incorrect r is /people/deceased_person/
pendency paths mostly arise out of noisy sentences
place_of_death. An incorrect z may lead to
or botched parses.) 30% of the sentences were
rb(q, z) being jobs death.
thus retained. Finally, we defined
Using corpus: Considerable variation may exist n(r, rb(q, z))
in how r is represented textually in a query. The ΨR (q, z, r) = P 0
, (2)
p0 n(r, p )
relation language model needs to build a bridge
between the formal r and the textual rb, so that where p0 ranges over all phrases that are known to
(un)likely r’s have (small) large potential. Many hint at r, and n(r, p) denotes the number of sen-
approaches (Berant et al., 2013; Berant and Liang, tences where the phrase p occurred in the depen-
2014; Kwiatkowski et al., 2013; Yih et al., 2014) dency path between the entities participating in re-
to this problem have been intensely studied re- lation r.
cently. Given our need to process billions of Web Assuming entity co-occurrence implies evi-
pages efficiently, we chose a pattern-based ap- dence is admittedly simplistic. However, the pri-
proach (Nakashole et al., 2012): with each r, dis- mary function of the relation model is to retrieve
cover the most strongly associated phrase patterns top-k relations that are compatible with the type/s

1108
of e1 and the given relation hint. Moreover, the physicist) using an edge with label /type/
remaining noise is further mitigated by the collec- object/type.
tive scoring in the graphical model. While we may But relation types provide additional clues to
miss relations if they are expressed in the query types of the endpoint entities. Freebase relation
through obscure hints, allowing the relation to be types have the form /x/y/z, where x is the
⊥ acts as a safety net. domain of the relation, and y and z are string
Using Freebase relation names: As mentioned representations of the type of the entities partic-
earlier, queries may express relations differently as ipating in the relation. E.g., the (directed) re-
compared to the corpus. A relation model based lation type /location/country/capital
solely on corpus annotations may not be able to connects from from /location/country to
bridge that gap effectively, particularly so, because /location/citytown. Therefore, “capital”
of sparsity of corpus annotations or the rarity of can be added to the set of descriptive phrases of
Freebase triples in ClueWeb. E.g., for the Freebase entity type /location/citytown.
relation /people/person/profession, we It is important to note that while we use Free-
found very few annotated sentences. One way base link nomenclature for relation and type lan-
to address this problem is to utilize relation type guage models, our models are not incompati-
names in Freebase to map hints to relation types. ble with other catalogs. Indeed, most catalogs
Thus, in addition to the corpus-derived relation have established ways of deriving language mod-
model, we also built a language model that used els that describe their various structures. For ex-
Freebase relation type names as lemmas. E.g., the ample, most YAGO types are derived from Word-
word ‘profession’ would contribute to the relation Net synsets with associated phrasal descriptions
type /people/person/profession. (lemmas). YAGO relations also have readable
Our relation models are admittedly simple. This names such as actedIn, isMarriedTo, etc. which
is mainly because telegraphic queries may ex- can be used to estimate language models. DB-
press relations very differently from natural lan- Pedia relations are mostly derived from (mean-
guage text. As it is difficult to ensure precision of ingfully) named attributes taken from infoboxes,
query interpretation stage, our models are geared hence they can be used directly. Furthermore, oth-
towards recall. The system generates a large num- ers (Wu and Weld, 2007) have shown how to asso-
ber of interpretations and relies on signals from ciate language models with such relations.
the corpus and KG to bring forth correct interpre-
4.2.3 Snippet scoring
tations.
The factor ΨE1 ,R,E2 ,S (q, z, e1 , r, e2 ) should be
4.2.2 Type language model for ΨT2 large if many snippets contain a mention of e1 and
Similar to the relation language model, we need e2 , relation r, and many high-signal words from s.
a type language model to measure compatibil- Recall that we begin with a corpus annotated with
ity between t2 and tb2 (q, z). Estimating the tar- entity mentions. Our corpus is not directly anno-
get entity type, without over-generalizing or over- tated with relation mentions. Therefore, we get
specifying it, has always been important for QA. from relations to documents via high-confidence
E.g., when tb2 is ‘city’, a good type language model phrases. Snippets are retrieved using a combined
should prefer t2 as /location/citytown entity + word index, and scored for a given e1 , r,
over /location/location while avoiding e2 , and selectors sb(q, z).
/location/es_autonomous_city. Given that relation phrases may be noisy and
A catalog like Freebase suggests a straight- that their occurrence in the snippet may not nec-
forward method to collect a type language model. essarily mean that the given relation is being ex-
Each type is described by one or more phrases pressed, we need a scoring function that is cog-
through the link /common/topic/alias. We nizant of the roles of relation phrases and enti-
can collect these into a micro-‘document’ and ties occurring in the snippets. In a basic ver-
use a standard Dirichlet-smoothed language model sion, e1 , p, e2 , sb are used to probe a combined en-
from IR (Zhai, 2008). In Freebase, an entity tity+word index to collect high scoring snippets,
node (e.g., Einstein, /m/0jcx) may be linked with the score being adapted from BM25. The sec-
to a type node (e.g. /base/scientist/ ond, refined scoring function used a RankSVM-

1109
style (Joachims, 2002) optimization. During inference, we seek to maximize
P
min kλk2 + C e+ ,e− ξe+ ,e− s.t. max w · φ(q, z, e1 , t2 , r, e2 ), (5)
λ,ξ q,z,e1 ,t2 ,r
∀e+ , e : λ · f (q, De+ , e+ ) + ξe+ ,e−

(3)
− for a fixed w, to find the score of each candidate
≥ λ · f (q, De− , e ) + 1.
entity e2 . Here all w• and φ• have been collected
where e+ and e− are positive and negative enti- into unified weight and feature vectors w, φ. Dur-
ties for the query q and f (q, De , e) represents the ing training of w, we are given pairs of correct and
feature map for the set of snippets De belonging incorrect answer entities e+ −
2 , e2 , and we wish to
to entity e. The assumption here is that all snip- satisfy constraints of the form
pets containing e+ are “positive” snippets for the
query. f consolidates various signals like the num- max w · φ(q, z, e1 , t2 , r, e+
2)+ξ (6)
q,z,e1 ,t2 ,r
ber of snippets where e occurs near query entity ≥ 1 + max w · φ(q, z, e1 , t2 , r, e−
2 ),
e1 and a relation phrase, or the number of snippets q,z,e1 ,t2 ,r

with high proportion of query IDF, hinting that e


is a positive entity for the given query. A partial because collecting e+ −
2 , e2 pairs is less work than
list of features used for snippet scoring is given in supervising with values of z, e1 , t2 , r, e2 for each
Figure 4. query. Similar distant supervision problems were
posed via bundle method by (Bergeron et al.,
Number of snippets with distance(e2 , eb1 ) < k1 (k1 = 5, 10)
2008), and (Yu and Joachims, 2009), who used
Number of snippets with distance(e2 , relation phrase) < k2
(k2 = 3, 6) CCCP (Yuille and Rangarajan, 2006). These are
Number of snippets with relation r = ⊥ equivalent in our setting. We use the CCCP style,
Number of snippets with relation phrases as prepositions and augment the objective with an additional en-
Number of snippets covering fraction of query IDF > k3
(k3 = 0.2, 0.4, 0.6, 0.8) tropy term as in (Sawant and Chakrabarti, 2013).
We call this LVDT (latent variable discriminative
Figure 4: Sample features used for learning training) in §5.
weights λ to score snippets.
5 Experiments
4.2.4 Query entity model 5.1 Testbed
Potential ΨE1 (q, z, e1 ) captures the compatibil-
Corpus and knowledge graph: We used the
ity between eb1 (q, z) (i.e., the words that mention
ClueWeb09B (ClueWeb09, 2009) corpus contain-
e1 ) and the claimed entity e1 mentioned in the
ing 50 million Web documents. This corpus
query. We used the TagMe entity linker (Fer-
was annotated by Google with Freebase enti-
ragina and Scaiella, 2010) for annotating enti-
ties (Gabrilovich et al., 2013). The average page
ties in queries. TagMe annotates the query with
contains 15 entity annotations from Freebase. We
Wikipedia entities, which we map to Freebase, and
used the Freebase KG and its links to Wikipedia.
use the annotation confidence scores as the poten-
tial ΨE1 (q, z, e1 ). Queries: We report on two sets of entity-seeking
queries. A sample of about 800 well-formed
4.3 Discriminative parameter training with queries from WebQuestions (Berant et al., 2013)
latent variables were converted to telegraphic utterances (such as
We first set the potentials in (1) as explained in would be typed into commercial search engines)
§4.2 (henceforth called ‘Unoptimized’), and got by volunteers familiar with Web search. We call
encouraging accuracy. Then we rewrote each po- this WQT (WebQuestions, telegraphic). Queries
tential as are accompanied by ground truth entities. The
 second data set, TREC-INEX, from (Sawant and
Ψ• (· · · ) = exp w• · φ• (· · · ) (4) Chakrabarti, 2013) has about 700 queries sam-
Q P
or log • Ψ• (· · · ) = • w• · φ• (· · · ), pled from TREC and INEX, available at http:
//bit.ly/WSpxvr. These come with well-
with w• being a weight vector for a specific poten- formed and telegraphic utterances, as well as
tial •, and φ• being a corresponding feature vector. ground truth entities.

1110
There are some notable differences between were used. We believe commercial search engines
these query sets. For WQT, queries were gener- can cut this down to less than a second.
ated by using Google’s query suggestions inter-
face. Volunteers were asked to find answers using 5.3 Research questions
single Freebase pages. Therefore, by construction, In the rest of this section we will address these
queries retained can be answered using the Free- questions:
base KG alone, with a simple r(e1 , ?) form. In • For telegraphic queries, is our entity-relation-
contrast, TREC-INEX queries provide a balanced type-selector segmentation better than the
mix of t2 and r hints in the queries, and direct an- type-selector segmentation of (Sawant and
swers from triples is relatively less available. Chakrabarti, 2013)?
• When semantic parsers (Berant et al., 2013;
5.2 Implementation details Kwiatkowski et al., 2013) are subjected to
telegraphic queries, how do they perform
On an average, the pseudocode in Figure 2 compared to our proposal?
generated 13 segmentations per query, with • Are the KG and corpus really complementary
longer queries generating more segmentations as regards their support of accurate ranking of
than shorter ones. candidate entities?
We used an MG4J (Boldi and Vigna, 2005) • Is the prediction of r and t2 from our ap-
based query processor, written in Java, over en- proach better than a greedy assignment based
tity and word indices on ClueWeb09B. The in- on local language models?
dex supplies snippets with a specified maximum
We also discuss anecdotes of successes and fail-
width, containing a mention of some entity and
ures of various systems.
satisfying a WAND (Broder et al., 2003) predi-
cate over words in sb. In case of phrases in the 5.4 Benefits of relation in addition to type
query, the WAND threshold was computed by
Figure 5 shows entity-ranking MAP, MRR, and
adding the IDF of constituent words. The index
NDCG@10 (n@10) for two data sets and vari-
returned about 330,000 snippets on average for
ous systems. “No interpretation” is an IR baseline
WAND threshold of 0.6.
without any KG. Type+selector is our implemen-
We retained the top 200 candidate entities from tation of (Sawant and Chakrabarti, 2013). Unopti-
the corpus; increasing this horizon did not give mized and LVDT both beat “no interpretation” and
benefits. We also considered as candidates for e2 “type+selector” by wide margins. (Boldface im-
those entities that are adjacent to e1 in the KG plies best performing formulation.) There are two
via top-scoring r candidates. In order to gener- notable differences between S&C and our work.
ate supporting snippets for an interpretation con- First, S&C do not use the knowledge graph (KG)
taining entity annotation e, we need to match e and rely on a noisy corpus. This means S&C fails
with Google’s corpus annotations. However, re- to answer queries whose answers are found only
lying solely on corpus annotations fails to retrieve in KG. This can be seen from WQT results; they
many potential evidence snippets, because entity perform only slightly better than the baseline. Sec-
annotations are sparse. Therefore we probed the ond, even for queries that can be answered through
token index with the textual mention of e1 in the the corpus alone, S&C miss out on two important
query; this improved recall. signals that the query may provide - namely the
We also investigated the feasibility of our pro- query entity and the relation. Our framework not
posals for interactive search. There are three major only provides a way to use a curated and high pre-
processes involved in answering a query - gener- cision knowledge graph but also attempts to pro-
ating potential interpretations, collecting/scoring vide more reachability to corpus by the use of re-
snippets, and inference (MAP for Unoptimized lational phrases.
and wφ(·) for LVDT). For the WQT dataset, av- In case of TREC-INEX, LVDT improves upon
erage time per query for each stage was approx- the unoptimized graphical model, where for WQT,
imately - 0.2, 16.6 and 1.3 seconds respectively. it does not. Preliminary inspection suggests this is
Our (Java) code did not optimize the bottleneck because WQT has noisy and incomplete ground
at all; only 10 hosts and no clever load balancing truth, and LVDT trains to the noise; a non-convex

1111
Dataset Formulation map mrr n@10 using a single Freebase page, which effectively re-
No interpretation .205 .215 .292 duced the role of a corpus. In fact, a large frac-
TREC Type+selector .292 .306 .356 tion of WQT queries cannot be answered well us-
-INEX Unoptimized .409 .419 .502 ing the corpus alone, because FACC1 annotations
LVDT .419 .436 .541 are too sparse and rarely cover common nouns and
No interpretation .080 .095 .131 phrases such as ‘democracy’ or ‘drug overdose’
WQT Type+selector .116 .152 .201 which are needed for some WQT queries.
Unoptimized .377 .401 .474 For WQT, our system also compares favorably
LVDT .295 .323 .406 with Jacana (Yao and Van Durme, 2014). Given
that they subject their input to natural langauge
Figure 5: ‘Entity-relation-type-selector’ segmen- parsing, their relatively poor performance is not
tation yields better accuracy than ‘type-selector’ unsurprsing.
segmentation.
5.6 Complementary benefits of KG & corpus
objective makes matters worse. The bias in our
Figure 7 shows the synergy between the corpus
unoptimized model circumvents training noise.
and the KG. In all cases and for all metrics, using
5.5 Comparison with semantic parsers the corpus and KG together gives superior perfor-
mance to using any of them alone. However, it
For TREC-INEX, both unoptimized and LVDT
is instructive that in case of TREC-INEX, corpus-
beat SEMPRE (Berant et al., 2013) convinc-
only is better than KG-only, whereas this is re-
ingly, whether it is trained with Free917 or Web-
versed for WQT, which also supports the above
Questions (Figure 6).
argument.
SEMPRE’s relatively poor performance, in this
case, is explained by its complete reliance on the Data Formulation map mrr n@10
knowledge graph. As discussed previously, the Unoptimized (KG) .201 .209 .241
TREC-INEX

TREC-INEX dataset contains a sizable proportion Unoptimized (Corpus) .381 .388 .471
of queries that may be difficult to answer using Unoptimized (Both) .409 .419 .502
a KG alone. When SEMPRE is compared with LVDT (KG only) .255 .264 .293
our systems with a telegraphic sample of Web- LVDT (Corpus) .267 .272 .315
Questions (WQT), results are mixed. Our Unop- LVDT (Both) .419 .436 .541
timized model still compares favorably to SEM- Unoptimized (KG) .329 .343 .394
PRE, but with slimmer gains. As before, LVDT Unoptimized (Corpus) .188 .228 .291
falls behind.
WQT

Unoptimized (Both) .377 .401 .474


Dataset Formulation map mrr n@10 LVDT (KG only) .257 .281 .345
SEMPRE(Free917) .154 .159 .186 LVDT (Corpus only) .170 .210 .280
TREC SEMPRE(WQ) .197 .208 .247 LVDT (Both) .295 .323 .406
-INEX Unoptimized .409 .419 .502
LVDT .419 .436 .541 Figure 7: Synergy between KB and corpus.
SEMPRE(Free917) .229 .255 .285
WQT SEMPRE(WQ) .374 .406 .449 5.7 Collective vs. greedy segmentation
Unoptimized .377 .401 .474
Jacana .239 .256 .329 To judge the quality of interpretations, we asked
LVDT .295 .323 .406 paid volunteers to annotate queries with an appro-
priate relation and type, and compared them with
Figure 6: Comparison with semantic parsers. the interpretations associated with top-ranked en-
tities. Results in Figure 8 indicate that in spite
Our smaller gains over SEMPRE in case of of noisy relation and type language models, our
WebQuestions is explained by how WebQuestions formulations produce high quality interpretations
was assembled (Berant et al., 2013). Although through collective inference.
Google’s query suggestions gave an eclectic pool, Figure 9 demonstrates the benefit of collective
only those queries survived that could be answered inference over greedy segmentation followed by

1112
Formulation Type Relation Type/Rel
Unoptimized (top 1) 23 49 60
Unoptimized (top 5) 29 57 68
LVDT (top 1) 25 52 61
LVDT (top 5) 33 61 69

Figure 8: Fraction of queries (%) with correct in-


terpretations of t2 , r, and t2 or r, on TREC-INEX.

evaluation. Collective inference boosts absolute


MAP by as much as 0.2. Figure 11: Comparison of various approaches for
NDCG at rank 1 to 10, WQT dataset
Dataset Formulation map mrr n@10
Unoptimized (greedy) .343 .347 .432
lector to arrive at the correct answer Alfa Romeo
TREC Unoptimized .409 .419 .502
(/m/09c50). The corpus features also play a cru-
-INEX LVDT (greedy) 205 .214 .259
cial role for queries which may not be accurately
LVDT .419 .436 .541
represented with an appropriate logical formula.
Unoptimized (greedy) .246 .271 .335
For the query meg ryan bookstore movie, the
Unoptimized .377 .401 .474
textual patterns for the relation ActedIn in con-
WQT LVDT (greedy) .212 .246 .317
junction with the selector word ‘bookstore’ cor-
LVDT .295 .323 .406
rectly identifies the answer entity You’ve Got Mail
(/m/014zwb).
Figure 9: Collective vs. greedy segmentation We also analyzed samples of queries where
our system did not perform particularly well.
5.8 Discussion We observed that one of the recurring themes
Closer scrutiny revealed that collective infer- of these queries was that their answer enti-
ence often overcame errors in earlier stages ties had very little corpus support, and the
to produce a correct ranking over answer en- type/relation hint mapped to too many or no
tities. E.g., for the query automobile com- candidate type/relations. For example, in the
pany makes spider the entity disambiguation query south africa political system, the rel-
stage fails to identify the car Alfa Romeo Spi- evant type/relation hint ‘political system’ could
der (/m/08ys39). However, the interpretation not be mapped to /government/form_of_
stage recovers from the error and segments the government and /location/country/
query with Automobile (/m/0k4j) as the query form_of_government respectively.
entity e1 , /organization/organization
6 Conclusion and future work
and /business/industry/companies as
target type t2 and relation r respectively (from the We presented a technique to partition telegraphic
relation/type hint ‘company’), and spider as se- entity-seeking queries into functional segments
and to rank answer entities accordingly. While
our results are favorable compared to strong prior
art, further improvements may result from relax-
ing our model to recognize multiple e1 s and rs. It
may also help to deploy more sophisticated para-
phrasing models (Berant and Liang, 2014) or word
embeddings (Yih et al., 2014) for relation hints.
It would also be interesting to supplement entity-
linked corpora and curated KGs with extracted
triples (Fader et al., 2014). Another possibility is
to apply the ideas presented here to well-formed
Figure 10: Comparison of various approaches for questions.
NDCG at rank 1 to 10, TREC-INEX dataset

1113
References ing shallow semantics. In CIKM, October. (demo).
Yanen Li, Bo-Jun Paul Hsu, ChengXiang Zhai, and
Jonathan Berant and Percy Liang. 2014. Semantic Kuansan Wang. 2011. Unsupervised query segmen-
parsing via paraphrasing. In ACL Conference. tation using clickthrough for information retrieval.
Jonathan Berant, Andrew Chou, Roy Frostig, and Percy In SIGIR Conference, pages 285–294. ACM.
Liang. 2013. Semantic parsing on Freebase from Percy Liang. 2013. Lambda dependency-
question-answer pairs. In Empirical Methods in based compositional semantics. Technical Report
Natural Language Processing (EMNLP). arXiv:1309.4408, Stanford University. http://
Charles Bergeron, Jed Zaretzki, Curt Breneman, and arxiv.org/abs/1309.4408.
Kristin P. Bennett. 2008. Multiple instance ranking. Thomas Lin, Patrick Pantel, Michael Gamon, Anitha
In ICML, pages 48–55. ACM. Kannan, and Ariel Fuxman. 2012. Active objects:
Paolo Boldi and Sebastiano Vigna. 2005. MG4J at Actions for entity-centric search. In WWW Confer-
TREC 2005. In Ellen M. Voorhees and Lori P. Buck- ence, pages 589–598. ACM.
land, editors, TREC, number SP 500-266 in Special Ndapandula Nakashole, Gerhard Weikum, and Fabian
Publications. NIST. Suchanek. 2012. PATTY: A taxonomy of relational
Andrei Z. Broder, David Carmel, Michael Herscovici, patterns with semantic types. In EMNLP Confer-
Aya Soffer, and Jason Zien. 2003. Efficient query ence, EMNLP-CoNLL ’12, pages 1135–1145. ACL.
evaluation using a two-level retrieval process. In Patrick Pantel, Thomas Lin, and Michael Gamon.
CIKM, pages 426–434. ACM. 2012. Mining entity types from query logs via user
Tao Cheng and Kevin Chen-Chuan Chang. 2010. Be- intent modeling. In ACL Conference, pages 563–
yond pages: supporting efficient, scalable entity 571, Jeju Island, Korea, July.
search with dual-inversion index. In EDBT. ACM. Fernando Pereira. 2013. Meaning in the
ClueWeb09. 2009. http://www. wild. Invited talk at EMNLP Conference.
lemurproject.org/clueweb09.php/. http://hum.csse.unimelb.edu.au/
Anthony Fader, Luke Zettlemoyer, and Oren Etzioni. emnlp2013/invited-talks.html.
2014. Open question answering over curated and Jeffrey Pound, Alexander K. Hudek, Ihab F. Ilyas, and
extracted knowledge bases. In SIGKDD Confer- Grant Weddell. 2012. Interpreting keyword queries
ence. over Web knowledge bases. In CIKM.
Paolo Ferragina and Ugo Scaiella. 2010. TAGME: Nikos Sarkas, Stelios Paparizos, and Panayiotis
on-the-fly annotation of short text fragments (by Tsaparas. 2010. Structured annotations of Web
wikipedia entities). CoRR/arXiv, abs/1006.3498. queries. In SIGMOD Conference.
http://arxiv.org/abs/1006.3498. Uma Sawant and Soumen Chakrabarti. 2013. Learn-
Evgeniy Gabrilovich, Michael Ringgaard, and Amar- ing joint query interpretation and response ranking.
nag Subramanya. 2013. FACC1: Free- In WWW Conference, Brazil.
base annotation of ClueWeb corpora. http:// Fei Wu and Daniel S Weld. 2007. Automatically se-
lemurproject.org/clueweb12/, June. Ver- mantifying Wikipedia. In CIKM, pages 41–50.
sion 1 (Release date 2013-06-26, Format version 1, Mohamed Yahya, Klaus Berberich, Shady Elbas-
Correction level 0). suoni, Maya Ramanath, Volker Tresp, and Gerhard
Sean Gallagher. 2012. How Google and Microsoft Weikum. 2012. Natural language questions for the
taught search to ‘understand’ the Web. ArsTechnica Web of data. In EMNLP Conference, pages 379–
article. http://goo.gl/NWs0zT. 390, Jeju Island, Korea, July.
Jiafeng Guo, Gu Xu, Xueqi Cheng, and Hang Li. 2009. Xuchen Yao and Benjamin Van Durme. 2014. Infor-
Named entity recognition in query. In SIGIR Con- mation extraction over structured data: Question an-
ference, pages 267–274. ACM. swering with Freebase. In ACL Conference. ACL.
Johan Hall, Jens Nilsson, and Joakim Nivre. 2014. Xuchen Yao, Jonathan Berant, and Benjamin Van
Maltparser. http://www.maltparser.org/. Durme. 2014. Freebase QA: Information extrac-
Thorsten Joachims. 2002. Optimizing search engines tion or semantic parsing? In ACL 2014 Workshop
using clickthrough data. In SIGKDD Conference, on Semantic Parsing (SP14).
pages 133–142. ACM. Wen-tau Yih, Xiaodong He, and Christopher Meek.
Gjergji Kasneci, Fabian M. Suchanek, Georgiana Ifrim, 2014. Semantic parsing for single-relation question
Maya Ramanath, and Gerhard Weikum. 2008. answering. In ACL Conference. ACL.
NAGA: Searching and ranking knowledge. In Chun-Nam John Yu and Thorsten Joachims. 2009.
ICDE. IEEE. Learning structural SVMs with latent variables. In
Daphne Koller and Nir Friedman. 2009. Probabilistic ICML, pages 1169–1176. ACM.
Graphical Models: Principles and Techniques. MIT A. L. Yuille and Anand Rangarajan. 2006. The
Press. concave-convex procedure. Neural Computation,
Tom Kwiatkowski, Eunsol Choi, Yoav Artzi, and 15(4):915–936.
Luke S. Zettlemoyer. 2013. Scaling seman- ChengXiang Zhai. 2008. Statistical language models
tic parsers with on-the-fly ontology matching. In for information retrieval: A critical review. Founda-
EMNLP Conference, pages 1545–1556. tions and Trends in Information Retrieval, 2(3):137–
Xiaonan Li, Chengkai Li, and Cong Yu. 2010. Enti- 213, March.
tyEngine: Answering entity-relationship queries us-

1114

You might also like