You are on page 1of 33

Keyword Searching and Browsing in

Databases using BANKS

Gaurav Bhalotia, Arvind Hulgeri,


Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan

I.I.T. Bombay
11/25/2018 1
Motivation

 Keyword search of documents on the Web has


been enormously successful
 Simple and intuitive, no need to learn any query
language
 Database querying using keywords is desirable
 SQL is not appropriate for casual users
 Form interfaces cumbersome:
 Require separate form for each type of query — confusing for
casual users of Web information systems
 Not suitable for ad hoc queries

11/25/2018 2
Motivation

 Many Web documents are dynamically generated


from databases
 E.g. Catalog data
 Keyword querying of generated Web documents
 May miss answers that need to combine information
on different pages
 Suffers from duplication overheads

11/25/2018 3
Examples of Keyword Queries

 On a railway reservation database


 “mumbai bangalore”
 On an e-store database
 “camcorder panasonic”
 On a book store database
 “sudarshan databases”

11/25/2018 4
Differences from IR/Web Search

 Related data split across multiple tuples due to


normalization
 E.g. Paper (paper-id, title, journal),
Author (author-id, name)
Writes (author-id, paper-id, position)
 Different keywords may match tuples from
different relations
 What joins are to be computed can only be decided on
the fly
 Cites(citing-paper-id, cited-paper-id)

11/25/2018 5
Connectivity

 Tuples may be connected by


 Foreign key
 Implicit links (shared words), etc.
 Tuples belonging to the same relation
 Would like to find sets of (closely) connected
tuples that match all given keywords

11/25/2018 6
Basic Model

 Database: modeled as a graph


 Nodes = tuples
 Edges = references between tuples
 foreign key, other kind of relationships
 Edges are directed.

BANKS: Keyword search… MultiQuery Optimization paper

writes

Charuta S. Sudarshan Prasan Roy author

11/25/2018 7
Answer Example

Query: sudarshan roy


paper
MultiQuery Optimization

writes writes

author author
S. Sudarshan Prasan Roy

11/25/2018 8
Edge Directionality

 Some popular tuples are connected to many


other tuples
 E.g. Students -> departments -> university
 Popular tuples would create misleading shortcuts
from every tuple to every other
 E.g. every student would be closely linked with every
other student via the department/university
 Solution: define different forward and backward
edge weights
 Forward edges: In the direction of the foreign key
reference
11/25/2018 9
Edge Weight

 Weight of forward edge based on schema


 e.g. citation link weights > writes link weights
 Weight of backward edge = indegree of edges
pointing to the node
3

1
3
1

3
1

11/25/2018 10
Edge Weight Scaling

 Problem: Some backward edges have unduly


large weights
 Scale edge weights by using log(1+raw-edgeweight)
 total-edge-weight =  edge-weights
 Edge score E = 1 / total-edge-weight

11/25/2018 11
Node Weight

 Nodes have prestige weights too


 Observation: nodes with intuitively greater prestige
tend to have greater indegree
 Set node weight = indegree
 Problem: Nodes with many in-edges result in
skewed answers
 Subdue extreme node weights by using
log(1+indegree)
 Node score N =
root-node-weight +  leaf-node-weights

11/25/2018 12
Combining Scores

 Problem: how to combine two independent


metrics: node weight and edge weight
 Normalize each to 0-1
 Combine using weighting factor 
 Additive: (1- ) E +  N

 Multiplicative: E N
 Performance study to compare alternatives and
to find reasonable values for 

11/25/2018 13
The BANKS Answer Model

 Query: set of keywords {k1, k2, .., kn}


 Each keyword ki matches set of nodes Si
 Answer: rooted, directed tree connecting
nodes, with one node from each Si
 Root node(also referred to as Information Node) has
special significance, may be restricted to some
relations
 E.g. relations representing entities, not relationships
 May include intermediate nodes not in any Si and
hence a Steiner tree.
 Multiple answers
 Ranking based on proximity + prestige
11/25/2018 14
Finding Answer Trees

 Computation of minimum weight Steiner


Trees: NP complete
 Backward Expanding Search Algorithm:
 Intuition: find vertices from which a forward path
exists to at least one node from each Si.
 Run concurrent single source shortest path algorithm
from each node matching a keyword
 Create an iterator for each node matching a keyword
 Traverse the graph edges in reverse direction

 Output a node whenever it is on the intersection of the sets of


nodes reached from each keyword

11/25/2018 15
Finding Answer Tress

 For each vertex visited, maintain a nodelist v.Li


for each search term ti.
 Update the ith nodelist when the search starting
from a vertex uєSi reaches the vertex v.
 The new result tress produced correspond to the
nodelists : u × Л v.Lj
i‡j

11/25/2018 16
Backward Expanding Search
Query: sudarshan roy

paper MultiQuery Optimization

writes

authors S. Sudarshan Prasan Roy

11/25/2018 17
Result Ordering
 Answer trees may not be generated in relevance
order
 Solution:
 Best-first search across all iterators, based on path
length
 Output answers to a buffer
 Eliminate duplicates: Isomorphic Trees
 Output highest ranked answer from buffer to user
when buffer is full

11/25/2018 18
THE BANKS SYSTEM

 BANKS provides keyword search coupled with


extensive browsing facilities
 Schema browsing + data browsing
 Graphical display of data
 Implemented using Java + servlets
 Keyword search response times typically 1 to 3
seconds on
 DBLP database with 100,000 tuples/300,000 edges
 P3 600 MHz, 512 MB RAM
 Try it out at www.cse.iitb.ac.in/banks/

11/25/2018 19
The BANKS Architecture

HTTP JDBC
User BANKS

Web Server
+ Servlets Database

 Connects to any database using JDBC


 JDBC metadata features used to provide schema
browsing
 No programming needed for customization
 Minimal preprocessing of database to create indices and give
weights to links
 Extensive set of browsing features
11/25/2018 20
Browsing Features

 Hyperlinks are automatically added to all


displayed results
 Template facilities to do a variety of tasks
 Browsing data by grouping and creating crosstabs
 e.g., theses grouped by department and year

 Hierarchical views of data


 Nested XML style, even on relational data

 Graphical displays
 Bar charts, pie charts, etc

11/25/2018 21
Example of Browsing in BANKS

11/25/2018 22
BANKS Query Result Example

 Result of “Soumen Sunita”

11/25/2018 23
Anecdotes

 “Mohan”
 Returns C. Mohan at top based on prestige (number of
papers written)
 “Transaction”
 Returns Jim Gray’s classic paper and textbook as top
answers based on prestige (number of citations)
 “Sunita Seltzer”
 No common papers, but both have papers with
Stonebraker: system finds this connection

11/25/2018 24
Effect of Parameters
 Log scaling of edge weights worked well
 (1- ) E +  N versus E N -- made little difference
 Best with  = .2 (subdue node weights but not entirely)

11/25/2018 25
Related Work
 DataSpot (DTL)/Mercado Intuifind [VLDB 98]
 Based on patent by Palmon (filed 1995, granted 1998)
 Similar answer model to ours
 Differences: our model of backward link weights and prestige
 Proximity Search [VLDB98]
 Different model of proximity
 No edge weights, prestige, different evaluation algorithm
 Information units (linked Web pages) [WWW10]
 No directionality, only studied in Web context
 Microsoft DBExplorer
 No ranking, based on SQL generation
 Addresses efficient construction of text indexes

11/25/2018 26
Some Extensions to the BANKS

 Searching for similar results: Template Search


 define the notion of similarity between two result trees
 perform the restricted search
 Efficiently handling meta-data queries
 starting the search from each of the tuples in a table is
too costly

11/25/2018 27
Template Search

 Feedback in terms of result tree


 Type of a result tree defined in terms of
 type of nodes
 the table to which the node belongs
 type of edges :
 the type of nodes which it connects
 the link information e.g. ‘cites’ and ‘cited’ link between two
papers.
 Which nodes to start the search from
 only the chosen nodes
 all the nodes corresponding to a particular keyword

11/25/2018 28
Template Search

 Start the backward search only from allowed set


of nodes
 Follow the edges as defined by the result type
 Example : Consider Query “sudarshan database”
 Two types of results for above query
 papers written by professor sudarshan
 papers cited by papers written by professor sudarshan
 Two result types distinguished by whether to
follow the cites/cited link from a paper node.

11/25/2018 29
Metadata Keyword Queries

 Metadata keywords : match all the tuples of


a relation.
 Too costly to start the search from each of
 the tuples of a table
 First cut approach: start the forward search from
the information node for the non-metadata
keywords
 selectively choose the nodes from where to
start the forward search

11/25/2018 30
Example of Metadata Query

 Consider the query “sudarshan paper”

writes table
nodes

To paper table
(forward search)
sudarshan

11/25/2018 31
Conclusions and Future Work

The next big wave: keyword searching and


browsing of databases?
Future work:
 Keyword queries on XML

 Disambiguating queries by selecting

 Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”


 Tree structure: “coauthors” or “cites”
 Boolean queries, stemming, thesaurus
 Metadata: column/relation names

11/25/2018 32
Thank You

11/25/2018 33

You might also like