Keyword Searching and Browsing in Databases Using BANKS

Keyword Searching and Browsing in
Databases using BANKS
Gaurav Bhalotia, Arvind Hulgeri,

Charuta Nakhe,
Soumen Chakrabarti, S. Sudarshan
I.I.T. Bombay
11/25/2018 1
Motivation
 Keyword search of documents on the Web has

been enormously successful
 Simple and intuitive, no need to learn any query
language
 Database querying using keywords is desirable
 SQL is not appropriate for casual users
 Form interfaces cumbersome:
 Require separate form for each type of query — confusing for
casual users of Web information systems
 Not suitable for ad hoc queries
11/25/2018 2
Motivation
 Many Web documents are dynamically generated

from databases
 E.g. Catalog data
 Keyword querying of generated Web documents
 May miss answers that need to combine information
on different pages
 Suffers from duplication overheads
11/25/2018 3
Examples of Keyword Queries
 On a railway reservation database

 “mumbai bangalore”
 On an e-store database
 “camcorder panasonic”
 On a book store database
 “sudarshan databases”
11/25/2018 4
Differences from IR/Web Search
 Related data split across multiple tuples due to

normalization
 E.g. Paper (paper-id, title, journal),
Author (author-id, name)
Writes (author-id, paper-id, position)
 Different keywords may match tuples from
different relations
 What joins are to be computed can only be decided on
the fly
 Cites(citing-paper-id, cited-paper-id)
11/25/2018 5
Connectivity
 Tuples may be connected by

 Foreign key
 Implicit links (shared words), etc.
 Tuples belonging to the same relation
 Would like to find sets of (closely) connected
tuples that match all given keywords
11/25/2018 6
Basic Model
 Database: modeled as a graph

 Nodes = tuples
 Edges = references between tuples
 foreign key, other kind of relationships
 Edges are directed.
BANKS: Keyword search… MultiQuery Optimization paper
writes
Charuta S. Sudarshan Prasan Roy author
11/25/2018 7
Answer Example
Query: sudarshan roy

paper
MultiQuery Optimization
writes writes
author author
S. Sudarshan Prasan Roy
11/25/2018 8
Edge Directionality
 Some popular tuples are connected to many

other tuples
 E.g. Students -> departments -> university
 Popular tuples would create misleading shortcuts
from every tuple to every other
 E.g. every student would be closely linked with every
other student via the department/university
 Solution: define different forward and backward
edge weights
 Forward edges: In the direction of the foreign key
reference
11/25/2018 9
Edge Weight
 Weight of forward edge based on schema

 e.g. citation link weights > writes link weights
 Weight of backward edge = indegree of edges
pointing to the node
3
1
3
1
3
1
11/25/2018 10
Edge Weight Scaling
 Problem: Some backward edges have unduly

large weights
 Scale edge weights by using log(1+raw-edgeweight)
 total-edge-weight =  edge-weights
 Edge score E = 1 / total-edge-weight
11/25/2018 11
Node Weight
 Nodes have prestige weights too

 Observation: nodes with intuitively greater prestige
tend to have greater indegree
 Set node weight = indegree
 Problem: Nodes with many in-edges result in
skewed answers
 Subdue extreme node weights by using
log(1+indegree)
 Node score N =
root-node-weight +  leaf-node-weights
11/25/2018 12
Combining Scores
 Problem: how to combine two independent

metrics: node weight and edge weight
 Normalize each to 0-1
 Combine using weighting factor 
 Additive: (1- ) E +  N
 Multiplicative: E N
 Performance study to compare alternatives and
to find reasonable values for 
11/25/2018 13
The BANKS Answer Model
 Query: set of keywords {k1, k2, .., kn}

 Each keyword ki matches set of nodes Si
 Answer: rooted, directed tree connecting
nodes, with one node from each Si
 Root node(also referred to as Information Node) has
special significance, may be restricted to some
relations
 E.g. relations representing entities, not relationships
 May include intermediate nodes not in any Si and
hence a Steiner tree.
 Multiple answers
 Ranking based on proximity + prestige
11/25/2018 14
Finding Answer Trees
 Computation of minimum weight Steiner

Trees: NP complete
 Backward Expanding Search Algorithm:
 Intuition: find vertices from which a forward path
exists to at least one node from each Si.
 Run concurrent single source shortest path algorithm
from each node matching a keyword
 Create an iterator for each node matching a keyword
 Traverse the graph edges in reverse direction
 Output a node whenever it is on the intersection of the sets of

nodes reached from each keyword
11/25/2018 15
Finding Answer Tress
 For each vertex visited, maintain a nodelist v.Li

for each search term ti.
 Update the ith nodelist when the search starting
from a vertex uєSi reaches the vertex v.
 The new result tress produced correspond to the
nodelists : u × Л v.Lj
i‡j
11/25/2018 16
Backward Expanding Search
Query: sudarshan roy
paper MultiQuery Optimization
writes
authors S. Sudarshan Prasan Roy
11/25/2018 17
Result Ordering
 Answer trees may not be generated in relevance
order
 Solution:
 Best-first search across all iterators, based on path
length
 Output answers to a buffer
 Eliminate duplicates: Isomorphic Trees
 Output highest ranked answer from buffer to user
when buffer is full
11/25/2018 18
THE BANKS SYSTEM
 BANKS provides keyword search coupled with

extensive browsing facilities
 Schema browsing + data browsing
 Graphical display of data
 Implemented using Java + servlets
 Keyword search response times typically 1 to 3
seconds on
 DBLP database with 100,000 tuples/300,000 edges
 P3 600 MHz, 512 MB RAM
 Try it out at www.cse.iitb.ac.in/banks/
11/25/2018 19
The BANKS Architecture
HTTP JDBC
User BANKS
Web Server
+ Servlets Database
 Connects to any database using JDBC

 JDBC metadata features used to provide schema
browsing
 No programming needed for customization
 Minimal preprocessing of database to create indices and give
weights to links
 Extensive set of browsing features
11/25/2018 20
Browsing Features
 Hyperlinks are automatically added to all

displayed results
 Template facilities to do a variety of tasks
 Browsing data by grouping and creating crosstabs
 e.g., theses grouped by department and year
 Hierarchical views of data

 Nested XML style, even on relational data
 Graphical displays
 Bar charts, pie charts, etc
11/25/2018 21
Example of Browsing in BANKS
11/25/2018 22
BANKS Query Result Example
 Result of “Soumen Sunita”
11/25/2018 23
Anecdotes
 “Mohan”
 Returns C. Mohan at top based on prestige (number of
papers written)
 “Transaction”
 Returns Jim Gray’s classic paper and textbook as top
answers based on prestige (number of citations)
 “Sunita Seltzer”
 No common papers, but both have papers with
Stonebraker: system finds this connection
11/25/2018 24
Effect of Parameters
 Log scaling of edge weights worked well
 (1- ) E +  N versus E N -- made little difference
 Best with  = .2 (subdue node weights but not entirely)
11/25/2018 25
Related Work
 DataSpot (DTL)/Mercado Intuifind [VLDB 98]
 Based on patent by Palmon (filed 1995, granted 1998)
 Similar answer model to ours
 Differences: our model of backward link weights and prestige
 Proximity Search [VLDB98]
 Different model of proximity
 No edge weights, prestige, different evaluation algorithm
 Information units (linked Web pages) [WWW10]
 No directionality, only studied in Web context
 Microsoft DBExplorer
 No ranking, based on SQL generation
 Addresses efficient construction of text indexes
11/25/2018 26
Some Extensions to the BANKS
 Searching for similar results: Template Search

 define the notion of similarity between two result trees
 perform the restricted search
 Efficiently handling meta-data queries
 starting the search from each of the tuples in a table is
too costly
11/25/2018 27
Template Search
 Feedback in terms of result tree

 Type of a result tree defined in terms of
 type of nodes
 the table to which the node belongs
 type of edges :
 the type of nodes which it connects
 the link information e.g. ‘cites’ and ‘cited’ link between two
papers.
 Which nodes to start the search from
 only the chosen nodes
 all the nodes corresponding to a particular keyword
11/25/2018 28
Template Search
 Start the backward search only from allowed set

of nodes
 Follow the edges as defined by the result type
 Example : Consider Query “sudarshan database”
 Two types of results for above query
 papers written by professor sudarshan
 papers cited by papers written by professor sudarshan
 Two result types distinguished by whether to
follow the cites/cited link from a paper node.
11/25/2018 29
Metadata Keyword Queries
 Metadata keywords : match all the tuples of

a relation.
 Too costly to start the search from each of
 the tuples of a table
 First cut approach: start the forward search from
the information node for the non-metadata
keywords
 selectively choose the nodes from where to
start the forward search
11/25/2018 30
Example of Metadata Query
 Consider the query “sudarshan paper”
writes table
nodes
To paper table
(forward search)
sudarshan
11/25/2018 31
Conclusions and Future Work
The next big wave: keyword searching and

browsing of databases?
Future work:
 Keyword queries on XML
 Disambiguating queries by selecting
 Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”

 Tree structure: “coauthors” or “cites”
 Boolean queries, stemming, thesaurus
 Metadata: column/relation names
11/25/2018 32
Thank You
11/25/2018 33

Keyword Searching and Browsing in Databases Using BANKS

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Keyword Searching and Browsing in Databases Using BANKS

Uploaded by

Copyright:

Available Formats

Keyword Searching and Browsing in

Databases using BANKS

Gaurav Bhalotia, Arvind Hulgeri,

 Keyword search of documents on the Web has

 Many Web documents are dynamically generated

 On a railway reservation database

 Related data split across multiple tuples due to

 Tuples may be connected by

 Database: modeled as a graph

BANKS: Keyword search… MultiQuery Optimization paper

Charuta S. Sudarshan Prasan Roy author

Query: sudarshan roy

 Some popular tuples are connected to many

 Weight of forward edge based on schema

 Problem: Some backward edges have unduly

 Nodes have prestige weights too

 Problem: how to combine two independent

 Query: set of keywords {k1, k2, .., kn}

 Computation of minimum weight Steiner

 Output a node whenever it is on the intersection of the sets of

 For each vertex visited, maintain a nodelist v.Li

paper MultiQuery Optimization

authors S. Sudarshan Prasan Roy

 BANKS provides keyword search coupled with

 Connects to any database using JDBC

 Hyperlinks are automatically added to all

 Hierarchical views of data

 Result of “Soumen Sunita”

 Searching for similar results: Template Search

 Feedback in terms of result tree

 Start the backward search only from allowed set

 Metadata keywords : match all the tuples of

 Consider the query “sudarshan paper”

The next big wave: keyword searching and

 Disambiguating queries by selecting

 Nodes: G.W.Bush: “Bush Jr” or “Bush Sr”

You might also like