You are on page 1of 21

Information Retrieval

-Dr. Rahul Dixit

Dr Rahul Dixit
IIIT Pune
Dr Rahul Dixit IIIT
Pune
Dr Rahul Dixit IIIT
Pune
Dr Rahul Dixit IIIT
Pune
Wildcard queries

 User is unsure about spelling

 User wants to capture alternative spellings

 User wants to control stemming


Terror*= terror, terrorist, terrorism, terrorizing,…

Dr Rahul Dixit IIIT


Pune
 Phrase searching: Ex: “IIIT Pune”

 Stemming based searching: Ex: eat* = eats, eaten,


eating.

 Wildcard searching: Ex: defen?e = defence, defense.

Dr Rahul Dixit IIIT


Pune
Wildcard queries

 Wildcard queries are used in any of the following situations:

 Sydney vs. Sidney, which leads to the wildcard query S*dney

 Color/Colour

 judicial vs. judiciary, leading to the query judicia*

Dr Rahul Dixit IIIT


Pune
 Wildcard operators: *, ?, @, [ ], ^, %, _, -, <>
 Mark-logic Server supports two wildcards i.e. * , ?

 * : matches zero or more non-space characters

 ?: Matches exactly one non space character

Example: he* : he, her, help, hello, etc.


he? : her, hen etc.
s?t : sat, sit, …, etc.
Which of the following words are relevant for the wildcard term s*t?
1) Secret
2) “Signoverdocument”
3) “Sailingboat”
4) Striker
s*t=

Dr Rahul Dixit
IIIT Pune
 - : Represents arrange of characters
Ex: c[a,e]t = cat, cbt, cct, cdt, cet
 SQL Server Supports: %, _, -, [ ], ^

 % represents zero more characters


Ex: bl% = search engine will find bl, black, blue
 _ = Represents single character
Ex: h_t = hit, hat, ??
 [ ] = Represents any single character within [ ]
Ex: h[oa]t: hot, hat, what about hit
 ^ = Represents any character not in the []s
Ex: h[^oa]t: hit but not hot and hat
Dr Rahul Dixit
IIIT Pune
Wildcard queries
 @ : used to find one or more occurrences of the previous character
Example: lo@t = loot
ful@: ????
 Types of wildcard queries:
 Trailing or Prefix wildcard queries

 Leading wildcard query/ Suffix wildcard queries


 Inner wildcard queries

 Trailing or Prefix wildcard query: * symbol occurs only once, at


the end of the term.
Ex: mon* = Monday, monkey

Dr Rahul Dixit IIIT


Pune
Wildcard queries
 Leading wildcard query/ Suffix wildcard query: * symbol
occurs only once, at the beginning of the term
Ex: *mon = common, lemon

 Inner wildcard query: * symbol occurs inside the term or


word
Ex: wo*d = word

Dr Rahul Dixit IIIT


Pune
Wildcard queries
How can we find all terms meeting the wild-card
query pro*cent ?

Dr Rahul Dixit IIIT


Pune
Handling the wildcard queries in IR
Ex: W.Q = gen* universit*
gen* AND universit*
Possibilities: (geneva AND university) OR
(geneve AND university) OR
(geneva AND universite) OR
(geneve AND universite) OR
(general AND university) OR
(general AND universite) OR
……….
(1) Processing above wildcard query with the help of term-
document incidence matrix
~ Expensive
Dr Rahul Dixit IIIT
Pune
K-Gram indexes
 Sequences of K-characters
 K=2: bigram: look at two characters at a time
 K=3: trigram: look at three characters at a time
 K= 4: tetra-gram: look at four characters at a time
 Add $ at beginning and end of the term
 Example: Calculate the bigrams for the given term
school

Dr Rahul Dixit IIIT


Pune
K-Gram indexes
 Example: Calculate the bigrams for the given term
school
 Step1: $school$
 Step2: k=2
$school$= {$s, sc, ch, ho, oo, ol, l$}
 Calculate the trigrams for the given term school

Dr Rahul Dixit IIIT


Pune
K-Gram indexes
 Example: Calculate the bigrams for
“August is the cruelest month”
: $August$ $is$ $the$ $cruelest$ $month$
= {} + {} + {} + {}+ {}

Dr Rahul Dixit IIIT


Pune
Query processing
1. $ is a special word boundary symbol
2. Maintain an inverted index from bigrams to the
terms that contain the bigram
Ex:

3. After this query run as a Boolean query

Dr Rahul Dixit IIIT


Pune
Handling the wildcard query in IR
K-Gram indexes

Dr Rahul Dixit IIIT


Pune
K-Gram indexes

 How does such an index help us with wildcard queries?

Example: W.Q = mon*


Step1: $mon*$ (K=2 default)
Step2: {$m, mo, on}

Dr Rahul Dixit IIIT


Pune
Step3: $m AND mo AND on

Dr Rahul Dixit IIIT


Pune
W.Q: CA*ON

Dr Rahul Dixit IIIT


Pune

You might also like