Welcome to Scribd!

Skip carousel

Chap 9

Uploaded by

Jeevanantham Palanisamy

0% found this document useful (0 votes)

14 views32 pages

Original Title

Chap9.ppt

Copyright

Available Formats

PPT, PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

14 views32 pages

Chap 9

Uploaded by

Jeevanantham Palanisamy

Copyright:

Available Formats

Download as PPT, PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 32

Search inside document

Indexing and Searching

Modern Information Retrieval

by R. Baeza-Yates and B. Ribeiro-Neto
Chapter 8

1
Outline
 Inverted Files
 Other Indices for Text
 Sequential Searching
 Pattern Matching
 Compression

2
Inverted Files
 And inverted file (or inverted index) is a word-
oriented mechanism for indexing a text collection
in order to speed up the searching task.
 Structure：vocabulary and occurrences
 Block addressing
 The text is divided in blocks, and the
occurrences point to the blocks
 Full inverted indices：exact occurrences

3
4
5
Inverted Files
 The search algorithm on an inverted index
 Vocabulary search

 Retrieval of occurrences

 Manipulation of occurrences

 Construction (split the index into two files)

 Posting file：the lists of occurrences are stored
contiguously
 The vocabulary is stored in lexicographical
order and points to its list.
6
7
Inverted Files
 For Large texts
 Partial index

 Merging two indices consists of merging

the sorted vocabularies.

8
9
Other Indices for Text
 Suffix Trees
 Suffix Arrays
 Signature Files

10
Suffix Trees and Suffix Arrays
 Each position in the text is considered as a
text suffix
 Index points are selected form the text,
which point to the beginning of the text
positions which will be retrievable

11
12
Suffix arrays
 The main drawbacks of Suffix Array are its
costly construction process.
 Allow binary searches done by comparing
the contents of each pointer.
 Supra-indices (for large suffix array)

13
14
15
Construction of Suffix Arrays for
Large Texts

16
Signature Files
 Word-oriented index structures base on hashing
 Maps words to bit masks of B bits
 Divides the text in blocks of b words each
 The mask is obtained by bitwise ORing the
signatures of all the words in the text block.
 Hash the query to a bit mask W
 If W & Bi = W, the text block may contain the
word

17
18
Sequential Searching
 Brute Force
 Knuth-Morris-Pratt
 Boyer-Moore Family
 Shift-Or
 Suffix Automaton
 Backward DAWG matching (BDM)

 BNDM

19
Knuth-Morris-Pratt

20
Boyer-Moore Family

21
Shift-Or

22
Suffix Automaton

23
24
Pattern Matching
 Searching allowing errors
 Dynamic Programming

 Automaton

 Regular Expressions and Extended patterns

 Pattern Matching Using Indices
 Inverted files

 Suffix Trees and Suffix Arrays

25
Dynamic Programming

26
Automaton

27
Regular Expressions

28
Pattern Matching Using Indices
 Inverted Files
 The types of queries such as suffix or
substring queries, searching allowing
errors and regular expressions, are solved
by a sequential search
 The restriction is to find approximate
matches or regular expressions that span
many word.

29
Pattern Matching Using Indices
 Suffix Trees
 Suffix trees are able to perform complex

searches
 Word, prefix, suffix, substring, and Range
queries
 Regular expressions

 Unrestricted approximate string matching

 Useful in specific areas

 Find the longest substring

 Find the most common substring of a fixed 30

size
Pattern Matching Using Indices
 Suffix Arrays
 Some patterns can be searched directly in
the suffix array without simulation the
suffix tree
 Word, prefix, suffix, subword search and
range search

31
Compression
 Compressed text--Huffman coding
 Taking words as symbols

 Use an alphabet of bytes instead of bits

 Compressed indices
 Inverted Files

 Suffix Trees and Suffix Arrays

 Signature Files

Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
Rating: 3.5 out of 5 stars
3.5/5 (738)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
Rating: 4.5 out of 5 stars
4.5/5 (266)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
Rating: 4.5 out of 5 stars
4.5/5 (4611)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
Rating: 3.5 out of 5 stars
3.5/5 (231)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
Rating: 4.5 out of 5 stars
4.5/5 (122)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
Rating: 4 out of 5 stars
4/5 (590)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
Rating: 3.5 out of 5 stars
3.5/5 (2259)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
Rating: 4.5 out of 5 stars
4.5/5 (540)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
Rating: 4 out of 5 stars
4/5 (609)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
Rating: 3.5 out of 5 stars
3.5/5 (401)
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
Rating: 4 out of 5 stars
4/5 (5813)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
Rating: 4.5 out of 5 stars
4.5/5 (844)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
Rating: 4 out of 5 stars
4/5 (822)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
Rating: 4.5 out of 5 stars
4.5/5 (234)
John Adams
From Everand
John Adams
David McCullough
Rating: 4.5 out of 5 stars
4.5/5 (2409)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
Rating: 4.5 out of 5 stars
4.5/5 (271)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
Rating: 4.5 out of 5 stars
4.5/5 (1929)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
Rating: 4.5 out of 5 stars
4.5/5 (1716)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
Rating: 4 out of 5 stars
4/5 (897)
Yes Please
From Everand
Yes Please
Amy Poehler
Rating: 4 out of 5 stars
4/5 (1898)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
Rating: 3.5 out of 5 stars
3.5/5 (137)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
Rating: 4.5 out of 5 stars
4.5/5 (474)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
Rating: 4.5 out of 5 stars
4.5/5 (348)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
Rating: 4 out of 5 stars
4/5 (1092)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
Rating: 4.5 out of 5 stars
4.5/5 (441)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
Rating: 3.5 out of 5 stars
3.5/5 (104)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
Rating: 4.5 out of 5 stars
4.5/5 (2104)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
Rating: 4 out of 5 stars
4/5 (3811)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
Rating: 4 out of 5 stars
4/5 (74)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
Rating: 3.5 out of 5 stars
3.5/5 (2522)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
Rating: 4 out of 5 stars
4/5 (4203)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
Rating: 4.5 out of 5 stars
4.5/5 (789)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
Rating: 4 out of 5 stars
4/5 (98)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
Rating: 4 out of 5 stars
4/5 (1850)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
Rating: 4 out of 5 stars
4/5 (1104)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
Rating: 4 out of 5 stars
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
Rating: 4 out of 5 stars
4/5 (104)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Tóibín
Rating: 3.5 out of 5 stars
3.5/5 (1947)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
Rating: 4 out of 5 stars
4/5 (1018)
Au086inst PDF
Document510 pages
Au086inst PDF
Nguyễn Cương
No ratings yet
Turtle Diagram PPAP
Document1 page
Turtle Diagram PPAP
DL
50% (2)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
Rating: 3.5 out of 5 stars
3.5/5 (792)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
Rating: 4.5 out of 5 stars
4.5/5 (807)
Maintenance
Document34 pages
Maintenance
rashm006ranjan
100% (3)
RTU560 Remote Terminal Unit RTUtil560 Us
Document134 pages
RTU560 Remote Terminal Unit RTUtil560 Us
Ajay Singh
No ratings yet
Luis A. Bastidas M.: Software Developer
Document1 page
Luis A. Bastidas M.: Software Developer
Luis Bastidas
No ratings yet
BASIC ICT SKILLS-II - Cl-10-1
Document4 pages
BASIC ICT SKILLS-II - Cl-10-1
Harshit Pandey
No ratings yet
AWR Connected Mentor Datasheet
Document2 pages
AWR Connected Mentor Datasheet
AWR Corporation
No ratings yet
Basic - Troubleshooting - Guide (1) Rip 900c
Document5 pages
Basic - Troubleshooting - Guide (1) Rip 900c
SONY CHACHALO
No ratings yet
Google Cloud Security Engineer Exam Prep Sheet
Document9 pages
Google Cloud Security Engineer Exam Prep Sheet
shadab umair
No ratings yet
Cyclone
Document2 pages
Cyclone
arstjunk
No ratings yet
Sharding-Hashgraph A High-Performance Blockchain-Based Framework For Industrial Internet of Things With Hashgraph Mechanism
Document10 pages
Sharding-Hashgraph A High-Performance Blockchain-Based Framework For Industrial Internet of Things With Hashgraph Mechanism
yanshaowen0211
No ratings yet
A Crash Course On TheDepths of Win32 Structured Exception Handling, MSJ January 1997
Document19 pages
A Crash Course On TheDepths of Win32 Structured Exception Handling, MSJ January 1997
tzving
100% (1)
Heartbeat Detection and Recognition of Anomaly Using IOT
Document55 pages
Heartbeat Detection and Recognition of Anomaly Using IOT
mohan
No ratings yet
2021-10-16
Document64 pages
2021-10-16
azka danis
No ratings yet
CLI Reference 3 6
Document151 pages
CLI Reference 3 6
Perica Matic
No ratings yet
Introduction To Scilab
Document14 pages
Introduction To Scilab
Xyreen Paz
No ratings yet
Chapter 1 PDF
Document14 pages
Chapter 1 PDF
Jayaraj Joshi
No ratings yet
Lab#06 Hashing
Document4 pages
Lab#06 Hashing
Uzair Khan
No ratings yet
Junjoewong (2018) - ProCircle A Promotion Platform Using
Document5 pages
Junjoewong (2018) - ProCircle A Promotion Platform Using
José
No ratings yet
Nouman Riaz: Country: Pakistan
Document3 pages
Nouman Riaz: Country: Pakistan
aasi121
No ratings yet
VTS2
Document26 pages
VTS2
mirzagn2004
No ratings yet
CAD Book
Document139 pages
CAD Book
Pamela Mendoza
No ratings yet
Package Declaration and Package Body
Document15 pages
Package Declaration and Package Body
Kshama Nikhade
No ratings yet
03 - PLC Manual v1 - 2 - Siemens - en
Document15 pages
03 - PLC Manual v1 - 2 - Siemens - en
Quốc Việt
No ratings yet
Toward Developing Benchmark Dataset PDF
Document18 pages
Toward Developing Benchmark Dataset PDF
Rahmatul Husna
No ratings yet
Geek On The Top
Document56 pages
Geek On The Top
Anurag Singh
No ratings yet
LMS Pricing: Everything You Need To Know
Document22 pages
LMS Pricing: Everything You Need To Know
Acorn
No ratings yet
Model QP Awt
Document4 pages
Model QP Awt
VM SARAVANA
No ratings yet
Vtu Se Syllabus
Document34 pages
Vtu Se Syllabus
oceanparkk
No ratings yet