You are on page 1of 5

What Is An “Inverted Index”?

It was not many years ago that only Computer Science graduates knew what an “inverted
index” was, or what one might do with such a thing. Then, along came Google, and now
everyone is using “inverted indexes” each time they “google up” something from the
World Wide WEB (WEB).

This short paper is intended to provide the most simplest of tutorials to introduce people
new to the world of computers a little better understanding of this important concept
which they use every day when “googling”.

Book Index (or Simple Index)

So, what makes an “inverted” (or sometimes called a “fully inverted”) index so useful?
Anyone who has ever picked up a reference book will know what an index is, and how
useful this tool is to being able to find specific information in the book, which otherwise
would require a lot of “thumbing” to find some specific.

Example #1:

Apple
Green 1
Red 10
Yellow 10, 15

Boy 12
Girl 13

In this case, the words on the left can be located on the pages identified in the column to
the right. However, these indexes only provide information about the page on which the
word can be found, and generally never attempt to provide information beyond the book
for which the index was constructed.

“Inverted Indexes” (or Google-style) Indexes

Book indexes only list the occurrence of words in the specific book; libraries have not
seen fit to create library-wide indexes which provide a list of book titles where a specific
person, event or thing can be found. Google has taken on this task of indexing every
thing in the world, it would seem.

In order to increase the scope of an index to provide more information than that found in
just one entity (such as a book), a more powerful kind of index was created—an “inverted
index”:

http://en.wikipedia.org/wiki/Inverted_index
Inverted indexes

From Wikipedia, the free encyclopedia

An inverted index (also referred to as postings file or inverted file) is an


index structure storing a mapping from words to their locations in a
document or a set of documents, allowing full text search. It is the most
popular data structure used in document retrieval systems.

There are two main variants of inverted indexes: A record level inverted
index (or inverted file index or just inverted file) contains a list of
references to documents for each word. A word level inverted index (or
full inverted index or inverted list) additionally contains the positions of
each word within a document.[1] The latter form offers more functionality
(like phrase searches), but needs more time and space to be created.

Example #2:

Suppose we have three sentences:

(1) The book had a red cover and was about twelve inches tall.
(2) A tall building can have a red hue around sunset.
(3) About sunset the State Court Judge told the felon: “you will serve twelve years for
stealing this sacred book!”

The following spreadsheet demonstrates how an “inverted” index would be constructed


from these three sentences:

(1) Initial List (2) Sorted List (3) Nuisance (4) Inverted
Words Removed Index
1,
Item.1 The Item.1 a Item.1 book book 3
Item.1 book Item.2 a Item.3 book building 2
Item.1 had Item.1 About Item.2 building can 2
Item.1 a Item.3 About Item.2 can convicted 3
Item.1 red Item.1 And Item.3 convicted Court 3
Item.1 cover Item.2 Around Item.3 Court cover 1
Item.1 and Item.1 Book Item.1 cover felon 3
Item.1 was Item.3 Book Item.3 felon hue 2
Item.1 about Item.2 building Item.2 hue inches 1
Item.1 twelve Item.2 Can Item.1 inches judge 3
1,
Item.1 inches Item.3 convicted Item.3 judge red 2
Item.1 tall. Item.3 Court Item.1 red sacred 3
Item.1 Cover Item.2 red serve 3
Item.2 A Item.3 Felon Item.3 sacred State 3
Item.2 tall Item.3 For Item.3 serve stealing 3
Item.2 building Item.1 had Item.3 State sunset 3,2
1,
Item.2 can Item.2 Have Item.3 stealing tall 2
Item.2 have Item.2 Hue Item.3 sunset told 3
1,
Item.2 a Item.1 Inches Item.2 sunset. twelve 3
Item.2 red Item.3 Judge Item.2 tall years 3
Item.2 hue Item.1 Red Item.1 tall
Item.2 around Item.2 Red Item.3 told
Item.2 sunset. Item.3 Sacred Item.3 twelve
Item.3 Serve Item.1 twelve
Item.3 About Item.3 State Item.3 years
Item.3 sunset Item.3 stealing
Item.3 the Item.3 sunset
Item.3 State Item.2 sunset.
Item.3 Court Item.2 tall
Item.3 judge Item.1 tall
Item.3 told Item.3 the
Item.3 the Item.3 the
Item.3 convicted Item.1 The
Item.3 felon Item.3 this
Item.3 you Item.3 told
Item.3 will Item.3 twelve
Item.3 serve Item.1 twelve
Item.3 twelve Item.1 was
Item.3 years Item.3 will
Item.3 for Item.3 years
Item.3 stealing Item.3 you
Item.3 this
Item.3 sacred
Item.3 book

Explanation:

First, List #1 (Initial List) would be constructed by “tokenizing” each sentence (breaking
the sentence up into its individual words). This list necessarily must contain an identifier
of where the word was found (in this case the sentence number).

Second, List #1 is sorted.

Thirdly, List #2 has its “nuisance words (a, an, the, etc.) removed.

Lastly, List #3 is constructed by deleting the duplicate words, but not losing any
information about them by adding the deleted word’s identifier to the first occurrence of
that word.
So, we see that we can find the word book in sentence #1 and #3. The word hue occurs
only in sentence #2.
Conclusion:

Inverted indexes provide Google (and the other search engines) the ability to effectively
track large bodies of textual information in source documents, thereby allowing retrieval
of information about the source documents relatively quickly. Most Internet Users are
probably unaware of the underlying technology—since it has been packaged so
unobtrusively. Google has raised the bar so that any search application these days should
use “inverted indexes”, or that application’s users will be asking: “why isn’t this like
Google”?

Wayne Martin
Palo Alto, CA
www.twitter.com/wmartin46
www.youtube.com/wmartin46