You are on page 1of 70

CHAPTER ONE

Introduction to Information
Storage and Retrieval
 IR and IR systems
 Data vs information retrieval
 IR and the retrieval process
 Basic structure of an IR system
 How search engines work
Introduction to ISR
 Information retrieval is a process of
– looking for relevant information in large and
heterogeneous collection of information and
– retrieved information ordered in relevant rank.
 Based on the kind of collection the retrieval works
on there are different types of IR, like
– textual IR,
– graphic IR,
– multimedia IR and others.
November 22 ISR 2
Introduction to ISR

 The Systems which are used to store information


gathered from different sources.

 In such a way that it can be retrieved easily and

effectively upon request are referred to as


information storage and retrieval systems.

November 22 ISR 3
Introduction to ISR

Collecting information from different resources

And storing it in either storage room(maintaining


paper records) or the storage devices such as hard
disk, DVD, CD is called as information storage.

This information may be in any of the form that is

audio, video, text.


November 22 ISR 4
Information Retrieval
 The process of searching, fetching and serving of
information to the requested users is information
retrieval.
 Is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an
information need from within large collections
(usually stored on computers).

November 22 ISR 5
Information Retrieval
Information Retrieval (IR)

Is an activity of obtaining
relevant documents
based on user needs
from
collection of retrieved documents.

November 22 ISR 6
Information Retrieval
 Goal = find documents relevant to an information need
from a large document set

November 22 ISR 7
IR in Practice

 Information Retrieval is a research-driven


theoretical and experimental discipline.

– The focus is on different aspects of the


information–seeking process, depending on the
researcher’s background or interest.

November 22 ISR 8
IR in Practice
• Computer scientist – fast and accurate search engine

• Librarian – organization and indexing of information

• Cognitive scientist – the process in the searcher’s mind

• Philosopher – Is this really relevant ? …

– Progress influenced by advances in Computational


Linguistics, Information Visualization, Cognitive
Psychology, HCI, …

November 22 ISR 9
IR Systems
 Document (Web page) retrieval in response
to a query
Quite effective (at some things)
Commercially successful (some of them)
 But what goes on behind the scenes?
How do they work?
What happens beyond the Web?

November 22 ISR 10
IR Systems

November 22 ISR 11
IR Systems
 Information is organized into (a large
number of) documents
– Large collections of documents from various
sources:
• news articles,
• research papers,
• books,
• digital libraries,
• Web pages, etc.
November 22 ISR 12
IR Systems

 Example: Google indexing size

– In 1998 Google already had 26 million pages

– Now: more than 1 trillion (as in


1,000,000,000,000) unique URLs

November 22 ISR 13
IR Systems

 Information Retrieval System is mainly

focus electronic searching and retrieving old


documents.

 An Information Retrieval System is a

system that is capable of storage, retrieval,


and maintenance of information.
November 22 ISR 14
IR Systems

 An IR System is capable of performing

operations like methods for


Adding documents to the database,

Modifying or deleting them from the database,

Methods for searching and serving appropriate


document to the users.
November 22 ISR 15
IR Systems
A static, or relatively static, document collection is indexed
prior to any user query.

November 22 ISR 16
IR Systems

November 22 ISR 17
IR Systems
 A query is issued by user
 And a set of documents that are deemed
relevant to the query are ranked based on
their computed similarity to the query and
presented to the user query.
 Information Retrieval (IR) is devoted to
finding relevant documents, not finding
simple matches to patterns.
November 22 ISR 18
IR Systems
 Automated information retrieval (IR)
systems were originally developed to help
manage the huge scientific literature that
has developed since the 1940s.
 Many university, corporate, and public
libraries now use IR systems to provide
access to books, journals, and other
documents.
November 22 ISR 19
IR Systems
 An Information Retrieval System consists
of a software program that facilitates a user
in finding the information the user needs.
 The system may use standard computer
hardware or specialized hardware to support
the search sub-function and to convert non-
textual sources to a searchable media (e.g.,
transcription of audio to text).
November 22 ISR 20
General Goal of IR
 To help users find useful/relevant
information based on their information
needs (with a minimum effort) despite;
The challenges:
Increasing complexity of Information (overload)
Changing needs of user
 Provide immediate random access to the
document collection.
 Retrieval systems, such as Google, Yahoo,…
November 22 ISR 21
are developed with this aim.
Objectives of IR Systems
 To minimize the overhead of a user locating
needed information.
 Overhead can be expressed as the time a
user spends in all of the steps leading to
reading an item containing the needed
information
 (e.g., query generation, query execution,
scanning results of query to select items to
read,
November 22 reading non-relevant
ISR items). 22
Objectives of IR Systems
 The success of an IR system is very subjective,
 Based upon what information is needed and the
willingness of a user to accept overhead.
 Needed information can be all information that is
in the system that relates to a user’s need.
 In other cases it may be sufficient information in
the system to complete a task, allowing for missed
data.

November 22 ISR 23
Functions of an IRS
The Major Functions of an IRS are:
 To identify the sources of information
relevant to the areas of interest of the target
users’ and community.
 To analyze the contents of the sources
(documents).
 To represent the contents of the analyzed
sources for matching with the users’
queries.
November 22 ISR 24
Functions of an IRS
The Major Functions of an IRS are:
 To match the search statement with the
stored database.

 To retrieve information which are relevant.

 To make the necessary adjustments in the


system
November 22
based on feedback
ISR
from the users. 25
Types of IR System
 The IR system may be of different types
depending upon the search conducted by the
information seeker.
 As a whole it can be categorized namely as:
1. Reference Retrieval System:
2. Document Retrieval System:
3. Fact Retrieval System:
4. Knowledge Retrieval System:
November 22 ISR 26
Types of IR System
1. Reference Retrieval System:
 Information related to specific questions is
retrieved.
2. Document Retrieval System:
 Information can retrieve by the attributes of
documents such as author, title, subject, and
so on.
 Nowadays, complete texts are also
November 22 ISR 27
retrieved, so called text retrieval system.
Types of IR System
3. Fact Retrieval System:
 The specific data or facts are retrieved (viz.,
numerical databases).
4. Knowledge Retrieval System:
 Is a rule-based system in which there is a
knowledge base with capability for
knowledge acquisition and an inference
engine.
November 22 ISR 28
Data vs Information Retrieval
Information Retrieval: Data Retrieval:
 The software the program  Data retrieval deals with
that deals with the obtaining data from a
organization, storage, database management
retrieval, and evaluation of system such as ODBMS.
information from  It is A process of
document repositories identifying and retrieving
particularly textual the data from the database,
information. based on the query
provided by user or
application.
November 22 ISR 29
Data vs Information Retrieval
Information Retrieval: Data Retrieval:
 Retrieves information  Determines the
about a subject. keywords in the user
 Small errors are likely query and retrieves the
to go unnoticed. data.
 Not always results are  A single error object
ordered by relevance. means total failure.
 Has a well-defined
structure and
November 22 ISR semantics. 30
Data vs Information Retrieval
Information Retrieval: Data Retrieval:
 Does not provide a  Provides solutions to
solution to the user of the user of the
the database system. database system.
 The results obtained  The results obtained
are approximate are exact matches.
matches.  Results are unordered
 It is a probabilistic by relevance.
model.  It is a deterministic
November 22 ISR model. 31
Data vs Information Retrieval

November 22 ISR 32
IR and the Retrieval Process
 The Process of IR starts when a user creates
any query into the system through some
graphical interface provided.
 These user-defined queries are the
statements of needed information.
 For example, queries fork by users in search
engines.

November 22 ISR 33
IR and the Retrieval Process
 In IR single query does not match to the
right data object instead;
 It matches with the several collections of
data objects from which the most relevant
document is taken into consideration for
further evaluation.
 The ranking of relevant documents is done
to find out the most related document to the
given
November 22 query. ISR 34
IR and the Retrieval Process
 This is the key difference between the
Database searching and IR.
 After the query is sent to the core of the
system.
 This part has the access to the content
management module which is directly
linked with the back-end
 i.e. the large collections of data objects.
November 22 ISR 35
IR and the Retrieval Process
 Once results IR are generated by the core
system then it is returned to the user by
some graphical user interfaces.
 The process repeats and results are modified
until the user satisfied for what he is
actually looking for.

November 22 ISR 36
IR and the Retrieval Process

November 22 ISR 37
IR and the Retrieval Process

Document Parsing
 Document parsing deals with the overall
document structure.
 In this phase, it breaks down the document
into discrete components.
 In Preprocessing phase it creates unit
documents for example one document
representing emails and another as
November 22 ISR 38
additional specific part.
IR and the Retrieval Process

Lexical Analysis
 In Lexical analysis, tokenization is the process of
breaking a stream into words, phrases, symbols, or
other meaningful terms called tokens.
 These meaningful elements are further sent to
Parts of Speech Tagging.
 Typically, Tokenization occurs at a word level.

November 22 ISR 39
IR and the Retrieval Process

Stemming and Lemmatization


Stemming
 In English grammar, for correct sentence
structures, we often use different forms of any
word. e.g. go, going, goes etc.
 The process of cutting down the affixes and let the
root word be found out.
 Any word is formed using regular noun + plural
affix.
November 22 ISR 40
Example Stemming
 Removing identified prefixes and suffixes
from words
– Flies → fli
– Mules → mule
– Agreed → agre
– Owned → own
– Traditional → tradit
 Can be done heuristically without a
dictionary
November 22 ISR 41
IR and the Retrieval Process

Lemmatization
 Usually refers to doing
these things properly  Reduces words to their
with Vocabulary and base form
Morphological analysis – Flies → fly
of words. – Mules → mule
– Agreed → agree
 Aiming to remove
– Owned → own
inflectional endings
– Traditional → tradition
only.
November 22  Requires a dictionary 42
ISR
Basic structure of an IR system
 The two subsystems of an IR system:
 Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
 Searching: is an online process of finding
relevant documents in the index list as per users
query
 Indexing and searching: are unavoidably
connected
November 22 ISR 43
Basic structure of an IR system
 You cannot search that was not first indexed
in some manner.
 Indexing of documents is done in order to
be searchable
 There are many ways to do indexing
 To index one needs an indexing language
 There are many indexing languages
 Every word in a document could be an indexing
Novemberlanguage.
22 ISR 44
Basic structure of an IR system

November 22 ISR 45
IR Systems

November 22 ISR 46
Issues in IR
 Text representation
– what makes a “good” representation?
– how is a representation generated from text?
 Information needs representation
– what is an appropriate query language?
 Comparing representations
– to identify relevant documents
– what is a “good” model of retrieval?
 Evaluating effectiveness of retrieval
November 22 ISR 47
– what are good metrics?
Why is IR so hard?
 Information retrieval problem: locating
relevant documents based on user input,
such as keywords or example documents.
 The real problem boils down to matching
the language of the query to the language of
the document.

November 22 ISR 48
Why is IR so hard?
 Simply matching on words is a very weak
approach.
 One word can have different semantic
meanings. Consider: Take
– “take a place at the table”
– “take money to the bank”
– “take a picture”
 In Amharic ገና ……...ትፈራለህ …….
November 22 ISR 49
The World Wide Web
 The Web is an infrastructure of distributed
information combined with software that uses
networks as a vehicle to exchange that
information.
 Web page is a document that contains or
references various kinds of data, such as text,
images, graphics, and programs.
 Links are connections between one web page and
another that can be used “move around” as
desired.
Nov-22 50
The World Wide Web
 Website is a collection of related web
pages.
 The Internet makes the communication
possible, but the Web makes that
communication easy, more productive, and
more enjoyable.

Nov-22 51
Characteristics of the Web
 Decentralized content publishing with essentially
no central control of authorship.
 This turned out to be the biggest challenge for web
search engines in their quest to index and retrieve
this content.
 Web page authors created content in dozens of
(natural) languages and thousands of dialects,
 Thus demanding many different forms of
stemming and other linguistic operations.
Nov-22 52
Characteristics of the Web
 Huge (1.75 terabytes of text)
 Allow people to share information globally and
freely
 Hides the detail of communication protocols,
machine
 locations, and operating systems
 Data are unstructured
 Exponential growth
 Increasingly commercial over time (1.5 % .com in
1993 to 60% .com in 1997)
Nov-22 53
Search Engine ?
 Search Engines is a website that helps you find
other websites like Yahoo and Google.
 You enter keywords and the search engine
produces a list if links to potentially useful sites.
 There are two types of searches:
 Keyword searches
 Concept-based searches

Nov-22 54
Search Engine ?
 Browser is a software tool that issues the
request for the web page we want and
displays it when it arrives.
 We often talk about “visiting” a website, as
if we were going there.
 In truth, we actually specify the information
we want, and it is brought to us.

Nov-22 55
The Browser
 A browser is a Web client program that uses
Hypertext Transfer Protocol (HTTP) to make
requests of Web servers throughout the Internet on
behalf of the browser user.
 Text-only mode such as Lynx
 Graphic mode involves a graphical software
program that retrieves
 text,
 audio, and
 video
Nov-22 56
Challenges of Building a Search Engine

 Build by Companies and hide the technical


detail
 Distributed data over the web
 High percentage of volatile data but Large
volume
 Unstructured and redundant data
 Quality of data can not be easily verified
Nov-22 57
Challenges of Building a Search Engine

 Heterogeneous data, data structure and data


sources.
 Dynamic data due to changes by owners of
various websites.
 How to specify a query from the user is
always a challenge.
 How to interpret the answer provided by the
system.
Nov-22 58
Index and Search Engine
 The search query is everything that the user
types to get results.
 It is made up of one or more search terms, plus
optional special characters
 Analyzing the Query
 Expanding the query
 Word variants: plural/singular, various verb
forms
 Spelling correction
Nov-22 59
User query needs
 There appear to be three broad categories into
which common web search queries can be
grouped:
(i) Informational,
(ii) Navigational, and
(iii) Transactional.
 It should be clear that some queries will fall in
more than one of these categories, while others
will fall outside them.
Nov-22 60
User query needs
(i) Informational
 Queries seek general information on a broad topic, such
as queries leukemia or Provence.
(ii) Navigational
 Queries seek the website or home page of a single
entity that the queries user has in mind, say Ethiopian
airlines.
(iii) Transactional
 User performing a transaction query action on the Web
 Such as purchasing a product, downloading a file, or
Nov-22 making a reservation. 61
User Problems
 Do not exactly understand how to provide a
sequence of words for the search.
 Not aware of the input requirement of the
search engine.
 Problems understanding Boolean logic, so
the users cannot use advanced search.

Nov-22 62
User Problems
 Novice users do not know how to start
using a search engine.
 Do not care about advertisements? No
funding!
 Around 85% of users only look at the first
page of the result, so relevant answers
might be skipped.

Nov-22 63
Searching Guidelines
 Specify the words clearly (+, -)
 Use advanced search when necessary
 Provide as many particular terms as
possible
 If looking for a company, institution, or
organization, try:

Nov-22 64
Searching Guidelines
 Some searching engine specialize in some
areas
 If the user use broad queries, try to use Web
directories as starting points.
 The user should notice that anyone can
publish data on the Web, so information
that they get from search engines might not
be accurate.
Nov-22 65
Types of Search Engines
 Search by Keywords:
 e.g. AltaVista, Excite, Google
 Search by categories:
 e.g. Yahoo!
 Specialize in other languages:
 e.g. Chinese Yahoo! and Yahoo! Japan
 Interview simulation
 e.g. Ask Jeeves!
Nov-22 66
Search Engine Architecture

Nov-22 67
Web Crawlers
 Software agents that traverse the Web sending
new or updated pages to a main server where they
are indexed.
 Also called robots, spiders, worms, wanders,
walkers, and knowbots
 The 1st crawler, Wanderer was developed in 1993
 Runs on local machine and send requests to
remote Web servers.

Nov-22 68
Web Crawlers
 Breath-first and depth-first manner of
searching is applied
 Avoid crawling same pages
 Web pages change dynamically
 Fastest crawlers are able to traverse up to 10
million pages per day

Nov-22 69
Thank you!

You might also like