IR Chapter I

CHAPTER ONE
Introduction to Information
Storage and Retrieval
 IR and IR systems
 Data vs information retrieval
 IR and the retrieval process
 Basic structure of an IR system
 How search engines work
Introduction to ISR
 Information retrieval is a process of
– looking for relevant information in large and
heterogeneous collection of information and
– retrieved information ordered in relevant rank.
 Based on the kind of collection the retrieval works
on there are different types of IR, like
– textual IR,
– graphic IR,
– multimedia IR and others.
November 22 ISR 2
Introduction to ISR
 The Systems which are used to store information

gathered from different sources.
 In such a way that it can be retrieved easily and
effectively upon request are referred to as

information storage and retrieval systems.
November 22 ISR 3
Introduction to ISR
Collecting information from different resources
And storing it in either storage room(maintaining

paper records) or the storage devices such as hard
disk, DVD, CD is called as information storage.
This information may be in any of the form that is
audio, video, text.

November 22 ISR 4
Information Retrieval
 The process of searching, fetching and serving of
information to the requested users is information
retrieval.
 Is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an
information need from within large collections
(usually stored on computers).
November 22 ISR 5
Information Retrieval (IR)
Is an activity of obtaining
relevant documents
based on user needs
from
collection of retrieved documents.
November 22 ISR 6
 Goal = find documents relevant to an information need
from a large document set
November 22 ISR 7
IR in Practice
 Information Retrieval is a research-driven

theoretical and experimental discipline.
– The focus is on different aspects of the

information–seeking process, depending on the
researcher’s background or interest.
November 22 ISR 8
IR in Practice
• Computer scientist – fast and accurate search engine
• Librarian – organization and indexing of information
• Cognitive scientist – the process in the searcher’s mind
• Philosopher – Is this really relevant ? …
– Progress influenced by advances in Computational

Linguistics, Information Visualization, Cognitive
Psychology, HCI, …
November 22 ISR 9
IR Systems
 Document (Web page) retrieval in response
to a query
Quite effective (at some things)
Commercially successful (some of them)
 But what goes on behind the scenes?
How do they work?
What happens beyond the Web?
November 22 ISR 10
IR Systems
November 22 ISR 11
IR Systems
 Information is organized into (a large
number of) documents
– Large collections of documents from various
sources:
• news articles,
• research papers,
• books,
• digital libraries,
• Web pages, etc.
November 22 ISR 12
IR Systems
 Example: Google indexing size
– In 1998 Google already had 26 million pages
– Now: more than 1 trillion (as in

1,000,000,000,000) unique URLs
November 22 ISR 13
IR Systems
 Information Retrieval System is mainly
focus electronic searching and retrieving old

documents.
 An Information Retrieval System is a
system that is capable of storage, retrieval,

and maintenance of information.
November 22 ISR 14
IR Systems
 An IR System is capable of performing
operations like methods for

Adding documents to the database,
Modifying or deleting them from the database,
Methods for searching and serving appropriate

document to the users.
November 22 ISR 15
IR Systems
A static, or relatively static, document collection is indexed
prior to any user query.
November 22 ISR 16
IR Systems
November 22 ISR 17
IR Systems
 A query is issued by user
 And a set of documents that are deemed
relevant to the query are ranked based on
their computed similarity to the query and
presented to the user query.
 Information Retrieval (IR) is devoted to
finding relevant documents, not finding
simple matches to patterns.
November 22 ISR 18
IR Systems
 Automated information retrieval (IR)
systems were originally developed to help
manage the huge scientific literature that
has developed since the 1940s.
 Many university, corporate, and public
libraries now use IR systems to provide
access to books, journals, and other
documents.
November 22 ISR 19
IR Systems
 An Information Retrieval System consists
of a software program that facilitates a user
in finding the information the user needs.
 The system may use standard computer
hardware or specialized hardware to support
the search sub-function and to convert non-
textual sources to a searchable media (e.g.,
transcription of audio to text).
November 22 ISR 20
General Goal of IR
 To help users find useful/relevant
information based on their information
needs (with a minimum effort) despite;
The challenges:
Increasing complexity of Information (overload)
Changing needs of user
 Provide immediate random access to the
document collection.
 Retrieval systems, such as Google, Yahoo,…
November 22 ISR 21
are developed with this aim.
Objectives of IR Systems
 To minimize the overhead of a user locating
needed information.
 Overhead can be expressed as the time a
user spends in all of the steps leading to
reading an item containing the needed
information
 (e.g., query generation, query execution,
scanning results of query to select items to
read,
November 22 reading non-relevant
ISR items). 22
Objectives of IR Systems
 The success of an IR system is very subjective,
 Based upon what information is needed and the
willingness of a user to accept overhead.
 Needed information can be all information that is
in the system that relates to a user’s need.
 In other cases it may be sufficient information in
the system to complete a task, allowing for missed
data.
November 22 ISR 23
Functions of an IRS
The Major Functions of an IRS are:
 To identify the sources of information
relevant to the areas of interest of the target
users’ and community.
 To analyze the contents of the sources
(documents).
 To represent the contents of the analyzed
sources for matching with the users’
queries.
November 22 ISR 24
Functions of an IRS
The Major Functions of an IRS are:
 To match the search statement with the
stored database.
 To retrieve information which are relevant.
 To make the necessary adjustments in the

system
November 22
based on feedback
ISR
from the users. 25
Types of IR System
 The IR system may be of different types
depending upon the search conducted by the
information seeker.
 As a whole it can be categorized namely as:
1. Reference Retrieval System:
2. Document Retrieval System:
3. Fact Retrieval System:
4. Knowledge Retrieval System:
November 22 ISR 26
Types of IR System
1. Reference Retrieval System:
 Information related to specific questions is
retrieved.
2. Document Retrieval System:
 Information can retrieve by the attributes of
documents such as author, title, subject, and
so on.
 Nowadays, complete texts are also
November 22 ISR 27
retrieved, so called text retrieval system.
Types of IR System
3. Fact Retrieval System:
 The specific data or facts are retrieved (viz.,
numerical databases).
4. Knowledge Retrieval System:
 Is a rule-based system in which there is a
knowledge base with capability for
knowledge acquisition and an inference
engine.
November 22 ISR 28
Data vs Information Retrieval
Information Retrieval: Data Retrieval:
 The software the program  Data retrieval deals with
that deals with the obtaining data from a
organization, storage, database management
retrieval, and evaluation of system such as ODBMS.
information from  It is A process of
document repositories identifying and retrieving
particularly textual the data from the database,
information. based on the query
provided by user or
application.
November 22 ISR 29
 Retrieves information  Determines the
about a subject. keywords in the user
 Small errors are likely query and retrieves the
to go unnoticed. data.
 Not always results are  A single error object
ordered by relevance. means total failure.
 Has a well-defined
structure and
November 22 ISR semantics. 30
 Does not provide a  Provides solutions to
solution to the user of the user of the
the database system. database system.
 The results obtained  The results obtained
are approximate are exact matches.
matches.  Results are unordered
 It is a probabilistic by relevance.
model.  It is a deterministic
November 22 ISR model. 31
November 22 ISR 32
IR and the Retrieval Process
 The Process of IR starts when a user creates
any query into the system through some
graphical interface provided.
 These user-defined queries are the
statements of needed information.
 For example, queries fork by users in search
engines.
November 22 ISR 33
 In IR single query does not match to the
right data object instead;
 It matches with the several collections of
data objects from which the most relevant
document is taken into consideration for
further evaluation.
 The ranking of relevant documents is done
to find out the most related document to the
given
November 22 query. ISR 34
 This is the key difference between the
Database searching and IR.
 After the query is sent to the core of the
system.
 This part has the access to the content
management module which is directly
linked with the back-end
 i.e. the large collections of data objects.
November 22 ISR 35
 Once results IR are generated by the core
system then it is returned to the user by
some graphical user interfaces.
 The process repeats and results are modified
until the user satisfied for what he is
actually looking for.
November 22 ISR 36
November 22 ISR 37
Document Parsing
 Document parsing deals with the overall
document structure.
 In this phase, it breaks down the document
into discrete components.
 In Preprocessing phase it creates unit
documents for example one document
representing emails and another as
November 22 ISR 38
additional specific part.
Lexical Analysis
 In Lexical analysis, tokenization is the process of
breaking a stream into words, phrases, symbols, or
other meaningful terms called tokens.
 These meaningful elements are further sent to
Parts of Speech Tagging.
 Typically, Tokenization occurs at a word level.
November 22 ISR 39
Stemming and Lemmatization

Stemming
 In English grammar, for correct sentence
structures, we often use different forms of any
word. e.g. go, going, goes etc.
 The process of cutting down the affixes and let the
root word be found out.
 Any word is formed using regular noun + plural
affix.
November 22 ISR 40
Example Stemming
 Removing identified prefixes and suffixes
from words
– Flies → fli
– Mules → mule
– Agreed → agre
– Owned → own
– Traditional → tradit
 Can be done heuristically without a
dictionary
November 22 ISR 41
Lemmatization
 Usually refers to doing
these things properly  Reduces words to their
with Vocabulary and base form
Morphological analysis – Flies → fly
of words. – Mules → mule
– Agreed → agree
 Aiming to remove
– Owned → own
inflectional endings
– Traditional → tradition
only.
November 22  Requires a dictionary 42
ISR
Basic structure of an IR system
 The two subsystems of an IR system:
 Indexing: is an offline process of organizing
documents using keywords extracted from the
collection
 Searching: is an online process of finding
relevant documents in the index list as per users
query
 Indexing and searching: are unavoidably
connected
November 22 ISR 43
 You cannot search that was not first indexed
in some manner.
 Indexing of documents is done in order to
be searchable
 There are many ways to do indexing
 To index one needs an indexing language
 There are many indexing languages
 Every word in a document could be an indexing
Novemberlanguage.
22 ISR 44
November 22 ISR 45
IR Systems
November 22 ISR 46
Issues in IR
 Text representation
– what makes a “good” representation?
– how is a representation generated from text?
 Information needs representation
– what is an appropriate query language?
 Comparing representations
– to identify relevant documents
– what is a “good” model of retrieval?
 Evaluating effectiveness of retrieval
November 22 ISR 47
– what are good metrics?
Why is IR so hard?
 Information retrieval problem: locating
relevant documents based on user input,
such as keywords or example documents.
 The real problem boils down to matching
the language of the query to the language of
the document.
November 22 ISR 48
Why is IR so hard?
 Simply matching on words is a very weak
approach.
 One word can have different semantic
meanings. Consider: Take
– “take a place at the table”
– “take money to the bank”
– “take a picture”
 In Amharic ገና ……...ትፈራለህ …….
November 22 ISR 49
The World Wide Web
 The Web is an infrastructure of distributed
information combined with software that uses
networks as a vehicle to exchange that
information.
 Web page is a document that contains or
references various kinds of data, such as text,
images, graphics, and programs.
 Links are connections between one web page and
another that can be used “move around” as
desired.
Nov-22 50
The World Wide Web
 Website is a collection of related web
pages.
 The Internet makes the communication
possible, but the Web makes that
communication easy, more productive, and
more enjoyable.
Nov-22 51
Characteristics of the Web
 Decentralized content publishing with essentially
no central control of authorship.
 This turned out to be the biggest challenge for web
search engines in their quest to index and retrieve
this content.
 Web page authors created content in dozens of
(natural) languages and thousands of dialects,
 Thus demanding many different forms of
stemming and other linguistic operations.
Nov-22 52
Characteristics of the Web
 Huge (1.75 terabytes of text)
 Allow people to share information globally and
freely
 Hides the detail of communication protocols,
machine
 locations, and operating systems
 Data are unstructured
 Exponential growth
 Increasingly commercial over time (1.5 % .com in
1993 to 60% .com in 1997)
Nov-22 53
Search Engine ?
 Search Engines is a website that helps you find
other websites like Yahoo and Google.
 You enter keywords and the search engine
produces a list if links to potentially useful sites.
 There are two types of searches:
 Keyword searches
 Concept-based searches
Nov-22 54
Search Engine ?
 Browser is a software tool that issues the
request for the web page we want and
displays it when it arrives.
 We often talk about “visiting” a website, as
if we were going there.
 In truth, we actually specify the information
we want, and it is brought to us.
Nov-22 55
The Browser
 A browser is a Web client program that uses
Hypertext Transfer Protocol (HTTP) to make
requests of Web servers throughout the Internet on
behalf of the browser user.
 Text-only mode such as Lynx
 Graphic mode involves a graphical software
program that retrieves
 text,
 audio, and
 video
Nov-22 56
Challenges of Building a Search Engine
 Build by Companies and hide the technical

detail
 Distributed data over the web
 High percentage of volatile data but Large
volume
 Unstructured and redundant data
 Quality of data can not be easily verified
Nov-22 57
Challenges of Building a Search Engine
 Heterogeneous data, data structure and data

sources.
 Dynamic data due to changes by owners of
various websites.
 How to specify a query from the user is
always a challenge.
 How to interpret the answer provided by the
system.
Nov-22 58
Index and Search Engine
 The search query is everything that the user
types to get results.
 It is made up of one or more search terms, plus
optional special characters
 Analyzing the Query
 Expanding the query
 Word variants: plural/singular, various verb
forms
 Spelling correction
Nov-22 59
User query needs
 There appear to be three broad categories into
which common web search queries can be
grouped:
(i) Informational,
(ii) Navigational, and
(iii) Transactional.
 It should be clear that some queries will fall in
more than one of these categories, while others
will fall outside them.
Nov-22 60
User query needs
(i) Informational
 Queries seek general information on a broad topic, such
as queries leukemia or Provence.
(ii) Navigational
 Queries seek the website or home page of a single
entity that the queries user has in mind, say Ethiopian
airlines.
(iii) Transactional
 User performing a transaction query action on the Web
 Such as purchasing a product, downloading a file, or
Nov-22 making a reservation. 61
User Problems
 Do not exactly understand how to provide a
sequence of words for the search.
 Not aware of the input requirement of the
search engine.
 Problems understanding Boolean logic, so
the users cannot use advanced search.
Nov-22 62
User Problems
 Novice users do not know how to start
using a search engine.
 Do not care about advertisements? No
funding!
 Around 85% of users only look at the first
page of the result, so relevant answers
might be skipped.
Nov-22 63
Searching Guidelines
 Specify the words clearly (+, -)
 Use advanced search when necessary
 Provide as many particular terms as
possible
 If looking for a company, institution, or
organization, try:
Nov-22 64
Searching Guidelines
 Some searching engine specialize in some
areas
 If the user use broad queries, try to use Web
directories as starting points.
 The user should notice that anyone can
publish data on the Web, so information
that they get from search engines might not
be accurate.
Nov-22 65
Types of Search Engines
 Search by Keywords:
 e.g. AltaVista, Excite, Google
 Search by categories:
 e.g. Yahoo!
 Specialize in other languages:
 e.g. Chinese Yahoo! and Yahoo! Japan
 Interview simulation
 e.g. Ask Jeeves!
Nov-22 66
Search Engine Architecture
Nov-22 67
Web Crawlers
 Software agents that traverse the Web sending
new or updated pages to a main server where they
are indexed.
 Also called robots, spiders, worms, wanders,
walkers, and knowbots
 The 1st crawler, Wanderer was developed in 1993
 Runs on local machine and send requests to
remote Web servers.
Nov-22 68
Web Crawlers
 Breath-first and depth-first manner of
searching is applied
 Avoid crawling same pages
 Web pages change dynamically
 Fastest crawlers are able to traverse up to 10
million pages per day
Nov-22 69
Thank you!

IR Chapter I

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR Chapter I

Uploaded by

Copyright:

Available Formats

CHAPTER ONE

 The Systems which are used to store information

 In such a way that it can be retrieved easily and

effectively upon request are referred to as

Collecting information from different resources

And storing it in either storage room(maintaining

This information may be in any of the form that is

audio, video, text.

 Information Retrieval is a research-driven

– The focus is on different aspects of the

• Librarian – organization and indexing of information

• Cognitive scientist – the process in the searcher’s mind

• Philosopher – Is this really relevant ? …

– Progress influenced by advances in Computational

 Example: Google indexing size

– In 1998 Google already had 26 million pages

– Now: more than 1 trillion (as in

 Information Retrieval System is mainly

focus electronic searching and retrieving old

 An Information Retrieval System is a

system that is capable of storage, retrieval,

 An IR System is capable of performing

operations like methods for

Modifying or deleting them from the database,

Methods for searching and serving appropriate

 To retrieve information which are relevant.

 To make the necessary adjustments in the

Stemming and Lemmatization

 Build by Companies and hide the technical

 Heterogeneous data, data structure and data

You might also like