You are on page 1of 21

INFORMATION RETRIEVAL

TECHNIQUES

Unit – I
INTRODUCTION
UNIT I INTRODUCTION
Information Retrieval – Early Developments – The IR Problem
– The User‘s Task – Information versus Data Retrieval - The IR
System – The Software Architecture of the IR System – The
Retrieval and Ranking Processes - The Web – The e-Publishing
Era – How the web changed Search – Practical Issues on the
Web – How People Search – Search Interfaces Today –
Visualization in Search Interfaces.
Information Retrieval
• Information retrieval is finding material of an
unstructured nature that satisfies an
information need from with in large
collections.
• Information retrieval (IR) is concerned with
representing, searching, and manipulating
large collections of electronic text and other
human-language data.
Information Retrieval
Web Search

• Regular users of Web search engines casually expect


to receive accurate and near-instantaneous answers
to questions and requests merely by entering a short
query — a few words — into a text box and clicking
on a search button.
IR PROBLEMS
• The main objective of an IR system is to retrieve all
the items that are relevant to a user query, while
retrieving as few non relevant items as possible.
Main problems in IR:
1. Document and Query indexing
How to best represent their contents?
2. Query evaluation(or retrieval process)
To what extent does a document correspond to
a query?
3. System evaluation
How good is a system
Three Big Issues in IR
Three Big Issues in IR
• Relevance
• Evaluation
• Emphasis on users and their information
needs
1. Relevance
• A relevant document contains the information that a person was
looking for when she submitted a query to the search engine.

• To address the issue of relevance, retrieval models are used.

• A retrieval model is a formal representation of the process of matching


a query and a document.

• It is the basis of the ranking algorithm that is used in a search engine to


produce the ranked list of documents.

• A good retrieval model will find documents that are likely to be


considered relevant by the person who submitted the query.

• For example, the ranking algorithms are concerned with the counts of
word occurrences than whether the word is a noun or an adjective.
2. Evaluation
Precision & Recall
Precision
• Are the retrieved documents relevant?
• Precision is the proportion of retrieved documents that are
relevant

Recall
• Are all the relevant documents retrieved?
• Recall is the proportion of relevant documents that are retrieved.
3.Emphasis on users and their information needs
• Text queries are often poor descriptions of what the user actually
wants compared to the request to a database system, such as for
the balance of a bank account.

• Despite their lack of specificity, one-word queries are very


common in web search.

• A one-word query such as “cats” could be a request for


information on where to buy cats or for a description of the Cats
(musical).

• Techniques such as query suggestion, query expansion and


relevance feedback use interaction and context to refine the
initial query in order to produce better ranked results.
The User Task
• The information first is supposed to be
translated into a query by the user.

• In the information retrieval system, there is


a set of words that convey the semantics of
the information that is required whereas, in
a data retrieval system, a query expression
is used to convey the constraints which are
satisfied by the objects.

• Example: A user wants to search for


something but ends up searching with
another thing. This means that the user is
browsing and not searching. The above
figure shows the interaction of the user
through different tasks.
Information vs Data Retrieval
The IR System
1. Collection of Data
(Crawling)

2. Storing data in
Repository.

3. Indexing ( easy
Retrieval and
Ranking)

4. Retrieval process

5. User Query / Hyper


link

6. Expansion of Query

7. Ranking

8. Feedback for better


ranking
THE RETRIEVAL AND RANKING PROCESSES

1. spelling corrections
and elimination of
terms from the query.

2. Query Expansion

3. Eliminating stop
words, stemming

4. indexing terms used

5. Retrieved

6. Ranking
The Web
• With the rapid growth of the Internet, more
information is available on the Web and Web
information retrieval presents additional technical
challenges when compared to classic information
retrieval due to the heterogeneity and size of the web.
• Web information retrieval is unique due to the
dynamism, variety of languages used, duplication, high
linkage, ill formed query and wide variance in the
nature of users.
• Many software tools are available for web information
retrieval such as Google, Yahoo, and many other agents.
E-PUBLISHING-ERA
• Since its inception, the Web became a huge
success - Well over 20 billion pages are now
available and accessible in the Web More than
one fourth of humanity now access the Web on
a regular basis.
• Publishing articles in web easily without any
time delay.
How the Web Changed the Search
IMPACTS
• Characteristics / Variety of the document. (Hyperlink and its connections)

• Size of the collection and the volume of user queries submitted on


a daily basis.

• In a very large collection, predicting relevance is much harder than


before. Noise in relevance is High.

• Now, Web is not only a search drive but also a Business.

• Web advertising and other economic incentives leads to Web


Spam.

• Spam makes relevance negative.


HOW PEOPLE SEARCH
Information Lookup versus Exploratory Search

User interaction with search interfaces differs


depending on the type of task, the amount of
time and effort available to invest in the process,
and the domain expertise of the information
seeker.
Classic versus Dynamic Model of Information
Seeking
Eg: Static – searching for simple information
Dynamic – Researching or analyzing the data.
SEARCH INTERFACES TODAY
Query Specification
• Query Specification Interfaces
• Query Reformulation
VISUALIZATION IN SEARCH INTERFACES.

• Text
• Images
• Graphs
• GIFs
• Videos

Depending upon the Information to be delivered


the format varies.

You might also like