You are on page 1of 137

Information Retrieval

CS485

Tibebe Beshah
https://sites.google.com/site/sadcoursesite/

HiLCoE 1
What the course is about

 How people search and find information


 How computers store and retrieve
information
 How computer systems are designed to
help people find information they need

HiLCoE 2
Motivation (why)

 Huge amount of information


 Information overload
 Difficulty in information access

HiLCoE 3
How?

 Lecture
 Assignment
 Project
 Quizzes/tests
 Class work (group)

HiLCoE 4
What?
 Course Outline
 Information Storage and Retrieval (ISR): Basic concepts
(one and half weeks)
 Content (subject) analysis and Representation (two weeks)
 Standardization and Identifying term importance (one and
half week)
 Models of Modern IR systems (three weeks)
 Users and interfaces of information retrieval systems (one
week)
 Evaluation of Information Retrieval System (one and half
weeks)
 Current areas of research in Information Retrieval (half a
week)

HiLCoE 5
Course Objective:
 The general objective of the course is to let students
understand motivation, mechanisms, and potential of
Information Storage and Retrieval. In line with this, at
the end of the course students are expected to:
 Know the basic theories and principles of knowledge
organization and Information retrieval
 Understand the process of Information Storage and Retrieval
 Acquire qualities to study and analyze Information Retrieval
Systems designed using various models
 Be familiar with evaluation issues in Information Retrieval

HiLCoE 6
What the course emphasize on
understanding of

 Theories

 Tools (lexical analyzers, stemmers, etc.)

 Algorithms (ranking, matching, clustering,


etc.) and

 Evaluation of information retrieval systems

HiLCoE 7
What this course is NOT
 An algorithm design course
 We might use several related algorithms,
not study them in details
 A system development course
 Except some assignments may require
you to write or compile some C, C++, java,
etc procedures
 We look at an IR system as a whole,
not as an individual components

HiLCoE 8
Knowledge Useful for the Course
 Mathematics (set theory, probability, vector
algebra)
 Data / File structure
 Linguistics (read papers on Linguistics in Information
Science)
 System Analysis & Design
 Programming in higher level languages such as C, C+
+, Java, VB, etc.

HiLCoE 9
What IR assumes?

 Information is stored (or available)


 A user has an information need
 An automated system exists from which
information can be retrieved
 The system works!!

HiLCoE 10
References
 Robert Korfhage.(1997). Information Storage and Retrieval. John
Wiley & Sons, inc.
 G.G. Chowdhury (1990). Introduction to Modern Information
Retrieval. Library Association Publication. London.
 Ricardo Baeza-yetes and Berthier Riberio Neto.(1999). Modern
Information Retrieval. New York: ACM Press
 A.C. Foskett (1988). The subject approach to Information. 5th ed.
London: Bingley.
 Charles T. Meadow (2000). Text Information Retrieval
Systems.2nd ed. Academic Press. New York.

HiLCoE 11
Evaluation

 Course work (Assignments)


 Quizzes and test
 Final exam
 Class participation and attendance

HiLCoE 12
Information Storage and
Retrieval (ISR): Basic concepts

Chapter One

HiLCoE 13
Objectives (Chapter One)

 Providing general overview on the historical


development, scope, and coverage of IR as a
discipline
 Letting students be aware of what the basic
purpose, function, components, structure and
features of an IRS are.

HiLCoE 14
Topics ( Chapter one )

 Definition, Foundation, theories and principles


 Information retrieval system: components,
structures and functions
 Database retrieval Vs. information retrieval
 The (information) retrieval process
 Factors affecting effective retrieval
 Challenges in IR
 Other central concepts in IR

HiLCoE 15
Definition, Foundation,
theories and principles

HiLCoE 16
Information Retrieval Systems?
Document (Web page)
retrieval in response to a
query
 Quite effective (at some
things)
 Commercially successful
(some of them)
Butwhat goes on behind
the scenes?
 How do they work? Web search systems
 What happens beyond the • Lycos, Excite, Yahoo, Google,
Web? Live, Northern Light, Teoma,
HotBot, Baidu, …
Web Search System
Web Spider
Document
corpus

Query IR
String System

1. Page1
2. Page2
3. Page3 Ranked
. Documents
.
Information Retrieval - Definition

 Is an Important sub-discipline of Information


Science that is concerned with developing
theories and methods of access to
information
 Focus is on helping user find information

that matches their information need (User


Centered View)

HiLCoE 19
Cont…

 Is a branch of applied Computer Science that


focus on representation, storage,
organization of, and access to information
items (System Centered View).

HiLCoE 20
Cont…

 A good formal definition of information retrieval


is given in Baeze-Yates & Riberio-Neto (1990p1)

“Information retrieval deals with representation,


storage, organization of, and access to
information items. The organization and access
of information items should provide the user with
easy access to the information in which he is
interested”

HiLCoE 21
Cont…
 The definition incorporates all important features of a
good information retrieval system
 Representation

 Storage

 Organization

 Access

 Evaluation

 As a field, IR focuses on advanced application of


computers
 Is about finding relevant information in large
collection of data

HiLCoE 22
Cont…

 Information items: usually text, but possibly


also image, audio, video, etc.
 Text items are often referred to as
documents, and may be of different scope
(books, article, paragraphs, etc.)
 Information items are translated to a query
consisting of keywords (word forms) which
summarizes the description of the user
information needed.
HiLCoE 23
IR from different perspectives
 Conceptually,
 IR is used to cover all related problems in
finding needed information
 Historically,
 information retrieval is about document
retrieval, emphasizing documents as a basic
units
 Technically,
 information retrieval refers to (text) string
manipulation, indexing, matching, querying, etc.

HiLCoE 24
Cont…

 IR involves helping users find information that


matches their information needs
 Its techniques and applications have reached
many fields where processing large amount
of information is essential

HiLCoE 25
Information Retrieval
 Can be structured for ease of discussion as
 Text IR

 Discusses the classic problem of searching a collection of


documents for useful information
 Focuses is on document images that are predominantly
text (rather than pictures)
 These are called textual images and are amenable to
automatic extraction of key words
 Multimedia IR
 Discusses how to index document images and other
binary data by extracting features from their content and
how to search them efficiently

HiLCoE 26
Cont…

 Human computer interaction (HIC) for IR


 Discusses current trends in IR towards improved
user interface and better data visualization tools
 Application of IR
 Covers modern applications of IR (such as the
Web, bibliographic systems, and digital libraries)

HiLCoE 27
Entities in IRS

 Two important entities


 Information need: to be represented by search
statements (query)
 Information items (documents): to be
represented by index terms or any form of
representation like summary
 Thus the process in IRS is matching this
abstractions

HiLCoE 28
Thus the focus is on

 How to organize and represent information


items effectively and efficiently
 How to represent information needs
 How to match these two

HiLCoE 29
Key Issues IR
 Organizing
 How to describe information resources or
information-bearing objects in ways so that they
may be effectively used by those who need to use
them
 Retrieving
 How to find the appropriate information resources
or information-bearing objects for someone’s (or
your own) needs.Build a system that retrieves
documents that users are likely to find relevant to
their queries
 This set of assumption underlies the field of IR

HiLCoE 30
IR is an Iterative Process
Creation

Active
Authoring
Modifying

Using Organizing
Creating Indexing

Retention/
Mining Accessing Storing
Filtering Retrieval
Semi-Active
Discard
Distribution
Networking
Utilization Disposition Searching

Inactive
HiLCoE 31
IR-Representation/organizing

 The representation and organization


should provide the user with easy access
to the information in which he is interested
 Thus, focus is on the user information
need
 Unfortunately, Charicterization of the user
information need is is not a simple problem

HiLCoE 32
Cont…
 An example of User information
need:
 Find all passages (documents) containing
information on college tennis teams which:
are maintained by a USA university

HiLCoE 33
Cont…
 Basic remarks on user information need(in the context of
the World Wide Web):
 Such full descripion of the user information need cannot
be used directly to request information using the current
interfaces of web search
 The user must first translate his information need into a

query which can be prossessed by the search engine( or


IR system).
 In its most common form, the translation yields a set of

keywords (or index terms) which summerizes the


descripion of the user information needs
 Emphasis is on the retrieval of information (not data)

HiLCoE 34
IR- Storing/Retrieving

 Information storage
 How and where is information stored?
 Retrieving information
 How is information recovered from storage
 How to find needed information
 Linked with accessing/filtering stage

HiLCoE 35
IR - Accessing/Filtering

 Using the organization created in the O/I


stage to:
 Select desired (or relevant) information
 Locate that information
 Retrieve the information from its storage
location (often via a network)

HiLCoE 36
Implementation

 Thus in order to meet the above key issues


the implementation is developing an
Information System
 Retrieval system

HiLCoE 37
More on IR
 IR is concerned with retrieval of relevant
documents from a large collection of
documents
 Relevant documents are identified
according to specific criteria (usually
called query)
 IR usually deals with NL text which is not
always well structured and could be
semantically ambiguous

HiLCoE 38
Cont…

• IR deals with very large sets of documents


_High amount of robustness, efficiency
_Domain-independent & multi-linguality
• IR considers NL text mainly from a lexical view
 Identifying possible word forms
 Elimination of stop words (e.g the, of zu, ...)
 Stemming (e.g., supporting, supported support)
 Selection of index terms
 Term weighting

HiLCoE 39
A sketch of a searcher… moving through many actions towards a
general goal of satisfactory completion of research related to an
information need
IR is an Iterative Process

Repositories
Q2 Q4

Q3
Q1
Q5

Goals Q0

HiLCoE 40
Historical overview
 Organization and storage of knowledge for
ease of access is centuries old
 That is, the history of recording knowledge
goes as far as thousands of years.
 Important events
 Development of writing, Books and printing
technology, News publishing, Journal publishing
(economic reasons- books are not economical in
terms of money and time), Libraries (to put
publications in one centre)

HiLCoE 41
Cont…
 Now- The World-Wide-web
 A gigantic distributed collection of
heterogeneous information items (web pages)
 New challenges

HiLCoE 42
Cont…

 As the size of the collection grows, access to


documents becomes more difficult without
proper mechanism
 Therefore, in order to reach to documents in
libraries or other collection, access
mechanisms were necessary

HiLCoE 43
Cont…
 Simple methods to facilitate access to
single document:
 Table of contents,
 Keyword index
 Classical methods to facilitate access to
collections of documents
 Index (keywords, authors)
 Hierarchical (Dewey-Decimal classification)

HiLCoE 44
Cont…

 Increasing demand for information access


created Information Science as a discipline
 IR, First coined in 1952 and then get
acceptance in 1961 onwards ,becomes an
important sub discipline that is concerned
with developing theories and methods of
access to information

HiLCoE 45
Creation of Disciplines
 The mechanized era (Sparc Jones and Willett, 1997)
 IR systems were mainly used by librarians
 for carrying out bibliographic searches in place of
manual tools such as card catalogue and universal
classification systems
 The advent of word processing technology (software
+ hardware)
 a rapid, wide spread growth in the usage of IR
 Increased interest in Web-based distributed information
processing and in the application of IR techniques to
non-textual information
− The growth of knowledge  Creation of discipline

HiLCoE 46
Cont…
 Discipline oriented era
e.g. Science from philosophy
physics from science
electricity from physics
electronics from electricity
 Similarly, information retrieval from the wider
discipline of information science
 Then came the Problem Oriented Era
 Disciplines are merged to form a new subject.
 E.g. Molecular Biology from physics and Biology
(Fosket, 1988)

HiLCoE 47
Cont…
 Such growth knowledge gave birth to the
creation of disciplines (domain knowledge)
which then brought about the need for
classification and indexing
 Putting related knowledge together
 E.g. Science, Arts, and Humanities
 Creation of subclasses within classes
 Designing ways and means of
accessing information (which is the
area of IR)

HiLCoE 48
Information retrieval system:
components, structures and
functions
How do we characterize IRS?

HiLCoE 49
What is a system?
 Is a set of interrelated components interacting
together to achieve an objective.
 Has basic characteristics like:
 Input, output, environment, boundary, objectives,
components, interaction, interface
 Can be living or non-living
 What is “systems thinking”?
 Do you agree with this? “A system is bigger than the sum
of its components”

HiLCoE 50
Systems thinking
 Is a mind set or way of thinking to view the world
(every thing in the world) as a system.
 It emphasizes on interaction that keeps the system
alive.
 Benefits
 Identification of a system leads to abstraction
 From abstraction you can think about essential
characteristics of specific system
 Abstraction allows analyst to gain insights into specific
system, to question assumptions, provide documentation
and manipulate the system without disrupting the real
situation

HiLCoE 51
Cont..

 Different types of Information systems


 IRS
 DBMS
 MIS
 DSS
 ESS

HiLCoE 52
IRS
 Is a system that is capable of storage, retrieval,
and maintenance of information items
 The processes of an IR system is to match two
abstractions
 Data abstracted in the system

 Queries abstracted from user’s information needs

Need [ ] Docs
matching the two sides

HiLCoE 53
Cont…
 The goal of IR systems is to help users find
information that satisfies their information
needs
 Information items are translated to a query
consisting of keywords (word forms) which
summarizes the description of the user
information needed.
 Given the user query, the key goal of an IR
system is to retrieve information which
might be useful or relevant to the user.

HiLCoE 54
Cont…

 Examples :
 search engines like Google which retrieve
documents on the Web containing the
keywords, and return a ranked list of relevant
indices to documents.
 Such Search Engines are word form based

and often analyze the link structure of the


WWW

HiLCoE 55
IRS – Presentation of IR

 Present results in format that helps


user determine relevant items
 Arbitrary (physical) order
 Relevance order

HiLCoE 56
Purpose of IRS
 The purpose of an IRS is to capture wanted items
(information ) and to filter out unwanted information
1. Writers ideas
 Information is generated by authors
 Authors generate large quantities of information every day
 Are represented in the form of documents
2. Information need
 At the other end we have chain of communications
 We have readers, each with its own individual need for
information which has to be selected from the mass available
3. IRS
 Helps to organize the information sources in (1)
 Helps to bring (1) and (2) together (writer and information
seeker)

HiLCoE 57
Information Retrieval- objective/goal
 Minimize search overload of a user who is
locating needed information
 Needed information: either
 All information in the system relevant to the user needs
 Sufficient information in the system to complete a task
 Example
 Looking for an item to purchase
 Looking for an item to purchase at minimal cost

HiLCoE 58
Cont…
 Support user search, providing tools to overcome
obstacles such as :
 Ambiguities inherent in languages
 Homographs – words with identical spelling but with
multiple meanings
 Limit to user’s ability to express needs
 Lack of system experience or aptitude
 Lack of experience in the area being searched
 Initially only vague concept of information sought
 Differences between user’s vocabulary and authors’
vocabulary: different words with similar meanings

HiLCoE 59
Basic functions of an IRS
 Analysis of doc. and organization of information
(creation of document database)
 Analysis of users preparation of a strategy to search
the database
 Actual searching or matching of users queries with
data base
 Retrieval of items that fully or partially match the
search statement

HiLCoE 60
Subsystems of an IR system
 The two subsystems of an IR system:
 Searching: is an online process of finding relevant
documents in the index list that matches users query
 Indexing: is an offline process of organizing documents
using keywords extracted from the collection
 Indexingis used to speed up access to desired information from
document collection as per users query
 Indexing and searching: are inexorably connected
 You cannot search that that was not first indexed in some manner
or other
 Indexing of documents is done in order to be searchable
there are many ways to do indexing
 to index one needs to select an indexing language
thereare many indexing languages, including inverted file,
sequential file, suffix tree, signature file, etc..
even taking every word in a document is an indexing language
 Knowing searching is knowing indexing
Indexing Subsystem

documents
Documents Assign document identifier

text Tokenize document


IDs
tokens Stop list
non-stoplist Stemming & Normalize
tokens
stemmed Term weighting
terms
terms with
weights Index
Searching Subsystem
query parse query
query tokens
ranked non-stoplist
document Stop list
tokens
set
ranking
Stemming & Normalize
relevant stemmed terms
document set
Similarity Query Term weighting
Measure terms
Index terms
Index
Components of IR Systems
IR Systems

Human Components System Components


 Human Components
 Users -- who create the needs of the system (the user)
 Organization -- who makes it possible to have the system
(the funder)
 Information professionals -- who operate the system and
provide the services (the server)
 System Components
 Data -- the content of the system
 Device & media -- hardware of the system
 Algorithms & procedures -- software of the system

HiLCoE 64
High level structure of an IRS

Users Black box Documents

•Indexing
•Searching

HiLCoE 65
Structure of an IR System
Search Storage
Line Interest profiles Documents Line
& Queries & data
Information Storage and Retrieval System

T T
r Rules of the game = r
a Formulating query in
Rules for subject indexing +
Thesaurus (which consists of Indexing
a
n terms of
descriptors Lead-In
(Descriptive and
Subject)
n
s
Vocabulary
and
s
l Indexing
Language
l
a Storage of
Storage of
a
profiles
t
Documents t
i i
o o
n Store1: Profiles/ Comparison/ Store2: Document n
Search requests Matching representations

Ranking
Adapted from Soergel, p. 19
Potentially
Relevant
Documents
HiLCoE 66
HiLCoE 67
Cont…
 Each incoming item is analysed
 Appropriate descriptions are chosen to reflect the
information content of the item
 Each item is classified in accordance with established
procedures and incorporated into the collection of existing
information items
 Procedures are established for formulating
requests designed to satisfy an information need
and comparing these requests, queries, with the
descriptions of the stored items
 These comparisons are the bases for deciding
which items are appropriate for the respective
queries

HiLCoE 68
Cont…
 Finally, a retrieval and dissemination
mechanism is used to deliver the information
items of potential interest to the users of the
information system
 These steps are all carried in conventional
libraries where libraries where a card catalogue
forms the principal auxiliary tool used in an
information search
 In this course we are interested in the process
and methodologies needed to carry out such
tasks automatically
HiLCoE 69
Cont…
 Translation from user need to query
 Usually, manually ( by user himself)

 Translation from item to representation (surrogate)


 Often, automatically (by the system)

 Representation can be at different level:


 Full text, abstract only, index terms only, etc.
 Duality of the two translations
 User query can be regarded as the representation of

the ideal (sought-after) item


 Often, similar techniques are used to generate both

HiLCoE 70
Another view of IR
Information
Collections
need

Pre-process
text input

Parse Query Index

Rank

HiLCoE 71
Content analysis areas
Information
need Collections
How is
the query Pre-process
text input How is
constructed
? the text
Parse Query Index processed?

Rank

HiLCoE 72
Information
need Collections

Pre-process
text input

Parse Query Index

Rank Reformulated
Query

Re-Rank

HiLCoE 73
Example of IRS

HiLCoE 74
General Categories

 In-house
 Domain specific
 Set up by a particular information center
 Example: library catalogue
 On-line
 Designed to provide access to remote database
to a variety of users

HiLCoE 75
Cont…

 Most people have used IR systems


one way or the other:
 Traditional IR systems
 Library catalogue
 List of books (reading list)
 Bibliographies (universal, national or
subject)
 Abstracting and indexing serials

HiLCoE 76
Cont…
 Modern IR systems
 Computer based Library systems to search
for books, papers and course information
 Search engines, which retrieve documents on
the Web containing the keywords, and return a
ranked list of relevant indices to documents.
 Such Search Engines are word form based and
often analyze the link structure of the WWW
 Google, AltaVista and Yahoo are most popular
example of IR application nowadays
 Electronic encyclopaedia (online or CDROM)

HiLCoE 77
Database Systems Vs
Information Retrieval Systems

Are they the same or not? Is there any


Overlap?

HiLCoE 78
DBMS vs IRS

 IRS is one of the different types of


information systems
 But it does have more similarity than
difference with DBMS
 Accordingly it will be logical to compare
and contrast these two information
systems

HiLCoE 79
Cont…
 On the Information/data
 DBMS: structured data (often homogeneous records),
semantic unambiguity
 IR systems: unstructured (free text), ambiguity
 On the answers/results
 DBMS:
 Records (tuples)
 Perfect precision and recall, each item is relevant (no
ranking)
 Well defined results
 IR systems
 Documents
 Imperfect precision and recall, each item has specific
relevance (ranking)
 fuzzy results

HiLCoE 80
Cont…

 On their relationship
 Systems complement each other
 On their history
 DB grew out of files and traditional business
system
 IR grew out of library science and need to
categorize/group/access books/articles

HiLCoE 81
Cont…
Data retrieval
Information retrieval

 Content Data Information


 Data object Table Document
 Matching Exact match Partial match, Best
match
 Items wanted Matching Relevant
 Query language SQL (artificial) Natural
 Query specification Complete Incomplete
 Organization Highly structured less structured
 Classification Monothetic Polythetic

HiLCoE 82
Cont…

 Data retrieval
 records contain a set of keywords
 Well defined semantics
 a single erroneous object implies failure!
 Information retrieval
 information about a subject or topic
 semantics is frequently loose
 small errors are tolerated

HiLCoE 83
Cont…

 IR system:
 interpret contents of information items
 generate a ranking which reflects relevance
 notion of relevance is most important
 Information retrieval is much more difficult
than data retrieval

HiLCoE 84
The Retrieval Process

What do the basic retrieval process


looks like?

HiLCoE 85
The Retrieval Process  Web search
engine
 Web browser

Text
User
Interface

user need Text

Text Operations

logical view logical view

Query DB Manager
Operations Indexing
Module
user feedback

inverted file
query

Searching
Index

retrieved docs
Text
Database
Ranking
ranked docs

HiLCoE 86
The Retrieval Process
 The user interface – think of it as the user interface
available with current IR systems including
 Web search engines
 It is necessary to define the text database before
any of the retrieval processes are initiated
 This is usually done by the manager of the
database and includes specifying the following
 the documents to be used

 The operations to be performed on the text

 The text model to be used (the text structure and


what elements can be retrieved)
 The text operations transform the original
documents and generate a logical view of them

HiLCoE 87
Cont…
 Once the logical view of the documents is
defined, the database manager builds an index
of the text
 An index is a critical data structure
 It allows fast searching over large volumes of
data
 Different index structures might be used , but the
most popular one is the inverted file
 Given the document database is indexed, the
retrieval process can be initiated

HiLCoE 88
Cont…

 The user first specifies a user need which is


then parsed and transformed by the same
text operation applied to the text
 Then the query operations might be applied
before the actual query, which provides the a
system representation for the user need, is
generated

HiLCoE 89
Cont…
 The query is then processed to obtain the retrieved
documents
 Before the retrieved documents are sent to the
user, the retrieved documents are ranked
according to the likelihood of relevance
 The user then examines the set of ranked
documents in the search for useful information
 The user may need to reformulate query

HiLCoE 90
Cont…
 At this point, he might pinpoint a subset of the
documents seen as definitely of interest and
initiate a user feedback cycle
 In such a cycle, the system uses the
documents selected by the user to change
the query formulation
 Hopefully, this modified query is a better
representation of the real user need

HiLCoE 91
Cont…

User
Interface queries
spider of the
Index Search
engine

Web pages

HiLCoE 92
Factors Affecting Effective
Retrival

What are the two factors affecting


retrieval?

HiLCoE 93
Factors Affecting Effective Retrival

•The effective retrival of relevant


information is directly affected by two things
The User Task
The logical view of the documents
adopted by the retrival system

HiLCoE 94
The User Task

Retrieval

Database

Browsing/ surfing

HiLCoE 95
Cont…
 The user task: The user task might be
one of rtetrival or browsing
 Retrieval
 information or data
 Information need (retrieval goal) is focused and
crystalized, Purposeful, Often user is sophesticated
 Browsing/ surfing
 Information need (retrival goal) is vague and impresise
 Glancing around, Often user is naive
 Both are initiated by the user

HiLCoE 96
Users
 The user: anyone who need to find some
information
 The user groups
 group by their knowledge of the system
 novice users vs. experienced users
 end users vs. information specialists
 group by their domain knowledge
 Domain experts vs. general public
 group by information needs
 need to locate a particular item
 need some information
 need all information on a subject
HiLCoE 97
User’s Information Needs
 At all levels of our life we need information (e.g.
crossing the road, health, nutrition, travel,…)
 Information need is the desire to know, the desire
to fill a gap of knowledge
 Example- problem: one wants to cross a road in a
high traffic area: What is the information he
needs? He needs information
 About the direction people drive (left or right)
 About the meanings of the traffic light (green, yellow, and
red)
 Sign posts, etc ?

HiLCoE 98
Cont…

 People depend on information to carry


out their activities of daily life.
 need to accomplish some goals
 need to solve some problems
 People realize a lack of information
 perceive a gap in their knowledge state
 desire to fill the gap

HiLCoE 99
Logical view of documents

 The logical view of documents


 Full text
 Any point in between full text and index
terms
 Set of index terms

HiLCoE 100
Document Processing Steps

HiLCoE 101
From “Modern IR” textbook
Cont..
 Documents in a collection are
frequently represented through a set
of index terms or keywords
 An index term is a key word (or group of
related words) which has some meaning
of its own (which usually has the
semantics of a noun)
 In its more general form, an index term is
simply any word which appears in the
text of a document collection

HiLCoE 102
Cont…

 it is simply a (document) word whose


semantic helps in remembering the
document’s main theme
 Index terms are used to index and
summarize the document content
 How to generate index terms? (next
chapter)

HiLCoE 103
Cont…
 Key words might be extracted directly from the text
of the document or
 Keywords might be specified by a human expert
(this is frequently done in the information science
arena)
 No matter whether these representative keywords
are derived automatically or generated by a
specialist, they provide a logical view of a document
(concise logical view)

HiLCoE 104
Cont...

 Modern computers make possible to represent a


document by its full set of words
 In this case, we say that the retrieval system
adopts a full text logical view (or representation)
of the documents
 With very large collections, however, modern
computers might have to reduce the set of
representative keywords
 This can be accomplished through the following
standard steps

HiLCoE 105
Cont...
 Standard steps
 Recognizing document structures (titles,
sections, paragraphs, etc.)
 Break into tokens
 Usually space and punctuation delimited
 Special issues with some languages
 The elimination of stopwords (such as
articles and connectives)

HiLCoE 106
Cont…
 Conflation: The use of stemming/ morphological
analysis
 Purpose
 Overcome the variants of word forms by reducing all
words with the same root, i.e., (which reduces distinct
words to their common grammatical root)
 Most IR systems perform stemming on both text
and query
 The identification of noun groups (which
eliminates adjectives, adverbs, and verbs)
 Other further operation can also be performed
 Store in inverted index (to be discussed in later
chapters)

HiLCoE 107
Cont…
 Such text operations reduce the
complexity of the document
representation and allow moving the
logical view from that of a full text to
that of indexed terms
 Index - A list of important key words
from the documents

HiLCoE 108
Cont...
 The full text is the most complete logical
view of a document, But its usage usually
implies higher computational costs
 A small set of categories/ index terms
(generated automatically or by a human
specialist) provides the most concise
logical view of a document, But its usage
might lead to retrieval of poor quality
 Several intermediate logical views (of a
document) might be adopted by an
information retrieval system as shown in
the figure
HiLCoE 109
Cont…

 The issue of logically representing a


document should be viewed as a continuum
in which the logical view of the document
might shift (smoothly) from a full text
representation to a higher level
representation specified by a human subject

HiLCoE 110
Cont...
 The index terms obtained are a description of
a document content and of its structure
 Models may allow reference to the text
document
 The models might also allow references to
the structure normally present in written text
(in this case we say a structured model)
 Retrieval based on index terms or keywords
might be of fairly low quality

HiLCoE 111
Cont…
 Two major reasons for this
 The user query might be composed of too few terms which
usually implies the query context is poorly characterized
 This problem is dealt with through transformations in the query
such as query expansion and user relevance feedback
 The set of keywords generated for a given document might
fail to summarize its semantic content properly
 This problem is dealt with through transformations in the text such
as
 Identification of noun groups to be used as keywords
 Stemming
 The use of thesaurus

HiLCoE 112
Cont...
 Given a set of index terms for a document, we
notice that not all the terms are equally useful for
describing the document contents
 There are index terms that are simply vaguer
than the others
 Deciding on the importance of a term for
summarizing the contents of a document is not a
trivial issue
 Despite this difficulty, there are properties of an
index term

HiLCoE 113
Cont…
 Examples of such properties
 A word which appears in each of the one hundred
thousand documents is completely useless as an
index term because it does not tell us anything
about which documents the user night be interested
in
 A word which appears in just five documents is quite
useful because it narrows down considerably the
space of documents which might be of interest to the
user
 Thus, distinct index terms have varying relevance
when used to describe document contents
 This effect is captured through the assignment of
numerical weights to each of the index term of a
document

HiLCoE 114
Challenges in IR

Why is IR a Difficult Problem?

HiLCoE 115
Why is IR a Difficult Problem?
 The size of the web is doubling every
year:
 50 million pages in November 1995, 320
million pages in December 1997, 800 million
pages in February 1999, 1 billion pages in
2000, and growing every day
 Huge amount of data (e.g., WWW) dictates
efficiency, effectiveness and user-friendliness
 Thus :Any IR system needs the capability of
large scale data processing. Use of indexes
and various representations are required
HiLCoE 116
Cont…

 Unstructured data: difficult to capture


semantics in documents. Compare:
 “select * from Employee where Salary >
100,000”
 “retrieve all news items about corporate
takeover”
 Why is the second query more difficult to
answer? The following query is even more
difficult:
 “retrieve all news items about corporate
takeover involving an internet company”
HiLCoE 117
Cont…
 Documents have unrestricted domains
 it is hard to predefine or pre-categorize
the subject domains of documents
 a particular subject is related to several major
topics including linguistics, psychology,
Cybernetics, Communications, Information
System design, Engineering & Technology,
Networking, Computer Science, Mathematics,
Economics, Management Science, education

HiLCoE 118
Cont…
 Diversified user base: expert to casual
users
 The users of information retrieval systems
include
 Research scientists (that seek articles related to
particular experiments)
 Engineers (who try to determine W/r a patent is
covering some new idea has previously been
obtained)
 Attorney( who search for legal presidents)
 Buyers in general (who try to obtain new product
information)

HiLCoE 119
Cont…
 Information retrieval users
 Have a wide variety of different information needs
(Interest), Exhibit many different backgrounds
 May be led by many different reasons to use the retrieval
facilities
 As a result, they require a variety of services and end
products
 In other words, a system may be clumsy for an expert
user but difficult to use for a casual user
 a system may return information too general to be
useful for an expert in the subject but too narrow for a
general user

HiLCoE 120
Cont…

User Search/select Information

Info. Needs Queries Stored Information

Translating info. Matching queries


needs to queries To stored information

Query result evaluation


Does information found match user’s
information needs?
HiLCoE 121
Cont…
 Distributed and interlinked (e.g., Hypertext and WWW)
 Where to start a search? Unlike in a centralize database, you have only
one (or a few) database's) to search.
 How are the information related?

 Efficiency vs. effectiveness.


 With a limited amount of resources, one can only improve
efficiency and effectiveness to a certain degree. Moreover,
improving efficiency often means degrading effectiveness,
and vice versa.

HiLCoE 122
Other Central Concepts in
IR

What else to start the actual work?

HiLCoE 123
Other Central Concepts in IR

 Documents
 Queries
 Collections
 Evaluations
 relevance

HiLCoE 124
Documents
 Document Retrieval Model. Are IR
systems better called Document Retrieval
systems?
 Document: a long string of characters contained
in a single file
• What do we mean by a document?
− Full document?
− Document surrogates?
− Pages?
• A document is a representation of some aggregation
of information, treated as a unit
HiLCoE 125
Cont…

 Logical Units of Text


 Units of records (text & other components)
 Units that can be stored, retrieved, and
displayed as an unique entity
 Units of semantic entity
 units of text grouped together for a purpose
 Units of unformatted text
 Text as written by authors of documents.

HiLCoE 126
Cont…

 Logical unit of text


 articles, books,
 links, web pages
 Other components that come with the text
 figures, charts, graphics
 multimedia

HiLCoE 127
Cont…
 Document Representation
 Since
 Documents are full of text.
 Not every words of the text are meaningful for
searching/retrieval.
 Even some times documents themselves do not have
identifiable attributes such as author, titles.
 Documents need to be processed and
represented to a concise and identifiable
formats/structures.

HiLCoE 128
Cont…
 Documents should be represented to help
users identify and receive information
from the system.
 to identify subjects
 to provide summaries/abstracts
 to classify subject categories

HiLCoE 129
Collection

 A collection is some physical or logical


aggregation of documents
 A database
 A Library
 An index?
 Others?

HiLCoE 130
Queries
 A query is some expression of a user’s information
needs
 Can take many forms
 Natural language description of need
 list of words
 Formal query in a query language Query-Boolean (A and B
or C)
 Queries may not be accurate expressions of the
information need
 Differences between conversation with a person and formal
query expression

HiLCoE 131
Cont…
 Given the user query, the information system has
to retrieve the documents which are related to that
query
 The potentially large size of the documents
collection (e.g. the Web is composed of millions of
documents) implies that specialized indexing
techniques must be used if efficient retrieval is to
be achieved
 Thus to speed up the task of matching documents
to queries, proper indexing and searching
techniques are used

HiLCoE 132
Relevance

 The quality of relation ship that


exists between an information
need and information item.

HiLCoE 133
Why is IR Important?

 Most information available is in textual form and has no


predefined format (e.g., emails and newsgroup articles).
 Integration of text retrieval capability in most relational
database systems. SQL already supports limited search
capability such as search based on regular expressions:
 select * from Employee where Name like ’%Lee%’.
 Increasing number of online documentation systems (no
more hardcopy!)
 Of course, the blooming of World Wide Web

HiLCoE 134
Reference Materials from the web

 Information Retrieval. An online book by


c. j. van rijsbergen
http://www.dcs.gla.ac.uk/keith/preface.html
 Modern Information Retrieval by
Ricardo Baeza-Yates
http://www.sims.berkeley.edu/~hearst/irboo
k/

HiLCoE 135
Summary

 What were our Objectives?


 Definition
 IRS- other systems
 Process of Retrieval
 Challenges
 Concepts

HiLCoE 136
Reflection

 Write a half page essay on your


understanding of the course Information
Retrieval?
 It may include but not limited to
 Definition
 Scope of the subject
 Objective
 Etc…

HiLCoE 137

You might also like