You are on page 1of 114

WOLKITE UNIVERITY

COLLEGE OF COMPUTING AND INFORMATICS

DEPARTMENT OF INFORMATION SYSTEM

Course : Introduction to Information Storage & Retrieval

INSY 2063: IR

BSc(IS) Third Year, First Semester, 2018


ISAYAS W.
INFORMATION SYSYTEM
Information Retrieval

Chapter 1:
Information Storage and Retrieval
Questions

• Find “BRUTUS AND CAESAR AND NOT


CALPURNIA” in the big book of shakespare.
• I want to get some idea about the concepts of information
retrieval
?
Example
Introduction
• The practice of archiving (A depository containing
historical records and documents) written information can be
traced back to around 3000 BC, when the Sumerians designated
special areas to store clay tablets with cuneiform inscriptions
(Amit Singhal,2001)

The need to store and retrieve written information became


increasingly important over centuries, especially with inventions
like paper and the printing press.

• After computers were invented, people realized that they could be


used for storing and mechanically retrieving large amounts of
information
Cont……
 Approaching the end of the twentieth century, societies all
over the world are changing.

 In countries of many different kinds, information now plays


an increasingly important part in economic, social, cultural
and political life.

 This phenomenon is taking place regardless of a country’s


size, state of development or political philosophy.
Cont……..

 Changes that are happening in Singapore, with a population


of 2.5 million, are similar to those taking place in Japan with
its population of 125 million.

 Developing countries like Thailand are striving to build


information-intensive(concentrated) social and economic
systems just as hard as countries like the United Kingdom or
France.
The storage of information from first to now

• Clay tablets

Paper and other soft materials

cloud
Computers
What is Information, storage and retrieval?

• Information:

 Data that have been processed and has meaning of itself and
the meaning is useful but does not have to be.

 Information is a critical business resource and like any other


critical resource must be properly managed

 Provides answer to who,what,where,when questions


Storage

 The action of or method of storing something.

 The place where data is held in an electromagnetic or optical


for access by a computer processor.

Retrieval : The process of getting some thing backfrom


somewhere easily.

The action of obtaining or consulting material stored in a


computer system.

Example: find „BRUTUS AND CAESAR AND NOT


CALPURNIA‟ in the big book of shakespare.
Information Storage
• The computers can store different types of information in
different ways, depending on what the information is, how
much storage it requires and how quickly it needs to be
accessed.

• Information storage is the part of the accounting system


that keeps data accessible to the information processors
(cpu)

• Accounting system  is the system used to manage the


income, expenses, and other financial activities of a
business.
Cont..
After the input devices enter data into an accounting system,
the information processors take the raw data and convert it
into a usable form.

• This information is then stored, often in the form of a


database, on the information storage component of the
accounting system.
Example
….cont
instead
IR and IR systems
What is IR(Information Retrieval)

?
Information Retrieval(IR)
• The term Information Retrieval was first coined by Calvin
Moore (1950)

Definition: Is an Important sub-discipline of Information Science


that is concerned with developing theories and methods of access
to information

– Focus is on helping user find information that matches their


information need (User Centered View)

• Is a branch of applied Computer Science that focus on


representation, storage, organization of, and access to information
items (System Centered View).
…cont
• A good formal definition of information retrieval is given in
Baeze-Yates and Riberio-Neto (1990p1)

“Information retrieval deals with representation, storage,


organization of, and access to information items. The
organization and access of information items should provide
the user with easy access to the information in which he is
interested”

• Is about finding relevant information in large collection of


data
….cont
• Conceptually, IR is used to cover all related problems in
finding needed information

• Historically, information retrieval is about document retrieval,


emphasizing documents as a basic units
– Until recently, in the above sense, IR was considered as a narrow area of
interest for Librarians and Information experts

– Today, IR includes Modelling, document classification, user


interfaces and visualization, multimedia retrieval, digital library,
filtering, natural languages etc.

• Technically, information retrieval refers to (text) string


manipulation, indexing, matching, querying, etc.
The Task of Information Retrieval
• A large depository document is stored on a computer=Corpus

• There is a topic about which we desire to get some


information=Information need

• Some of those documents may contain the information that


satisfies my need=relevance

• How do we retrieve those documents?

• we communicate our information need to the computers by


expressing it in the form of a query
How to prepare a Query
• How the query is expressed will depend on whether the data is
structured or unstructured

• Structured data information in a tables, has a clear,


overt(obvious) semantic structure, organized, Relations

• Example :
Employee Manager salary
A B 80000
Example
• Structured data allows for the expressive queries like:

Give me the social security numbers of all the employees


who have stayed with the company for more than five years

Id Name Manger Year stayed

1 Abebe Beka 2

2 Bona Dedefo 2

3 Chala Boru 6
Cont……….
• Unstructured Data: does not have a clear, overt semantic
structure(e.g. free text on web page, video, audio)

• Allows less expressive queries of the form:

• Give me all documents that have the keywords


• ‘These romans are crazy’

Structured data Database System

Unstructured Information
data Retrieval
Generally;

• Information Retrieval (IR) is finding material (usually


documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers). Information retrieval
technology has been central to the success of the Web.

• Question 1: what is the difference between the structured


and unstructured data? What about semi-structured data?
Query

• (computing) a set of instructions passed to a database to


retrieve particular data (Dictionary Definition)

• Queries are formal statements of information needs that are put


to an IR system by the user to search for a document.

• The users’ query is matched to the documents stored in a


database through the documents’ index.
Query

• When formulating a query, the user can employ search


facilities such as search limits (by date of publication,
language, publication type, and so on) and Boolean operators
(AND/OR/NEAR/NOT) to make the query more specified
(i.e. refine or relax the query).

• The user can also often control the output in terms of, for
example, number of retrieved documents to display and of
highlighting search terms.
Goal of IR

• The general goal of IR is to

Help users find useful information based on their information

needs (with a minimum effort, ) despite the increasing

complexity of Information and the changing needs of user

Provide immediate random access to the data

Remark

Retrieval systems such as google are developed with this aim


What IR assumes?

• Information is stored (or available)

• A user has an information need

• An automated system exists from which information can


be retrieved

• The system works!!


Cont…
Challenges in IR

• Representation of information items and information needs


(first problem)

– Document representation is one area of IR


– Query representation is another area of IR

• Matching (second problem)

– How to match need Vs. information items

• Modification of representation as a result of judgment (query


expansion or reformulation)
Question
Data Information Retrieval Systems
• Are systems which are build to retrieve documents highly
likely relevant to the user

• Are systems built to reduce user’s workload in searching


through the store of documents to find relevant one’s

• Are systems that give information about the presence or


absence of documents in accordance with the query
– Automated abstracts or summaries of documents were developed to
further simplify access to search results

• Are computer based systems (we are talking about


automation )
….cont
• Are systems that attempt to find relevant documents to
respond to user’s request

• Are systems that interposed (interrepted) between a


potential user of information and the information collection
itself.

– For a given information problem, the purpose of the


system is to capture wanted items and to filter out
unwanted items
…….cont
Programmable IR Tools
 Apache Lucene

 Apache Solr

 Lemur

 Terrier

 Rapid Miner
Generally;
• Is a set of rules and procedures, as operated by humans and/or
machines, for doing some or all of the following operations

– Indexing (or constructing representation of documents)

– Search formulation (or constructing representation of information


needs)

– Searching (or matching representation of documents against


representation of needs)

– Feedback (or repeating any or all of the above processes with


modifications introduced in response to an assessment of results of
some process)
Information retrieval system

• Consists of:

1. Sets of Information items (documents)


• Objects that have the information we need

2. A set of requests (Information needs)

3. Some mechanisms for determining the requirements of


the request (matching functions)
Information Retrieval Systems
Examples of IR systems

• Typical examples of IR systems are search engines that


can be found on the web or in library

– They concentrate on finding documents, performing


full text retrieval

– After a user types in several keywords, the system


returns the documents that are most interesting
according to the system
What an IR system should do?
 Store/archive information

 Provide access to that information

 Answer queries with relevant information

 Understand the user’s queries

 Understand the user’s need

 Acts as an assistant
Major functions of an IR systems
– Analyze contents of information items

– Represent the contents of the analyzed sources in a way suitable


for matching with users’ queries

– Analyze users information need and represent them in a form


that will be suitable for matching with the database

– Match the search statement with the stored database

– Retrieve or generate information that are relevant in a ranking


which reflects relevance

– Make necessary adjustments in the system based on feedback


from users
Types of IR Systems/applications
• IR can be structured for ease of discussion as:
– Text IR
• Discusses the classic problem of searching a collection of documents
for useful information
• Focuses is on document s that are predominantly text (rather than
pictures)
• These are called textual images and are amenable(agreeable) to
automatic extraction of key words
– Multimedia IR
• Discusses how to index document images and other binary data by
extracting features from their content and how to search them
efficiently
– Human computer interaction (HCI) for IR
• Discusses current trends in IR towards improved user interface and
better data visualization tools
– Application of IR
• Covers modern applications of IR (such as the Web, bibliographic
systems, and digital libraries)
Components of an IR systems
• An IR system comprises the following major subsystems

– Document selection subsystem


• Documents are there in the database. How are we going to select those
documents that are relevant (matched with user requests)

– Vocabulary subsystem
• In indexing we need to use controlled vocabulary i.e., a list of selected
subject terms to represent a document. It is based on a vocabulary that
the indexing is updated

– Text Operations subsystem

– Tokenization, Stopword removal, Stemming


Data versus Information Retrieval
• 1. Information items
– DBMS
• highly structured data (are of known nature), often
homogeneous records, often semantically unambiguous (well
defined semantics)
– IR systems
• Unstructured or unformatted data (as opposed to relational
database). When you go to a specific document it is not
structured as in DB
• Free text
– text data- papers, technical reports, news article ( completely
untagged or plain text)
– Web-pages – HTML and XML files (semi structured)
• None textual data – images, graphics etc.
• Heterogeneous, Semantically ambiguous (semantics is
frequently loose; we want approximate match)
……cont
• 2. Answers

– DBMS:

• Records, tupples, No ranking

• Well defined results

• Perfect precision and recall, each item is relevant

– IR systems

• Documents, ranked list of documents. The issue ranking is


very important (page through the top k documents)

• Imperfect precision and recall, each item has specific


relevance
….cont
• 3. Matching
– DBMS:
• Analoguous to db quering: Which records contain a set of
keywords?
• Exact match; We talk of items that match exactly; Every record
either matches or fails to match a query; No notion of relevnce
• A single erroneous (conatining error) object implies failure!
– IR systems
• Information about a subject or topic
• Partial or best match; We talk of possibly relevant items not
exact matched items
• Notion of relevance is most important- needs a model
• Small errors are tolerated (and in fact inevitable)
• Interpret contents of information items
• Generate a ranking which reflects relevance
….cont
• 4. Items wanted
– DBMS
• Matching
– IR systems
• Relevant
• 5. Model
• DBMS:
• Deterministic (answer can be predetermined
– IR systems:
• Probabilistic, not deterministic; answer is not
predetermined
…cont
• 6. Querying
– DBMS:
• (DB query) assumes that the data is in standardized
format
– IR system
• Query assumes that we work on plain, unformatted data
7. Query language
• DBMS
• Artificial language
– IR system
• Natural language
….cont
• 8. Query specification

• DBMS

• Complete (requires precise retrieval criteria)

• A single erroneous object implies failure

– IR system

• Incomplete

• Small errors are tolerated


…cont
 DB grew out of files and traditional business system

 IR grew out of library science and need to


categorize/group/access books/articles

 Information retrieval is much more difficult than data


retrieval

 Both support queries over large data set, using indexing

 Relationship

Systems complement each other


Summary of Comparison (data retrieval Vs information retrieval)
Discussion questions
1. What is IR and IR systems
2. What is search engine?
3. List and explain components of IR block diagram.
4. write the difference between IR retrieval and DR.
5. Apache solr is open source software. What is open source
software?
IR and the Retrieval Process
• The purpose of an information retrieval strategy is to retrieve all
the relevant documents whilst(although) at the same time
retrieving as few non relevant once as possible

• The process involves a certain amount of element of feed back


and is best illustrated using the diagram in the next slide

• Can be seen or interpreted in terms of component sub-processes


whose study fields yields many of the topics that will be
covered in the course
Retrieval Process
Text
User
Interface

user need 4, 10 Text

Text Operations
6, 7
logical view logical view

Query DB Manager
Operations Indexing
Module
user feedback

5 8
inverted file
query

Searching
Index

8
retrieved docs
Text
Database
Ranking
ranked docs
2

A simple and generic software architecture to describe the retrieval process


Indexing part
Searching part
….cont
• There are three main ingredients to the IR process
– Texts or documents
– Queries
– The process of evaluation
• For texts, the main problem is to obtain a representation of
the text in a form which is amenable to automatic indexing
• This is achieved (i.e., the representation) by creating an
abbreviated form of the text, known as a text surrogate
• A typical surrogate would consist of a set of index terms or
keywords or descriptors
Document surrogates
Example
….cont

For queries
• For queries, the query has arisen as a result of an information
need on the part of the user

• The query is then a representation of the information need and


must be expressed in a language understood by the system

• Due to the inherent difficulty of accurately representing the


information need, the query in IR system is always regarded
as approximate and imperfect
…..cont

For the evaluation


• The evaluation process involves a comparison of the text
actually retrieved with those the user expected to retrieve

• This often leads to some modification, typically of the query


through possibly of the information need or even of the
surrogates

• The extent to which modification is required is closely linked


with the process of measuring the effectiveness of the retrieval
operation (recall and precision)
….cont
 It is necessary to define the text database before any of the
retrieval processes are initiated

 This is usually done by the manager of the database and


includes specifying the following
 The documents to be used
 The operations to be performed on the text
 The text model to be used (the text structure and what
elements can be retrieved)

 The text operations transform the original documents and the


information needs and generate a logical view of them
….cont.
• Once the logical view of the documents is defined, the
database module builds an index of the text

– An index is a critical data structure

– It allows fast searching over large volumes of data

• Different index structures might be used , but the most popular


one is the inverted file (more on this later) as indicated in the
slide

• Given the document database is indexed, the retrieval process


can be initiated
……..cont.
• The user first specifies a user need which is then parsed and
transformed by the same text operation applied to the text

• Then the query operations might be applied before the actual


query, which provides the a system representation for the user
need, is generated

• Matching- The query is then processed to obtain the retrieved


documents

• Before the retrieved documents are sent to the user, the


retrieved documents are ranked according to the likelihood of
relevance
……..cont.
• The user then examines the set of ranked documents in the
search for useful information

• Two choices for the user


– Reformulate query, run on entire collection
– Reformulate query, run on result set

• At this point, he might pinpoint a subset of the documents seen


as definitely of interest and initiate a user feedback cycle

• In such a cycle, the system uses the documents selected by the


user to change the query formulation

• Hopefully, this modified query is a better representation of the


real user need
Basic Structure of an IR System
Components of an IR System
Why
1. Regulatory compliance
• A well-organized information storage and retrieval system
that follows compliance (agreement) regulations and tax
record-keeping guidelines significantly

increases a business owner’s confidence the business is fully


complying.
2. Efficiency and Productivity

• A good information storage and retrieval system, including an


effective indexing system, not only decreases the chances
information will be misfiled but also speeds up the storing and
retrieval of information.

• The resulting time saving benefit increases office efficiency


and productivity while decreasing stress and anxiety
3. Improving working environment
• It can be disheartening to anyone walking through an office area
to see vital business documents and other information stacked
on top of file cabinets or in boxes next to office workstations.

• Not only does this create a stressful and poor working


environment, but if customers see this, can cause customers to
form a negative perception of the business.

• Contrast this with an office area in which file cabinets, passages


and workstations are clear and neatly organized to see how
important it is for even a small business to have a well-organized
information storage and retrieval system.
The Standard Retrieval Interaction Model
Question
Information Retrieval

Chapter 2:
Automatic Term Selection and
Term Weighting
Definition of Term Selection

• The act or fact of carefully choosing some term as being the


best or the most suitable

• The process of choosing the most important term from the


given documents for the purpose of indexing and text
operation
Why term selection?

• Some words are not good for representing documents

• Use of all words have computational cost, increase searching


time and storage requirements

• Using the set of all words in a collection to index


documents generates too much noise for the retrieval
task
Objective or aim of term selection
• Represent textual documents by a set of keywords called index
terms or simply terms

• Increase efficiency by extracting from the resulting document a


selected set of terms to be used for indexing the document

• If full text representation is adopted then all words are used for
indexing (not as such efficient as it will have an overhead, time
and space)
Index term
• Is also called keyword

• Is a word (a single word) or phrase (multiword) in a document


whose semantics gives an indication of the document’s theme
(main idea)

– A term that captures subject of the topic of a document.

– Help in remembering the documents main theme


Index Terms

• Assumption

– The index terms selected are assumed to reflect the content of


the text (are descriptions of content)

• Index terms can be extracted from the title, abstract and text of
the document
Indexing

Is a critical process

– User’s ability to find documents on a particular subject is


limited by the indexing process used to create index terms for
the subject

Indexing is The act of classifying and providing an index


in order to make items easier to retrieve
Example: (in a book, set of books), an alphabetical list of names, subjects,
etc, with reference to the pages on which they are mentioned.
Indexing
Indexing
• Some definitions
– Is the art of organizing information

– Is an association of descriptors (keywords, concepts) to


documents in view of future retrieval

– Is a process of constructing document surrogates by


assigning identifiers to text items

– Is the process of analyzing the information content in the


language of the indexing system
Document surrogates
Example
Indexing
• Purpose/objective

– To give access point to a collection that are expected to be


most useful to the users of information

– To allow easy identification of documents (e.g., find


documents by topic)

– To relate documents to each other

– To allow prediction of document relevance to a particular


information need
Indexing
• Indexing may also assign weights to terms

– Non-weighted indexing

– Weighted indexing
Indexing
• Non-weighted indexing

– No attempt to determine the value of the different terms


assigned to a document

– Not possible to distinguish between major topics and causal


references

– All retrieved documents are equal in value

– Typical of commercial systems through the 1980s


Indexing
• Weighted indexing

– Attempt made to place a value on each term of the


description of the document

– This value is related to the frequency of occurrence of the


term in the document (higher is better), but also to the
number of collection documents that uses this term (lower is
better)
Indexing exhaustively(completely)

• Should we index only the most important concepts, or also more


minor concepts?

Indexing specificity
• Should we use general index terms or more specific terms?

• Should we use the term “computer” or “personal computer”?


Indexing

• Ways to do indexing

– Manual

– Automatic (focus of the course)


Manual Indexing

• Indexers decide which keywords to assign to documents based


on controlled vocabulary
– Human indexers assign index terms to documents

• The indexers try to summarize the contents or aboutness of the


whole document in a few keywords

• That is, indexers analyze and represent the content of a document


through keywords

• Is based on intellectual judgment and semantic interpretation of


(concepts, themes) of indexers
Manual Indexing
• Indexers prior knowledge of the following is important to come
up with good keywords or index terms

– Terms that will be used by the user

– Indexing vocabulary

– Collection characteristics
Advantages of Manual Indexing

• Ability to perform abstraction (conclude what the subject is) and


determine additional related terms

• Ability to judge the value of concepts (because it is done by


human being)
Disadvantages of Manual Indexing
• Slow and expensive (significant cost)
– Cost of professional indexers is very expensive

• Is based on intellectual judgment and semantic interpretation


(concepts, themes)

– High probability of inconsistency or low consistency among


indexers (maintaining consistency is difficult),

• Labor intensive

• In automatic indexing all these problems will some how be solved


Automatic Indexing
• Is the assignment of content identifiers, with the help of modern
computing technology

– A computer system is used to record the descriptors generated


by the human

• The system extracts “typical”/ “significant” terms

• The human may contribute by setting the parameters or


thresholds, or by choosing components or algorithms
Why automatic indexing?
• Reasons for the necessity of automatic indexing

– Information overload
• Enormous amount of information is being generated from
day to day activities

– Explosion of machine-readable text


• Massive information available in electronic format and on
Internet.

– Cost effectiveness
• Human indexing is expensive and labor intensive.
Current Procedures for Automatic Indexing
• Generating document representatives through automatic indexing
involves

Lexical analysis the process of converting an input stream of


characters into a stream of words or tokens

– Use of stoplist

– Use of conflation procedures (stemming, optional)

– Selection of index terms

– Weighting the resulting terms (optional)


Procedures for Building an Index Automatically

Documents
Tokenizing
text break into words
Noise reduction
words stoplist Feature
normalization
non-stoplist stemming*
words
*Indicates
optional stemmed term weighting*
operation words

terms with Index


weights database
Procedures for Building an Index Automatically

• Thus, automatic indexing consists of two processes

– Assigning terms or concepts capable of representing document


content

– Assigning a weight or value to each term reflecting its


presumed importance for the purpose of content identification

• Important words are assigned higher weights

• Less important words are assigned lower weights


Advantages of Automatic Indexing
• Reduced processing time (Fast)
• Reduced cost (inexpensive)
• Easy to maintain
• Improved consistency(reliability)
– No inconsistency or high consistency
– Algorithms select index terms much more consistently than
humans.
• Better retrieval (achieved)

Disadvantages of Automatic Indexing

• Mechanical execution of algorithms, with no intelligent interpretation (of


aboutness / relevance)
Automatic Text Analysis
• Not all words in a text are good index terms

• Some are good, some are bad and some are indifferent

• How do we know whether a term is good or bad or indifferent for


indexing?

• Luhn’s idea will give us answer to this question


Automatic Text Analysis

• It was Luhn (1957) who first suggested that certain words could
be automatically extracted from texts to represent their content

– He is one of the earliest researcher into IR

• He discovered that the distribution patterns of words could give


significant information about the property of being content
bearing

• Much of text analysis has been built on the original idea of Luhn
Automatic Text Analysis
• Luhn’s proposal
“The frequency of word occurrences in an article furnishes a useful
measure of word significance…”

However, a high frequency term will be acceptable for indexing


purposes only if its occurrence frequency is not equally high in
all documents of the collection

Still today, the search engines that operate on the Internet index the
documents based on this principle
Automatic Text Analysis
• Luhn’s observation
– He noted that high frequency words tend to be common, non
content bearing words

– He also recognized that one or two occurrences of a word in


a relatively long text could not be taken significant in defining
the subject matter

• Came up with a model for selecting terms based on their


frequency of occurrences
Automatic Text Analysis
• Luhn’s model

– Words which occur very infrequently in a collection are of


little importance for indexing since they are unlikely to be
specified in queries

• Such rare terms are likely to be specific to the documents


and they may not occur in users queries

– Words which occur very frequently in a collection are of


little importance for indexing since they do not
discriminate sufficiently between documents
Cont..
It is less likely to use these terms to discriminate
the documents from others so not important for
indexing
– The most important words for indexing are those which
occur with intermediate frequencies

• Thus, according to Luhn, medium frequency terms are


better candidates for indexing
Automatic Text Analysis
• Let f be the frequency
of occurrence of
various word types in
a given position of
text
• Let r be their rank
order, the order of
their frequency of
occurrence
• Then a plot relating f
and r yields a curve
similar to the
hyperbolic curve
shown to the right
• The curve is, in fact,
demonstrates Zipf’s
law
Automatic Text Analysis

• Therefore, Luhn suggested using the words in the middle of the


frequency range

• These findings are the bases of a number of classical weighting


schemes
Problems with Luhn’s Selection Mechanism

• Finding a way for elimination of high and low frequency words

– Certain arbitrariness is involved in determining the cut-offs

– That is, there is no means which gives their values

– They have to be determined by trial and error

• The risk of loss of retrieval performance

– The removal of high frequency words may reduce recall


– The removal of low frequency words may bring losses in
precision
Zipf’s Law in IR

• The law states that there is an inverse relation between the


frequency of a word f and its rank r; highest frequency term
has rank 1, second highest frequency term has rank 2 etc.)

• If the terms in a collection are ranked (r) by their frequency (f),


they roughly fit the relation r_t * f_t = C, which is known as
Zipf’s law f = C*1/r

– In other words, the law states that the product of the


frequency of use of words and their rank order is
approximately constant

rank * frequency ≈ constant


Question

You might also like