IR Chapter 1 & 2

WOLKITE UNIVERITY
COLLEGE OF COMPUTING AND INFORMATICS
DEPARTMENT OF INFORMATION SYSTEM
Course : Introduction to Information Storage & Retrieval
INSY 2063: IR
BSc(IS) Third Year, First Semester, 2018

ISAYAS W.
INFORMATION SYSYTEM
Information Retrieval
Chapter 1:
Information Storage and Retrieval
Questions
• Find “BRUTUS AND CAESAR AND NOT

CALPURNIA” in the big book of shakespare.
• I want to get some idea about the concepts of information
retrieval
?
Example
Introduction
• The practice of archiving (A depository containing
historical records and documents) written information can be
traced back to around 3000 BC, when the Sumerians designated
special areas to store clay tablets with cuneiform inscriptions
(Amit Singhal,2001)
The need to store and retrieve written information became

increasingly important over centuries, especially with inventions
like paper and the printing press.
• After computers were invented, people realized that they could be

used for storing and mechanically retrieving large amounts of
information
Cont……
 Approaching the end of the twentieth century, societies all
over the world are changing.
 In countries of many different kinds, information now plays

an increasingly important part in economic, social, cultural
and political life.
 This phenomenon is taking place regardless of a country’s

size, state of development or political philosophy.
Cont……..
 Changes that are happening in Singapore, with a population

of 2.5 million, are similar to those taking place in Japan with
its population of 125 million.
 Developing countries like Thailand are striving to build

information-intensive(concentrated) social and economic
systems just as hard as countries like the United Kingdom or
France.
The storage of information from first to now
• Clay tablets
Paper and other soft materials
cloud
Computers
What is Information, storage and retrieval?
• Information:
 Data that have been processed and has meaning of itself and
the meaning is useful but does not have to be.
 Information is a critical business resource and like any other

critical resource must be properly managed
 Provides answer to who,what,where,when questions

Storage
 The action of or method of storing something.
 The place where data is held in an electromagnetic or optical

for access by a computer processor.
Retrieval : The process of getting some thing backfrom

somewhere easily.
The action of obtaining or consulting material stored in a

computer system.
Example: find „BRUTUS AND CAESAR AND NOT

CALPURNIA‟ in the big book of shakespare.
Information Storage
• The computers can store different types of information in
different ways, depending on what the information is, how
much storage it requires and how quickly it needs to be
accessed.
• Information storage is the part of the accounting system

that keeps data accessible to the information processors
(cpu)
• Accounting system  is the system used to manage the

income, expenses, and other financial activities of a
business.
Cont..
After the input devices enter data into an accounting system,
the information processors take the raw data and convert it
into a usable form.
• This information is then stored, often in the form of a

database, on the information storage component of the
accounting system.
Example
….cont
instead
IR and IR systems
What is IR(Information Retrieval)
?
Information Retrieval(IR)
• The term Information Retrieval was first coined by Calvin
Moore (1950)
Definition: Is an Important sub-discipline of Information Science

that is concerned with developing theories and methods of access
to information
– Focus is on helping user find information that matches their

information need (User Centered View)
• Is a branch of applied Computer Science that focus on

representation, storage, organization of, and access to information
items (System Centered View).
…cont
• A good formal definition of information retrieval is given in
Baeze-Yates and Riberio-Neto (1990p1)
“Information retrieval deals with representation, storage,

organization of, and access to information items. The
organization and access of information items should provide
the user with easy access to the information in which he is
interested”
• Is about finding relevant information in large collection of

data
….cont
• Conceptually, IR is used to cover all related problems in
finding needed information
• Historically, information retrieval is about document retrieval,

emphasizing documents as a basic units
– Until recently, in the above sense, IR was considered as a narrow area of
interest for Librarians and Information experts
– Today, IR includes Modelling, document classification, user

interfaces and visualization, multimedia retrieval, digital library,
filtering, natural languages etc.
• Technically, information retrieval refers to (text) string

manipulation, indexing, matching, querying, etc.
The Task of Information Retrieval
• A large depository document is stored on a computer=Corpus
• There is a topic about which we desire to get some

information=Information need
• Some of those documents may contain the information that

satisfies my need=relevance
• How do we retrieve those documents?
• we communicate our information need to the computers by

expressing it in the form of a query
How to prepare a Query
• How the query is expressed will depend on whether the data is
structured or unstructured
• Structured data information in a tables, has a clear,

overt(obvious) semantic structure, organized, Relations
• Example :
Employee Manager salary
A B 80000
Example
• Structured data allows for the expressive queries like:
Give me the social security numbers of all the employees

who have stayed with the company for more than five years
Id Name Manger Year stayed
1 Abebe Beka 2
2 Bona Dedefo 2
3 Chala Boru 6
Cont……….
• Unstructured Data: does not have a clear, overt semantic
structure(e.g. free text on web page, video, audio)
• Allows less expressive queries of the form:
• Give me all documents that have the keywords

• ‘These romans are crazy’
Structured data Database System
Unstructured Information
data Retrieval
Generally;
• Information Retrieval (IR) is finding material (usually

documents) of an unstructured nature (usually text) that
satisfies an information need from within large collections
(usually stored on computers). Information retrieval
technology has been central to the success of the Web.
• Question 1: what is the difference between the structured

and unstructured data? What about semi-structured data?
Query
• (computing) a set of instructions passed to a database to

retrieve particular data (Dictionary Definition)
• Queries are formal statements of information needs that are put

to an IR system by the user to search for a document.
• The users’ query is matched to the documents stored in a

database through the documents’ index.
Query
• When formulating a query, the user can employ search

facilities such as search limits (by date of publication,
language, publication type, and so on) and Boolean operators
(AND/OR/NEAR/NOT) to make the query more specified
(i.e. refine or relax the query).
• The user can also often control the output in terms of, for
example, number of retrieved documents to display and of
highlighting search terms.
Goal of IR
• The general goal of IR is to
Help users find useful information based on their information
needs (with a minimum effort, ) despite the increasing
complexity of Information and the changing needs of user
Provide immediate random access to the data
Remark
Retrieval systems such as google are developed with this aim

What IR assumes?
• Information is stored (or available)
• A user has an information need
• An automated system exists from which information can

be retrieved
• The system works!!

Cont…
Challenges in IR
• Representation of information items and information needs

(first problem)
– Document representation is one area of IR

– Query representation is another area of IR
• Matching (second problem)
– How to match need Vs. information items
• Modification of representation as a result of judgment (query

expansion or reformulation)
Question
Data Information Retrieval Systems
• Are systems which are build to retrieve documents highly
likely relevant to the user
• Are systems built to reduce user’s workload in searching

through the store of documents to find relevant one’s
• Are systems that give information about the presence or

absence of documents in accordance with the query
– Automated abstracts or summaries of documents were developed to
further simplify access to search results
• Are computer based systems (we are talking about

automation )
….cont
• Are systems that attempt to find relevant documents to
respond to user’s request
• Are systems that interposed (interrepted) between a

potential user of information and the information collection
itself.
– For a given information problem, the purpose of the

system is to capture wanted items and to filter out
unwanted items
…….cont
Programmable IR Tools
 Apache Lucene
 Apache Solr
 Lemur
 Terrier
 Rapid Miner
Generally;
• Is a set of rules and procedures, as operated by humans and/or
machines, for doing some or all of the following operations
– Indexing (or constructing representation of documents)
– Search formulation (or constructing representation of information

needs)
– Searching (or matching representation of documents against

representation of needs)
– Feedback (or repeating any or all of the above processes with

modifications introduced in response to an assessment of results of
some process)
Information retrieval system
• Consists of:
1. Sets of Information items (documents)

• Objects that have the information we need
2. A set of requests (Information needs)
3. Some mechanisms for determining the requirements of

the request (matching functions)
Information Retrieval Systems
Examples of IR systems
• Typical examples of IR systems are search engines that

can be found on the web or in library
– They concentrate on finding documents, performing

full text retrieval
– After a user types in several keywords, the system

returns the documents that are most interesting
according to the system
What an IR system should do?
 Store/archive information
 Provide access to that information
 Answer queries with relevant information
 Understand the user’s queries
 Understand the user’s need
 Acts as an assistant
Major functions of an IR systems
– Analyze contents of information items
– Represent the contents of the analyzed sources in a way suitable

for matching with users’ queries
– Analyze users information need and represent them in a form

that will be suitable for matching with the database
– Match the search statement with the stored database
– Retrieve or generate information that are relevant in a ranking

which reflects relevance
– Make necessary adjustments in the system based on feedback

from users
Types of IR Systems/applications
• IR can be structured for ease of discussion as:
– Text IR
• Discusses the classic problem of searching a collection of documents
for useful information
• Focuses is on document s that are predominantly text (rather than
pictures)
• These are called textual images and are amenable(agreeable) to
automatic extraction of key words
– Multimedia IR
• Discusses how to index document images and other binary data by
extracting features from their content and how to search them
efficiently
– Human computer interaction (HCI) for IR
• Discusses current trends in IR towards improved user interface and
better data visualization tools
– Application of IR
• Covers modern applications of IR (such as the Web, bibliographic
systems, and digital libraries)
Components of an IR systems
• An IR system comprises the following major subsystems
– Document selection subsystem

• Documents are there in the database. How are we going to select those
documents that are relevant (matched with user requests)
– Vocabulary subsystem
• In indexing we need to use controlled vocabulary i.e., a list of selected
subject terms to represent a document. It is based on a vocabulary that
the indexing is updated
– Text Operations subsystem
– Tokenization, Stopword removal, Stemming

Data versus Information Retrieval
• 1. Information items
– DBMS
• highly structured data (are of known nature), often
homogeneous records, often semantically unambiguous (well
defined semantics)
– IR systems
• Unstructured or unformatted data (as opposed to relational
database). When you go to a specific document it is not
structured as in DB
• Free text
– text data- papers, technical reports, news article ( completely
untagged or plain text)
– Web-pages – HTML and XML files (semi structured)
• None textual data – images, graphics etc.
• Heterogeneous, Semantically ambiguous (semantics is
frequently loose; we want approximate match)
……cont
• 2. Answers
– DBMS:
• Records, tupples, No ranking
• Well defined results
• Perfect precision and recall, each item is relevant
– IR systems
• Documents, ranked list of documents. The issue ranking is

very important (page through the top k documents)
• Imperfect precision and recall, each item has specific

relevance
….cont
• 3. Matching
– DBMS:
• Analoguous to db quering: Which records contain a set of
keywords?
• Exact match; We talk of items that match exactly; Every record
either matches or fails to match a query; No notion of relevnce
• A single erroneous (conatining error) object implies failure!
– IR systems
• Information about a subject or topic
• Partial or best match; We talk of possibly relevant items not
exact matched items
• Notion of relevance is most important- needs a model
• Small errors are tolerated (and in fact inevitable)
• Interpret contents of information items
• Generate a ranking which reflects relevance
….cont
• 4. Items wanted
– DBMS
• Matching
– IR systems
• Relevant
• 5. Model
• DBMS:
• Deterministic (answer can be predetermined
– IR systems:
• Probabilistic, not deterministic; answer is not
predetermined
…cont
• 6. Querying
– DBMS:
• (DB query) assumes that the data is in standardized
format
– IR system
• Query assumes that we work on plain, unformatted data
7. Query language
• DBMS
• Artificial language
– IR system
• Natural language
….cont
• 8. Query specification
• DBMS
• Complete (requires precise retrieval criteria)
• A single erroneous object implies failure
– IR system
• Incomplete
• Small errors are tolerated

…cont
 DB grew out of files and traditional business system
 IR grew out of library science and need to

categorize/group/access books/articles
 Information retrieval is much more difficult than data

retrieval
 Both support queries over large data set, using indexing
 Relationship
Systems complement each other

Summary of Comparison (data retrieval Vs information retrieval)
Discussion questions
1. What is IR and IR systems
2. What is search engine?
3. List and explain components of IR block diagram.
4. write the difference between IR retrieval and DR.
5. Apache solr is open source software. What is open source
software?
IR and the Retrieval Process
• The purpose of an information retrieval strategy is to retrieve all
the relevant documents whilst(although) at the same time
retrieving as few non relevant once as possible
• The process involves a certain amount of element of feed back

and is best illustrated using the diagram in the next slide
• Can be seen or interpreted in terms of component sub-processes

whose study fields yields many of the topics that will be
covered in the course
Retrieval Process
Text
User
Interface
user need 4, 10 Text
Text Operations
6, 7
logical view logical view
Query DB Manager
Operations Indexing
Module
user feedback
5 8
inverted file
query
Searching
Index
8
retrieved docs
Text
Database
Ranking
ranked docs
2
A simple and generic software architecture to describe the retrieval process

Indexing part
Searching part
….cont
• There are three main ingredients to the IR process
– Texts or documents
– Queries
– The process of evaluation
• For texts, the main problem is to obtain a representation of
the text in a form which is amenable to automatic indexing
• This is achieved (i.e., the representation) by creating an
abbreviated form of the text, known as a text surrogate
• A typical surrogate would consist of a set of index terms or
keywords or descriptors
Document surrogates
Example
….cont
For queries
• For queries, the query has arisen as a result of an information
need on the part of the user
• The query is then a representation of the information need and

must be expressed in a language understood by the system
• Due to the inherent difficulty of accurately representing the

information need, the query in IR system is always regarded
as approximate and imperfect
…..cont
For the evaluation

• The evaluation process involves a comparison of the text
actually retrieved with those the user expected to retrieve
• This often leads to some modification, typically of the query

through possibly of the information need or even of the
surrogates
• The extent to which modification is required is closely linked

with the process of measuring the effectiveness of the retrieval
operation (recall and precision)
….cont
 It is necessary to define the text database before any of the
retrieval processes are initiated
 This is usually done by the manager of the database and

includes specifying the following
 The documents to be used
 The operations to be performed on the text
 The text model to be used (the text structure and what
elements can be retrieved)
 The text operations transform the original documents and the

information needs and generate a logical view of them
….cont.
• Once the logical view of the documents is defined, the
database module builds an index of the text
– An index is a critical data structure
– It allows fast searching over large volumes of data
• Different index structures might be used , but the most popular

one is the inverted file (more on this later) as indicated in the
slide
• Given the document database is indexed, the retrieval process

can be initiated
……..cont.
• The user first specifies a user need which is then parsed and
transformed by the same text operation applied to the text
• Then the query operations might be applied before the actual

query, which provides the a system representation for the user
need, is generated
• Matching- The query is then processed to obtain the retrieved

documents
• Before the retrieved documents are sent to the user, the

retrieved documents are ranked according to the likelihood of
relevance
……..cont.
• The user then examines the set of ranked documents in the
search for useful information
• Two choices for the user

– Reformulate query, run on entire collection
– Reformulate query, run on result set
• At this point, he might pinpoint a subset of the documents seen

as definitely of interest and initiate a user feedback cycle
• In such a cycle, the system uses the documents selected by the

user to change the query formulation
• Hopefully, this modified query is a better representation of the

real user need
Basic Structure of an IR System
Components of an IR System
Why
1. Regulatory compliance
• A well-organized information storage and retrieval system
that follows compliance (agreement) regulations and tax
record-keeping guidelines significantly
increases a business owner’s confidence the business is fully

complying.
2. Efficiency and Productivity
• A good information storage and retrieval system, including an

effective indexing system, not only decreases the chances
information will be misfiled but also speeds up the storing and
retrieval of information.
• The resulting time saving benefit increases office efficiency

and productivity while decreasing stress and anxiety
3. Improving working environment
• It can be disheartening to anyone walking through an office area
to see vital business documents and other information stacked
on top of file cabinets or in boxes next to office workstations.
• Not only does this create a stressful and poor working

environment, but if customers see this, can cause customers to
form a negative perception of the business.
• Contrast this with an office area in which file cabinets, passages

and workstations are clear and neatly organized to see how
important it is for even a small business to have a well-organized
information storage and retrieval system.
The Standard Retrieval Interaction Model
Question
Information Retrieval
Chapter 2:
Automatic Term Selection and
Term Weighting
Definition of Term Selection
• The act or fact of carefully choosing some term as being the

best or the most suitable
• The process of choosing the most important term from the

given documents for the purpose of indexing and text
operation
Why term selection?
• Some words are not good for representing documents
• Use of all words have computational cost, increase searching

time and storage requirements
• Using the set of all words in a collection to index

documents generates too much noise for the retrieval
task
Objective or aim of term selection
• Represent textual documents by a set of keywords called index
terms or simply terms
• Increase efficiency by extracting from the resulting document a

selected set of terms to be used for indexing the document
• If full text representation is adopted then all words are used for
indexing (not as such efficient as it will have an overhead, time
and space)
Index term
• Is also called keyword
• Is a word (a single word) or phrase (multiword) in a document

whose semantics gives an indication of the document’s theme
(main idea)
– A term that captures subject of the topic of a document.
– Help in remembering the documents main theme

Index Terms
• Assumption
– The index terms selected are assumed to reflect the content of

the text (are descriptions of content)
• Index terms can be extracted from the title, abstract and text of
the document
Indexing
Is a critical process
– User’s ability to find documents on a particular subject is

limited by the indexing process used to create index terms for
the subject
Indexing is The act of classifying and providing an index

in order to make items easier to retrieve
Example: (in a book, set of books), an alphabetical list of names, subjects,
etc, with reference to the pages on which they are mentioned.
Indexing
Indexing
• Some definitions
– Is the art of organizing information
– Is an association of descriptors (keywords, concepts) to

documents in view of future retrieval
– Is a process of constructing document surrogates by

assigning identifiers to text items
– Is the process of analyzing the information content in the

language of the indexing system
Document surrogates
Example
Indexing
• Purpose/objective
– To give access point to a collection that are expected to be

most useful to the users of information
– To allow easy identification of documents (e.g., find

documents by topic)
– To relate documents to each other
– To allow prediction of document relevance to a particular

information need
Indexing
• Indexing may also assign weights to terms
– Non-weighted indexing
– Weighted indexing
Indexing
• Non-weighted indexing
– No attempt to determine the value of the different terms

assigned to a document
– Not possible to distinguish between major topics and causal

references
– All retrieved documents are equal in value
– Typical of commercial systems through the 1980s

Indexing
• Weighted indexing
– Attempt made to place a value on each term of the

description of the document
– This value is related to the frequency of occurrence of the

term in the document (higher is better), but also to the
number of collection documents that uses this term (lower is
better)
Indexing exhaustively(completely)
• Should we index only the most important concepts, or also more

minor concepts?
Indexing specificity
• Should we use general index terms or more specific terms?
• Should we use the term “computer” or “personal computer”?

Indexing
• Ways to do indexing
– Manual
– Automatic (focus of the course)

Manual Indexing
• Indexers decide which keywords to assign to documents based

on controlled vocabulary
– Human indexers assign index terms to documents
• The indexers try to summarize the contents or aboutness of the

whole document in a few keywords
• That is, indexers analyze and represent the content of a document

through keywords
• Is based on intellectual judgment and semantic interpretation of

(concepts, themes) of indexers
Manual Indexing
• Indexers prior knowledge of the following is important to come
up with good keywords or index terms
– Terms that will be used by the user
– Indexing vocabulary
– Collection characteristics
Advantages of Manual Indexing
• Ability to perform abstraction (conclude what the subject is) and

determine additional related terms
• Ability to judge the value of concepts (because it is done by

human being)
Disadvantages of Manual Indexing
• Slow and expensive (significant cost)
– Cost of professional indexers is very expensive
• Is based on intellectual judgment and semantic interpretation

(concepts, themes)
– High probability of inconsistency or low consistency among

indexers (maintaining consistency is difficult),
• Labor intensive
• In automatic indexing all these problems will some how be solved

Automatic Indexing
• Is the assignment of content identifiers, with the help of modern
computing technology
– A computer system is used to record the descriptors generated

by the human
• The system extracts “typical”/ “significant” terms
• The human may contribute by setting the parameters or

thresholds, or by choosing components or algorithms
Why automatic indexing?
• Reasons for the necessity of automatic indexing
– Information overload
• Enormous amount of information is being generated from
day to day activities
– Explosion of machine-readable text

• Massive information available in electronic format and on
Internet.
– Cost effectiveness
• Human indexing is expensive and labor intensive.
Current Procedures for Automatic Indexing
• Generating document representatives through automatic indexing
involves
Lexical analysis the process of converting an input stream of

characters into a stream of words or tokens
– Use of stoplist
– Use of conflation procedures (stemming, optional)
– Selection of index terms
– Weighting the resulting terms (optional)

Procedures for Building an Index Automatically
Documents
Tokenizing
text break into words
Noise reduction
words stoplist Feature
normalization
non-stoplist stemming*
words
*Indicates
optional stemmed term weighting*
operation words
terms with Index

weights database
Procedures for Building an Index Automatically
• Thus, automatic indexing consists of two processes
– Assigning terms or concepts capable of representing document

content
– Assigning a weight or value to each term reflecting its

presumed importance for the purpose of content identification
• Important words are assigned higher weights
• Less important words are assigned lower weights

Advantages of Automatic Indexing
• Reduced processing time (Fast)
• Reduced cost (inexpensive)
• Easy to maintain
• Improved consistency(reliability)
– No inconsistency or high consistency
– Algorithms select index terms much more consistently than
humans.
• Better retrieval (achieved)
Disadvantages of Automatic Indexing
• Mechanical execution of algorithms, with no intelligent interpretation (of

aboutness / relevance)
Automatic Text Analysis
• Not all words in a text are good index terms
• Some are good, some are bad and some are indifferent
• How do we know whether a term is good or bad or indifferent for

indexing?
• Luhn’s idea will give us answer to this question

• It was Luhn (1957) who first suggested that certain words could
be automatically extracted from texts to represent their content
– He is one of the earliest researcher into IR
• He discovered that the distribution patterns of words could give

significant information about the property of being content
bearing
• Much of text analysis has been built on the original idea of Luhn
• Luhn’s proposal
“The frequency of word occurrences in an article furnishes a useful
measure of word significance…”
However, a high frequency term will be acceptable for indexing

purposes only if its occurrence frequency is not equally high in
all documents of the collection
Still today, the search engines that operate on the Internet index the
documents based on this principle
• Luhn’s observation
– He noted that high frequency words tend to be common, non
content bearing words
– He also recognized that one or two occurrences of a word in

a relatively long text could not be taken significant in defining
the subject matter
• Came up with a model for selecting terms based on their

frequency of occurrences
• Luhn’s model
– Words which occur very infrequently in a collection are of

little importance for indexing since they are unlikely to be
specified in queries
• Such rare terms are likely to be specific to the documents

and they may not occur in users queries
– Words which occur very frequently in a collection are of

little importance for indexing since they do not
discriminate sufficiently between documents
Cont..
It is less likely to use these terms to discriminate
the documents from others so not important for
indexing
– The most important words for indexing are those which
occur with intermediate frequencies
• Thus, according to Luhn, medium frequency terms are

better candidates for indexing
• Let f be the frequency
of occurrence of
various word types in
a given position of
text
• Let r be their rank
order, the order of
their frequency of
occurrence
• Then a plot relating f
and r yields a curve
similar to the
hyperbolic curve
shown to the right
• The curve is, in fact,
demonstrates Zipf’s
law
• Therefore, Luhn suggested using the words in the middle of the

frequency range
• These findings are the bases of a number of classical weighting

schemes
Problems with Luhn’s Selection Mechanism
• Finding a way for elimination of high and low frequency words
– Certain arbitrariness is involved in determining the cut-offs
– That is, there is no means which gives their values
– They have to be determined by trial and error
• The risk of loss of retrieval performance
– The removal of high frequency words may reduce recall

– The removal of low frequency words may bring losses in
precision
Zipf’s Law in IR
• The law states that there is an inverse relation between the

frequency of a word f and its rank r; highest frequency term
has rank 1, second highest frequency term has rank 2 etc.)
• If the terms in a collection are ranked (r) by their frequency (f),

they roughly fit the relation r_t * f_t = C, which is known as
Zipf’s law f = C*1/r
– In other words, the law states that the product of the

frequency of use of words and their rank order is
approximately constant
rank * frequency ≈ constant

Question

IR Chapter 1 &amp; 2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

IR Chapter 1 &amp; 2

Uploaded by

Copyright:

Available Formats

WOLKITE UNIVERITY

COLLEGE OF COMPUTING AND INFORMATICS

DEPARTMENT OF INFORMATION SYSTEM

Course : Introduction to Information Storage & Retrieval

BSc(IS) Third Year, First Semester, 2018

• Find “BRUTUS AND CAESAR AND NOT

The need to store and retrieve written information became

• After computers were invented, people realized that they could be

 In countries of many different kinds, information now plays

 This phenomenon is taking place regardless of a country’s

 Changes that are happening in Singapore, with a population

 Developing countries like Thailand are striving to build

Paper and other soft materials

 Information is a critical business resource and like any other

 Provides answer to who,what,where,when questions

 The action of or method of storing something.

 The place where data is held in an electromagnetic or optical

Retrieval : The process of getting some thing backfrom

The action of obtaining or consulting material stored in a

Example: find „BRUTUS AND CAESAR AND NOT

• Information storage is the part of the accounting system

• Accounting system  is the system used to manage the

• This information is then stored, often in the form of a

Definition: Is an Important sub-discipline of Information Science

– Focus is on helping user find information that matches their

• Is a branch of applied Computer Science that focus on

“Information retrieval deals with representation, storage,

• Is about finding relevant information in large collection of

• Historically, information retrieval is about document retrieval,

– Today, IR includes Modelling, document classification, user

• Technically, information retrieval refers to (text) string

• There is a topic about which we desire to get some

• Some of those documents may contain the information that

• How do we retrieve those documents?

• we communicate our information need to the computers by

• Structured data information in a tables, has a clear,

Give me the social security numbers of all the employees

Id Name Manger Year stayed

• Allows less expressive queries of the form:

• Give me all documents that have the keywords

Structured data Database System

• Information Retrieval (IR) is finding material (usually

• Question 1: what is the difference between the structured

• (computing) a set of instructions passed to a database to

• Queries are formal statements of information needs that are put

• The users’ query is matched to the documents stored in a

• When formulating a query, the user can employ search

• The general goal of IR is to

Help users find useful information based on their information

needs (with a minimum effort, ) despite the increasing

complexity of Information and the changing needs of user

Provide immediate random access to the data

Retrieval systems such as google are developed with this aim

• Information is stored (or available)

• A user has an information need

• An automated system exists from which information can

• The system works!!

• Representation of information items and information needs

– Document representation is one area of IR

• Matching (second problem)

– How to match need Vs. information items

• Modification of representation as a result of judgment (query

• Are systems built to reduce user’s workload in searching

IR Chapter 1 & 2

IR Chapter 1 & 2