You are on page 1of 388

INFS 427: AUTOMATED INFORMATION RETRIEVAL

(1st Semester, 2021/2022)

INTRODUCTION TO THE INFORMATION


RETRIEVAL PROCESS

Lecturer: Dr. Mrs. Florence O. Entsua-Mensah


◼ Department of Information Studies ◼ fentsua-mensah@ug.edu.gh

Dr Mrs Florence O. Entsua-Mensah January 24, 2023 1


Session Overview

• This lecture session sets the pace for studying


automated information retrieval systems.

• It introduces leaners to the concept of information


retrieval, and the systems that are used to automate the
information retrieval process.

January 24,
Dr Mrs Florence O. Entsua-Mensah Slide 2
2023
Session Outline
The key topics to be covered in the session are:

• Topic One: Understanding Information Retrieval


• Topic Two: Information Retrieval System

January 24,
Dr Mrs Florence O. Entsua-Mensah 3
2023
Recommended Reading

Chowdhury, G. G. (2010). Introduction to modern


information retrieval. London: Facet publishing. –
Read Chapter One.

January 24,
Dr Mrs Florence O. Entsua-Mensah Slide 4
2023
Understanding information retrieval
Topic One

Dr Mrs Florence O. Entsua-Mensah January 24, 2023 Slide 5


Defining Information Retrieval (IR)
• IR is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an
information need from within large collections (usually
stored on computers).

• IR includes:
• Web search
• Searching your laptop
• Searching large cooperate databases

January 24,
Dr Mrs Florence O. Entsua-Mensah 6
2023
Definition of IR

Information retrieval
➢ The technique & process of searching, recovering, &
interpreting information from large amounts of stored
data (MScience & Technology Dictionary).

➢ It relates to “the organization of, processing of, and access to


information of all forms and formats” (Chowdhury, 2010).

January 24,
Dr Mrs Florence O. Entsua-Mensah Slide 7
2023
Basic assumptions of IR
• Collection: A set of documents
– Assume it is a static collection for the moment
• Goal: Retrieve documents with information that is relevant
to the user’s information need and helps the user complete
a task.
• Precision : Fraction of retrieved documents that are
relevant to the user’s information need.
• Recall : Fraction of relevant documents in a collection that
are retrieved.

(Manning, Raghavan, & Nayak, 2015)

January 24,
Dr Mrs Florence O. Entsua-Mensah 8
2023
Information Retrieval System (IRS)
Topic Two (2)

Dr Mrs Florence O. Entsua-Mensah January 24, 2023 Slide 10


What is an IRS?
• IRS is a system that allows people to communicate with
information system or an information service in order to
find information – text, graphic images, sound recordings,
or video that meet their specific needs (Chowdhury, 2010).

• The main objective of an IRS is to enable users to find


relevant information from an organised collection of
documents.

January 24,
Dr Mrs Florence O. Entsua-Mensah 11
2023
The purpose of an IR system
• To organize documents/records in a way that facilitates
easy access or retrieval of relevant information by its
users.
• IR systems retrieve bibliographic items or exact match of
texts of queries from full text databases or multimedia
information.

January 24,
Dr Mrs Florence O. Entsua-Mensah Slide 13
2023
Elements of an IRS

Dr Mrs Florence O. Entsua-Mensah January 24, 2023 Slide 14


The IR Process
• Documents in an IRS are processed into an index
(Indexing process)
• A user formulates his problem/information need in
a query (Query formulation)
• The IR software compares/matches the query to
the index (Matching process)
• The user is presented with a set of retrieved
documents which he judges for relevance or
appropriateness in meeting his need (Feedback)
• The query can be modified if the retrieved
documents are irrelevant

Dr Mrs Florence O. Entsua-Mensah January 24,


Slide 15
2023
Characteristics of an effective IR system
Must be equipped for:
• Prompt information dissemination
• Information filtering (exclude unwanted information)
• Active switching of information (Such as switching from web search to
email access)
• Receiving information in a desired format
• Browsing
• Getting information in an economical way
• Current literature
• Accessing other information systems
• Personalized help, and must be
• User friendly, i.e., must consider the convenience of the
user(Liston and Schoene, as cited by Chowdhury, 2010).

January 24,
Dr Mrs Florence O. Entsua-Mensah Slide 16
2023
Types of IR systems
1. OPAC
➢ Searching library catalogues online
➢ checking availability of library resource.
2. Online database
➢ Provide access to peer reviewed scholarly information resources
➢ Are subscription or fee-based services
3. Digital libraries and web information service
➢ Information is stored in digital formats
➢ Often free and accessed via the web
4. Web search engines
➢ Free search tools for web information retrieval

January 24,
Dr Mrs Florence O. Entsua-Mensah Slide 17
2023
Everyday uses of IR systems
• The search for information • Searching for information
from library OPACs on company or institutional
• Accessing information intranets
from bibliographic or full • Access web information
text databases e.g. Web of via URLs, search engines,
Science, LISA and subject gateways
• Access to e-books & e- (provide links to more academic, reliable
information).
journals (World public library at
http://worldlibrary.net/ , Emerald at
• Access information from
www.emeraldinsight.com)
social networking sites.
• Access information from
email services and mobile
phones

Dr Mrs Florence O. Entsua-Mensah 18 January 24, 2023


Summary
• In this session, we have learnt what information retrieval
is generally concerned with.
• We have also discussed the nature and characteristics of
information retrieval systems.

January 24,
Dr Mrs Florence O. Entsua-Mensah 19
2023
Activity 1.1
• Discuss the core elements of an information retrieval
system.

January 24,
Dr Mrs Florence O. Entsua-Mensah 20
2023
References

Chowdhury, G. G. (2010). Introduction to modern


information retrieval. London: Facet publishing

January 24,
Dr Mrs Florence O. Entsua-Mensah 21
2023
INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2021/2022)

Lection 2:
Historical Developments in AIR

Lecturer: Dr. Mrs. Florence O. Entsua-Mensah


◼ Department of Information Studies ◼ fentsua-mensah@ug.edu.gh
Session Overview

The aim of this session is to:


• Provide an understanding of how the information retrieval
field progressed/evolved to its current state.
• Each evolutionary milestone is illustrated by describing
the standards and protocols and by discussing the
global initiatives and the research that shaped it.

Dr Florence O. Entsua-Mensah (Mrs) Slide 2


Session Outline

The key topics to be covered in the session are:

• Topic 1: Information Retrieval Standards & Protocols


• Topic 2: Global Digital Library
• Topic 3: Intelligent Information Retrieval
• Topic 4: Hypertext and Hypermedia Systems

Dr Florence O. Entsua-Mensah (Mrs) 3


Reading List
Rego, A., Garcia, L., Llopis, M., & Lloret, J. (2016). A New Z39.50
Protocol Client for Searching in Libraries and Research
Collaboration. Network Protocols and Algorithms, 8(3), 29.
https://doi.org/10.5296/npa.v8i3.10147

The Apache Software Foundation: The Free and Open Productivity


Suite. Retrieved from:
http://www.openoffice.org/bibliographic/srw.html. Accessed on
August 7, 2018.

Dr Florence O. Entsua-Mensah (Mrs) Slide 4


Introduction
• The growth of Information and Communication
Technology (ICT) has refashioned information search
and retrieval.

• There are several advancements that have taken place


in this area over the period.

Dr Florence O. Entsua-Mensah (Mrs) 5


Information Retrieval Standards & Protocols
Topic One

Dr Florence O. Entsua-Mensah (Mrs) Slide 6


What is a Standard?

• A standard means an agreement by what way to perform a task or


carry out some activity to obtain a predictable result .
• There are various standards and protocols that are in existence for
IR systems.
• Some of the popular search and retrieval standards and protocols
include:
• Z39.50
• SRW
• SRU
• CQL

Dr Florence O. Entsua-Mensah (Mrs) 7


Z39.50
• Z39.50 is a communication protocol between a client and a
server.

• The increasing number of information available at libraries


and the necessity to find a mechanism to look for
information at several libraries at the same time prompted
to creation of the Z39.50 protocol.

Dr Florence O. Entsua-Mensah (Mrs) 8


Z39.50 (Cont’d.)
• Sessions inside one connection between both nodes are
known as Z39.50-association or Z-association (Rego et al.,
2016).
• These sessions are initiated by the client. Since Z-
association is open, both server and client can start any
operation defined in Z39.50 protocol. In the same way, Z-
association can be closed by either client or server, or
implicitly terminated by loss of connection (Rego et al.,
2016).

Dr Florence O. Entsua-Mensah (Mrs) 9


Z39.50 (Cont’d.)

• The main goal of Z39.50 is to provide a standard to search


information into an external database whatever its data
organization.

• Thus, Z39.50 is widely used in some of the biggest libraries.


This goal is achieved because the communication between the
client and the server is standard and independent to the
database (Rego et al., 2016).

Dr Florence O. Entsua-Mensah (Mrs) 10


Z39.50 (Cont’d.)
• Z39.50 is used both at the national and international level
as a standard protocol that defines computer-to-computer
information retrieval technique. It is a non-proprietary and
vendor-independent.
• Z39.50 was originally approved by the National Information
Standards Organization (NISO) in 1988. In 1998,
International Organization for Standardization (ISO)
adopted Z39.50 and issued ISO 23950 Information and
documentation - Information retrieval (Z39.50).

Dr Florence O. Entsua-Mensah (Mrs) 11


Z39.50 (Cont’d.)
• Using Z39.50 a user through his/her system can search and retrieve
information from other Z39.50 compliant computer systems without
having the prior idea about the syntax of search that is used by the
other systems.
• The primary goal of Z39.50 is to reduce the complexity and
difficulties involved in searching and retrieving electronic information
.

Dr Florence O. Entsua-Mensah (Mrs) 12


SRW
• SRW stands for Search/Retrieve Web Service protocol
(The Apache Software Foundation, 2018). Its aim is to
minimize the cross-language problems.

• The goal is to allow access to several networked resources


and support interoperability among distributed databases,
using a common utilization framework (The Apache
Software Foundation, 2018).

• It is developed by collective implementers with more than


20 years of experience of the Z39.50 Information Retrieval
protocol with nascent developments in the technological
arena of the web.

Dr Florence O. Entsua-Mensah (Mrs) 13


SRU

• SRU stands for Search/Retrieve via URL. It is a standard


XML-based protocol for search by utilizing CQL
(http://www.loc.gov/cql/), a standard syntax for query
representation (The Apache Software Foundation, 2018).

• The prime difference between SRU and SRW is that the


former uses HTTP as the transport mechanism and the
latter is based on SOAP protocol and uses XML streams
for both the query and the results.

• This depicts that the query is communicated as a URL and


the XML is received as if it were a web page.

Dr Florence O. Entsua-Mensah (Mrs) 14


CQL

• CQL stands for Contextual Query Language (formerly


known as, Common Query Language).
• It is designed for use with SRW which is a search protocol
successor to Z39.50 (as discussed in the previous section).
• CQL is an abstract and extensible query language for
maximum interoperability amongst the connected systems.
The goal is to reduce the difficulty to learn and use while
retaining the capability to allow complex searches.
• Primarily CQL is used in the bibliographic domain, however
it is not restricted to this context alone.

Dr Florence O. Entsua-Mensah (Mrs) 15


Global Digital Library
Topic Two

Dr Florence O. Entsua-Mensah (Mrs) 16


Global Digital Library
• This is more or less a virtual library that consolidates the
collections of individual libraries as one collection.
• The WWW and the internet laid the foundation for the virtual/digital
libraries.

• Global Digital Library (GDL) is a prototype which aims to


connect several national libraries and some major libraries,
museums, archives, and information organizations with
each other (Chen, 2001).

Dr Florence O. Entsua-Mensah (Mrs) 17


Challenges/Issues with Digital information sharing
• Several legal issues may arise related to intellectual property,
copyright, confidentiality and privacy, security, personal, business
equity, etc.;
• Difference in culture may influence the way of information
communication;
• The presence of generational gaps;
• The sheer complexity of information architecture both at the global
and national level;
• To have an effective and adequate inventory of available resources
comprising the knowledge of information;
• The ability to locate, identify and retrieve relevant and quality
information;
• Due to the huge amount of information, the complexity arises
related to "undesirable" "indecent" information.

Dr Florence O. Entsua-Mensah (Mrs) 18


Intelligent Information Retrieval
Topic Three

Dr Florence O. Entsua-Mensah (Mrs) 19


Intelligent IR defined
• Intelligent IR is a computer system having the capability to infer
knowledge with the help of its previous knowledge for establishing a
link between the requirement of its user and a set of candidate
document (Jones et al., 2000).

• This is a system which can perform intelligent retrieval. The


realization of researchers to use knowledge in the information
retrieval system has led them to think about the artificial intelligent
system which also has the similar purpose, and one among these
classes is an expert system.

Dr Florence O. Entsua-Mensah (Mrs) 20


Expert System Defined
• An expert system is “a computer system which emulates the
decision-making ability of human experts” (Jackson, 1998).
• The expert systems are designed to solve complex problems by
reasoning over knowledge stored in a knowledge base.
• The knowledge in the knowledge base is primarily represented as
IF-THEN rules rather than conventional procedural code.
• The first expert systems were invented in the 1970s and then
proliferated in the 1980s.

Dr Florence O. Entsua-Mensah (Mrs) 21


Developments in Expert Systems
• As expert systems evolved, several new techniques were adopted
into various types of inference engines. Some of the most important
ones include:
• Truth Maintenance
• Hypothetical Reasoning
• Fuzzy Logic
• Ontology Classification

Dr Florence O. Entsua-Mensah (Mrs) 22


Expert Systems for LIS Profession
• AUTOCAT was produced in Germany. The system was designed to
generate bibliographic records of physical sciences periodicals
available in machine-readable form (Endres-Niggemeyer and Knorz,
1987) .

• Qualcat (Quality Control in Cataloguing) was undertaken at the


University of Bradford. The goals of the project were to develop
expert systems to select the best records, to link the databases and
centralized authority control, to build a fully automated control
package for day to day running, and to investigate interface
problems for cataloguing (Ayres et al., 1994).

Dr Florence O. Entsua-Mensah (Mrs) 23


Expert Systems for LIS Profession

• OCLC developed an expert system, called Cataloguer’s Assistant.


The system was tested in Carnegie-Mellon University to reclassify
the mathematics and computer science collection (De Silva, 1997) .

• FRUMP: (developed by DeJong) analyses articles from newspapers


using frame-based techniques. The articles were first scanned and
then data were automatically fed into the different slots within
frames.

• SCISOR: (developed by Rau, Jacobs and Zernik, 1989) is a system


that generate reports on corporate acquisitions and mergers.

Dr Florence O. Entsua-Mensah (Mrs) 24


Hypertext & Hypermedia Systems
Topic Four

Dr Florence O. Entsua-Mensah (Mrs) 25


Hypertext

• Hypertext refers to the use of hyperlinks (or simply “links”) to


present text and static graphics. Many websites are entirely or
largely hypertexts (Farkas, 2004).

Dr Florence O. Entsua-Mensah (Mrs) 26


Hypermedia
• Hypermedia refers to the presentation of video, animation, and
audio, which are often referred to as “dynamic” or “time based”
content or as “multimedia” (Farkas, 2004).
• Hypermedia, a logical extension of hypertext, is a non-linear
medium of information space which includes plain text, audio, video,
graphics and hyperlinks link.

Dr Florence O. Entsua-Mensah (Mrs) 27


Hypertext and Hypermedia (Cont’d.)
• Forms of hypertext and hypermedia include CD-ROM and DVD
encyclopaedias (such as Microsoft's Encarta), eBooks, and the
online help systems we find in software products.
• It is common for people to use "hypertext" as a general term that
includes hypermedia (Farkas, 2004). For example, when
researchers talk about “hypertext theory,” they refer to theoretical
concepts that pertain to both static and multimedia content.
(Farkars, 2004)

Dr Florence O. Entsua-Mensah (Mrs) 28


Summary

• In this session we have discussed some of the IR techniques and


technologies that evolved in the recent past.
• We have discussed some of the significant IR standards and
protocols.
• We have also reported the state-of-the-art research in IR field, for
instance, the initiative of global digital library, application of
intelligent systems like expert system in library cataloguing,
classification and abstracting, the application and issues of
intelligent hypertext and hypermedia systems.

Dr Florence O. Entsua-Mensah (Mrs) 29


Activity 2.1
• Discuss the role of protocols and standards in the
development of modern IR systems.

Dr Florence O. Entsua-Mensah (Mrs) 30


References - 1
Ayres, F. H., Cullen, J., Gierl, C., Huggill, J. A. W., Ridley, M. J., &
Torsun, I. S. (1994). QUALCAT: automation of quality control
in cataloguing. BLRD REPORTS, 6068.
Chen, C.-C. (2001). Global Digital Library Development in the New
Millennium: Fertile Ground for Distributed Cross-Disciplinary
Collaboration. Tsinghua University Press.
De Silva, S. M. (1997). A review of expert systems in library and
information science. Malaysian Journal of Library &
Information Science, 2(2), 57–92.
Endres-Niggemeyer, B., & Knorz, G. (1987). AUTOCAT: knowledge-
based descriptive cataloguing of articles published in scien-
tific journals. In Second International GI Congress 1987.
Knowledge Based Sys-tems (pp. 20–21).

Dr Florence O. Entsua-Mensah (Mrs) 31


References - 2
Farkas, D. K. (2004). Hypertext and hypermedia. In Berkshire
Encyclopedia of Human-Computer Interaction (Vol. 16, pp. 332–
336). https://doi.org/10.1016/0360-1315(91)90062-V
Jackson, P. (1998). Introduction to expert systems. Addison-Wesley
Longman Publishing Co., Inc.
Jones, K. S., Walker, S., & Robertson, S. E. (2000). A probabilistic
model of information retrieval: development and comparative
experiments: Part 2. Information Processing & Management,
36(6), 809–840.
Rego, A., Garcia, L., Llopis, M., & Lloret, J. (2016). A New Z39.50
Protocol Client for Searching in Libraries and Research
Collaboration. Network Protocols and Algorithms, 8(3), 29.
https://doi.org/10.5296/npa.v8i3.10147
The Apache Software Foundation: The Free and Open Productivity
Suite. Retrieved from:
http://www.openoffice.org/bibliographic/srw.html. Accessed on
August 7, 2018.
Dr Florence O. Entsua-Mensah (Mrs) 32
INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2021/2022)

THE COLLECTION COMPONENT OF AN AIR

Lecturer: Dr. Mrs. Florence O. Entsua-Mensah


Department of Information Studies | fentsua-mensah@ug.edu.gh
Session Overview

• In this class, we will discuss the nature of the body of knowledge/


collection that exists for the information user to access.

• We will discuss the various types of data and how they are
organized to enhance the information retrieval process.

• Also, the automated systems for information gathering


processing, and presentation.

Dr. Mrs. Florence O. Entsua-Mensah Slide 2


Session Outline
The key topics to be covered in the session are:

• Topic One: The Concept of a Collection in an AIR


• Topic Two: Automated Information Gathering
• Topic Three: Automated Systems for Information Processing and
Presentation
• Topic Four: Database technology

Dr. Mrs. Florence O. Entsua-Mensah 3


Recommended Reading

Chowdhury, G. G. (2010). Introduction to modern information


retrieval (3rd ed. ). New York: Neal-Schuman
Publishers, Inc

Ferguson, S., Hebels, R., & Charles Stuart University.


(2003). Computers for librarians: An introduction to
the electronic library. Wagga Wagga: Centre for
Information Studies, Charles Stuart University.

Korfhage, R. R. (2006). Information Storage and Retrieval.


Wiley India Pvt. Limited.

Dr. Mrs. Florence O. Entsua-Mensah Slide 4


The Concept of a Collection in an AIR
Topic One

Dr. Mrs. Florence O. Entsua-Mensah 5


What is a Collection?
• An organized pool of knowledge or information resources which a
user may access to satisfy an information need.

• One of the essential components of an information retrieval system


is its collection or the database.

• The collection is generally made up of documents of different kinds.

Dr. Mrs. Florence O. Entsua-Mensah 6


Documents
• In information retrieval, a document is defined as “a stored data or
record in any form” (Korfhage, 2006).

• A document refers to a piece of written, printed, or electronic matter


that provides information or evidence or that serves as an official
record (Stevenson & Waite, 2011).

• The underlying idea is that the document must be stored in a


retrievable form.

Dr. Mrs. Florence O. Entsua-Mensah 7


Examples of documents
• Books • Graphics
• Letters • Sound/voice recordings
• Messages • Images
• Parts of a book, such as an • Computer programs
encyclopaedia dealing with • Data files
different topics, i.e.,
• A chapter
• Email messages etc.
• A section
• A paragraph

Dr. Mrs. Florence O. Entsua-Mensah 8


Document surrogates
• Document Surrogates are “limited representations of full
documents” (Korfhage, 2006).

• Types of Document Surrogates include:


• Document Identifier
• Bibliographic Data/Records
• Keyword
• Abstract
• Extract
• Review

Dr. Mrs. Florence O. Entsua-Mensah 9


Types of Document Surrogates

• Document Identifier – a number/code e.g. accession or a


classification number for the purpose of inventory control or
document location.

Dr. Mrs. Florence O. Entsua-Mensah 10


Types of Document Surrogates –Cont’d.
• Bibliographic data/record – all the data elements used to identify,
describe, or retrieve a document/publication of information content
• OR
• A collection of data elements organized in a logical way to represent
a bibliographic item or document, publication or any record of
human communication.
• Examples - author, title, publication date, publisher, ISBN etc.
These are useful to the information seeker. For e.g., date shows the
timeliness and appropriateness of the document.

Dr. Mrs. Florence O. Entsua-Mensah 11


Types of Document Surrogates – Cont’d.
• Keyword – one or a set of individual words chosen by the
author/editor or sometimes dictated by the database to represent
the contents of the document.
• Abstract – a brief one or two paragraph description of the contents
of a paper often written by the author.
• Its purpose is to help a reader to determine whether the entire document
should be retrieved.

Dr. Mrs. Florence O. Entsua-Mensah 12


Types of Document Surrogates – Cont’d.
• Extract – “Artificially constructed surrogates created by
someone other than the author of a paper” (Korfhage, 2006).
• May comprise the first sentence of each paragraph or
significant words and phrases in the document.

• Review – a critical article on a book, play, recital etc.,


written by someone other than the author
• Its purpose is to indicate the value of the document with respect to
other works in the same field.
• It can be retrieved separately to suit the purposes of a reader.

Dr. Mrs. Florence O. Entsua-Mensah 13


Automated Information Gathering
Topic Two

Dr. Mrs. Florence O. Entsua-Mensah Slide 14


What is Information Gathering?
• In general practice, information gathering is the collection of data for
dealing with the individual’s or the organization’s current situation
(Teamreporter, 2018).

• Simply put, information gathering involves the building of collection


to satisfy the information needs of a defined user group.

• Information gathering is a time-consuming process due to overload


of available information and there are dedicated teams in many
organizations for this task (Kate, Prapanca & Kalagnanam, 2014).

Dr. Mrs. Florence O. Entsua-Mensah 15


Information Gathering

• The problem of information gathering has received considerable


attention from information professionals.
• Among the major challenges is gathering the right information to match
users’ information needs/goals.
• Research in this area has generally assumed a user’s
information goal is perfectly represented by the query.
• This ‘assumed’ notion has been challenged by recent studies,
arguing that the user's queries are only approximate
representations of the user’s true information goals, and
complete models of the content of information sources are not
available.
• What is important, is to ensure that the information gathering
process is shaped by the users’ information needs.
(Hiyakumoto & Veloso, 2002)

Dr. Mrs. Florence O. Entsua-Mensah 16


Information Gathering
• It is also important to bear in mind that …

• Usually, more data means more and better ways of


dealing with the current situation.

• New ideas come more easily if there is a solid


knowledge base.

Dr. Mrs. Florence O. Entsua-Mensah 17


Automated Systems for Information Processing and
Presentation

Topic Three

Dr. Mrs. Florence O. Entsua-Mensah 18


Ways to automate the preparation of documents for an IRS.

• Preparing data for an automated information retrieval began as a


manual process, however, new technologies have automated
some of the activities in preparing the documents for an IRS,
thereby reducing the level of human involvement.

• Some of the mechanisms that have helped to fully or partially


automate the document preparation process are:
• Automatic text analysis
• Automated classification

Dr. Mrs. Florence O. Entsua-Mensah 19


1. Automatic Text Analysis
• Before a computerised Information Retrieval system can actually
operate to retrieve some information, that information must have
already been stored inside the computer.
• Originally, it will usually have been in the form of documents.
• The computer, however, is not likely to have stored the complete
text of each document in the natural language in which it was
written.
• It will have, instead, a document representative which may have
been produced from the documents either manually or
automatically.
(van Rijsbergen, 2012)

Dr. Mrs. Florence O. Entsua-Mensah 20


Automatic Text Analysis – Cont’d.
• The starting point of the text analysis process may be the complete
document text, an abstract, the title only, or perhaps a list of words
only.
• From it, the process must produce a document representative in a form
which the computer can handle.

Dr. Mrs. Florence O. Entsua-Mensah 21


2. Automated Classification
• Classification, in a narrower sense, describes the process
by which a classificatory system is constructed.

• There are two main areas of application of classification


methods in IR:
1. keyword clustering;
2. document clustering.

(Sobrino, 2014)

Dr. Mrs. Florence O. Entsua-Mensah 22


Automated Classification – Cont’d

• Key Word Clustering:


• This is a technique that target search terms into groups
(clusters) relevant to designated part of a collection/
database.

Dr. Mrs. Florence O. Entsua-Mensah 23


Automated Classification – Cont’d

• Document Clustering:
• The task of organizing a collection of documents, whose classification is
unknown, into meaningful groups (clusters) that are homogeneous
according to some notion of proximity (distance or similarity) among
documents (Tagarelli , 2009).

• The process of grouping similar documents into partitions where


documents within the same partition exhibit higher degree of similarity
among each other than to any other document in any other partition
(Rahal, Wang, Schnepf, 2009).

Dr. Mrs. Florence O. Entsua-Mensah 26


Database Technology
Topic Four

Dr. Mrs. Florence O. Entsua-Mensah 27


Key concepts in Database Technology

• Data- a “set of given facts” or “information in a form that


can be processed by a computer” (Chowdhury, 2010).
• Can be numbers, eg., age, heights or weights of a group
of people or
• Words eg., a set of keywords, medical records of patients

Dr. Mrs. Florence O. Entsua-Mensah 28


Key concepts in Database Technology – Cont’d.

• Record – a collection of related • Subfield – further sub


information or unit of information divisions of a field e.g., the
in a database e.g., bibliographic imprint field in a bibliographic
information of a book, such as database is made up of
title, author, publication date, publisher, date of publication
place of publication, etc. and place of publication.
• Field – the elements or segments • Field tag/Primary key –
included in a record e.g. Author unique identifier given to a
field, title field, etc. field at the design stage for
• A query is a request for data or the purposes of editing,
information from a database table printing, searching and data
or combination of tables. input.

Dr. Mrs. Florence O. Entsua-Mensah 29


An illustration of the key concepts of the database technology

• Typically, databases capture and organize data using


FIELD

“tables”
Student ID NAME GENDER SCORE
10901582 Reggie Male 60
RECORD

10797822 Valerie Female 73


10805526 Julie Female 81

• The Primary key/Unique Field here is the Student ID


• Databases [with structured data] allows numerical range and exact
match (for text) queries. For example:
• - QUERY: List the names of female students who scored above 80. [this is
usually executed with a structured query language (SQL)].
• - RESULTS: Julie
Dr. Mrs. Florence O. Entsua-Mensah 30
Class Exercise
• Use the table below to attempt the following queries:
1. QUERY: Year < 2000 AND Subject = Information Studies RESULTS: ?
2. QUERY: Year > 2015 AND Subject = Psychology RESULTS: ?

AUTHOR’S TITLE OF BOOK YEAR OF SUBJECT


NAME PUBLICATION AREA
Chun-Li Computer Application in 2018 Information
Libraries. Studies
Valerie Preservation of Information 1991 Information
Resources Studies
Julie The Origin of Man 2008 Archeology

Reggie Introduction to Information 2017 Information


Management Studies
Mark Psychology for Everyday Living 2013 Psychology

Dr. Mrs. Florence O. Entsua-Mensah 31


Class Exercise
In response to the second Query in the exercise:

• There are times when you search the UGCat and you have ‘no’
results.
• This simply means that your query/ search yielded no results – the
search engine could not match your search to the available
collection/database.
• In situation like this, you may have to reformulate your query or
search a deferent database.

Dr. Mrs. Florence O. Entsua-Mensah 32


Unstructured Data
• Unstructured Data: Typically refers to free text. Text that
appears in “no” particular order.

• Unstructured data allows:


• Keyword queries including operators
• More sophisticated “concept” queries e.g.,
• Find all web pages dealing with drug abuse

Dr. Mrs. Florence O. Entsua-Mensah 33


Semi Structured Data
• Semi-structured data is data that has not been organized
into a specialized repository, such as a database, but that
nevertheless has associated information, such as
metadata1 , that makes it more amenable to processing
than raw data.

1 Metadata: Descriptive data about a data/information. E.g., date and time


the data was recorded.

(Rouse & Wigmore, 2015)

Dr. Mrs. Florence O. Entsua-Mensah 34


Database
Definitions • Any collection of data or
• A database is a collection information specifically
of information organized to organised for fast
provide efficient retrieval searching and retrieval by a
(Online Library Learning). computer (Encyclopaedia
• “A collection of interrelated Britannica)
data stored so that it may • A database is structured to
be accessed by users with facilitate storage, retrieval,
simple user friendly modification, deletion of
dialogues” (The Macmillan data and other data
Dictionary of Information processing operations.
Technology, cited in
Chowdhury, 2004, p. 16)

Dr. Mrs. Florence O. Entsua-Mensah 35


Databases - Cont’d

• A database is a persistent, logically coherent collection


of inherently meaningful data, relevant to some aspects
of the real world.
• Most databases these days are computerized or
electronic. They may also be accessed online (via the
internet) or offline (on a local storage).

• An electronic database is therefore electronically


organized collection of logically related data.
• Databases are usually designed to manage large bodies
of data or information.

Dr. Mrs. Florence O. Entsua-Mensah 36


Databases - Cont’d
• Databases organized in the form of a matrix are referred to
as structured databases.
• Most databases used by librarians are structured and they
include:
• External or remote databases – they are accessed online over the
Internet
• Portable databases – they are stored on optical discs e.g. CDROMS
• In-house or locally stored databases – also accessed online, e.g.
Catalogues or indexes to local collections

Dr. Mrs. Florence O. Entsua-Mensah 37


Representation of data as a matrix/table
• As mentioned earlier, databases typically hold data that has been
structured at least to some extent.
• The databases mainly employ the use of tables to capture and
organise the data in the structured form.
• In a matrix, each row represents a discrete record within the file
(Stuart & Hebels, 2003).
• Each cell represents a single datum (Stuart & Hebels, 2003).
ISBN Author field Title field Publisher field Date field
(Title of Info Mat) (Name of Publisher)
800158 Maame Ama Making fuel at home Adinkra Publishing 2003
House
100198 Author 2 Title 2 Sage 2006

720154 Author 3 Title 3 Blackwell 1998

410152 Author 4 Title 4 Cambridge 2015

Dr. Mrs. Florence O. Entsua-Mensah 38


Databases - Cont’d.

• The following are examples of databases that we use


often:
– address book
– dictionary
– telephone book

• DB are organized so that data or information stored in


the DB can easily be
• accessed,
• managed, and
• updated.

Dr. Mrs. Florence O. Entsua-Mensah 39


Databases Cont’d.
• A database allows both information professionals and
users to avoid the loss of time, confusion and errors that
can result when information is scattered and disorganized.

Dr. Mrs. Florence O. Entsua-Mensah 40


Types of databases
• Databases can be classified according to types of
content.
• When classified according to the type of content we have
the following types:

Bibliographic Full-text Numeric Images.

Dr. Mrs. Florence O. Entsua-Mensah 41


Classification of Databases

• In information retrieval, the two major divisions/ database


classifications are:

reference source
databases databases

Dr. Mrs. Florence O. Entsua-Mensah 42


Reference databases
They are bibliographies or indexes which serve as guides to
information in published literature
1. Bibliographic databases – provide a citation or descriptive
record of an item but the item itself is not included in the
database. Sometimes they include abstracts, e.g., Social
Science Abstracts.
2. Catalogue databases – show the catalogue of a given library or
a network of libraries.
3. Referral databases – Connect people to community resources,
agencies, and specialised services . It e.g., Physician Referral
databases, Child Care Referral database, Legal Referral
database, a database of NGOs etc.

Dr. Mrs. Florence O. Entsua-Mensah 43


Source databases
They provide users with required information without the need for
referral. They are often grouped by content, examples are:
1. Numeric databases – contain numerical data such as survey,
financial, and statistical data
2. Full-text databases – contain the full text of documents and not
just the citations. Examples journals, books, newspapers,
dissertations, reports etc.

Dr. Mrs. Florence O. Entsua-Mensah 44


Source databases
3. Text-numeric databases – contain both text and numerical data
such as annual reports of companies and handbook
4. Directory databases – provide information about individuals and
organisations. Check http://www.the100lists.com/ for examples of
directories.
5. Multimedia databases – Contain one or more primary media file
types, such as video, audio, graphics, animation sequences,
sequences, as well as documents.

Dr. Mrs. Florence O. Entsua-Mensah 45


The development of database in an information retrieval
environment
Factors to consider
1. Functionality/purpose – is it for online retrieval, resource sharing,
stand alone system etc.
2. Nature of documents/records to be included in the system.
3. Maximum number of records to be integrated into the system.
4. The nature and number of users
5. Availability of resources, i.e. Software, hardware and staff.
6. Knowledge and skills required to maximize use of software
7. Training facilities available.
These factors are important because they determine choice of software
package, number of fields to be created, the optimal performance of the
system

Dr. Mrs. Florence O. Entsua-Mensah 46


Other considerations
Hardware Requirements:
• The processor for executing the program
• Memory for holding ongoing works
• Disk storage for holding data files
• Devices for archiving data files to be used in the event of accidental
damage or loss of data
• Printers to produce hard copy when needed
• Terminals for data input and control of all processes

Dr. Mrs. Florence O. Entsua-Mensah 47


Practical SESSION
Tutorial Session / Individual Practice

Dr. Mrs. Florence O. Entsua-Mensah 48


Steps in the design of a database
• Database design is the first step in the development of a text
retrieval system.
• Pre-requisite decisions include determining:
• The nature of data
• Nature and number of fields and subfields
• Nature of database indexing
• Format for display and printing of data
• Sorting of data during printing
• Entry or editing of data

Dr. Mrs. Florence O. Entsua-Mensah 49


Steps in the design of a database contd.
• E.g. The number/lists of fields and subfields are based on the
nature of data/record. For e.g. Fields in a simple library catalogue
are as follows:

Author Price
Title of book Call number
Publisher’s name Accession number
Place of publication Keywords
Date of publication

Dr. Mrs. Florence O. Entsua-Mensah 50


Steps in the design of a database contd.
• Database indexing – this step generates the index file
that can be searched. This process depends on the
software package being used for developing the database.
Some software packages are programmed to update index
files as soon as new records are added or existing records
deleted.
• Data entry form/worksheet – This is a blank form used for
entering data
• Output format- mode of display of records for browsing or
searching.
• Data entry, searching and printing- These steps
concludes the design of a database

Dr. Mrs. Florence O. Entsua-Mensah 51


Summary

• Knowledge on the nature of the information content of


an automated retrieval system is very crucial to both
the information professional and the user.
• We have, in this class, discussed the nature of the body
of knowledge/ collection that exists in an information
system.
• We also deliberated on automated systems for
information gathering processing, and presentation;
with special attention to database technology.

Dr. Mrs. Florence O. Entsua-Mensah 52


Activity 2.1

• Go to the link below and practice how to create a


searchable database.
• http://www.smallbusinesscomputing.com/buyersguide/articl
e.php/3721436/Build-Your-First-Database-with-Access.htm

Dr. Mrs. Florence O. Entsua-Mensah 53


References

Hiyakumoto, L. S., & Veloso, M. M. (2002). Towards planning and


execution for information retrieval. TV©! § W4E% X3YA8 aG6
A&© 4bX § cX § 09C de A&©£ 3, 22.
Korfhage, R. R. (2006). Information Storage and Retrieval.
Wiley India Pvt. Limited.
Stevenson, A. & Waite, M. (2011). Concise Oxford English
Dictionary: Book & CD-ROM Set. Oxford University Press.

Dr. Mrs. Florence O. Entsua-Mensah 54


INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2021/2022)

Session 04
SUBJECT ANALYSIS & REPRESENTATION

Lecturer: Dr. (Mrs.) Florence O. Entsua-Mensah, DIS


Contact Information: fentsua-mensah@staff.ug.edu.gh
Session Overview

• One of the major functions of an information retrieval system is to


match the contents of documents with users’ queries.
• The system personnel must prepare a surrogate for every
document, and all such surrogates must be maintained in an
organized manner.
• This activity is achieved through ‘subject analyses.
• The session therefore examines the concept of subject analysis
and representation.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 2
Session Outline

The key topics to be covered in the session are:

• Topic 1: Understanding Subject Analysis


• Topic 2: Determining the ‘subject matter’ of a document
• Topic 3: Important factors to note in subject analysis
• Topic 4: Subject Indexing Systems

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 4
Recommended Reading

Chowdhury, G. G. (2010). Introduction to modern


information retrieval. London: Facet publishing.
(Chapter 5).

Korfhage, R. R. (2006). Information Storage and Retrieval. Wiley


India Pvt. Limited.

Taylor, A. G. (2009). The organization of information. (3rd


ed.). Westport, CT: Libraries Unlimited. – Chapter 9

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 5
Introduction
• The major function of an IR system is to match users’
queries to contents of a document
• This matching will be possible after preparation of a
surrogate for each documents
• Document surrogates are “limited representations of full
documents” (Korfhage, 2006).
• The construction of document surrogates by assigning
specific identifiers or keywords to text items is referred to
as indexing.
• Indexing based on the conceptual analysis of the subject of
documents is known as subject indexing.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 6
Introduction – Cont’d.
• Subject indexing comprises two (2) main intellectual steps:

Conceptual
Representation
analysis

what keyword
what is the
can best
subject of this
represent this
document?
document?

What is it about?

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 7
Introduction – Cont’d.
• This session focuses on the what, why, when, how, and who of
subject analysis, or determining what an information object is about.
• All four subject description processes (classification, subject
cataloging, indexing, and abstracting) as well as searching depend
on subject analysis.
• This is a vital skill area for information professionals.
• Expert subject analysis requires high levels of verbal aptitude and
abstract thinking skills.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 8
Subject Analysis
Topic One

Dr. (Mrs.) F. O. Entsua-mensah, DIS/SCDE Slide 9


Subject Analysis (SA)
• SA refers to examination of a bibliographic item by a
trained subject specialist to determine the most specific
subject heading(s) or descriptor(s) that fully describe
its content, to serve in the bibliographic record as
access points in a subject search of a library catalog,
index, abstracting service, or bibliographic database.

(Bastida, 2016)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 10
SA Cont’d
• Subject analysis is the task of determining the intellectual
content or aboutness of an information object.

• But this is not just the documents in the collection.


If you recall, the basic information retrieval (IR)
model, two kinds of information go into an IR
system:
• representations of information objects (documents) and
• representations of information need (queries).

(Taylor, 2009)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 11
SA Cont’d
• Document analysis is studying a document to determine how to
represent it in a record, or what indexing terms or codes to enter.

• Query analysis is studying an information request to determine


how to formulate a search query, or how to choose appropriate
search terms.

(Taylor, 2009)
Dr. (Mrs.) F. O. Entsua-mensah,
DIS/SCDE Slide 12
SA Cont’d

• Subject analysis is the part of indexing or cataloging that


deals with the conceptual analysis of an item (document):
• Begins with determining what a document is about?
what is its form/genre/format of the document?
• translates that analysis into a particular subject heading
system.
• It is usually the first step in classification

(Robare, 2004)
Dr. (Mrs.) F. O. Entsua-mensah,
DIS/SCDE Slide 13
Perspectives of subject analysis
Subject analysis is used in two ways in the library and information
science (LIS) literature
1. Relates to construction of indexing language and classification
systems.
2. Relates to the analysis of the topical content of a document
(which is our focus).
Thus, it determines the essence or the subject matter in
document texts, databases, controlled and natural
languages, information requests, and search strategies.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 15
Determining the ‘subject matter’ of a
document.
Topic Two

Dr. (Mrs.) F. O. Entsua-mensah, DIS/SCDE Slide 18


Conducting Subject Analysis
Subject analysis involves four steps:

1. Familiarization: Becoming acquainted with general content of


document and query.
2. Extraction: Identifying and extracting significant concepts and
natural-language terms.
3. Translation: Converting extracted terms into controlled
vocabulary of system.
4. Formalization: Applying rules for exact format, spelling,
punctuation, codes, etc. for input to system.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 19
Conducting Subject Analysis – Cont’d.
• The steps do not necessarily occur in this order: subject analysis
requires evaluation and verification at every stage in a continuous,
iterative cycle.
• Some expects (e.g., Taylor, 2004) subsumes steps 1 and 2 under
conceptual analysis, or determining “aboutness”, and steps 3 and 4
under subject analysis, or translating concepts into system terms.
• Although controlled-vocabulary systems are assumed above, steps
1 and 2 are also applied in natural-language systems. Regardless,
the same general considerations come into play.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 20
Parts of a Book to Examine When Determining the subject content

Examine the subject-rich portions of the item


being cataloged to identify key words and
concepts:
• Title • Abstract or summary
• Table of contents • Index
• Introduction or preface • Illustrations, diagrams
• Author’s purpose or • Containers
forward

(Robare, 2004)

Dr. (Mrs.) F. O. Entsua-mensah, DIS/SCDE 21


Types of concepts to identify in S. A.
• Topics
• Names of:
• Persons
• Corporate bodies
• Geographic areas
• Time periods
• Titles of works
• Form of the item

(Robare, 2004)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 22
Translating key words & concepts into subject headings

• Controlled vocabulary
• Thesauri (examples)
• Art & Architecture Thesaurus (AAT)
• Thesaurus of ERIC Descriptors
• Subject heading lists (examples)
• Library of Congress Subject Headings
• Sears List of Subject Headings
• Medical Subject Headings (MeSH)

(Robare, 2004)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 23
Why use controlled vocabulary?
• Controlled vocabularies:
• identify a preferred way of expressing a concept
• allow for multiple entry points (i.e., cross-references) leading to the
preferred term
• identify a term’s relationship to broader, narrower, and related terms
• “syndetic structure”

(Robare, 2004)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 24
Function of keywords

• Advantages:
• provide access to the words used in bibliographic
records
• Disadvantages:
• cannot compensate for complexities of language and
expression
• cannot compensate for context
• Keyword searching is enhanced by assignment of
controlled vocabulary!

(Robare, 2004)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 25
Guidelines for determining the subject matter of a
document for the purposes of indexing
Dewey Decimal Classification guidelines
• The indexer must:
• examine the title and table of contents,
• chapter headings and subheadings,
• read the forward, preface, and introduction,
• and lastly scan through the main text.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 26
Guidelines for determining the subject matter of a
document for the purposes of indexing
• Guidelines by Int Org for Standardization (ISO)
• The indexer must examine:
• the title, abstract (if available)
• List of contents and introduction
• The opening chapters and the conclusion
• Illustrations, diagrams, tables and their captions
• Words or group of words which are underlined or printed
in an unusual typeface
• Finally indexer identifies main concepts in the document
by consulting a checklist of questions.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 27
Checklist for examining documents, determining their subjects and selecting
index terms –British Standards Institution (BS 6529:1984)

❑Does the document deal with a • Where these factors considered


specific product, condition and in the context of a particular
phenomenon? location or environment?
❑Does the document contain an • Are any independent or
action concept, an operation or dependent variables identified?
a process? • Was the subject considered
❑Does the document deal with from a special viewpoint not
the agent of this action? normally associated with that
❑Does it refer to particular field of study, e.g. a s
means of accomplishing the sociological study of religion?
action e.g., special instrument, (Chowdhury, 2010, p. 97)
techniques or methods? • Such a checklist demands
some level of intellectual
capacity on the part of the
indexer leading to problems in
manual indexing

Dr. (Mrs.) F. O. Entsua-mensah, DIS/SCDE 28


Wilson’s proposition for determining the subject of a
document
Wilson proposes 4 ways: 3. Constantly-referred-to
1. The purposive method author method – Subject is
oriented, i.e. The indexer determined by counting
determines the purpose of the frequencies of words in the
document to ascertain what the document. The assumption is
author is narrating, proving, the word with the highest
describing, questioning, or frequency is the subject of the
explaining by looking for document. However this might
specific clues in the document not be true.
such as, I will show that, it shall
be proved that etc. 4. The appeal to unity
method – The indexer tries to
2. The figure-ground determine what unifies or
method – Indexer makes the document whole or
determines the aspects of cohesive. Might differ from
the document that are indexer to indexer.
emphasized or stand out.
More or less the indexers
impression of the
document. May differ from
person to person

Dr. (Mrs.) F. O. Entsua-mensah, DIS/SCDE 29


Modelling the subject analysis process
• The available models for subject analysis can be generalised into a
3 step model:
1. Document analysis process - Analysis of the document to
determine the subject
2. Subject description process – formulation of an indexing
phrase or subject description.
3. Subject analysis process – translation of the subject
description into an indexing language or classification scheme

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 30
Important Factors to Note In Subject Analysis
Topic 3

Dr. (Mrs.) F. O. Entsua-mensah, DIS/SCDE Slide 31


Objectivity

• Catalogers must give an accurate, unbiased


indication of the contents of an item
• Assess the topic objectively, remain openminded
• Consider the author’s intent and the audience
• Avoid personal value judgments
• Give equal attention to works, including:
• Topics you might consider frivolous
• Works with which you don’t agree

(Robare, 2004)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 32
(Informed) Subjectivity
Cataloger’s judgment
• Individual perspective
• Informed by the cataloger’s background knowledge of the
subject
• Informed by the cataloger’s cultural background
• Consistency in determining “What is it about?” leads to
greater consistency in assignment of subject headings

(Robare, 2004)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 33
Topic
Subject Representation
4 Subject Indexing Systems

Dr. (Mrs.) F. O. Entsua-mensah, DIS/SCDE Slide 34


Subject indexing systems
• They are indexing systems based on the analysis of contents of
documents.
• They facilitate retrieval of documents by assigning index terms after
the subject matter of a document has been analysed.
• Assignment of index terms can be manual or automatic.
• There are 2 types of subject indexing systems:
• Pre-coordinate systems
• Post-coordinate systems

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 35
Pre-coordinate indexing system
• Coordination of terms are performed at the indexing stage.
• Each index entry represents the full content of a specific
document
• Indexer may select terms from an authoritative source such
as Library of Congress Subject Headings (LCSH)
• There is no room for manipulation of terms during
searching. Searchers can only use the terms or compound
terms pre-determined by the indexer (more or less like a
controlled vocabulary)
• Examples are Chain indexing, relational indexing,
PREserved Context Index System (PRECIS), Postulate-
based Permuted Subject Indexing (POPSI)

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 36
Post-coordinate indexing system
• A single entry is prepared for each keyword selected to
represent the subject of a document. All entries are
organized in a file
• During searching users queries are matched against the
file of index terms and the relevant documents are
retrieved.
• Examples: Uniterm, Peek-a-boo

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 37
Summary

• In this session we examined the work of the subject


analyst in (automated) information retrieval.

• We also studied some of the ways to conduct


subject analysis.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 52
Activity 5.1
• Discuss the role of subject indexing systems in modern information
retrieval.
• What parts of a book would you look at to determine the subject?
• How do you determine what an item is about?
• In what way(s) do controlled vocabularies help in providing subject access?

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 53
Activity 5.2

• Pick any book of your choice from the library.

• Read through it and determine its subject content.

• Create a list of key words and concepts that would be translated into a
controlled vocabulary.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 54
Activity 5.2
• Access this link http://www.ugapress.org/upload/indexing.pdf
and make notes on how to systematically index a book

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 55
References

Bastida, G. (2016). Subject analysis and representation.


Accessed on 23rd August, 2018. Available at:
https://slideplayer.com/slide/7802631/

Chowdhury, G. G. (2010). Introduction to modern


information retrieval. London: Facet publishing.
Korfhage, R. R. (2006). Information Storage and Retrieval.
Wiley India Pvt. Limited.
Robare, L. (2004). Basic Subject Cataloguing Using
LCSH.: Instructor’s manual. United States. Library of
Congress. Retrieved from
https://books.google.com.gh/books?id=Um--
PAAACAAJ
Taylor, A. G. (2009). The organization of information. (3rd
ed.). Westport, CT: Libraries Unlimited.

Dr. (Mrs.) F. O. Entsua-mensah,


DIS/SCDE Slide 56
INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2022/2023)

Lecture 5:
BIBLIOGRAPHIC FORMATS

Lecturer:
Dr. Mrs. Florence O. Entsua-Mensah (fentsua-mensah@ug.edu.gh)
Session Outline
The key topics to be covered in the session are:

• Bibliographic Formats
• The ISO 2709
• The MARC Format

Dr. Mrs. Florence O. Entsua-Mensah 3


Recommended Reading

Chowdhury, G. G. (2010). Introduction to modern information


retrieval. London: Facet publishing. – Chapter 3.

Dr. Mrs. Florence O. Entsua-Mensah Slide 4


Introduction (1)

▪ In any organisation different kinds of information may be


required.

1. A large part of the information required is Factual.

2. The second large category of information required in


any organisation is Bibliographic or textual in nature.

3. The third category of information that may be required


in an organisation is includes personal and
institutional information, project information, etc.

(Chowdhury, 2010).

Dr. Mrs. Florence O. Entsua-Mensah 5


Introduction (2)

Factual Bibliographical Institutional


contains various facts such as the
features of a particular chemical
element or compound, a metal, a
Personal &
tool, a piece of equipment, an
automobile, a spare part, a drug, Contains details of
institutional
a patient, a plant, etc.
bibliographic information
items, with or Project
without and/or
The creation and maintenance of full text. information
factual information retrieval systems
require a background of (i)the subject Etc.
field and (ii) the actual and potential
users and their activities in relation to
their information requirements and
interests.

6 Dr. Mrs. Florence O. Entsua-Mensah (Chowdhury, 2010).


Introduction (3)
• Exchange and sharing of these categories of information
across different user communities require the use of
standard formats to facilitate creation and exchange of
bibliographic data.
• In the case of libraries, to facilitate international library
resource sharing, interlibrary loan and global networking
by use of ICT, there is the development of several
bibliographic formats.
• A designer of a bibliographic database must choose a
format suitable to the needs of the target user community.

(Chowdhury, 2010).

Dr. Mrs. Florence O. Entsua-Mensah 7


What is a bibliographic record?
• All the elements that can be used “to describe, identify, or retrieve
any physical item of information content” (Chowdhury, 2010).

OR

• A collection of data elements organized in a logical way to


represent a bibliographic item
• Bibliographic item – Any record of human communication treated as
an entity, such as a book, a document, part of a document, or group
of documents.

Dr. Mrs. Florence O. Entsua-Mensah 8


Bibliographic Data Exchange
• Effective exchange of bibliographic data between agencies
can be accomplished only if the records conform in respect
of components such as the structure, the content
designators and the data elements.

(Shahil, Sivakumar, & Rejitha, 2011)

Dr. Mrs. Florence O. Entsua-Mensah 9


Components of a bibliographic format
Efficient exchange of bibliographic data between
institutions/agencies can be accomplished if their record
conform to the ff components:
1. Physical structure – Rules for the arrangement of the
data to be exchanged on a computer storage medium,
e.g. Floppy disk or CD-ROM
2. Content designators – These are codes or tags to
identify or define the different data elements in the
record, i.e., author, title, date of publication etc.
3. Content - contents of the record and the rules that
govern the formulation of the data elements of different
formats. For example all bibliographic IR systems must
follow some form of cataloguing rules to ensure uniform
data presentation, display and printed output.

Dr. Mrs. Florence O. Entsua-Mensah 10


Format of records in an AIR
• Records of a database exist in separate but compatible
formats. There are different formats for:
• Input of records into the system
• Long-term storage
• Retrieval of records
• Display of records
• Organizations seeking to exchange information
must also have a standard exchange format.
• There are several standard exchange formats with many
similarities but also differences hindering exchange of
information from one format to the other.

Dr. Mrs. Florence O. Entsua-Mensah 11


Bibliographic Formats
Topic 1

Dr. Mrs. Florence O. Entsua-Mensah 12


What is a bibliographic Format?
• A bibliographic format describes the structure that enables
cataloguing metadata to be processed by a computer
(IFLA, 2017).

• Numerous bibliographic formats exist both within and


beyond the library environment e.g., MARC 21.

Dr. Mrs. Florence O. Entsua-Mensah 13


International Bibliographic Formats

• The lack of uniformity in national standard formats


has led to the development of international
standard exchange formats such as:
• ISO 2709
• UNISIST
• MARC
• UNIMARC
• MARC21

Dr. Mrs. Florence O. Entsua-Mensah 14


Types of International exchange formats

• UNIMARC (UNIversal MAchine Readable Cataloguing) –


format is a standard supported by the International
Federation of Library Associations and Institutions (IFLA)
with a primary function to facilitate the international exchange
of bibliographic data in machine readable form (Dunsire,
Willer, & Perožić, 2013).

Dr. Mrs. Florence O. Entsua-Mensah 15


Programmes of UNISIST
• UNISIST (United Nations International Scientific
Information System) – It is UNESCO World Scientific
Information programme to facilitate scientific information.
It includes a number of programmes.

• ISSN (International Standard Serial Number)- This is


an eight digit number for identifying a journal.
• The number is associated with the tile of the journal and therefore
changes when the name of a journal is changed.
• The purpose of ISSN is to ensure bibliographic control, i.e.,
organization of recorded information according to established
standards to make it readily retrievable.

Dr. Mrs. Florence O. Entsua-Mensah 17


ISO 2709
Topic 2

Dr. Mrs. Florence O. Entsua-Mensah 19


What is ISO 2709?
• ISO 2709 is a format for bibliographic information
exchange.

• ISO 2709 is an ISO standard for bibliographic descriptions,


titled Information and documentation—Format for
information exchange.

• It is maintained by the Technical Committee for Information


and Documentation (TC 9846).

(ISO, 2016)

Dr. Mrs. Florence O. Entsua-Mensah 20


ISO 2709
• ISO 2709 specifies the requirements for a generalized
exchange format which will hold records describing all
forms of material capable of bibliographic description as
well as other types of records.

• It does not define the length or the content of individual


records and does not assign any meaning to tags,
indicators or identifiers, these specifications being the
functions of an implementation format.

(ISO, 2016)
Dr. Mrs. Florence O. Entsua-Mensah 21
Characteristics of the ISO 2709

• It is a framework for communication between data


processing systems and also for use as a processing
format within a system.

• It does not define the length or the content of individual


records

• Does not assign any meanings to tags, indicators or


identifiers.
• Such specifications are the functions of an implementation
format

Dr. Mrs. Florence O. Entsua-Mensah 22


Elements of ISO 2709

ISO 2709 consists of four (4)


elements: Record
Directory
• Record label label
• Directory
• Fields
• Record separator Record
Fields
separator

Dr. Mrs. Florence O. Entsua-Mensah 23


Elements of ISO 2709 – RECORD LABEL
• Record label – It has a fixed length field of 24 characters
in total and holds the basic information of a record. This is
the only portion of the record that is fixed in length.
• It includes:
• record length (total no. of characters in the records),
• record status (e.g., new record),
• implementation tag ( e.g., Record type)
• identifier length

Dr. Mrs. Florence O. Entsua-Mensah 25


Elements of ISO 2709 – DIRECTORY
• Directory is a variable length field which provides the entry
positions to the fields in the records together with the field
tags.

• The directory provides the entry positions to the fields in the


record, along with the field tags.

• A directory entry has four parts:


1. a tag; (3 octets)
2. the length of the field
3. the starting character position
4. the implementation-defined part.

(Galabova , Trencheva & Trenchev, 2009)

Dr. Mrs. Florence O. Entsua-Mensah 26


Elements of ISO 2709 – DIRECTORY

• The length of the tag shall be three octets. The length in octets of
the other three parts in each directory entry shall be given by the
directory map (octets 20 to 22 in the record label).
• All elements in a directory shall have the same structure.
(Galabova , Trencheva & Trenchev, 2009)

• The Directory Contains ‘content designator’ for each data field


followed by an indication of the position in the record where the
data relating to that field start and the length of the field.
• If a field is repeated, it has two entries in the directory, one for
each appearance. (Shahil, Sivakumar, & Rejitha, 2011)

Dr. Mrs. Florence O. Entsua-Mensah 27


Elements of ISO 2709 – DIRECTORY
Structure of the Directory
• The Directory is made up of:
• field tag (3 characters),
• length of the field (4 characters),
• starting character position of the field (5 characters), occurrence of
the field,
• number of segments containing the field.
• It ends with a terminating symbol.

Dr. Mrs. Florence O. Entsua-Mensah 28


Directory - Structure

Tag Length of Starting Segment Occurrence


the Data Character Identifier Identifier
Field Position

3 characters 4 characters 5 Characters 1 character 1 Character


A three character A four-digit number A five-digit A single character A single character
code identifying showing how many number giving the (chosen from 0-9 (chosen from 0-9
the data field which characters are position of the and/or A-Z) which and A-Z) which
corresponds to occupied the data first character of designates the differentiates
directory entry field, including the data field data field as being multiple
indicators and data relative to the base a member of occurrences of the
field separator but address of data, particular segment data fields that
excluding the record i.e. the first carry the same tag
separator code if the character of the within the same
data field is the last first of the data record segment
field in the record. field
(Shahil, Sivakumar, & Rejitha, 2011)

Dr. Mrs. Florence O. Entsua-Mensah 29


Elements of ISO 2709 – FIELD (1)
• All fields shall end with a field separator.

• There are basically four types of fields:


• Record identifier
• Reference fields
• Bibliographic data fields
• Field separators

(Galabova , Trencheva & Trenchev, 2009)

Dr. Mrs. Florence O. Entsua-Mensah 31


Elements of ISO 2709 – FIELD (2)
Types of fields in the ISO 2709 record:
Record identifier
• A variable-length field given by the organization that
creates the record. Its purpose is to identify the record.

Reference fields-
• Are used to hold reference data of a given record that may
be required for processing.

Bibliographic data fields


• It holds the actual bibliographic data together with its
indicators or tags.

Field separators
• Each data field is terminated with a field separator symbol.

Dr. Mrs. Florence O. Entsua-Mensah 33


Data Field - Structure

Indicators Subfield Subfield Field


Identifier Separator

2 characters

2 characters Variable 1 Character

(Shahil, Sivakumar, & Rejitha, 2011)

Dr. Mrs. Florence O. Entsua-Mensah 34


Elements of ISO 2709 – RECORD SEPARATOR
• The record separator is the final character of each record.
• This will always be a single character.

• It follows the field separator of the final data field of the record.

(ISO, 2016; Galabova , Trencheva & Trenchev, 2009)

Dr. Mrs. Florence O. Entsua-Mensah 35


Advantages of ISO 2709
• It provides a small number of mandatory data elements, which are
recognized by all sectors of the information community as essential
in order to identify an item.
• It gives mandatory data elements that are sufficiently flexible to
accommodate varying descriptive practices.
• It also provides a number of optional elements, which may be useful
to describe an item according to practices of the agency, which
creates the record.
• It provides a mechanism for linking records and segments of
records without imposing on the originating agency any uniform
practice regarding the treatment of related groups of records or data
elements

(Galabova , Trencheva & Trenchev, 2009)

Dr. Mrs. Florence O. Entsua-Mensah 36


The MARC Format
Topic Three (3)

Dr. Mrs. Florence O. Entsua-Mensah 37


MARC FORMAT
• MARC – means Machine Readable Catalogue or
Cataloguing
• Machine-readable – means a machine or a computer can
read and interpret the data in the cataloging record
• Cataloging record – means a bibliographic record or the
traditional information on a library catalogue card, i.e. Main
and added entries, subject headings, classification or call
no., etc.
• Purpose – The purpose of the MARC format is to employ a
set of conventions to identify and arrange bibliographic
data so that it is handled by a computer
• MARC adheres to ISO 2709 record structure

Dr. Mrs. Florence O. Entsua-Mensah 38


Brief history of MARC
• MARC was developed by the Library of Congress in 1960.
• There were 2 slightly different types of MARC formats
during the 1980’s and 1990’s, the US version, USMARC
and the Canadian version CAN/MARC. In 1999, the two
versions were blended together into a single version called
MARC 21.
• It is maintained by the Standards and Support Office at
the National Library of Canada, and the Network
Development and MARC Standards Office at the Library
of Congress.

Dr. Mrs. Florence O. Entsua-Mensah 39


MARC 21
• A MARC 21 format is a set of codes and content
designators defined for encoding machine-readable
records.

• Formats are defined for five (5) types of data:


1. bibliographic
2. holdings
3. authority
4. classification
5. community information

(American Library Association)

Dr. Mrs. Florence O. Entsua-Mensah 40


Advantages of using MARC as a common bibliographic
standard
• It avoids duplication of work and allows libraries to share
bibliographic data
• It enables libraries to acquire reliable cataloging data
• MARC format is compact, therefore saves space.
• MARC is the standard format used by most library
computer programs and systems , and therefore:
• It enables libraries to use commercially available library automated systems
to manage their operations
• Libraries are able to benefit from the latest advances in computer
technology
• Libraries have the flexibility of replacing one system with another without
fear of incompatibility with their data
• There is easy communication and exchange of information

Dr. Mrs. Florence O. Entsua-Mensah 41


MARC 21 – Guidelines for managing and formatting electronic
records of different information resources
• MARC 21 Format for Bibliographic Data – contains
specifications for encoding data elements required to describe
and retrieve all forms of bibliographic data.
• MARC 21 Formats for Holdings Data – contains specifications
for encoding data elements relevant to locations and holdings
information for all formats
• MARC 21 Format for Authority Data – contains format
specifications for encoding data elements relating to records
subject to authority control
• Authority control is the establishment and maintenance of consistent
forms of terms—names, subjects, and titles—to be used as headings
in the bibliographic records of the library catalog. Headings must not
only be consistent, they must also be unique.
• For e.g. two authors who happen to have published under the
same name can be distinguished from each other by adding
middle initials, birth and/or death dates

Dr. Mrs. Florence O. Entsua-Mensah 42


MARC 21 – Guidelines for managing and formatting electronic
records of different information resources
• MARC 21 Format for Classification Data – contains specifications
for encoding data elements relating to classification numbers.
• MARC 21 Data for Community Information – provides format
specification for encoding records relating to information about
events, programmes, and services to enable their integration into
OPAC
• Others are MARC Code List for Languages, Countries, Geographic
Areas, Organizations, etc.

Dr. Mrs. Florence O. Entsua-Mensah 43


Some key terms of MARC
• A field – it is a bibliographic record such as author, title etc.
It is a term used to describe the various sections of
cataloging information. In MARC the fields are by 3-digit
tags supplied by the system software

• A tag – It is the first 3-digit number that identifies the field.

Dr. Mrs. Florence O. Entsua-Mensah 44


Components of a MARC record
Field tags Component descriptions
(Stands for numeric values
from 00 to 99)
0xx Control fields
1xx Main entries
2xx Title, edition and imprint information
3xx Physical description
4xx Series statements
5xx Notes
6xx Subject access entries
7xx Added entries and linking fields
8xx Series added entries and holdings information
9xx Fields for local use

Dr. Mrs. Florence O. Entsua-Mensah 45


Definitions of components
• Control fields- holds information on bibliographic control
numbers (and coded information used for processing
MARC records). This number is assigned by the
organization creating, using, or distributing the record. For
e.g., 001-006 contain control numbers and coded
information about date and time of processing and type of
material e.g., e-resources or books.

• Main entry fields (1XX) – used for storing information on


the main entry heading of a record, example 100 –
personal name (NR-non repeated)

Dr. Mrs. Florence O. Entsua-Mensah 46


Definitions of components
• Title and title-related fields- stores title of item and related
information. E.g., 210=abbreviated title, 222=key title, 245=title
statement

• Edition, imprint, etc. Field (250-270)- stores information on


edition, imprint, address, etc. Example 250=edition statement,
260=publication distribution

• Physical description (3XX)- stores information on physical


characteristics, publication frequency, price etc. For e.g.,
310=publication frequency

Dr. Mrs. Florence O. Entsua-Mensah 47


Comparison of same record with textual information with MARC
tags
Components Data MARC

Main entry, personal name Arnosky, Jim. 100 1# $a


with a single surname: The
name:
Title and Statement of Raccoons and ripe 245 10 $a
responsibility area, pick up corn /
title for a title added entry, Jim Arnosky. $c
file under "Ra..." Title proper:

Statement of responsibility:

Edition area: Edition 1st ed. 250 ## $a


statement:
Publication, distribution, etc., New York : 260 ## $a
area: Place of publication: Lothrop, Lee & $b
Name of publisher: Shepard Books,
c1987. $c
Date of publication:
Slide 48 Dr. Mrs. Florence O. Entsua-Mensah
Summary
• In this class session we have established the role of
standardised bibliographic formats in information search
and retrieval.

• We looked at some of the international standards, such as:


• UNIMARC
• UNISIST
• ISO 2709
• MARC 21

Dr. Mrs. Florence O. Entsua-Mensah 49


Activity 4.1
• Define with examples the remaining components of a
MARC record, i.e., series statement, notes, subject access
added entry, series added entry fields.

Dr. Mrs. Florence O. Entsua-Mensah 50


References

Chowdhury, G. G. (2010). Introduction to modern information retrieval.


London: Facet publishing.

Dunsire, G., Willer, M., & Perožić, P. (2013). Representation of the


UNIMARC bibliographic data format in resource description
framework. Proceedings of the International Conference on
Dublin Core and Metadata Applications, (Ddc), 179–189.

Library of Congress (n.d.). What is MARC record and why is it


important? Retrieved from
http://www.loc.gov/marc/umb/um01to06.html#part2

Dr. Mrs. Florence O. Entsua-Mensah 51


INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2022/2023)

Lecturer: Dr. Mrs. Florence O. Entsua-Mensah, DIS


Contact Information: fentsua-mensah@ug.edu.gh
Session Overview

• Information search is a critical component of the decision


process for most information users.
• The identification and development of information search
models has underpinned our understanding of AIR for
several decades, offering important guidance on how users
interact with their information environments.
• The Information search model should reflect the processes
users undertake in searching an AIRS.

Dr. F. O. Entsua-Mensah (Mrs) Slide 2


Session Outline

The key topics to be covered in the session are:

• Topic One: The Searching Process


• Topic Two: Factors that affect the search process
• Topic Three: Search Strategies
• Topic Four: Techniques for AIR
• Topic Five: Search Engines

Dr. F. O. Entsua-Mensah (Mrs) Slide 3


Recommended Reading
Rowley, J. (2015). The Changing Nature of Information Behaviour. In
Encyclopedia of Information Science and Technology (3rd ed.). IGI
Global. https://doi.org/10.4018/978-1-4666-5888-2.ch389

Kuhlthau, C. C. (2008). Seeking Meaning: A Process Approach to Library


and Information Services. Libraries Unlimited. Retrieved from
https://books.google.com.gh/books?id=feDgAAAAMAAJ

Xie, I. (2009). Information Searching and Search Models. In Encyclopedia


of Library and Information Sciences, Third Edition (pp. 2592–
2604). https://doi.org/10.1081/E-ELIS3-120043745

Dr. F. O. Entsua-Mensah (Mrs) Slide 4


THE SEARCHING PROCESS

Topic One

Dr. F. O. Entsua-Mensah (Mrs) 5


What is Information Search?

• Information Search is a process, which people undertake to


locate or retrieve specific information to meet an information
need, typically, but not always with the aid of a search
engine or other information retrieval system (Rowley, 2015).

• Information searching can be defined as users’ purposive


behaviours in finding relevant or useful information in their
interactions with information retrieval (IR) systems (Xie,
2010).

Dr. F. O. Entsua-Mensah (Mrs) Slide 6


Categories of Information Searching

• Information searching can be categorized into intermediary


information searching and end-user information searching.
• In intermediary searching, information professionals serve
as intermediaries between users and the IR system in the
search process,
• whereas in end-user searching, users directly search for
information themselves.

(Xie, 2009)

Dr. F. O. Entsua-Mensah (Mrs) Slide 7


The Generic Information Search Process

Source: https://uva.libguides.com/searching_information

Dr. F. O. Entsua-Mensah (Mrs) Slide 8


Cleverdon’s Searching Process

• A database may comprise controlled or uncontrolled


vocabulary.
• Cleverdon mentions that a user searching a database that
has controlled index languages must do the following:

(Cleverdon, 1988)

Dr. F. O. Entsua-Mensah (Mrs) Slide 9


Cleverdon’s Searching Process – Cont’d

(Cleverdon, 1988)

Dr. F. O. Entsua-Mensah (Mrs) Slide 10


An Information Search Model

• Search models are illustrations of patterns of information


searching and the search process.

• Some of the models also identify the factors that influence


the search process.

(Xie, 2009)

Dr. F. O. Entsua-Mensah (Mrs) Slide 11


A MODEL OF THE INFORMATION SEARCH PROCESS (ISP)

1. Initiation, when a person first becomes aware of a lack of knowledge or


understanding, and feelings of uncertainty and apprehension are common.
2. Selection, when a general area, topic, or problem is identified, and initial
uncertainty often gives way to a brief sense of optimism and a readiness to begin
the search.
3. Exploration, when inconsistent, incompatible information is encountered and
uncertainty, confusion, and doubt frequently increase, and people find themselves
“in the dip” of confidence [self doubt creeps in].
4. Formulation, when a focused perspective is formed, and uncertainty diminishes
as confidence begins to increase.
5. Collection, when information pertinent to the focused perspective is gathered
and uncertainty subsides as interest and involvement deepens.
6. Presentation, when the search is completed with a new understanding enabling
the person to explain his or her learning to others or in someway put the learning
to use.
(Kuhlthau, 2008)

Dr. F. O. Entsua-Mensah (Mrs) Slide 12


Kuhlthau Model of the Search Process

Dr. F. O. Entsua-Mensah (Mrs) Slide 13


Revised Kuhlthau Model of the Search Process

(Kuhlthau, 2004)

Dr. F. O. Entsua-Mensah (Mrs) Slide 14


Factors Affecting the Search Process

Topic Two

Dr. F. O. Entsua-Mensah (Mrs) 15


Factors Affecting the Search Process

Information searching is affected by different types of factors,


in which four main types determine the selection and
application of different search strategies:
1. User goal and task
2. User knowledge structure
3. Design of IR systems
4. The social and organizational context.

(Xie, 2009)

Dr. F. O. Entsua-Mensah (Mrs) Slide 16


Factors Affecting the Search Process – Cont’d
1. User goal and task
• complexity of task and stages of task play major roles in
influencing search strategies. Task complexity has systematic
relationships with the types of information, information channels
and sources needed.
• As the level of task complexity increases, more information
channels and resources are required (Byström, 2002).
2. User knowledge structure
• Three types of knowledge are required for effective information
searching:
(i) IR knowledge; (ii) domain knowledge; and (iii) system knowledge.

Dr. F. O. Entsua-Mensah (Mrs) Slide 17


Factors Affecting the Search Process – Cont’d
3. Design of IR systems
• The design of IR systems no doubt affects users in their selections of
search strategies.
• Interfaces, computational mechanisms, and information objects are the
main components of IR systems that guide or impede users in their
application of different search strategies.
• the same time, the availability or unavailability of certain features
determines whether users could engage in certain strategies.

4. The social and organizational context.


• The social-organizational context also defines the environment that user–
system interactions take place.
• Mainly, the work environment influences how users determine their
search strategies in the search process.
• In addition, cultural dimensions shape how users interact with IR systems

(Xie, 2009)
Dr. F. O. Entsua-Mensah (Mrs) Slide 18
Search Strategies

Topic Three

Dr. F. O. Entsua-Mensah (Mrs) 19


Information Search Strategy Defined

• A search strategy is a plan for the whole search (Bates,


1979).
• A search strategy involves multiple dimensions, such as a
tactic – a move made to further a search, intentions,
resources, methods, and so on (Xie, 2009).

Dr. F. O. Entsua-Mensah (Mrs) Slide 20


Developing a search Strategy

The following can serve as a guide in designing a search


strategy:
• Identifying your information need
• Choosing search terms
• Searching with keywords
• Searching for exact phrases
• Using truncated and wildcard searches
• Searching with subject headings
• Using Boolean logic
• Citation searching

Dr. F. O. Entsua-Mensah (Mrs) Slide 21


Search Strategies

• Search strategies map out a plan for the entire search


process.
• Thus, what the ‘searcher’ will do at each stage in the search
process.

• To execute the items in your search strategies however, you


may need certain techniques, generally referred to as search
techniques/ techniques for information searching.

Dr. F. O. Entsua-Mensah (Mrs) Slide 22


Techniques for AIR

Topic 4

Dr. F. O. Entsua-Mensah (Mrs) 23


Techniques for searching (1)
• Keyword search
• Boolean search + Implied Boolean
• Natural language search
• Proximity operators
• Truncation/wildcards
• Subject headings – controlled vocabulary
• Thesaurus or Index

Dr. F. O. Entsua-Mensah (Mrs) Slide 24


Techniques for searching(2)
Keyword search
• standard ways of retrieving information from any electronic
database, i.e. online library catalog , periodical database or
Internet database.
• Keywords describe topic of research
• Can be individual words or a phrase
• Choose significant words & come up with synonyms
• Key word search available in almost all databases

Dr. F. O. Entsua-Mensah (Mrs) Slide 25


Techniques for searching(3)
Keyword search ……
• Many databases require explicit description of relationship
between keywords
• e.g., alternative fuels being used in automobiles
• Alternative fuels: electricity, ethanol, natural gas, hydrogen
fuel cells
• Automobiles: cars, vehicles, transportation,
motor vehicle

Dr. F. O. Entsua-Mensah (Mrs) Slide 26


Techniques for searching(4)
Boolean search
• Boolean uses three common words as logical operators; AND, OR
and NOT
• allows one to combine words and phrases to either limit or expand
the search
• OR - connector that allows either word to be present in each
record in results. Use OR to expand your search.
• Search Term Hits
• Adolescents 15,900,000 hits
• Teenagers 34,600,000 hits
• Adolescents or teenagers 46,500,000 hits
• Either 'adolescents' or 'teenagers' (or both) will be present in each
record.

Dr. F. O. Entsua-Mensah (Mrs) Slide 27


Techniques for searching(5)
Boolean …..
• AND - connector that requires both words to be present in
each record in results. Use AND to narrow your search.
• Adolescents AND teenagers 7,390,000 hits
• NOT - connector that requires first word be present in each
record in results, but only if record does not contain second
word.
• Adolescents NOT teenagers 27,100,000 hits

Dr. F. O. Entsua-Mensah (Mrs) Slide 28


Techniques for searching(6)
• OR - allows either • adolescents OR teenagers
word to be present in
each record in
results.

Adolescents
Teenagers
• Use OR to expand
your search.

Dr. F. O. Entsua-Mensah (Mrs) 29


Techniques for searching(7)

• AND - requires both • Adolescents AND


teenagers
words to be present in
each record in results
• Use AND to narrow
your search.
• Adolescents AND Adolescents Teenagers
teenagers

Dr. F. O. Entsua-Mensah (Mrs) 30


Techniques for searching(8)
• NOT - requires first • Adolescents NOT
word be present in teenagers
each record in results,
but only if record does
not contain second
word.
• operator NOT is also Adolescent
Teenagers

used to make a more s

restrictive set

Dr. F. O. Entsua-Mensah (Mrs) 31


Techniques for searching(9)
Implied Boolean
• refers to search in which symbols are used to represent
Boolean logical operators
• (+) represents AND
• the minus sign (-) represents NOT
• no sign at all as an OR relation.
• Examples:
• +scientific_revolution +women_ in _Europe

Dr. F. O. Entsua-Mensah (Mrs) Slide 32


Techniques for searching(10)
Phrase search
• enclosing phrase in quotation marks help ensure that
database searches for those words as a group.
• database then searches for those word together in specific
order you provided.
Phrases or combinations of words
“scientific revolution" OR scientific_revolution
Predetermined language in a user fill-in template
• all of these words (=AND)
any of these words (=OR)
must not contain (=NOT)

Dr. F. O. Entsua-Mensah (Mrs) Slide 33


Techniques for searching(11)
Truncation
• shortening a word or eliminating some characters from a
longer term to pick up variants
• a form of the Boolean operator OR.
• items that share a common sequence of characters, even if
they do not share all the same characters are put into single
set
• process called ‘wildcard’ search or stemming

Dr. F. O. Entsua-Mensah (Mrs) Slide 34


Techniques for searching(12)
•Some databases allow certain symbols to be
used for searching different forms of a word
(such as plurals) or different spellings
•Check help screens of particular database
to determine appropriate symbols to use.
•Symbols used in truncation
• asterisk (*)
• question mark (?)
• colon (:)
• plus sign (+)

Dr. F. O. Entsua-Mensah (Mrs) Slide 35


Techniques for searching(13)
• adolescen* retrieves adolescent, adolescents, or
adolescence (right truncation)
• teen##### would retrieve teens and teenager and teenagers
• *ship retrieves relationship, librarianship, friendship (left
truncation)
• wom#n retrieves woman or women (middle truncation)

Dr. F. O. Entsua-Mensah (Mrs) Slide 36


Techniques for searching(14)
Natural language search
• easiest to understand, but many databases don't offer it as
a function.
• search using regular spoken language, such as English.
• Ask database question or type sentence that describes
information you are looking for
• database then uses programmed logic to determine
keywords in sentence by their position in sentence.
• The Internet search service Ask.com offers natural
language searching. Ask.com

Dr. F. O. Entsua-Mensah (Mrs) Slide 37


Techniques for searching(15)

Proximity operators (closeness)


• Allow location of one word within certain distance of another.
• symbols generally used in this type of search are w and n.
• w represents word "with(in)" and n represents word "near."
• type of search not available in all databases.
• Near Operator (Nx) — finds words within x number of words
from each other, regardless of order in which they occur.
• Example: television n2 violence would find "television
violence" or "violence on television," but not "television may
be the culprit in recent high school violence."

Dr. F. O. Entsua-Mensah (Mrs) Slide 38


Techniques for searching(16)
• Within Operator (Wx) — finds words within x number of
words from each other, in order they are entered in search.
• Example: Franklin w2 Roosevelt would find
Franklin Roosevelt or Franklin Delano Roosevelt
or Franklin D. Roosevelt but would not find
Roosevelt Franklin.

Dr. F. O. Entsua-Mensah (Mrs) Slide 39


Techniques for searching(17)
Subject headings – controlled vocabulary
• match search subject or concept with term used by indexer
• Two basic lists of subject headings consulted by reference
librarians
• Library of Congress Subject Headings- lists the standard LC
subject heading in alphabetical order
• Sears List of Subject Headings – rough equivalent of the LC
subject headings for smaller libraries.

Dr. F. O. Entsua-Mensah (Mrs) Slide 40


Techniques for searching(18)
• results of an initial search reveal that article citations have
subject headings or descriptors. Articles that have similar
content will have same subject headings even if authors of
articles used different terms to describe topic.
• For example, one author may use phrase "capital
punishment" and another "death penalty."
• subject heading on both records will be same.
• Sometimes subject headings are hyperlinked
• can link to other articles with similar content.

Dr. F. O. Entsua-Mensah (Mrs) Slide 41


Techniques for searching(19)
Thesaurus or Index
• A list of subject headings used in particular database is
often referred to as a thesaurus.
• Some databases have sophisticated thesauri that provide
cross-references.
• Eg "death penalty,“ thesaurus might ask you to use "capital
punishment" instead
• Some thesauri also include description of term, and a list of
broader, narrower, and related terms.

Dr. F. O. Entsua-Mensah (Mrs) Slide 42


Search Engines
Topic Five

Dr. F. O. Entsua-Mensah (Mrs) 43


Introduction to Search Engines

• The hub of modern (computerised) information retrieval


systems are search engines.

• Search engines connect a user’s search with relevant


collection, in a very intuitive manner.

• (web) Search Engine is a computer program that searches


for and identifies items in a database that correspond to
keywords or characters specified by the user, used
especially for finding particular sites on the World Wide Web
(Vivekavardhan, 2018).

Dr. F. O. Entsua-Mensah (Mrs) Slide 44


Importance of Search Engines

• With over 8 billion web pages available, it is impossible to


search for the information that is specifically need through all
these pages manually.

• Search engines filter the information that is on the internet


and transform it into results that each individual can easily
access and use in a matter of milliseconds.

(Khurana, 2014)

Dr. F. O. Entsua-Mensah (Mrs) Slide 45


How does a search Engine Work?

• The search engine visits (every) web page that it can find, it
builds an index which is a list of tokens such as words that
are associated with pages (images, audio, video, etc.).
• This feature of the search engine is the CRAWLER.

• Bear in mind that standard/conventional search engines do


not access the DEEP WEB.
• Deep Web: The portion of the web composed of specialty
database, such as those housed by the US government. It is also
called the invincible web.
• The illegal content on deep web is called the Dark Web.

Dr. F. O. Entsua-Mensah (Mrs) Slide 46


Examples of Search Engines

• Google
• Bing
• Yahoo
• Ask.com
• AOL.com
• Baidu
• DuckDuckGo
• Yandex

Dr. F. O. Entsua-Mensah (Mrs) Slide 47


Summary

• In today’s class, we have examined the search process from


a typical users perspective – paying close attention to
Kuhlthau’s Model of the Information Search Process.
• We also explored the various search techniques for an
effective search process.
• As well as the role of search engines in modern information
retrieval.

Dr. F. O. Entsua-Mensah (Mrs) Slide 48


Activity 6.1 (Class Activity)
TECHNIQUES FOR SEARCHING
• Define your information need or formulate your topic.
For example: The involvement of women in the scientific
revolution of the 16th-18th centuries in Europe
• Identify keywords/concepts in the topic – articles and
prepositions not keywords ;For example : women, science,
Europe, 16th century

Dr. F. O. Entsua-Mensah (Mrs) Slide 49


Activity 6.1 (Class Activity)
Techniques for searching (2)

• Find synonyms for keywords/concepts – broader


terms, narrower terms or related terms, spelling
variants, abbreviations, alternative words
• Eg: Women: gender, female ;Science: zoology, biology, medicine, etc;
Europe: England, Italy, Germany, France, etc; 15th, 17th, 18th, Enlightenment
age

• Words and synonyms become search terms


• Use search tools (search engines, etc)
• Combine search terms with Boolean operators

Dr. F. O. Entsua-Mensah (Mrs) Slide 50


Activity 6.2 – Individual Assignment
Using the following search process, advice a student on how to
search for academic information:

• Step 1: Develop a Topic. Select Topic. Identify Keywords. ...

• Step 2: Locate Information. Search Strategy. Books, Journals...

• Step 3: Evaluate Information. Evaluate Sources. ...

• Step 4: Write. Organize / Take Notes. ...

• Step 5: Cite Sources. Citation Styles. ...

• Step 6: Legal / Ethical Use. Copyright.

Dr. F. O. Entsua-Mensah (Mrs) Slide 51


Activity 6.3

• Watch the video from this URL: https://youtu.be/BNHR6IQJGZs

Dr. F. O. Entsua-Mensah (Mrs) Slide 52


References
Rowley, J. (2015). The Changing Nature of Information Behaviour. In Encyclopedia
of Information Science and Technology (3rd ed.). IGI Global.
https://doi.org/10.4018/978-1-4666-5888-2.ch389
Kuhlthau, C. C. (2008). Seeking Meaning: A Process Approach to Library and
Information Services. Libraries Unlimited. Retrieved from
https://books.google.com.gh/books?id=feDgAAAAMAAJ
Kuhlthau, C. C. (1988). Developing a model of the library search process:
Investigation of cognitive and affective aspects. Reference Quarterly 28 (2),
pp.232-242.
Kuhlthau, C. C. (1991) Inside the search process: Information seeking from the
user’s perspective. Journal of the American Society for Information Science,
42 (5), pp. 361-371.
UTA Libraries (2018). Research Process. Available at:
https://libguides.uta.edu/researchprocess/write
Xie, I. (2009). Information Searching and Search Models. In Encyclopedia of Library
and Information Sciences, Third Edition (pp. 2592–2604).
https://doi.org/10.1081/E-ELIS3-120043745

Dr. F. O. Entsua-Mensah (Mrs) Slide 53


INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2022/2023)

Lecturer: Dr. Mrs. Florence O. Entsua-Mensah, DIS


Contact Information: fentsua-mensah@ug.edu.gh
Session Overview

• The information retrieval process is largely shaped by its


associated information model.
• This session discusses the approaches to Retrieval
Models and their significance in an Automated
Information Retrieval System.

Florence O. Entsua-mensah (Mrs),


DIS/SCDE Slide 2
Session Outline

The key topics to be covered in the session are:

• Topic One: Information Retrieval Models


• Topic Two: Vector Space Model
• Topic Three: Probabilistic Model

Florence O. Entsua-Mensah (Mrs) 3


Recommended Reading

Manning, C. D., Raghavan, P., & Schütze, H. (2009). An


Introduction to Information Retrieval. Information
Retrieval. Cambridge: Cambridge university press.

Xie, I. (2008). Interactive Information Retrieval in Digital


Environments; IGI Global Inc.: Hershey, PA.

Florence O. Entsua-mensah (Mrs),


DIS/SCDE Slide 4
Understanding IR Models

Topic One

Florence O. Entsua-mensah (Mrs), DIS/SCDE Slide 5


IR Models

• Every information system has, either explicitly or implicitly,


an associated theory of information access and a set of
assumptions that underlie that theory.

• A model is an embodiment of the theory in which we define


the set of objects about which assertions can be made and
restrict the ways in which classes of objects can interact.

(Turtle & Croft, 1992)


Florence O. Entsua-Mensah (Mrs) 6
IR Models Defined

• An IR model specifies the representations used for


documents and information needs, and how they are
compared.
• For example, many retrieval models have assumed
representations based on binary or weighted index terms,
and comparison based on a similarity measure.
• In the past, less emphasis has been placed on models that
specify how representations should be extracted from
document and query texts, and which representations
produce the best performance.

(Turtle & Croft, 1992)


Florence O. Entsua-Mensah (Mrs) 7
Approaches/Categories of IR Models

• Probabilistic retrieval model


• Vector Space processing model
• Best match searching & relevance feedback model
• Natural Language Processing model
• Hypertext model
• XML (eXtended Markup Language) retrieval model

(Chowdhury, 2010)

Florence O. Entsua-Mensah (Mrs) 8


Probabilistic Info. Retrieval Model
Topic Two

Florence O. Entsua-mensah (Mrs), DIS/SCDE Slide 9


Probabilistic retrieval model

• Given only a query, an IR system has an uncertain


understanding of the information need.
• Given the query and document representations, a system
has an uncertain guess of whether a document has content
relevant to the information need. Probability theory provides
a principled foundation for such reasoning under uncertainty.
• This topic provides one answer as to how to exploit this
foundation to estimate how likely it is that a document is
relevant to an information need.

(Manning, Raghavan, & Schütze, 2009)

Florence O. Entsua-Mensah (Mrs) 10


Why Probability in information Retrieval?

• Queries are representations of user’s information need


• Relevance is binary Retrieval is inherently uncertain, since
the needs of users are vague in nature i.e., change with time
• Probability deals with uncertainty Provides a good estimate
of which documents to choose, hence more reliable

(Thakkar, 2015)

Florence O. Entsua-Mensah (Mrs) 11


An Overview of the Probabilistic Approach to Information
Retrieval
• Given a user information need (represented as a query) and
a collection of documents (transformed into document
representations), a system must determine how well the
documents satisfy the query
• An IR system has an uncertain understanding of the user query,
and makes an uncertain guess of whether a document satisfies the
query
• Probability theory provides a principled foundation for such
reasoning under uncertainty
• Probabilistic models exploit this foundation to estimate how likely it
is that a document is relevant to a query.

(Thakkar, 2015)

Florence O. Entsua-Mensah (Mrs) 12


Types of Probabilistic Models

There is more than one possible retrieval model which has a


probabilistic basis. Examples include:
• Classical probabilistic retrieval model
• Probability ranking principle
• Binary Independence Model
• Bayesian networks for text retrieval
• Language model approach to IR

(Thakkar, 2015)

Florence O. Entsua-Mensah (Mrs) 13


Vector Space Model
Topic Three

Florence O. Entsua-Mensah (Mrs) 14


Vector Space Model

• In the vector space model, documents and queries are


represented as vectors in a k-dimension hyperspace where
each dimension corresponds to a possible document feature.
• Vector elements may be binary-valued, but they are
generally taken to be weights that describe the degree to
which the corresponding feature describes the document or
query.

Florence O. Entsua-Mensah (Mrs) 15


Vector Space Model

• A weight of 0 is taken to mean that the corresponding feature


does not describe a document or query, with weights greater
than 0 representing the degree to which the feature
describes the document.
• These weights are constrained to lie in some fixed interval,
say [0.. 1], so that documents and queries represent points in
a ^-dimension hypercube.

Florence O. Entsua-Mensah (Mrs) 16


Vector Space Model

• The matching function is then a distance metric that operates


on the document and query vectors.
• Given by:

Florence O. Entsua-Mensah (Mrs) 17


Summary

This session discussed information retrieval models


in general, ands zoomed in on two of the widely used
IR models – probabilistic and vector space models.

Florence O. Entsua-Mensah (Mrs) 18


Activity 7.1

• Compare the vector space model with the probabilistic


information retrieval model.

Florence O. Entsua-Mensah (Mrs) 19


References

Manning, C. D., Raghavan, P., & Schütze, H. (2009). An


Introduction to Information Retrieval. Information
Retrieval. Cambridge: Cambridge university press.

Florence O. Entsua-Mensah (Mrs) 21


INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2021/2022)

Lecturer: Dr. (Mrs.) Florence O. Entsua-Mensah, DIS


Contact Information: fentsua-mensah@ug.edu.gh
Session Overview

• The introduction and growth of the internet has brought


significant change in the way we access information.
• With the introduction of the WWW and its related
technologies such as the HTML, documents in an IRS are
‘well’ linked or associated with each other.
• This session seeks to digest some of the issues regarding
the web technology and searching of information on the
web.

Florence O. Entsua-mensah (Mrs),


DIS/SCDE Slide 2
Session Outline

The key topics to be covered in the session are:

• Topic One: Understanding Online IRS


• Topic Two: Types of Online databases for IR
• Topic Three: The WWW and Web Information Retrieval

Florence O. Entsua-Mensah (Mrs) 3


Recommended Reading

Chowdhury, G. G. (2010). Introduction to modern


information retrieval. London: Facet publishing.
(Chapter 18)

Xie, I. (2009). Information Searching and Search Models. In


Encyclopedia of Library and Information Sciences,
Third Edition (pp. 2592–2604).

Florence O. Entsua-mensah (Mrs),


DIS/SCDE Slide 4
Understanding Online IRS

Topic One

Florence O. Entsua-mensah (Mrs), DIS/SCDE Slide 5


Defining Online IRS

• Online IR systems can be characterized as IR systems that


allow remote access with searches conducted in real time
(Walker & Janes, 1999).

• Users generally search information from four types of online


IR systems:
1. Online databases,
2. Online public access catalogs (OPACs),
3. Digital libraries,
4. Web search engines.
(Xie, 2009)

Florence O. Entsua-Mensah (Mrs) 6


Types of Online IRS

Topic Two

Florence O. Entsua-mensah (Mrs), DIS/SCDE Slide 7


Online Databases

• Online databases consist of full-text documents or


citations and abstracts accessible via dial-up or other
Internet services (Xie, 2009).
• Online databases gives users an easy access to
documents can be accessed anytime and anywhere by
means of the internet.
• However, they can be very expensive.

Florence O. Entsua-Mensah (Mrs) 8


Online public access catalogs (OPACs)

• OPACs hold interrelated bibliographic data of collections of a


library that can be searched directly by end users (Xie,
2009).
• OPAC is a very useful tool in the Libraries as it is used for
locating library materials easily, saves time , energy and
money, used to place reservation, read news/bulletins, check
borrowers’ records among others.
• Some OPACs are criticized for not implementing the
“principle of least effort” from the user. i.e. some say it is not
user friendly.

Florence O. Entsua-Mensah (Mrs) 9


Digital Libraries

• Digital libraries collect, organize, store and disseminate


electronic resources in a variety of formats (Xie, 2009).
• The availability of online access to digital libraries began in
the 1990s.
• Digital libraries allow users to search and use multimedia
documents and can be hosted by a variety of organizations
and agencies, either for the public or for a specific user
group.
• Digital libraries also pose challenges for end users to interact
with multimedia information in different interface designs
without the same support as of physical libraries.
(Xie, 2009)

Florence O. Entsua-Mensah (Mrs) 10


Web Search Engines

• Web search engines allow users to mainly search for Web


materials.
• Also, many of the Web search engines offer users the
opportunity to search for multimedia information and
personalize their search engines.

• The emergence of the Web in the early 1990 enabled


millions of users to search for information without the
assistance of intermediaries.

(Xie, 2009)

Florence O. Entsua-Mensah (Mrs) 11


Web Search Engines – Cont’d

• Now, Web search engines also extend their services to full-


text books and articles in addition to Web materials.
• The popularity of Web search engines influences the way
that users interact with other types of online IR systems.
• A drawback of web search engine as an information
retrieval tool is that a search can result in far too many
results for the user to check.

(Xie, 2009)
Florence O. Entsua-Mensah (Mrs) 12
The WWW & Web Information Retrieval
Topic 3

Florence O. Entsua-Mensah (Mrs) 13


The WWW/ The Web

• The web is referred to as a “massive collection of web pages


stored on millions of computers across the world that are linked
by the Internet” (Chowdhury, 2010, p. 381).

• It was created in 1989 by Tim Berners-Lee and his team of


scientists at the European Laboratory for Particle Physics in
Geneva.

• The web has grown exponentially from over 9 million websites in


2002 to over 1 billion in 2014. Today the number of indexed
pages is 4.71 billion (Chowdhury, 2010, p. 381).
Hyper Text Transfer Protocol (HTTP)

• The HTTP was created to standardize communication


between clients and servers used by the web.
• Mosaic was the first web browser created for the web in
1993 at the US National Center for Supercomputing
Applications.
• This was followed by the Netscape Navigator and the
Internet Explorer. Today there are several (web) browsers.
Such as Firefox, Chrome, Safari etc.

(Chowdhury, 2010)

Florence O. Entsua-Mensah (Mrs) 15


Nature of Web Information Retrieval

• Traditionally, documents in information retrieval systems are


not linked to each other.
• The documents in the classical information retrieval system are
stored in physical disjointed forms.

• Web documents, are however, connected or can easily be


linked or connected to each other. Mainly using hyperlinks.

Florence O. Entsua-Mensah (Mrs) 16


Differences between traditional and web retrieval

Quality of information
• Quality of web information is uncertain since anyone can publish on
the web. Text retrieval system comprise published information
resources with definite quality control.
Frequency of changes
• Web information changes frequently. Contents of text retrieval
systems are static and thus easy to track and retrieved by a
retrieval system.
Ownership
• ownership of web resources varies, some are free, others require
permission or access rights, posing a challenge to retrieval.
Differences between traditional and web retrieval

Distributed nature of web-


• Web resources are distributed on millions of computers
throughout the world with different architecture, software, and
standards.
• Text retrieval systems deals with a set of documents, and
specified set of standards such as hardware, software, and
processing standards, (eg. MARC formats and OPAC).
Size and growth of the web
• The rapid growth of the web makes indexing and retrieval
complex and difficult.
• Traditional text retrieval systems are amenable to research and
testing for eventual handling of large volumes of data
Differences between traditional and web retrieval
Distributed users
• Unlike users of the web, text retrieval systems know the nature,
characteristics, information needs and seeking behaviours of their
users posing a challenge to the designer of a web information
retrieval system
Multiple languages
• language of both information resources and users are diverse posing
a challenge. An ideal IRS must be able to retrieve required
information irrespective of language of the query or the source of
information.
Resource requirements
• The astronomical size of the web makes it difficult for it to run
effectively and efficiently, and also be funded by a single body
although the world desires a good IRS to access the web information
resources.
Activity 8.1

• Make notes on issues and challenges of web information


retrieval. ( READ: Chowdhury, 2010, pp. 385-386).
• Not more than one page.
• Times New Roman
• Font size: 12pt
• Left, right, top, bottom margins = 1’ for each
• Alignment: Justified
• Submission is on SAKAI LMS
• Deadline: Two weeks from now
References

Xie, I. (2009). Information Searching and Search Models. In


Encyclopedia of Library and Information Sciences, Third Edition
(pp. 2592–2604). https://doi.org/10.1081/E-ELIS3-120043745

Florence O. Entsua-Mensah (Mrs) 21


INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2021/2022)

Session 9

USERS OF AIR SYSTEMS

Lecturer: Dr. (Mrs.) Florence O. Entsua-Mensah, DIS


Contact Information: fentsua-mensah@ug.edu.gh
Session Overview
• This session is two pronged:
• Frist, we will seek to know the information user;
• second, how an AIRS can be designed to maximize a
user's experience.
• In view of this, the session introduces the centrality of the
user/ patron /client, when we develop systems for the
organization of information.
• We will examine the various types of users, their information
needs, and their information seeking behavior.
• The aim in this session is simply to stimulate your thoughts
on users and the way users interact with information retrieval
systems.

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 2


Session Overview
• This session introduces the centrality of the user (also
known as patron, client, etc.) when we develop systems
to organize information.
• We will examination the various types of users, their
information needs, and their information seeking
behavior.

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 3


Session Outline

The key topics to be covered in the session are:

• Topic 1: The Concept of an Information User


• Topic 2: Types/Categories of Users of AIRS
• Topic 3: Users & their Information Need

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 4


Recommended Reading

Hearst, M. (2009). Search User Interfaces. Cambridge:


Cambridge University Press. (pp. 1 – 38)

Singh, S., 2015. Users and information use in academic


libraries. Acad. Libr. Syst.

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 5


Introduction

• The user is the focal point of all information retrieval


systems; because the sole objective of any IRS is to transfer
information from the source to the user (Chowdhury, 2010).

• The percentage of relevant documents accessed by the


user, is to a very significant extent influenced by how well the
user is able to interact with the information system.

• Hence, AIR systems ought to be designed with the


appropriate interface to allow the user to efficiently interact
with the information system.

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 6


The Concept of an Information User
Topic 1

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 7


Who is an Information User?

• An individual who makes use of information in any way to


complete a task (IGI Global, 2018).
• A person who uses one or more library’s services at least
once a year (Kenneth Whittaker, as cited in Singh, 2015).
• A person who needs information which can be provided by
specific library services; or someone who is known to have
the intention of using certain information services from the
library (Singh, 2015).

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 8


Who is an Information User? - Cont’d

• The term "user" can refer to any person who interacts with
an information system to search for and select resources
he/she needs.

• When we use the term user, often we imagine the person


visiting a library resources he/she needs.
• These people can also be called end users, patrons, clients,
searchers, consumers, readers, etc.

(UNT, 2017)

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 9


Who is an Information User? - Cont’d

• A more expansive notion of "users" include the people who


work in libraries or information centers.
• Whether they are reference librarians, catalogers, online
specialists, intermediaries, indexers, or system designers, they
interact with the library's organization system in doing their work.

(UNT, 2017)

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 10


Definition of an Information User

• The user can be defined as someone (or something!) that


interacts with an information organization or system with the
goal of finding information to solve a problem, answer a
question, or for other reasons (UNT, 2017).

• In the digital environment of the Web and the Internet, a user


may not be a person but a software program that comes to
the online catalog or library's website to search and select on
behalf of the person who sent this "robot" out looking for
information (UNT, 2017).

[UNT: University of North Texas]

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 11


The Information User – Cont’d.

• The type of information user is in fact dependent on the


nature of the information.
• Users may be limited by:

The nature of Sex Other forms


their Age
work/profession of social
groups
Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 12
Types/Categories of Information Users
Topic 2

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 13


Types of Information Users

• Users of IRS fall into two major categories that are non-
mutually exclusive:
1. Those who develop and evaluate IR systems and
services.
2. Those who consume them.
• The former are researchers and developers in disciplines
such as computing and information sciences, while the
latter are everyday users of the technology.

(Fernández-Luna, Huete, MacFarlane, & Efthimiadis, 2009)

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 14


Characteristics of users who consume the services of IRS

Actual users:
• those who are using the information service at a given time.

Potential Users:
• those who are not yet served by the information services.

Expected Users:
• those who not only have the privilege of using the information
service, but also have the intention of doing so.
Beneficiary users:
• Users who have derived some benefits form the information service.
(Chowdhury, 2010)

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 15


Users and their information needs
Topic 3

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 16


Users and their information needs

• It is important to recognize that different users will have


different types of (information) needs. Users differ in their
specific tasks, their capabilities and experience, and their
information seeking behaviours.
• Users have different motivations for seeking information.
• As information providers, one of our primary aims is to
develop IRS that accommodates the divergent ways in
which users search for information; and respond to the way
people use and interact with information/ IRS.

(UNT, 2017)

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 17


Users & their information needs –Cont’d.

• People bring their own needs, personal characteristics,


goals, skills, and knowledge when seeking information.
• An elementary school student looking for information on stars will
likely have different expectations and needs than an astronomer.
Sometimes a person is looking for information that will help them
do something (e.g., completing a craft).
• At another time, that same user needs information to verify a
fact or solve a work-related problem.

(UNT, 2017)

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 18


Users & their information needs –Cont’d.

• Users come to information systems with a diversity of skills,


knowledge, expectations, and needs.

• Information Providers need to understand the users of


information retrieval systems well enough to assist them, in
all their diversity (UNT, 2017).

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 19


Summary

• In this session, we have briefly discussed the concept of


users and their various categories.
• Since IRS are meant to be used, we should have a clear
understanding of our users.

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 20


Activity 9.1

• Who is an Information user?

• Discuss some of the characteristics of the information user


and how they shape the design of a user interface.

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 21


References
Dumais, S., Cutrell, E., Cadiz, J. J., Jancke, G., Sarin, R., & Robbins, D. C.
(2003). Stuff I’ve Seen: A System for Personal Information Retrieval
and Re-Use. Retrieved from https://www.microsoft.com/en-
us/research/wp-content/uploads/2003/01/siscore-sigir2003-final.pdf

Hearst, M. (2009). Search User Interfaces. Cambridge: Cambridge


University Press.

Kuhlthau, C. C. (1999). Accommodating the user’s information search


process: Challenges for information retrieval system designers.
Bulletin of the American Society for Information Science, 25, pp. 12-
16.

University of North Texas. (2017). Users and their information needs. In


Concepts of Information Organisation (pp. 1–2).

Dr. F. O. Entsua-Mensah (Mrs) 28-Feb-23 Slide 22


INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2022/2023)

Session 10:
USER SEARCH INTERFACE

Lecturer: Dr. (Mrs.) Florence O. Entsua-Mensah, DIS


Contact Information: fentsua-mensah@ug.edu.gh
Session Overview

• In this session, the aim is to stimulate the learner’s


thoughts about users and the way they interact with
information retrieval systems.
• We will study the design guideline for AIR system user
interface design.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 2


Session Outline

The key topics to be covered in the session are:

• Topic 1: User Interface Design


• Topic 2: Usability of AIR System
• Topic 3: Design Guidelines for User Search Interface
• Topic 4: Evaluation of User Search Interface

Dr. Florence O. Entsua-Mensah (Mrs) Slide 3


Recommended Reading

Hearst, M. (2009). Search User Interfaces. Cambridge:


Cambridge University Press. (pp. 1 – 38)

Singh, S., 2015. Users and information use in academic


libraries. Acad. Libr. Syst.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 4


User Interface Design
Topic 1

Dr. Florence O. Entsua-Mensah (Mrs) 5


Why Study User Interface Design?

• The design of IR systems no doubt affects users in their


selections of search strategies (Xie, 2009).
• The purpose of the search interface is to aid users in the:
• expression of their information needs
• formulation of their queries
• understanding of their search results
• keeping track of the progress of their information seeking efforts

(Hearst, 2009)

Dr. Florence O. Entsua-Mensah (Mrs) Slide 6


Keeping the interface Simple

• Search is a means towards some other end, rather than a


goal in itself.
• When a person is looking for information, they are usually
engaged in some larger task, and do not want their flow of
thought interrupted by an intrusive interface.

• User interface design aims to make the system usable


(hence the need to study the concept of USABILITY ).

Dr. Florence O. Entsua-Mensah (Mrs) Slide 7


Usability OF AIR Systems
Topic 2

Dr. Florence O. Entsua-Mensah (Mrs) 8


Usability

•An important quality of a user interface (UI)


is its usability.

•USABILITY is a term which refers to those


properties of the interface that determines
how easy it is to use.

(Hearst, 2009)

Dr. Florence O. Entsua-Mensah (Mrs) Slide 9


The Components of Usability

How easy is it for users to accomplish basic tasks


Learnability:
the first time they encounter the interface?

How quickly can users accomplish their tasks after


Efficiency:
they learn how to use the interface?

After a period of non-use, how long does it take


Memorability:
users to re-establish proficiency?

Satisfaction: How pleasant or satisfying is it to use the interface?

(Schneiderman and Plaisant, 2004; Nielsen, 2003)

Dr. Florence O. Entsua-Mensah (Mrs) Slide 10


How are interfaces Designed in order to attain the
goals of Usability?

• In order to achieve this goal, studies in the field of Human-


Computer Interaction, has suggested a design technique
called a user centred design.

• The goal of this design techniques is to lead to the


development of useable designs.

(Hearst, 2009)

Dr. Florence O. Entsua-Mensah (Mrs) Slide 11


User Centred Design:

• Decisions are made based on responses obtained from


target users of the system. Designers are not to assume they
know what users need.
Steps in User-centred Interface Design:
1. Needs Assessment: Designers investigate who the users
are, what their goals are, and what tasks they have to
complete.
2. Task Analysis: Designers characterise which steps the
users need to take to complete their tasks, decide which
user goals they will attempt to address.
3. …
(Hearst, 2009)

Dr. Florence O. Entsua-Mensah (Mrs) Slide 12


Design Guidelines For Search Interface
Design

Topic 3

Dr. Florence O. Entsua-Mensah (Mrs) 13


Guidelines for user Interface Design

These guidelines include:


1. Offer efficient and informative feedback
2. Balance user control with automated action
3. Reduce short-term memory load
4. Provide shortcuts
5. Reduce errors
6. Recognise the Importance of Small Details.
7. Recognize the Importance of Aesthetics in Design.

(Hearst, 2009)

Dr. Florence O. Entsua-Mensah (Mrs) Slide 14


Offer Efficient & Informative Feedback
• Because the search task is so cognitively intensive,
feedback about query formulation, about the reasons the
particular results were retrieved, and about next steps to be
taken is critically important.

• Important Feedback Indicators for Search Interfaces


• Show search results immediately
• Show informative document surrogates (Highlight
Query Terms)
• Allow sorting of Results by Various Criteria
• Show Query Term Suggestions

Dr. Florence O. Entsua-Mensah (Mrs) Slide 15


Highlighted Show results immediate
Search terms ;
displayed in
BOLD text.

Informative Document
Surrogates

Suggestions
Show Query Term Suggestion - 1

Query Term
Suggestions

Dr. Florence O. Entsua-Mensah (Mrs) Slide 17


Show Query Term Suggestion - 2

Dr. Florence O. Entsua-Mensah (Mrs) Slide 18


Balance user control with automated action
• Suggestions from the intelligent systems should be balanced
with users’ control of the query.

• In the design of search interfaces, there is a delicate balance


between clever, but opaque operations that correctly
anticipate the searcher’s needs most of the time, and less
powerful or less effective designs that are however easily
understandable and give the user control over system
behaviour.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 19


Balance user control with automated action - Cont’d.

• Two important types of search interface design decisions


that must be considered the trade-off between opaque
system control and transparent user control are: Results
Ordering and Query Transformations.

Query
Results Ordering
Transformations

Dr. Florence O. Entsua-Mensah (Mrs) Slide 20


Balance user control with automated action - Cont’d.

Results Ordering

• Perhaps the most understandable and transparent way


to order search results is according to how recently
they appeared (Hearst, 2009).
• For some information collections, such as news,
chronological ordering can be preferred over rank
ordering (Hearst, 2009).
• When it comes to searching personal information,
most users prefer chronological order over ranked
order (Dumais et al., 2003).

Dr. Florence O. Entsua-Mensah (Mrs) Slide 21


Balance user control with automated action - Cont’d.

Query Transformation
• Some search engines make subtle changes to queries to improve
results. Query Transformation is usually opaque to the user.
• For example, Microsoft’s web search engine (i.e. Bing) automatically
converts words like “VS.” to “versus”. The lack of user control for this
feature is mitigated by the fact that this transformation nearly always
matches the searcher’s intention.
• Although query transformation could be a useful feature, it could
sometimes frustrate the user.
• For examples, Google returns pages that contain people’s names for
which the middle initial is missing, even if the original query specifies
the middle initial. In this example, the transformed query (which the
users did not require anyway) will frustrate a user who is trying to
distinguish between two persons with similar names.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 22


Reduce Short-term Memory Load

• The interface Guideline “reduce the users’ memory load”


is very important for information-rich system interfaces.
• The main idea behind this heuristic is to show users
relevant information rather than require them to remember
or keep track of it.
• Among the several methods applicable to search
interfaces are:
• Suggest Search Actions in the entry form
• Support Simple History Mechanisms

Dr. Florence O. Entsua-Mensah (Mrs) Slide 23


Reduce short-term memory load – Cont’d

Suggest Search Actions in the entry form


• A useful interface trope that has arisen recently is, rather
than showing a blank entry form, the designer places text
within the entry form to indicate what actions will result from
using that form.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 24


Reduce short-term memory load – Cont’d
Another example from JSTOR

This text is usually shown in greyed-out signal that is intended to be replaced by the
user’s text. The text within the form disappears when the user clicks in the form.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 25


Reduce short-term memory load – Cont’d

Support Simple History Mechanisms:

• Research has shown that people are highly likely to revisit


information they have viewed in the past and to re-issue
queries that they have written in the past.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 26


Provide shortcuts

• This refers to providing alternative interface mechanisms for


practiced users of an interface.
• A classic example is keyboard shortcuts for menu items that
otherwise require pulling down and selecting from menus.
• Keyboard shortcut can save time and effort when the user is
typing, as the shortcut removes the need to move hands
away from keyboard to the mouse.
• But there is a barrier to using shortcuts as they require
memorization.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 27


Reduce Errors

• The steps taken by interface designers to reduce the


likelihood of user errors tend to overlap with other guidelines.
• For examples, to provide accurate suggestions for typographic and
spelling errors.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 28


Reduce Errors – Cont’d

Traditional heuristics to reduce errors include:

• Avoid Empty Results sets: Mechanism to achieve this include:


Spelling correction, and term correction. One may also use ‘query
previews’ to show how many documents will result if a particular
navigation step is taken.

• Address the vocabulary problem: Avoid the use of words that


the user does not recognise in navigational cues or in the menu
items, or in the search itself.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 29


Recognize the Importance of Aesthetics in Design

• It is always nice to have an appealing user search interface,


but this should not get in the way of efficiency. Thus, not too
clumsy with decorations.

• A visually appealing website always serves the good of the


user, but only when it does not affect the user in negative
ways such as:
• Taking too long to load search interface due to packed pictures
and animations on the search interface/ search page.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 30


EVALUATING USER SEARCH INTERFACE

Topic 4

Dr. Florence O. Entsua-Mensah (Mrs) 31


Evaluating User Search Interface (1)

• Studies show that ‘simple’ measures of precision and recall


are not sufficient to judge the quality of a user search
interface (Wilson, 2012).

• Besides most evaluative methods looks at whether or not the


search results are relevant; this does not say much about the
user’s interaction with the search interface.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 32


Evaluating User Search Interface (2)

• In evaluating user search interface, HCI user study methods


focus on how well the system enables the searchers to
complete a task.

• There is a growing consensus that the quality of a search


results set alone can not be used as a measure for
evaluation.

• User interface therefore has a strong bearing on usability.

(Wilson, 2012)

Dr. Florence O. Entsua-Mensah (Mrs) Slide 33


Evaluating User Search Interface (3)

• There are several evaluative methods for user search


interface which may be unrealistic to enumerate.

• One approach to choosing an evaluation approach, provided


by the HCI community, is the DECIDE process.

• There are six parts of this process:

(Wilson, 2012)
Dr. Florence O. Entsua-Mensah (Mrs) Slide 34
The D-E-C-I-D-E Process (1)

• Determine the goal of the evaluation. What do you want to prove or examine?
D

• Explore the specific questions to be answered. Which element of the framework is


E being effected? Which disciplines are involved?

• Choose an evaluation paradigm, such as systematic IR or empirical user studies.


C

• Identify the practical issues in performing such an evaluation, such as structured


I tasks and appropriate datasets.

• Decide how to deal with any ethical issues. Ethical issues are particularly
D important when dealing with humans.

• Evaluate, interpret, and present the data.


E

(Wilson, 2012)
Dr. Florence O. Entsua-Mensah (Mrs) Slide 35
The D-E-C-I-D-E Process (2)

• The DECIDE, process begins with identifying where results


have innovated and which types of discipline are involved in
the search user interface feature that is being evaluated.
• Once an evaluation method is decided, it is recommended
that, regardless of the evaluations approach, is to perform a
pilot study to make sure that both the evaluation and the
evaluator are properly prepared to perform it correctly.

(Wilson, 2012)
Dr. Florence O. Entsua-Mensah (Mrs) Slide 36
Summary

• The lecture session introduced the idea and practices


surrounding user interface design in general, and the
search interface design in particular.

• The lecture session acknowledges some of the difficulties


with search interface design and provided a set of
designed guidelines tailored specifically to search user
interfaces.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 37


Activity 10.1

• Discuss the requirements for a good IRS user search


interface design.
• What are the challenges associated with the design of a user
search interface for an IRS?
• What are stop-words? Explain how they are treated in a
user’s search query, and why they are treated as such.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 38


References

Dumais, S., Cutrell, E., Cadiz, J. J., Jancke, G., Sarin, R., & Robbins,
D. C. (2003). Stuff I’ve Seen: A System for Personal Information
Retrieval and Re-Use. Retrieved from
https://www.microsoft.com/en-us/research/wp-
content/uploads/2003/01/siscore-sigir2003-final.pdf

Hearst, M. (2009). Search User Interfaces. Cambridge: Cambridge


University Press.

Kuhlthau, C. C. (1999). Accommodating the user’s information search


process: Challenges for information retrieval system designers.
Bulletin of the American Society for Information Science, 25, pp.
12-16.

University of North Texas. (2017). Users and their information needs. In


Concepts of Information Organisation (pp. 1–2).
Dr. Florence O. Entsua-Mensah (Mrs) Slide 39
INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2020/2021)

Lecturer: Dr. Mrs. Florence O. Entsua-Mensah, DIS


Contact Information: fentsua-mensah@ug.edu.gh

College of Education
School of Continuing and Distance Education
2014/2015 – 2016/2017
Session Overview
• In recent years, the evaluation of Information Retrieval
Systems and techniques for indexing, sorting, searching
and retrieving information have become increasingly
important (Saracevic, as cited in Kowalski, 2007).

• Consequently, this session discusses the various ways


in which an IRS may be evaluated.

Dr. (Mrs.) F. O. Entsua-Mensah Slide 2


Session Outline
The key topics to be covered in the session are:

• Topic 1: Understanding Evaluation


• Topic 2: Steps in Evaluation (as proposed by Lancaster)
• Topic 3: Criteria for evaluation (Cleverdon & Lancaster)

Dr. (Mrs.) F. O. Entsua-Mensah 3


Reading List
Chowdhury, G. G. (2010). Introduction to modern information
retrieval. Facet publishing.

Kowalski, G. J. (2007). Information retrieval systems: theory and


implementation (Vol. 1). Springer.

Wilson, M. L. (2012). Search User Interface Design. Morgan &


Claypool Publishers.

Dr. (Mrs.) F. O. Entsua-Mensah Slide 4


Topic One

UNDERSTANDING EVALUATION

Dr. (Mrs.) F. O. Entsua-Mensah Slide 5


What is evaluation?
• An evaluation is basically a judgement of worth, in
other words, we evaluates a system in order to
ascertain the level of its performance or its worth.
• Ascertaining the value or worth of something.
• Assigning a rating
• Assessing/judging/appraising the quality, ability, or
extent of significance of something.

Dr. (Mrs.) F. O. Entsua-Mensah 6


What is evaluation? (2)
• Lancaster states that an IRS can be evaluated by
considering the following three (3) issues:
1. How well the system is satisfying its objectives, i.e. how
well it is satisfying the demands placed on it.
2. How effectively it is satisfying its objectives.
3. Whether the system justifies its existence.
• IR evaluation can be conducted from two main
viewpoints:
▪ Managerial view – Management oriented
▪ User view – User Oriented

Dr. (Mrs.) F. O. Entsua-Mensah 7


Reasons for Evaluating IRS
• To monitor system effectiveness.
• To aid in the selection of a system to procure.
• To provide input to cost benefit analysis of an
information system.
• To access the query generation process for
improvements.
• To determine the effects of changes made to an
existing information system.
(Kowalski, 2007)

Dr. (Mrs.) F. O. Entsua-Mensah 8


Goal of Evaluation of IR systems
• The purpose of evaluation of an IR system is to
measure its performance based on a given scale
• Performance is measured by 2 basic parameters:

Efficiency Effectiveness

Dr. (Mrs.) F. O. Entsua-Mensah 9


Effectiveness
Goal of Evaluation of IRS – Cont’d.

1) Efficiency – cost factors involved in meeting the stated objectives.


Can be determined by:
– Response time- average time taken by a system to respond to a
query
– User effort- time taken by a user to obtain the right information
– Financial expenditure- cost per search

Dr. (Mrs.) F. O. Entsua-Mensah 10


Effectiveness
Goal of Evaluation of IRS – Cont’d.

2) Effectiveness – The level up to which a given system attains its stated


objective.
– the extent to which relevant information is retrieved and non-relevant
information withheld.
– To measure IR effectiveness in the standard way we need to test collection
consisting of three things:
a. Document Collection {Does the content of the database/collection
match user interest? Are the contents of the collection relevant to the
user needs?}
b. A test suit of information needs expressible as queries {sample user
queries formulated from prospective/possible user information needs}
c. A set of relevant judgements, standardly a binary assessment of either
relevant or nonrelevant for each query-document pair. Others binary
sets include Yes/No; Good/Bad; Present/Absent; On/Off; etc.
Dr. (Mrs.) F. O. Entsua-Mensah 11
Topic 1

STEPS FOR EVALUATION (BY LANCASTER)

Dr. (Mrs.) F. O. Entsua-Mensah 12


Lancaster’s Steps for Evaluation
1. Designing the scope of evaluation
2. Designing the evaluation programme
3. Execution of the evaluation
4. Analysis and interpretation of results
5. Modifying the system based on the results

(Chowdhury, 2010)
Dr. (Mrs.) F. O. Entsua-Mensah 13
1. Designing the scope of evaluation
• This is where a detailed plan is set to form the basis
for the rest of the program.
• Set of Objectives that the given study will meet.
• Set the purpose and scope.
• How the evaluation will be conducted.
– Whether laboratory type setup or a real-life situation.
• What level will it be evaluated?
– Macro evaluation or Micro evaluation
• Probable constraints in terms of cost, staff time, etc.

Dr. (Mrs.) F. O. Entsua-Mensah 14


2. Designing the evaluation programme
• Identify the points on which data are to be collected.

• Proposition of the methodology to be used.

• Detailed plan of action for data collection

• Draw up a plan for data manipulation and the


consequent drawing of conclusion.

Dr. (Mrs.) F. O. Entsua-Mensah 15


3. Execution of the evaluation
• Meticulous implementation of the methodology by
evaluator to avoid bias or error
• Constant communication between designer &
evaluator about observations & possible re-design of
programme

Dr. (Mrs.) F. O. Entsua-Mensah 16


4. Analysis and interpretation of results
• Interpretation of results based objectives.
• Suggestions for improvement based on findings.
• At this stage, the evaluator does a failure analysis to
justify the results and suggest improvements.
• The joint use of performance figures and failure
analysis should answer most of the questions
identified in the objective(s) of the evaluation.

Dr. (Mrs.) F. O. Entsua-Mensah 17


5. Modifying the system based on the
results

• Finally, the IRS is modified where needed, as per the


results of the evaluation study.

Dr. (Mrs.) F. O. Entsua-Mensah 19


Topic 3

CRITERIA FOR EVALUATION

Dr. (Mrs.) F. O. Entsua-Mensah 20


Proponents of Evaluation Methods
• A number of researchers in the field of information
retrieval have suggested ways in which an IRS may be
evaluated.

• Some of the prominent ones are from:


– Cleverdon
– Lancaster

Dr. (Mrs.) F. O. Entsua-Mensah 21


Cleverdon’s Criteria for Evaluating IRS
Cleverdon (1966) proposed six 4. Effort- the physical and
criteria: intellectual effort
1. Recall – ability of the expended by the user to
system to present all obtain an answer to his
relevant items. query.
2. Precision – ability of the 5. Form of presentation of
system to present only search output, which
those items that are affects the ability of the
relevant. user to make use of the
3. Time lag – average interval retrieved items
between query submission 6. Coverage of the collection
and response to query – the extent to which the
system includes relevant
matter

Dr. (Mrs.) F. O. Entsua-Mensah 22


Lancaster’s Criteria for Evaluating
1. Coverage of the system
2. Ability of the system to retrieve wanted items (i.e.
recall)
3. Ability of the system to avoid retrieval of unwanted
items (i.e. precision)
4. The response time of the system, and
5.The amount of effort required by the user

Dr. (Mrs.) F. O. Entsua-Mensah 23


Summary
• In this session the importance of evaluating
information retrieval systems were discussed.
• The session also presented some of the criteria for
evaluation of IRS, with preference to:
– Lancaster Steps for Evaluation; and
– Cleverson’s Evaluation Criteria

Dr. (Mrs.) F. O. Entsua-Mensah 24


Activity 11.1
• You have been selected as a member of an inhouse
team developing an automated information retrieval
system. Advice the team on how to choose and
evaluation method for the user interface of the AIRs.

• Compare user experiences of ‘Browsing’ and ‘searching’.

• Read and make notes on “Lancaster’s Steps of


Evaluation” from Chapter 13 of the recommended text –
Chowdhury, G. G. (2010). Introduction to modern
information retrieval. Facet publishing.
Dr. (Mrs.) F. O. Entsua-Mensah 25
References
Chowdhury, G. G. (2010). Introduction to modern information
retrieval. Facet publishing.

Kowalski, G. J. (2007). Information retrieval systems: theory and


implementation (Vol. 1). Springer.

Wilson, M. L. (2012). Search User Interface Design. Morgan &


Claypool Publishers.

Dr. (Mrs.) F. O. Entsua-Mensah 26


INFS 427: AUTOMATED INFORMATION RETRIEVAL
(1st Semester, 2020/2021)

Lecturer: Dr. Mrs. Florence O. Entsua-Mensah, DIS


Contact Information: fentsua-mensah@ug.edu.gh

College of Education
School of Continuing and Distance Education
2014/2015 – 2016/2017
1
Session Overview
• This session continues the discussion on the the
various ways in which an AIRS may be evaluated.

• It discusses in detail some of the core tenets of


evaluation, such as precision and recall.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 2


Session Outline
The key topics to be covered in the session are:

• Topic 1: Precision and Recall as a Measure of Evaluation.


• Topic 2: Other Measures of Evaluation.

Dr. Florence O. Entsua-Mensah (Mrs) 3


Reading List
Chowdhury, G. G. (2010). Introduction to modern information
retrieval. Facet publishing.

Kowalski, G. J. (2007). Information retrieval systems: theory and


implementation (Vol. 1). Springer.

Wilson, M. L. (2012). Search User Interface Design. Morgan &


Claypool Publishers.

Dr. Florence O. Entsua-Mensah (Mrs) Slide 4


Topic 1

PRECISION & RECALL

Dr. Florence O. Entsua-Mensah (Mrs) 5


Precision & Recall

• Theses are Building Blocks of Evaluation.

• The two are always reported together – they are


complementary.

Dr. Florence O. Entsua-Mensah (Mrs) 6


Recall and precision
• Recall and precision are the most important evaluative criteria for
assessing the performance of an IRS.
• Recall – the extent to which the items retrieved are wanted or
relevant
• An IR system is expected to retrieve relevant documents in
response to a query
• However, in a large collection, only a proportion of the total
relevant documents are retrieved
• The system’s performance is measured by the recall ratio =
No. of relevant items retrieved
X 100
Total no of relevant items in the collection

Dr. Florence O. Entsua-Mensah (Mrs) 7


Recall and precision
• Precision – the extent to which the system retrieves
relevant items and withhold non-relevant items.
• It is defined as the proportion of the retrieved items
that is relevant =
No. of relevant items retrieved
X 100
Total no. of items retrieved

An ideal system will achieve 100% recall and 100% precision which is not
possible

Dr. Florence O. Entsua-Mensah (Mrs) 8


Watch Video on Precision & Recall

Dr. Florence O. Entsua-Mensah (Mrs) 9


Building Blocks of AIRS Evaluation
All documents in the IR System

(Lavrenko, 2013)

Dr. Florence O. Entsua-Mensah (Mrs) 10


Dr. Florence O. Entsua-Mensah (Mrs) 11
Confusion Matrix/ Contingency Table

(Lavrenko, 2013)

Dr. Florence O. Entsua-Mensah (Mrs) 12


Visualizing Recall & Precision

(Koehrsen, 2018)
Dr. Florence O. Entsua-Mensah (Mrs) 13
Limitations of recall and precision
• Different users may want different levels of recall – a person
preparing a report on a given topic may prefer high recall.
Conversely, the one who need to know just something about a
given topic may prefer low recall.
• Recall assumes that, all relevant items have the same value,
but the value may be relative and varies from user to user,
and even from time to time with the same user.
– Both recall and precision relies on the relevance judgement of the user
and this judgement may be subjective.
– A subjective view of relevance may also be dependent upon the
knowledge of the contents of the user at the time of search.
• Therefore all pertinent items may be relevant but not all
relevant items may be pertinent.
(Chowdhury, 2010)
Dr. Florence O. Entsua-Mensah (Mrs) 15
Limitations of recall and precision – contd.

• The evaluative criteria are document based and


therefore measure only the performance of the
system in retrieving items that have been
predetermined to be relevant to the information
need
• They do not consider how the information will be
used or whether the documents fulfil the
information need of the user

Dr. Florence O. Entsua-Mensah (Mrs) 16


Topic 2

OTHER MEASURES OF EVALUATION

Dr. Florence O. Entsua-Mensah (Mrs) 17


Other measures of evaluation
• Efficient IR systems must be designed to
maximize recall and precision. The limitations of
precision and recall evaluative criteria calls for
new measures of evaluation. These include:
– Fallout ratio
– Generality ratio
– Usability
– Cost

Dr. Florence O. Entsua-Mensah (Mrs) 18


Other measures contd.
• Fallout ratio = the proportion of non-relevant items
retrieved in a given search.
• Generality ratio = the proportion of relevant
documents in the collection for a given query.
• Usability – a measure that considers the interface,
expectations, experiences and skills of the user.
• Cost- include time used for searching the system,
search algorithms, options for display of search
results etc.

Dr. Florence O. Entsua-Mensah (Mrs) 19


Summary of Retrieval measures
Symbol Evaluation Formula Explanation
measure
R Recall a/(a+c) Proportion of relevant items
retrieved
P Precision a/(a+b) Proportion of retrieved items
that are relevant
F Fallout b/(b+d) Proportion of non-relevant
items retrieved
G Generality (a+c)/(a+b+c+d) Proportion of relevant items per
query

[a=docs relevant to query] [b=docs not relevant to query] [c=docs relevant to


query but could not be retrieved]
[d= documents that are not relevant to the query]
Dr. Florence O. Entsua-Mensah (Mrs) 20
Summary
• In this session we went further to study the core
measures used in evaluating an AIRS; viz Recall &
Precision.
• Bear in mind that these classical evaluation
parameters such as Recall, and Precision has been
deemed as problematic when applied to modern day
online information retrieval evaluation.
– Hence, experts propose RELATIVE RECALL; though that
has its own problems.

Dr. Florence O. Entsua-Mensah (Mrs) 21


Activity 12.1 - Recall-precision Matrix for Search Activity

Relevant Not Relevant Total


Retrieved 8 5
Not Retrieved 10 3
Total
Using the table above:
• Compute the row and column totals.
• Calculate the recall, precision, fallout and generality of
the search results.
• Explain each of the results obtained.

Dr. Florence O. Entsua-Mensah (Mrs) 23


References
Chowdhury, G. G. (2010). Introduction to modern information
retrieval. Facet publishing.

Kowalski, G. J. (2007). Information retrieval systems: theory and


implementation (Vol. 1). Springer.

Wilson, M. L. (2012). Search User Interface Design. Morgan &


Claypool Publishers.

Dr. Florence O. Entsua-Mensah (Mrs) 24

You might also like