Professional Documents
Culture Documents
4. How to present the search results in a format that facilitate the user in
determining relevant items
1.2 Objectives of Information Retrieval Systems
• Relevant items are those documents that contain information that helps
the searcher in answering his question.
• Non-relevant items are those items that do not provide any directly useful
information.
1. Item normalization,
2. Selective dissemination of information (i.e., “mail”),
3. Archival document database search,
4. An index database search along with the Automatic file build
process that supports index files.
1.3 Functional Overview :
1.3.1 Item Normalization:
• A typical item is sub-divided into zones, Title, Author, Abstract, Main Text,
Conclusion, References.
• A major limitation to the user is the size of the display screen which
constrains the number of items that are visible for review.
• Quite often the user will only display zones such as the Title or Title mid
Abstract. This allows multiple items to be displayed per screen.
• The user can expand those items of potential interest to see the complete
text.
1.3.1 Item Normalization
• 1.3.1.3 Identify Processing Tokens :
• Identify the information that are used in the search process –
Processing Tokens (Better than Words)
• The first step in identification of a processing token consists of
determining a word.
• Systems determine words by dividing input symbols into three classes
• Valid word symbols: Alphabetic characters, numbers.
• Inter-word symbols: Blanks, periods, semicolons (non searchable)
• Special processing symbols: hyphen (-)
• A word is defined as a contiguous set of word symbols bounded by
inter-word symbols.
1.3.1 Item Normalization
1.3.1.4 Stop Algorithm:
• a Stop List/Algorithm is applied to the list of potential processing tokens.
• The objective of the Stop function is to save system resources by
eliminating from the set of searchable processing tokens those that have
little value to the system
• Eliminating these words saves on storage and access structure complexities
• Any word found in almost every item
• Any word only found once or twice in the database
• The rank-frequency law of Ziph is
• Frequency * Rank = Constant
1.3.1 Item Normalization
1.3.1 Item Normalization
1.3.1.5 Characterize Tokens
• The next step in finalizing on processing tokens is identification of any
• specific word characteristics.
• The characteristic is used in systems to assist in disambiguation of a
particular word.
• Analysis of the processing token's part of speech is included here.
• Uppercase – proper names, acronyms, and organization Numbers and
dates
1.3.1 Item Normalization
1.3.1.6 Stemming Algorithm
• Normalize the token to a standard semantic representation
• For example, the system must keep singular, plural, past tense,
possessive, etc. as separate searchable tokens and potentially expand
a term at search time to all its possible representations
• Computer, Compute, Computers, Computing
• Comput
• Reduce the number of unique words the system has to contain.
1.3.1 Item Normalization
• 1.3.1.7 Create Searchable Data Structure:
• Based upon the stemmmg algorithm, they are used as updates to the
searchable data structure.
• The searchable data structure is the internal representation
• This structure contains
1. The semantic concepts that represent the items in the database
2. limits what a user can find as a result of their search.
1.3.2 Selective Dissemination of Information
• Provides the capability to dynamically compare newly received items
in the information system against standing statements of interest of
users and deliver the item to those users whose statement of interest
matches the contents of the items.
• It Consist of ,
1. Search process
2. User statements of interest (Profile)
3. User mail file.
• As each item is received, it is processed against every user’s profile
When the search statement is satisfied, the item is placed in the mail
file.
• A profile contains a typically broad search statement along with a list
of user mail files that will receive the document if the search
statement in the profile is satisfied
1.3.3 Document Database Search :
• Provides the capability for a query to search against all items received
by the system.
• Composed of the
1. Search process,
2. User entered queries
3. Document database
• Document database contains all items that have been received,
processed and store by the system.
• Usually items in the Document DB do not change May be partitioned
by time and allow for archiving by the Time partitions.
• The Document Database can be very large, hundreds of millions of
items or more.
1.3.4 Index Database Search:
• When an item is determined to be of interest, a user may want to
save it (file it) for future reference Accomplished via the index
process.
• In the index process, the user can logically store an item in a file along
with additional index terms and descriptive text the user wants to
associate with the item.
• The Index Database Search Process provides the capability to create
indexes and search them
• The user may search the index and retrieve the index and/or the
document it references
1.3.4 Index Database Search:
Two classes of index files:
1. Public
2. Private index files
• Every user can have one or more private index files leading to a very
large number of files
• Each private index file references only a small subset of the total
number of items in the Document database
• Public index files are maintained by professional library services and
typically index every item in the Document database
• Private Index files typically have very limited access lists.
1.3.5 Multimedia Database Search
• The multi-media data is not logically its own data structure, but an
augmentation to the existing structures in the information retrieval
system.
• The correlation between the multi-media and the textual domains
will be either via time or positional synchronization
• Time synchronization is the example of transcribed text from audio or
composite video sources.
• Positional synchronization is where the multi-media is localized by a
hyperlink in a textual item
1.4 Relationship to Database Management Systems :
• There are two major categories of systems available to process items:
1. Information Retrieval Systems
2. Data Base Management Systems (DBMS).
• An Information Retrieval System is software that has the features and
functions required to manipulate “information” items versus a DBMS that
is optimized to handle “structured” data.
• With structured data a user enters a specific request and the results
returned provide the user with the desired information
• The results are frequently tabulated and presented in a report format for
ease of use.
• In contrast, a search of “information” items has a high probability of not
finding all the items a user is looking for.
1.4 Relationship to Database Management Systems :
• The user has to refine his search to locate additional items of interest.
This process is called “iterative search “.
• The integration of DBMS’s and Information Retrieval Systems is very
important.
• Commercial database companies have already integrated the two
types of systems.
• One of the first commercial databases to integrate the two systems
into a single view is the INQUIRE DBMS .
1.5 Digital Libraries and Data Warehouses
1. Search Capabilities
2. Browse Capabilities
3. Miscellaneous Capabilities
• Search and browse capabilities are crucial to assist the user in locating
relevant items.
• The search capabilities address both Boolean and Natural Language
queries.
• The majority of existing commercial systems are based upon Boolean
query and search capabilities
1.6 Information Retrieval Systems Capabilities :
• 2.1 Search Capabilities:
• 2.2.3 Highlighting
• Highlighting has always been useful in Boolean systems to indicate the
cause of the retrieval.
• This is because of the direct mapping between the terms in the search and
the terms in the item
• In a ranking system different terms can contribute to different degrees to
the decision to retrieve an item.
• The highlighting may vary by introducing colors and intensities to indicate
the relative importance of a particular word in the item in the decision to
retrieve the item.
• Information visualization appears to be a better display process to assist in
helping the user formulate his query than highlights in items.
2.3 Miscellaneous Capabilities
• 2.3.1 Vocabulary Browse
• 2.3.2 lterative Search and Search History Log
• 2.3.3 Canned Query
2.3.1 Vocabulary Browse
• Vocabulary Browse provides the capability to display in alphabetical
sorted order words from the document database.
• Logically, all unique words in the database are kept in sorted order
along with a count of the number of unique items in which the word
is found.
• The user can enter a word or word fragment and the system will
begin to display file dictionary around the entered text
• The user can determine that entering the search term "compul*" in
effect is searching for "compulsion" or "compulsive" or“ compulsory.“
• The system indicates what word fragment the user entered and then
alphabetically displays other words found in the database in collating
sequence on either side of file entered term
2.3.1 Vocabulary Browse
2.3.2 Iterative Search and Search History Log
• Rather than typing in a complete new query, the results of the
previous search can be used as a constraining list to create a new
query that is applied against it.
• This has the same effect as taking the original query and adding
additional search statement against it in an AND condition.
• This process of refining the results of a previous search to focus on
relevant items is called iterative search.
• To facilitate locating previous searches as starting points for new
searches, search history logs are available.
• The search history log is the capability to display all the previous
searches that were executed during the current session.
2.3.3 Canned Query
• The capability to name a query and store it to be retrieved and
executed during a later user session is called canned or stored queries
• A canned query allows a user to create and refine a search that
focuses on the user's general area of interest one time and then
retrieve it to add additional search criteria to retrieve data that is
currently needed.
• Queries that start with a canned query are significantly larger than ad
hoc queries.
• Canned query features also allow for variables to be inserted into the
query and bound to specific values at execution time.
2.4 Z39.50 and WAIS Standards
• NISO is the only organization accredited by ANSI to approve and
maintain standards for information services, libraries and publishers.
• This standard is one of many standards they generated to support
interconnection of computer systems.
• American National Standards Institute/National Information
Standards Organization (ANSI/NISO) in generating its Z39.50,
Information Retrieval Application Service Definition and Protocol
Specification.
• In addition to the formal standard, a second de facto standard is the
WAIS standard based upon its usage in the INTERNET.
Z39.50
• The Z39.50 standard does not specify an implementation, but the
capabilities within an application (Application Service) and the protocol
used to communicate between applications
• Its objective is to overcome different system incompatibilities associated
with multiple database searching
• The client is identified as the "Origin“ and performs the communications
functions relating to initiating a search, translation of the query into a
standardized format, sending a query, and requesting return records.
• The server is identified as the "Target" and interfaces to the database at
the remote responding to requests from the Origin.
• This makes the dissimilarities of different database systems transparent to
the user and facilitates issuing one query against multiple databases at
different sites returning to the user a single integrated Hit file.
WAIS
• WAIS was developed by a project started in 1989 by three commercial
companies (Apple, Thinking Machines, andDow Jones).
• Wide Area Information Service (WAIS) is the de facto standard for many
search environments on the INTERNET.
• The original idea was to create a program that would act as a personal
librarian.
• It would act like a personal agent keeping track of significant amounts of
data and filtering it for the information most relevant to the user.
• The interface concept was user entered natural language statements of
topics the user had interest.
• In addition it provided the capability to communicate to the computer that
a particular item was of interest and have the computer automatically find
similar items