Professional Documents
Culture Documents
S Y S T E M S ( IRS )
U N I T- 5 S Y L L A B U S
Text Search Algorithms:
⚫ Introduction
⚫ Software text search algorithms
⚫ Hardware text search systems
Information System Evaluation:
⚫ Introduction
⚫ Measures used in system evaluation
⚫ Measurement example – T R E C results.
O V E RV I E W
Three classical techniques for text retrieval techniques
have been defined for organizing items in a textual
database, for rapidly identifying the relevant items and for
eliminating items that do not satisfy the search.
1. Full text scanning (streaming)
2. Word inversion and
3. Multi-attribute retrieval
I
S S
H E
B A E Z A -Y AT E S A N D G O N N E T A P P ROAC H
Handle "don't care" symbols and complement
symbols.
The search also handles tile cases of up to k
mismatches.
Their approach uses a vector of m different
states, where m is the length of tile search
pattern, and state i gives the state of the search
between the positions 1 . . . . . i of the pattern
and positions (j - i + 1) . . . . . j of the text where j
is the current position in the text.
H A R D WA R E T E X T S E A R C H S Y S T E M S
The searcher is hardware based, scalability is
achieved by increasing the number hardware search
devices. The only limit on speed is the time it takes
to flow the text off secondary storage (i.e., disk
drives) to the searchers.
By having one search machine per disk, the
maximum time it takes to search a database of any
size is the time to search one disk.
A DVA N TAG E S
the disks were formatted to optimize the data flow off
of the drives.
Another major advantage of using a hardware text
search unit is in the elimination of the index that
represents the document database.
APPROACH-1
Specialized hardware that interfaces with computers
and is used to search secondary storage devices was
developed from the early 1970s with the most recent
product being the Parasel Searcher (previously the
Fast Data Finder).
The need for this hardware was driven by the limits in
computer resources. The typical hardware
configuration is shown in Figure in the dashed box. The
speed of search is then based on the speed of the I/O.
One of the earliest hardware text string search units
was the Rapid Search Machine developed by General
Electric(GE).
The machine consisted of a special purpose search
unit in which a single query was passed against a
magnetic tape containing the documents.
A more sophisticated search unit was developed by
Operating Systems Inc. called the Associative File
Processor (AFP).
It is capable of searching against multiple queries at
the same time. Following that initial development,
O SI , using a different approach, developed the High
Speed Text Search (HSTS) machine.
It uses an algorithm similar to the Aho-Corasick
software finite state machine algorithm except that it
runs three parallel state machines.
One state machine is dedicated to contiguous word
phrases
Another for imbedded term match and the final for
exact word match. In parallel with that development
effort, G E redesigned their Rapid Search Machine into
the G E S C A N unit.
TRW, based upon analysis of the H ST S , decided to
develop its own text search unit. This became the Fast
Data Finder, now being marketed by Parasal.
All of these machines were based upon state machines
that input the text string and compared them to the
query terms.
The G E S C A N system uses a text array processor
(TAP) that simultaneously matches many terms and
conditions against a given text stream.
The TAP receives the query information from the
user's computer and directly accesses the textual
data from secondary storage.
The TAP consists of a large cache memory and an
array of four to 128 query processors.
Each query processor is independent and can be
loaded at any time. A complete query is handled by
each query processor. Queries support exact term
matches, fixed length don't cares, variable length
"don't cares," terms restricted to specified zones,
Boolean logic, and proximity.
A query processor works two operations in parallel:
⚫ Matching query terms to input text
Term matching is performed by a series of
character cells, each containing one character of
the query. A string of character cells is
implemented on the same L S I chip and the chips
can be connected in series for longer strings.
When a word or phrase of the query is matched, a
signal is sent to the resolution sub-process on the
L S I chip.
⚫ Boolean logic resolution
The resolution chip is responsible for resolving the
⚫ Introduction
⚫ Measures used in system evaluation
⚫ Measurement example – T R E C results.
Search
Inf. need
Results
User
Evaluation
There are many reasons to evaluate the
effectiveness of an Information Retrieval System
(Belkin-93, Callan-93):
⚫ To aid in the selection of a system to procure
⚫ To monitor and evaluate system effectiveness
⚫ To evaluate query generation
process for improvements
⚫ To provide inputs to cost-benefitanalysis
of an information system
⚫ To determine the effects of changes made
to an existing information System
E VA LUAT I O N C R I T E R I A
1. Information View
The information view is subjective in nature and
pertains to human judgment of the conceptual
relatedness between an item and the search.
It involves the user's personal judgment of the
relevancy (aboutness) of the item to the user's
information need.
When reference experts (librarians, researchers,
subject specialists, indexers) assist the user, it is
assumed they can reasonably predict whether certain
information will satisfy the user's needs.
The information view into four types of "aboutness”:.
1. Author Aboutness - determined by the author's
language as matched by the system in natural
language retrieval
2. Indexer Aboutness - determined by the indexer's
transformation of the author' s natural language
into a controlled vocabulary
3. Request Aboutness - determined by the user's or
intermediary's processing of a search statement
into a query
4. User Aboutness - determined by the indexer's
attempt to represent the document according to
user’s needs.
2. System View:
The system view relates to a match between query
terms and terms within an item. It can be objectively
observed, manipulated and tested without relying on
human judgment because it uses metrics associated
with the matching of the query to the item (Barry-94,
Schamber-90).
The semantic relatedness between queries and items
is assumed to be inherited via the index terms that
represent the semantic content of the item in a
consistent and accurate fashion.
3.The Situation View:
The situation view pertains to the relationship
between information and the user's information
problem situation. It assumes that only users can
make valid judgments regarding the suitability of
information to solve their information need.
Lancaster and Warner refer to information and
situation views as relevance and pertinence
respectively (Lancaster-93). Pertinence can be
defined as those items that satisfy the user's
information need at the time of retrieval.
M E A S U R E S U S E D I N S Y S T E M E VA L UAT I O N S
To define the measures that can be used in
evaluating Information Retrieval Systems, it is useful
to define the major functions associated with
identifying relevant items in an information system.
Items arrive in the system and are automatically or
manually transformed by "indexing" into searchable
data structures.
The user determines what his information need is
and creates a search statement. The system processes
the search statement, returning potential hits. The
user selects those hits to review and accesses them.
Identifying Relevant Items
Measurements can be made from two perspectives:
⚫ System Perspective
1.User Perspective
1. Search Process
2. Response Time
3. Consistency
4. Quality of the Search
5. Fallout
6. Unique Relevance Recall( U R R )
7. Novelty Ratio
8. Coverage Ratio
9. Sought Recall
SEARCH PROCESS
This is associated with a user creating a new search
or modifying an existing query. In creating a search,
an example of an objective measure is the time
required to create the query, measured from when
the user enters into a function allowing query input
to when the query is complete.
Completeness is defined as when the query is
executed. Although of value, file possibilities for
erroneous data (except in controlled environments)
are so great that data of this nature are not collected
in tiffs area in operational systems.
Example: The erroneous data comes from the user
performing other activities in the middle of creating
the search such as going to get a cup of coffee.
RESPONSE TIME
Response time is a metric frequently collected to
determine the efficiency of the search execution. Response
time is defined as the time it takes to execute the search.
The ambiguity in response time originates from the
possible definitions of file end time.
The beginning is always correlated to when the user tells
the system to begin searching. The end time is affected by
the difference between the user's view and a system view.
From a user's perspective, a search could be considered
complete when file first result is available for tile user to
review, especially if the system has new items available
whenever a user needs to see tile next item.
From a system perspective, system resources are being
used until the search has determined all hits.
CONSISTENCY
To ensure consistency, response time is usually
associated with the completion of the search.
This is one of the most important measurements in a
production system. Determining how well a system is
working answers the typical concern of a user: "the
system is working slow today.“
It is difficult to define objective measures on the
process of a user selecting hits for review and
reviewing them. The problems associated with search
creation apply to this operation.