Professional Documents
Culture Documents
The three different paradigms of information search and resource discovery are:
The main purpose of developing this process is to support naive users in areas like
electronic shopping and home banking. The goals include the following:
1
transmitted implicitly by different remote sources or by different alternative sources like
e-mail.
Software filters are used to provide access control ensures only the information
that is appropriate is passed to the decision maker. The transmitted information helps
them to behave in a more versatile way with respect to the changing organizational
surroundings. There are two types of software filters. They are :
i) Local Filters - Local filters are used for processing incoming stream of data.
ii) Remote Filters - Remote filters are software agents that perform their task
on behalf of users. They help users to perform daily task, search and retrieve
information, support decision-making. They work as a proxy for user that
move around the database present on different networks.
The process of searching for text strings in a large collection of documents can be
divided into two phases:
This phase consists of three steps that the user performs during the text search.
1) The user formulates a query, specifying in some way the material for which
the text database is to be searched.
2) The server interprets the user's query, performs the search, and returns to
the user a list of documents meeting the search criteria. Text systems usually
perform the search by comparing search terms with an index file containing
a sorted list of words found in the document database.
3) The user selects documents from the hit list and browses them, reading and
perhaps printing selected portions of the retrieved documents.
This process often takes place during off-hours so that system performance is not
2
degraded during working hours. Some systems, such as those used by news agencies, add
documents to the database constantly with a live data feed. The process of loading
documents into the system and updating indexes is normally not a concern to the user.
These two phases are highly interdependent. The user interface should provide a
way of entering search queries and for browsing matched documents. The index should
be structured to expedite the type of searching permitted by the queries, and the data-
entry procedures must work within the structure of the documents and the search indexes.
The purpose of a search engine in any indexing system is simple: to find item that
matches a query, no matter where it is located in the file system. The trick for the software
designers is to create a search engine that carries out this job quickly and accurately while
taking up as little disk space as possible. Search engines are now being designed to go
beyond simple, broadband searches for which WAIS is so popular.
Topic is a search engine used in Lotus Notes, Adobe Acrobat, and a variety of other
products. It uses both key words and information searching to rank the relevance of each
document. Topic might return a list of a hundred documents that match the user's criteria,
but they would be listed in order of the relevance that topic assigns.
Wide Area Information Service or WAIS enables users to search the contents of
files for any string of text that they supply. An extremely versatile service, WAIS uses an
English-language query front end to a large assortment of databases that contain text-
based documents.
WAIS lets users search the full text of all the documents on a server. Users on
different platforms can access personal, company, and published information from one
interface - text, pictures, voice or formatted documents. Since the system uses a single
computer-to-computer protocol, information can be stored anywhere on different types of
machines.
1) a client,
2) a server, and
3
3) an indexer. ,
The indexer takes a list of files the publisher wants to index and generates from it
several index files. These indexes include a directory of all words appearing in the
database, a list of documents and files that constitute the database, and the "headline" of
the documents contained in the database.
With the index created, the user must tell the rest of the world about it. The
publisher does so automatically, by running WAIS with a register option, which places this
index next to the hundreds of WAIS indexes already available on the Internet - items such
as a legal index from West Publishing, indexes of government documents, and countless
academic databases.
WAIS solves a number of problems from the user's perspective. They are :
Uses of WAIS:
WAIS is a sophisticated search engine. Some publishers create the WAIS indexes of files
that they serve through the World Wide Web or Gopher. These indexes enable users to
search the contents of those files. Several companies use WAIS to sell information over
the Internet. One can associate a cost directly with WAIS sources, unlike many other
Internet services.
Indexing Methods
The disadvantage of word-level indexing schemes is that all the extra information
they contain gobbles up a lot disks pace - anywhere between 35 percent and 100 percent
4
of the size of the original text. They also can be slower than file-level indexes because
they have more information to search through.
Indexing Packages
A large number of indexing packages have become available for UNIX-based workstations.
These packages fall into three categories:
1. The client- server method is based on the distributed approach in which the
document database and the text search and retrieval software reside on a central
server, while sophisticated data representation and user interface software reside
on the user's workstation. The power of the server is used for the data intensive
job of comparing search terms with text files or indexes, while the workstations are
best suited for graphical interfaces.
In this approach, the index file can be split into pieces corresponding to
work groups and maintained on separate servers. This approach provides fast
response time for documents "owned" locally. Searches of portions of the index
stored in other servers can be performed in the background while the user is
retrieving and studying locally owned documents. One disadvantage of this
approach is that each sub-index has to be updated individually each time the
master file is updated.
Robots, Wanderers, and Spiders are all programs that traverse the WWW
automatically gathering information. The terms robot and spider are often used in
reference to automated tools for access to publicly accessible databases on the Internet
for the purpose of building indices of documents. These web robots are generally used by
search engine like Google to perform the following tasks:
5
v) For providing up-to-date information to the users.
A software agent views the World Wide Web as a graph. It starts at a set of nodes (.HTML)
and traverses the hypertext links in these nodes to a certain depth beginning at a URL
passed as an argument. Only URLs having "." suffixes or tagged as "HTTP:" and ending in
a slash are probed. Unsuccessful attempts and document leaves are logged into a separate
table to prevent revisiting. This method results in a limited-depth breadth-first traversal
of only HTML portions of the Web.
There are three models that are used for retrieving information from the database
in an efficient manner.
The Boolean model is based on the "exact match" principle and is the standard for most
popular information retrieval systems. The term Boolean is used because the query
6
specifications are expressed as words or phrases, combined using the standard operators
OR, AND, and NOT.
This model retrieves all text files containing the combination of words or phrases
specified in the query, but it makes no distinction between any of the retrieved documents.
Thus the result of the comparison operation is a partition of the database into a set of
retrieved document set.
One disadvantage of this model is that it does not allow for any form of ranking of
the retrieved document set. Presenting documents to the user in presumed order of
relevance would result in more effective and usable systems. Similarly, excluding
documents that do not precisely match a query specification results in lower effectiveness.
Vector space and probabilistic models, based on best-match retrieval models, have
been formulated in response to the problems of Boolean models. The most widely known,
the vector space model, treats texts and queries as vectors in a multidimensional space,
the dimensions of which are the words used to represent the texts. The vector model
processes queries and texts by comparing the vectors, using, for example, a method called
the cosine correlation similarity measure.
The assumption is that the more similar a vector representing a text is to query
vector, the more likely that the text is weight terms (or dimensions) of a query, or text
representation, to take account of their importance. These weights are computed on the
basis of the statistical distributions of the terms in the database and in the texts.
7
Need of Directories
Yellow pages focus on customers who had made a decision of purchasing product.
Yellow pages act as an advertising medium of low profile. The difference between print-
based yellow pages and electronic yellow pages is that, the later provides more enhanced
services than the former. In electronic yellow pages, a directory is created that acts as an
interface to various resources. These directories can even be accessible even from an
electronic commerce application, which includes a huge demand for these directories.
8
- As distributed applications appear on the network, the directories have to begin
tracking all those objects and their components as well. Hence directories and
naming tend to go hand. A good name service makes use of a distributed
computing environment transparent to the user.
A directory or catalog is an information base about a set of real world objects. Users
often scene directories for telephone numbers or addresses, facts, or organizations or
persons. Directories must therefore be organized in a manner that facilitates easy access
to information, and the directory user must be able to locate "entries" in the directory
where the actual information is stored or presented. Directories are also being slowly
integrated with messaging services such as e-mail and EDI applications.
The electronic white pages provide services from a static listing of e-mail addresses
to directory assistance. The Internet directory assistance service can be more extensive
than the one provided by the phone companies, as the technology provides the ability to
publish important information that an individual may make publicly available,, such as
photographs, home mailing addresses and fax numbers, office information, and job
descriptions.
The original intention behind organizational directories was to reduce the amount of
duplication as corporations spend money maintaining identical lists in several sites - for
phones, security, payroll, faxes, computers, e-mail, and other reasons.
1) Searching and
2) Retrieving
Approaches used for creating white pages directory are interpretability and
conventional form of communication. These approaches are strong enough to provide all
the required functionality that is used for establishing directory service associated with
different technologies.
One of the first goals of the X.500 project has been to create a directory for keeping track
of individual electronic mail address on the Internet.
9
1) Decentralized maintenance: Each site running X.500 is responsible only for its
local part of the directory, so updates and maintenance can be done instantly.
2) Searching Capabilities: X.500 provides powerful searching facilities that allow
users to construct arbitrarily complex queries. For example, in the white pages,
you can search solely for users in one country. From there, you can view a list
of organizations, then departments, then individual names. This represents a
tree structure with successive descent to the terminal nodes or instances.
3) Single global name space: X.500 provides a single name space to users.
4) Structured information framework: X.500 defines the information framework
used in the directory, allowing local extensions.
5) Structured Information framework: X.500 can be used to build directory
applications that require distributed information (e-mail, automated resources
locators, special-purpose directory tools). These applications can access a
wealth of information in a uniform manner, no matter where they are based
or currently running.
10
A user of the directory can be a person of a computer program. The organization
and distribution of information among the DSAs is totally transparent to the users. A user
accesses the directory through a so-called directory user agent. The DUA automatically
contacts a nearby DSA by means of which the user may search or browse through the DIT
and retrieve corresponding information.
A DUA can be implemented in all sorts of user interfaces, so users can access the
directory through dedicated DUA interfaces or e-mail applications. Currently, most DUA
interfaces are dedicated, but it is expected that in the near future a lot of DUA interfaces
will be integrated with other applications.
The information requested by the user agent is located in the local server to which
the user agent is attached. However, it is often the case that the required information is
not contained within the local server (DSA). In this case, various server agents might
become involved and might need to cooperate to provide the information. For this reason,
several methods have been defined for the operation of the directory when information is
not located in the local server agent:
2) Referral - Referral identifies "more suitable DSA" who can satisfy the needs of the
user. A DSA might return a referral to a user or another DSA if the request cannot.
be performed.
3) Multicasting - Multicasting involves passing the same request by a DSA to two or
more DSAs.
4) Hybrids - Chaining, referrals, and multicasting can be combined as necessary to
perform the intended request.
1) It must be much easier to be part of the Internet white pages than to bring up a
X.500 DSA or make good use of the already deployed X.500 DSAs. X.500 is too
complicated and simpler white pages services must be defined to promote
widespread implementations.
2) To promote reliable operation and consistency of data, there must be some central
management of the X.500 system.
3) A common naming scheme must be identified and documented.
11
Electronic Yellow Pages
Users are increasingly turning to directory databases rather than printed Yellow Pages.
collection of printed financial directories. You may get additional information, such as
employee size, sales, and ownership information that are omitted from Yellow Pages
listings.
12
The goal of interactive catalogs is simple i.e. to enable customers everywhere to
buy goods from anywhere in a virtual mall open twenty-four hours a day, seven days a
week. Customers simply look through the on-line merchandize and interact with the
company using several methods, such as e-mail, form-based secure messaging systems,
interactive desktop video, and other methods.
Directories, in contrast to catalogs, are usually compiled by third parties and play
an influential role in guiding customers in the information space to reach catalogs.
Electronic yellow pages are organized by product and services and becomes
necessary as businesses move toward electronic commerce. The goal of yellow pages is
to organize the vast amount of information so that customers can quickly locate desired
and alternative products and services through eye-catching advertisements. Yellow pages
catalogs are a reactive medium, in that they satisfy a need but do not create a need.
However, it must be noted that yellow pages work only when someone is looking for some
particular information.
The job of interactive catalogs, which are based on the idea that effective marketing
relies on a two-way information flow between the marketer and the customer.
Interactive catalogs are ideally suited for small businesses, because they enable
them to effectively utilize their limited marketing, support, and sales staffs with potential
customers around the world.
Interactive catalogs are ideally suited for small businesses, because they enable
them to effectively utilize their limited marketing, support, and sales staffs with potential
customers around the world. Catalogs must support product / service bundling,
coordinated purchasing, and associated financing.
Information filtering
1) Filtering systems involve large amounts of data. Typical applications would deal with
gigabyte of text, or much larger amounts of other media.
2) Filtering typically involves streams of incoming data, either being broadcast by remote
sources or sent directly by other sources (e-mail). Filtering is often meant to imply the
removal of data from an incoming stream, rather than finding data in that stream.
3) Filtering has also been used to describe the process of accessing and retrieving
information from remote databases, in which case the incoming data are the result of a
13
search query. This scenario is also used by the developers of systems that generate
"smart agents" for searching remote, heterogeneous database.
4) Filtering is based on descriptions of individual or group information preferences, often
called profiles. Such profiles typically represent user interests. The use of user profiles is
common in the library community where the process is known as the selective
dissemination of information (SDI). SDI is defined as the service that attacks the
information overload problem by keeping individuals informed of new documents
published in their area of specialization so that they can keep abreast of new
developments.
5) Filtering systems deal primarily with textual information. The problem is more general
than that and should include other types of data such as images, voice, and video that
are part of multimedia information systems. None of these data types are handled well
by conventional filtering systems, and all have representations and meanings that are
difficult to filter.
Mail-Filtering Agents
Users of mail-filtering agents can instruct them to watch for items of interest in e-
mail in-boxes, on-line news services, electronic discussion forums, and the like. The mail
agent will pull the relevant information and put it in the user's personalized newspaper at
predetermined intervals.
News-Filtering Agents
14
These deliver real-time one-line news. Users can indicted topics of interest, and the
agent will alert them to news stories on those topics as they appear on the newswire.
Users can also create personalized news clipping reports by selecting from news services.
Customers can receive their news stories through the delivery channel of their choice -
fax; e-mail WWW page, or Lotus Notes platform.
For instance, one can create a user agent that, based on the categories selected,
will daily download news clips on the computer, business, financial, or medical industries.
Currently, news filtering services are primarily targeted to executives who heed to keep
current concerning their areas of interest.
15