Professional Documents
Culture Documents
1. Introduction
Gerard Salton, a pioneer in information retrieval and one of the leading figures from the 1960s
to the 1990s, proposed the following definition in his classic 1968 textbook (Salton, 1968):
Information retrieval is a field concerned with the structure, analysis, organization, storage,
searching, and retrieval of information.
Information retrieval (IR) is the process of searching within a document collection for a
particular information need (called a query). Web IR can be defined as the application of
theories and methodologies from IR to the World Wide Web. It is concerned with addressing
the technological challenges facing Information.
Increasingly, applications of information retrieval involve multimedia documents with
structure, significant text content, and other media. Popular information media include pictures,
video, and audio, including music and speech.
In addition to a range of media, information retrieval involves a range of tasks and
applications. The usual search scenario involves someone typing in a query to a search engine
and receiving answers in the form of a list of documents in ranked order. Although searching
the World Wide Web (web search) is by far the most common application involving
information retrieval, search is also a crucial part of applications in corporations, government,
and many other domains. Vertical search is a specialized form of web search where the
domain of the search is restricted to a particular topic. Enterprise search involves finding the
required information in the huge variety of computer files scattered across a corporate
intranet. Web pages are certainly a part of that distributed information store, but most
information will be found in sources such as email, reports, presentations, spreadsheets, and
structured data in corporate databases. Desktop search is the personal version of enterprise
search, where the information sources are the files stored on an individual computer,
including email messages and web pages that have recently been browsed. Peer-to-peer
search involves finding information in networks of nodes or computers without any
centralized control. This type of search began as a file sharing tool for music but can be used
in any community based on shared interests, or even shared locality in the case of mobile
devices. Search and related information retrieval techniques are used for advertising, for
intelligence analysis, for scientific discovery, for health care, for customer support, for real
estate, and so on. Any application that involves a collection of text or other unstructured
information will need to organize and search that information.
Search based on a user query (sometimes called ad hoc search because the range of possible
queries is huge and not prespecified) is not the only text-based task that is studied in
information retrieval. Other tasks include filtering, classification, and question answering.
Filtering or tracking involves detecting stories of interest based on a person’s interests and
providing an alert using email or some other mechanism. Classification or categorization uses
a defined set of labels or classes (such as the categories listed in the Yahoo! Directory) and
automatically assigns those labels to documents. Question answering is similar to search but
is aimed at more specific questions, such as “What is the height of Mt. Everest?”. The goal
of question answering is to return a specific answer found in the text, rather than a list of
documents. Table 1.1 summarizes some of these aspects or dimensions of the field of
information retrieval.
2. Search Engines:
A search engine is the practical application of information retrieval techniques to large-scale
text collections. A web search engine is the obvious example, but as has been mentioned,
search engines can be found in many different applications, such as desktop search or
enterprise search. Search engines have been around for many years. For example, MEDLINE,
the online medical literature search system, started in the 1970s.
Search engines come in a number of configurations that reflect the applications they are
designed for. Web search engines, such as Google and Yahoo!, must be able to capture, or
crawl, many terabytes of data, and then provide subsecond response times to millions of
queries submitted every day from around the world. Enterprise search engines—for example,
Autonomy—must be able to process the large variety of information sources in a company
and use company-specific knowledge as part of search and related tasks, such as data mining.
Data mining refers to the automatic discovery of interesting structure in data and includes
techniques such as clustering. Desktop search engines, such as the Microsoft Vista™ search
feature, must be able to rapidly incorporate new documents, web pages, and email as the
person creates or looks at them, as well as provide an intuitive interface for searching this
very heterogeneous mix of information. There is overlap between these categories with
systems such as Google, for example, which is available in configurations for enterprise and
desktop search. Open-source search engines are another important class of systems that have
somewhat different design goals than the commercial search engines. There are a number of
these systems, and the Wikipedia page for information retrieval provides links to many of
them. Three systems of particular interest are Lucene, Lemur, and Galago.
The “big issues” in the design of search engines include the ones identified for information
retrieval: effective ranking algorithms, evaluation, and user interaction. There are, however,
a number of additional critical features of search engines that result from their deployment
in large-scale, operational environments. Foremost among these features is the performance
of the search engine in terms of measures such as response time, query throughput, and
indexing speed. Response time is the delay between submitting a query and receiving the
result list, throughput measures the number of queries that can be processed in a given time,
and indexing speed is the rate at which text documents can be transformed into indexes for
searching. An index is a data structure that improves the speed of search.
Another important performance measure is how fast new data can be incorporated into the
indexes. Search applications typically deal with dynamic, constantly changing information.
Coverage measures how much of the existing information in, say, a corporate information
environment has been indexed and stored in the search engine, and recency or freshness
measures the “age” of the stored information.
Search engines can be used with small collections, such as a few hundred emails and
documents on a desktop, or extremely large collections, such as the entire Web. There may
be only a few users of a given application, or many thousands. Scalability is clearly an
important issue for search engine design. Designs that work for a given application should
continue to work as the amount of data and the number of users grow.
Figure 1.1 summarizes the major issues involved in search engine design.
3. Architecture of Search Engines:
An example of an architecture used to provide a standard for integrating search and related
language technology components is UIMA (Unstructured Information Management
Architecture). UIMA defines interfaces for components in order to simplify the addition of
new technologies into systems that handle text and other unstructured data.
An architecture is designed to ensure that a system will satisfy the application requirements
or goals. The two primary goals of a search engine are:
• Effectiveness (quality): We want to be able to retrieve the most relevant set of
documents possible for a query.
• Efficiency (speed): We want to process queries from users as quickly as possible.
Search engine components support two major functions, which we call the indexing process
and the query process. The indexing process builds the structures that enable searching, and
the query process uses those structures and a person’s query to produce a ranked list of
documents. The below Figure shows the high-level “building blocks” of the indexing process.
These major components are text acquisition, text transformation, and index creation.
The task of the text acquisition component is to identify and make available the documents
that will be searched. Although in some cases this will involve simply using an existing
collection, text acquisition will more often require building a collection by crawling or
scanning the Web, a corporate intranet, a desktop, or other sources of information. In
addition to passing documents to the next component in the indexing process, the text
acquisition component creates a document data store, which contains the text and metadata
for all the documents.
Metadata is information about a document that is not part of the text content, such the
document type (e.g., email or web page), document structure, and other features, such as
document length.
The text transformation component transforms documents into index terms or features.
Index terms, as the name implies, are the parts of a document that are stored in the index
and used in searching. The simplest index term is a word, but not every word may be used for
searching. A “feature” is more often used in the field of machine learning to refer to a part of
a text document that is used to represent its content, which also describes an index term.
Examples of other types of index terms or features are phrases, names of people, dates, and
links in a web page. Index terms are sometimes simply referred to as “terms.” The set of all
the terms that are indexed for a document collection is called the index vocabulary.
user. This includes, for example, generating the snippets used to summarize documents. The
document data store is one of the sources of information used in generating the results.
Finally, this component also provides a range of techniques for refining the query so that it
better represents the information need.
The ranking component is the core of the search engine. It takes the transformed query from
the user interaction component and generates a ranked list of documents using scores based
on a retrieval model. Ranking must be both efficient, since many queries may need to be
processed in a short time, and effective, since the quality of the ranking determines whether
the search engine accomplishes the goal of finding relevant information. The efficiency of
ranking depends on the indexes, and the effectiveness depends on the retrieval model. The
task of the evaluation component is to measure and monitor effectiveness and efficiency. An
important part of that is to record and analyze user behavior using log data. The results of
evaluation are used to tune and improve the ranking component. Most of the evaluation
component is not part of the online search engine, apart from logging user and system data.
Evaluation is primarily an offline activity, but it is a critical part of any search application.
Figure: Process of Search Engine.
We now look in more detail at the components of each of the basic building blocks.
3.1 Text Acquisition:
Crawlers: In many applications, the crawler component has the primary responsibility for
identifying and acquiring documents for the search engine. There are a number of different
types of crawlers, but the most common is the general web crawler. A web crawler is designed
to follow the links on web pages to discover and download new pages. Although this sounds
deceptively simple, there are significant challenges in designing a web crawler that can
efficiently handle the huge volume of new pages on the Web, while at the same time ensuring
that pages that may have changed since the last time a crawler visited a site are kept “fresh”
for the search engine. A web crawler can be restricted to a single site, such as a university, as
the basis for site search. Focused, or topical, web crawlers use classification techniques to
restrict the pages that are visited to those that are likely to be about a specific topic. This type
of crawler may be used by a vertical or topical search application, such as a search engine that
provides access to medical information on web pages.
Feeds: Document feeds are a mechanism for accessing a real-time stream of documents. For
example, a news feed is a constant stream of news stories and updates. In contrast to a
crawler, which must discover new documents, a search engine acquires new documents from
a feed simply by monitoring it. RSS2 is a common standard used for web feeds for content
such as news, blogs, or video. An RSS “reader” is used to subscribe to RSS feeds, which are
formatted using XML.
Conversion: The documents found by a crawler or provided by a feed are rarely in plain text.
Instead, they come in a variety of formats, such as HTML, XML, Adobe PDF, Microsoft Word™,
Microsoft PowerPoint®, and so on. Most search engines require that these documents be
converted into a consistent text plus metadata format. In this conversion, the control
sequences and non-content data associated with a particular format are either removed or
recorded as metadata.
Another common conversion problem comes from the way text is encoded in a document.
ASCII5 is a common standard single-byte character encoding scheme used for text. ASCII uses
either 7 or 8 bits (extended ASCII) to represent either 128 or 256 possible characters.
Document data store: The document data store is a database used to manage large numbers
of documents and the structured data that is associated with them. The document contents
are typically stored in compressed form for efficiency. The structured data consists of
document metadata and other information extracted from the documents, such as links and
anchor text (the text associated with a link). A relational database system can be used to store
the documents and metadata. Some applications, however, use a simpler, more efficient
storage system to provide very fast retrieval times for very large document stores.
3.2 Text Transformation:
most valuable sources of information for tuning and improving search effectiveness and
efficiency. Query logs can be used for spell checking, query suggestions, query caching, and
other tasks, such as helping to match advertising to searches.
Ranking analysis: Given either log data or explicit relevance judgments for a large number of
(query, document) pairs, the effectiveness of a ranking algorithm can be measured and
compared to alternatives. This is a critical part of improving a search engine and selecting
values for parameters that are appropriate for the application.
Performance analysis: The performance analysis component involves monitoring and improving
overall system performance, in the same way that the ranking analysis component monitors
effectiveness. A variety of performance measures are used, such as response time and
throughput, but the measures used also depend on the application.
4. Metasearch Engines
The Metasearch Engine is a search engine that combines the results of various search engines
into one and gives one result. It can also be stated as an online information retrieval tool.
The Metasearch Engine was developed because individual search engines were prone to
spams due to people trying to raise their website ranks online. The Search engine visits several
websites and creates a database of these sites. This is also known as indexing. Any search
engine answers several queries every second.
The metasearch engines run the queries on most other search engines and in turn reflect the
result in the form of the summarization of such sites.
Need for Metasearch Engines: The Metasearch Engines were developed to cover the entire
web, unlike most search engines. Individual search Engines try to spam people to enhance
their page rankings. This is an illegitimate way of promoting. The individual Search engines
are unable to find results from other search engines. This is when the Metasearch Engine
comes handy. This also supports multiple formats, unlike individual engines. The meta search
engines seem effortless.
The Metasearch Engine Architecture
1. User Interface: The user interface of the Metasearch Engine is similar to the look
and feel of the individual search engines like Google and Yahoo. It even has options
to search on the basis of types and categories and also of which search engines it
must use to write back the result.
2. Dispatcher: The dispatcher has the work of query generation.
3. Display: The display uses the queries to write back the results on the screen. It uses
methods such as page ranks, parsing techniques, cluster formation, and stitching
to give the desired result.
4. Personalization: The personalization in other words is being user-specific. This
involves comparing the results with each other.
Operations of the Metasearch Engine:
A metasearch engine does not create a database of itself rather it creates a federal database
that is actually an integration of the databases of various other Search Engines.
The 2 main ways of operations involved are:
1. The architecture of Ranking: Various search engines have their own ranking
algorithms. A metasearch engine develops its own algorithm where it eliminates
duplicate results and calculates a fresh ranking of the sites. This is because it
understands that the websites which are highly ranked on major sites are more
relevant and would thereby provide better results.
2. Fusion: Fusion is used to create better and more efficient results. Fusion is divided
into Collection Fusion and Data Fusion. The collection Fusion deals with search
engines that contain unrelated data. The data sources are then ranked based on
their content and the likelihood of providing relevant data. This is then recorded
in a list. The Data Fusion deals with the search engines that have indexes for
common data sets. The initial ranks of the data are compared with the original
ranks. A process of Normalization is applied using techniques such as the
CombSum algorithm.
5. Semantic Web
The Semantic Web is a vision of information that is understandable by computers, so that they
can perform more of the tedious works involved in finding, sharing and combining information
on the web.
The Semantic Web is an evolving extension of the World Wide Web in which the semantics
of information and services on the web is defined, making it possible for the web to understand
and satisfy the requests of people and machines to use the Web content. It derives from W3C
director Tim Berners -Lee vision of the Web as a universal medium for data, information and
knowledge exchange.
The World Wide Web is based mainly on documents written in HyperText Markup
Language(HTML), a markup convention that is used for coding a body of text interspersed
with multimedia objects such as images and interactive forms. The semantic web involves
publishing the data in a language, Resource Description Framework (RDF) specifically for
data, so that it can be manipulated and combined just as can data files on a local computer. The
HTML language describes documents and the links between them. RDF, by contrast, describes
arbitrary things such as people, meetings, and airplane parts.
Components Of Semantic Web: Several formats and languages form the building blocks of
the semantic web. Some of these include Identifiers: Uniform Resource Identifier(URI),
Documents : Extensible Markup Language(XML), Statements : Resource Description
Framework (RDF), a variety of data interchange formats (e.g. RDF/XML, N3) and notations
such as RDF Schemas(RDFS) and the Web Ontology Language (OWL), all of which are
intended to provide a formal description of concepts, terms and relationships within a given
knowledge domain , Logic, Proof and Trust.
Figure: The Semantic Web Stack (aka Semantic Web Cake). Source: Idehen 2017.
Semantic Web builds upon the foundations of the original web. Data (old and new) must be
described with metadata. This metadata will identify data, interlink data and relate data to
concepts so that machines can understand them.
Data must be uniquely identified and this is done using Uniform Resource Identifier
(URI) or Internationalized Resource Identifier (IRI). Resource Description Framework
(RDF) provides the data model. Meaning is added at a higher layer with what we
call ontologies. In other words, RDF specifies the syntax while ontologies specify the
semantics.
Just as HTML was the building block of the original web, RDF is the building block of the
Semantic Web. Web content can expose their semantics by embedding RDF statements
within webpages. There are many ways to do this: RDFa, RDF-XML, RDF-JSON, JSON-LD,
Microdata, etc. Semantic data already processed and stored in RDF format can be queried.
Just as MySQL exists to query relational databases, SPARQL is a language to query RDF stores.
Given the semantics, rules can help in applying logic and reasoning.
What is Resource Description Framework (RDF)?