Evol. of Web Searching

Defining the Web
The evolution of Web

In that constellation of computers known as
searching the Internet, transmitted data are split into
small ``packets'' which is an exponentially
David Green more efficient utilisation of bandwidth. This,
together with simpler technologies, has
dramatically reduced the cost of electronic
publishing, resulting in the estimated daily
increase of over 1 million Web pages of
information (Clever Team, 1999). However,
despite its uniform interface and seamless
linked integration, the Web is not a single
coherent element. There are two distinct
elements of the Web: the visible and the
invisible. In order to understand the
The author implications of this distinction for
David Green is currently the New Media Editor for one of information retrieval, it is necessary to first
the ``big five'' professional services firms. He can be consider how Web pages are produced.
contacted via his Web site at There are two types of Web page: static and
www.clickmedia.freeserve.co.uk dynamic. Static Web pages have been
manually created by a Web designer, posted
Keywords on to a Web server and are available to
anyone or anything that visits the Web site of
Information retrieval, Electronic publishing, Internet
which it is a part. Any changes must be made
manually. Dynamic Web pages are created by
Abstract
a computer using a script (often CGI, Java or
The interrelation between Web publishing and Perl). This script acts as an intermediary
information retrieval technologies is explored. The between the user requesting, or submitting,
different elements of the Web have implications for information on a static Web page (the front-
indexing and searching Web pages. There are two main end) and a database (the back-end), which
platforms used for searching the Web ± directories and supplies, or processes, the information. The
search engines ± which later became combined to create script slots the results into a blank Web page
one-stop search sites, resulting in the Web business template and presents the visitor with a
model known as portals. Portalisation gave rise to a dynamically generated Web page (Green,
second-generation of firms delivering innovative search 1998a). Figure 1 illustrates this process.
technology. Various new approaches to Web indexing and Static Web pages provide the same generic
information retrieval are listed. PC-based search tools information to everyone, while dynamically
incorporate intelligent agents to allow greater generated Web pages provide unique
manipulation of search strategies and results. Current information, customised to the user's specific
trends are discussed, in particular the rise of XML, and requirements. Available to everyone, and for
their implications for the future. It is concluded that the indexing to all search engines, static Web
Web is emerging from a nascent stage and is evolving pages together constitute the visible Web.
into a more complex, diverse and structured environment. This is what researchers at the NEC Research
Institute refer to as the ``publicly indexable
Electronic access World Wide Web'' (Lawrence and Giles,
The current issue and full text archive of this journal is 1999).
available at The invisible Web comprises Web pages
http://www.emerald-library.com with authorisation requirements, pages
excluded from indexing using the robots
exclusion meta tag and information that
resides within databases that will only ever be
temporarily present on the Web as
Online Information Review
Volume 24 . Number 2 . 2000 . pp. 124±137 Received September 1999
# MCB University Press . ISSN 1468-4527 Accepted February 2000
124
The evolution of Web searching Online Information Review
David Green Volume 24 . Number 2 . 2000 . 124±137
Figure 1 Dynamic Web page generation Web directories and search engines
Web directories explained
What is the difference between a Web
directory and a search engine? A Web
directory is:
. a pre-defined list of Web sites;
. compiled by human editors;
. categorised according to subject/topic;
. selective.
Because humans compile Web directories, a
dynamically generated Web pages (see qualitative decision concerning the content on
Table I). each listed Web site has already been made.
Consequently Web directories are popular
The first NEC study (Lawrence and Giles,
with Internet users looking for particular
1998) estimated that the visible Web
information because they feel that they have a
contained at least 320 million Web pages in
head start in identifying ``the best of the Web''
December 1997, whilst the second study
for the topic that they are interested in.
(Lawrence and Giles, 1999) estimated the
In using a Web directory the user can
visible Web had risen to 800 million Web
navigate through the listings or search across
pages, representing six terabytes of text data,
the entire directory (see Appendix). The
as of February 1999. Owing to its highly
major Web directories also license search
disparate structure and range of data types,
engine indexes to provide secondary results
there has been as yet no scientific research
whenever their human-compiled directory
conducted to determine the size of the
fails to produce matching results to the user's
invisible Web. query. For example, the world's largest Web
However, most publishers distribute their directory, Yahoo!, licenses the Inktomi search
data on the Web by integrating huge index for just this purpose.
databases, often gigabytes in size, with a As a result of the manual compilation
front-end search interface. By virtue of its process, Web sites that have been indexed by
commercial professionally published origin, Web directories will remain listed within that
such information is typically of high value and directory, unless, in the highly unlikely event,
more highly structured and indexed than the they are manually removed. This permanent
visible Web. The user's search enquiry will presence is not guaranteed for a listing within
generate customised, as opposed to generic, a search engine index, thus making a listing
results. Therefore, for professional within a popular Web directory such as
researchers, it can be said that information is Yahoo! highly desirable.
increasingly accessed via the Web, rather than Broadly speaking, any Web site that
on it. comprises several pages of organised links can
Nonetheless, the ``visible'' Web constitutes be considered a Web directory. Many
a significant contribution to the dissemination individuals, whether experts in their field or
of human knowledge, and as the NEC those passionate about a particular subject,
studies acknowledged ``much of [this] have compiled such sites. One such voluntary
material is not available in traditional Web directory which has exploded to global
databases''. It is no surprise that several status, becoming a real rival to world-leader
surveys such as Nielsen Netratings or Media Yahoo!, is the Open Directory. Other Web
Matrix (www.mediamatrix.com) consistently directories of specific relevance to information
show that search engines are amongst the professionals include:
most popular destination sites on the Web. . Sheila Webber's excellent Business
Information Sources on the Internet
Table I Comparison of static and dynamically generated www.dis.strath.ac.uk/business
Web pages . Business Researcher's Interests
Static Web pages Dynamic Web pages www.brint.com/interest.html
Manually produced Computer generated
Search engines explained
Generic information Customised information
When using a search engine, the user is
Most are indexable Not indexable
searching a database of indexed Web sites. All
125
search engines have three primary Portalisation

components:
(1) a ``spider'' that examines Web sites; In a previous research paper (Green, 1998b)
(2) an index/database of Web site listings; in which I examined the relationship between
(3) interrogation/retrieval software. portals and information providers, I defined
portals as ``those Web sites that are jockeying
Search engine spiders
for pole position as starting points for the
Search engine databases are primarily built up
Internet user's experience''. Unlike today,
by ``spiders''. Despatched on an automatic
where portals are the focal point for carefully
and frequent basis by search engines, spiders
crafted multi-million dollar e-commerce
are programs that search the Web for new
strategies, portals initially evolved gradually
Web pages, index words and/or links on
from the early experience (1994) of three
those pages, and match the indexed words
types of Internet company. Each of these
with the URL of the page on which they
companies performed a distinct role in the
appear.
information distribution chain that defined
Search engine index/database the Internet user's experience. The three
This is the main element of any search engine types are discussed below.
± it is what the user interrogates. Once it
could be said that these indexes were built Internet service/access providers
along similar guidelines, with the location and AOL and CompuServe (which was later
frequency of words the primary determining acquired by AOL) recognised that they were
factors in results relevance ranking. However, at the bottom end of the Internet user's
during 1998, a number of new search engine `èxperience value-chain''. They merely
providers appeared. These companies built provided connectivity, which was plentiful.
their indexes according to differing criteria. Such a low-value evaluation would translate
The Direct Hit index is based on the into low margins unless something was done.
``popularity'' of a Web site, the Google index
is based on the number of links between pages Search sites
and sites, whilst the Real Names index is a Yahoo!, Infoseek (now Go), Lycos and Excite
pay-for service that enables companies to all recognised that usage of their services was
register keywords to protect their brands and transitory: once the user had found what they
company identity. Each of these approaches is were looking for, they would click on the
discussed in further detail below. result and move off to another Web site. Such
narrowly defined usage was not an attractive
Interrogation/retrieval software business proposition for either investors or
All search engines have their own customised advertisers.
software to interrogate their databases.
Essentially, though, they operate according to Browser providers
similar principles: any Web site which Millions of users were greeted with the home
contains words or terms that match the user's pages Netscape (Netcenter) and Microsoft
search query will be presented in the list of (MSN) as they switched on their PCs each
results presented on screen to the user. morning. Most users did not know how, or
Ranking each of these matching Web sites by could not be bothered, to change the default
relevance is determined by algorithms that setting in their browsers. This behavioural
analyse the location and frequency of the inertia was primarily responsible for the huge
user's search terms against this list of traffic figures to these two Web sites. Each
matching Web sites. The nuances of how type of Internet company had something to
these algorithms work varies between search offer the others. Together they controlled
engines, which is one reason for the different most aspects of the information distribution
results that users usually experience when chain. All wanted to increase their
running the same search across different ``stickiness'' (the length of time each user
search engines. However, a much more spent at their Web site), before clicking off
important reason for these search results onto another site. All wanted to attract as
differences is that ``the [content] overlap many `èyeballs''/users as possible.
between the engines remains relatively low'' To achieve these two goals it was
(Lawrence and Giles, 1999). recognised that the user's primary destination
126
Web site had to offer as much value to the ever decreasing timescales. The same can be
user as possible. As users were ceding control said for Web search technology. By focusing
of their initial Internet experiences (they still their efforts on e-commerce and portalisation
do), Microsoft, Netscape, AOL, etc. were the first generation of search sites ± the ``big
able to control users' primary destination five'' ± neglected their core search
Web site each time the user logged on. As functionality. While they reigned supreme for
search sites were the first desired destination several years, this neglect, and failure to
sites of users, these companies licensed search appropriately adapt to a changing
engine and Web directory providers to environment, created niche opportunities
provide search services. Thus emerged the which were soon exploited by new types of
concept of the portal. search providers.
Portals became a huge success. However, in
the rush for users and their attendant dollars, Meta search engines
many search providers began to neglect their Meta search engines enable the user to search
core service ± their search index or Web across several search engines and Web
directory. From late 1996 until September directories simultaneously. Some of the most
1997 the growth of the main search engine popular meta search engines include:
indexes and Web directories was negligible Dogpile (www.dogpile.com) searches 14
(Sullivan, 1999a), despite the continued different engines and directories but does not
relentless growth of the Web. The spurt in eliminate duplicates. It was acquired in
growth of most search engine indexes in the August 1999 by search engine GO2 for
first half of 1996 was primarily attributable to US$40 million in stock and a further US$15
the arrival of AltaVista at the end of 1995, million in cash. At the time of acquisition,
with the largest search index at the time. This Dogpile only had five employees (The Wall
period was also marked by a distinct lack of Street Journal, 5 August 1999).
search technology innovation. Although meta Mamma (www.mamma.com) searches
search engines such as Mamma, Dogpile and seven engines but removes duplicates and re-
Metacrawler first rose to prominence during orders results according to its own relevance
this period, their search functionality was ranking algorithm.
essentially based on the ``location and Others include:
frequency of keywords'' approach that was . 2Q ± www.2q.com
developed by the main search engines. . Infind ± www.infind.com
Meanwhile, the distinction between search . Isleuth ± www.isleuth.com
engines and Web directories became . Surfy ± www.surfy.com
somewhat blurred for the user as the search . Webtaxi ± www.webtaxi.com
engines licensed Web directories and vice
versa, whilst AOL, Netscape, Microsoft, etc. Popularity based analysis
licensed both. This cross-fertilisation resulted The first generation of search engines created
in portals becoming all encompassing search indexes by spidering Web sites, analysing the
sites. It was not until the arrival of a second- location and frequency of words. Web
generation of search engine providers in 1998 directories were compiled manually.
that new approaches to indexing and Launched in April 1998, Direct Hit
searching the Web became available. (www.directhit.com) represented a radical
new departure from these approaches, and
dubbed its methodology ``the third way''. The
Evolution of search technology system was claimed to be user-controlled as
the ranking of results is based on Web sites
Anyone who has ever seen a diagrammatic that users have visited. Like many of the
representation of the evolution of life on our second-generation search technologies, it is
planet, as we currently understand it, would not a separate search engine with its own
notice that basic cellular lifeforms were index that can be accessed directly. Instead it
around for a very long time before the provides a second-level analysis of search
evolution of more complex biological entities. results where it is incorporated within existing
However, once this point had been reached, search engines, one being HotBot.
the rapid diversification of life into ever more Prior to licensing Direct Hit, HotBot
organised and intelligent forms occurred in returned a list of results based on the standard
127
methodology of matching search terms with other. Both offered natural language
content on the Web sites in its index. Now, searching, but adopted different philosophies
Direct Hit will run a second level analysis on in developing their solutions.
the user's set of results. From its database it Ask Jeeves (www.askjeeves.com) was
will identify those Web sites which are launched 1 June 1998. Billed as ``the first
popular, according to the number of visits natural language search agent on the
that each Web site has received, and then re- Internet'' it operates by matching a user's
rank the search results accordingly, with the query against a database of 7 million template
most popular Web sites that match your questions. If there is no match then the user is
search term presented first in the list of presented with the nearest alternatives from
results. the database and asked to select the most
However, the popularity of a Web site can appropriate. It will also conduct a metasearch
be largely determined by its search engine across AltaVista, Go (Infoseek), Lycos and
rankings and there are all sorts of ways to Yahoo! It has now been licensed by AltaVista
manipulate those if you have a good for its own search site. However, artificial
understanding of how search engines work. intelligence (AI) experts have criticised the
Direct Hit tries to compensate for this by company's natural language claims. It was
boosting obscure sites. For example, a Web named after the resourceful butler in P.G.
site could provide lots of valuable information Wodehouse's novels.
about a particular topic but could The Electric Monk (www.electricmonk.com)
nevertheless feature further down the list of was launched a few weeks later. This search
results of search engines. If a searcher has service conducts a syntactical analysis of the
been tenacious enough to dig down as far as query using natural language algorithms.
result number 100 (presumably an These algorithms will also make use of
information professional), and click on it, thesauri to consider alternative related words.
then Direct Hit's algorithms will give this site The natural language search is then translated
a big boost up the list of results next time it into a complex Boolean query and submitted
appears in other searchers' lists of results. If to AltaVista. It was named after a character in
other users do not click on this obscure Web a Douglas Adams novel.
site, then it will drop down the list of results
for subsequent searchers, because it did not Links-based analysis
prove popular (Green, 1999a). Since its The first-generation search engines have
launch, the company has successfully licensed focused on building huge indexes with the
its technology to ten search sites including goal of answering every possible kind of
AOL, HotBot, Lycos, MSN and LookSmart general query. They focus on the content of
and is available within Netscape each specific page they visit with little
Communicator 4.5 and Apple's Sherlock consideration of how these pages interrelate
search utility. and connect. As already discussed, the
indexing methodologies they use fail to
Natural language searching consider the complexity of human language:
As already discussed, the first generation of syntax (sentence structure), synonyms
search engines operated by matching the (different words for the same meaning) and
keywords submitted by the user to the polysemy (different meanings to the same
contents of the Web pages in their databases. word).
They did not consider the context of the Links-based analysis attempts to overcome
search terms, i.e. the syntactical relationships these problems by examining the relationships
between the search terms and other between pages ± the 1 billion or so hyperlinks
vocabulary within their index. Furthermore, that weave the Web together (Clever Team,
they search for literal exact matches and 1999). By examining how Web pages link
therefore fail to consider semantics or use together, links-based analysis offers
thesauri (Green, 1999b). Most search engines methodologies for identifying authoritative
also automatically ignore frequently used sources of topic-specific information, eliciting
words such as `òr'', ``to'', ``not'', etc. In June quality, highly relevant results to users'
1998 there was a major breakthrough in queries. Not surprisingly, links-based analysis
addressing these limitations. Two new search has quickly gained prominence amongst
engines were launched within weeks of each Internet users and is attracting a lot of
128
attention from both computer information effectiveness developed a system that was
scientists and corporate Internet investors. referred to internally as HITS (Hyperlink-
Google (www.google.com), like Yahoo!, was Induced Topic Search). The project later
developed by students at Stanford University. became known as ``Clever''.
This technology uses a methodology known Related to the scientific citation index (the
as PageRank (named after Larry Page, one of study of how scientific papers refer to one
its creators) to crawl the Web and analyse another), Clever examines the hypertext
how Web sites link to each other. Results are context of a keyword search. Like Google,
ranked on importance, i.e. how many other Clever examines hyperlinks and the
Web sites link to them. If you, as a Web site surrounding commentary. Unlike Google,
author, have included hyperlinks to other sites which crawls the Web, Clever first submits
that you deem important, then you have the query to a search engine such as AltaVista,
exercised some editorial judgement. In the and then conducts its links analysis on a set of
same way that Web directories, such as pages from the results produced by that
Yahoo!, are compiled by editors on a manual search engine ± typically about 200 pages. By
basis, Google seeks to capitalise on the adding all the pages that link to and from
editorial judgement of millions of Web site these 200 pages, Clever creates what is called
authors on an automated basis. a root set ± usually between 1,000 and 5,000
As a result, of course, it can analyse far pages. Using linear algebraic analysis, Clever
more Web sites than the humans who build then begins an iterative process of analysing
directories such as Yahoo!. In fact, unlike this root set of results to divide them into two
search engines that become less useful the categories: authorities and hubs (Clever
larger their index of Web sites becomes, Team, 1999). Authorities are Web pages
Google claims to return even better results about a particular topic that have lots of links
with a bigger index. Google also seeks to to them, i.e. they are authoritative sources of
capitalise on the accompanying editorial information. Hubs are Web pages which are a
commentary by processing the text around guide to, or list, authoritative sources, i.e. they
each hyperlink (Green, 1999a). do the most citing.
Links-based analysis does feature in the Hubs are similar to portals in that they act
relevance ranking algorithms of some search as a jump point for anyone interested in the
engine providers such as Excite and HotBot. particular topic they cover. Unlike Google,
However, Google is the only search engine which retains rankings for individual Web
that is exclusively focused on links-based sites in its index, independently of the user's
searching that is currently publicly available search query, Clever will always create a new
for Web-wide searching. The company root set for each query and prioritise each
estimates that its index is between 70 million page according to the context of the
and 100 million pages, but, through the links specific search statement. While not yet
analysis, enables users to reach an estimated available for Web-wide searching, IBM's
300 million Web pages. Google's research team is currently refining the Clever
combination of extensive reach and greater search engine and have been experimenting
accuracy of results has quickly catapulted this with Clever to automatically develop Web
relative latecomer to top ten status in search directories.
engine popularity. Data released by Nielsen Focused Crawler is another search engine
Net Ratings in August 1999 showed that technology that is being developed by IBM.
Google gained the largest month-on-month However, it is not yet as developed as Clever.
increase in unique audience figures. Visits to Unlike other search engines (including
Google increased by a massive 88 per cent Google and Clever) which perform an
compared to the average of 2.1 per cent for analysis after they have crawled through a
the other top ten search engines for that collection of hyperlinks, Focused Crawler, as
month. Later that month Google signed its its name suggests, seeks to identify highly
first licensing deal with AOL subsidiary relevant collections of data to topic-specific
Netscape, to be the main search provider on searching by crawling the Web with a specific
the Netcenter portal. goal, ignoring irrelevant sections of the Web.
Clever (www.almaden.ibm.com/cs/k53/ In other words, it only crawls Web sites of
clever.html) came about when a team of IBM relevance to the user's query, rather than
researchers examining search engine identify a subset of relevant Web sites as a
129
result of an analysis of a larger set of crawled News, Alt and Misc. Owing to the huge
sites. Focused Crawler crawls the Web guided number of groups available, specialised search
by a relevance and popularity mechanism that engines have emerged to identify relevant
has two parts: a classifier that evaluates the groups and postings.
relevance of a Web page to the user's search Deja News (www.dejanews.com) is probably
query, and a distiller that identifies ``hypertext the most widely known newsgroup search
nodes that are great access points to many engine. It contains a directory of selected
relevant pages within a few links'' newsgroups which users can browse through
(Chakrabarti et al., n.d). or search for a particular group, topic or
posting. A powersearch facility enables users
Newsgroup searching to search by author, date, language, plus
The Internet delivers two primary benefits: options on how the results are displayed.
content and connectivity. Although distinct, Reference.com (www.reference.com) is
the two are often closely interrelated. Portals similar to Deja News, but also enables
are a perfect example: they represent the searches in Web forums (Web-based bulletin
synergistic exploitation of both content and boards) and mailing lists (where each posting
connectivity to create e-commerce is sent to your e-mail address). Users also
opportunities. However, while the Web is the have the option to save searches.
primary repository of human knowledge on Liszt's Newsgroups Directory (http://
the Internet, it is not the only one. liszt.com/news) uses Deja News for searching
Newsgroups, where collections of individuals on newsgroups, but has its own extensive list
share their experiences, knowledge and of mailing lists and IRC channels. There is
opinions on a subject of common interest, also a directory of newsgroups with
constitute an important area of consideration descriptions for most.
for information retrieval. The distinction
between the Web and newsgroups is that the Subject-specific indexes
Web primarily represents a large body of Company information
explicit human knowledge whilst newsgroups There are many sites (usually from company
primarily represent a large body of implicit and business information providers) that any
knowledge. Explicit, codifiable knowledge researcher can visit. The amount and quality
can help individuals and organisations, but it of information provided for free varies.
is implicit knowledge ± the realm of However, all such sites are Web-enabled
experience, creativity and ideas ± that offers versions of commercial databases, rather than
the greatest potential. In an increasingly true search indexes. In a test of the ability of
knowledge-based information society, it will leading search engines and directories to
be implicit knowledge that will be needed to deliver relevant results for searches on
successfully exploit explicit knowledge to company names, conducted by the online
create new opportunities and develop industry publication Search EngineWatch,
adaptability. HotBot and Google were ranked joint first
Considering this, the role of specialised search engines while Netscape Search was
newsgroup search engines will become more ranked first Web directory (Sullivan, 1999b).
important as individuals use the Internet to However, company research is not the
seek out experts (or indeed anyone who is exclusive focus of these search sites.
qualified) to help with their problems. This Launched in August 1999, 1Jump
prediction is based not merely on a belief in (www.1jump.com) is a specialised search
human altruism, but also on phenomena such index that focuses exclusively on information
as the emergent sociology of citations on the and news about companies. In addition to
Web (Chakrabarti et al., n.d), the explosive providing news, this search engine also
growth of the volunteer-based Open provides details of company executives (titles,
Directory (see appendix) and the emphasis on age, background and e-mail addresses),
people/expert connectivity in many corporate patents (every patent owned by a company)
intranets. and ``peers'' (subsidiaries, parents and related
There are literally thousands of newsgroups companies). It also enables the user to visit
covering all manner of topics. These are other Web pages that are relevant to a
organised in a tree-like structure with eight particular company, e.g. an industry
main categories: Comp, Rec, Sci, Soc, Talk, association.
130
Multimedia and image files words and live highlighting of search terms
According to industry analyst organisation are possible because of the nature of
Future Image in its report ``Comparative intelligent agents. Unlike a standard software
Evaluation of Web Image Search Engines'' program that will execute specific functions
almost 70 per cent of the Web is non-textual. within clearly defined parameters, agents/
Considering that humans assimilate and bots:
process information in visual format more . are adaptive ± they can interpret
readily than textual format, and the greater monitored events to make appropriate
availability of broadband capacity in the near- decisions;
future, the role of multimedia search engines . are self-organising ± they assimilate both
will continue to grow. The three main information and experience;
specialised search engines in this area are: . can communicate with both the user and
(1) Ditto ± www.ditto.com other bots (Green, 1999c).
(2) Scour ± www.scour.net
Agents can search across a wide range of
(3) AltaVista PhotoFinder ±
document types and formats. They can
www.altavista.com
provide a uniform interface for search queries
Some other specialised search indexes across different sources and are true
include: `ìnfomediaries'' in that they can identify and
. Finding People ± www.whowhere.com search appropriate resources that may or may
. Law ± www.fastsearch.com/law not be known to the searcher. The adaptive
. Health ± www.drkoop.com element of intelligent agents is central to the
. Movies ± www.imdb.com/search functionality of many search products that
. Ask an Expert ± www.vrd.org/locator/ incorporate agents. The following popular
subject.html search utilities, which all contain agent
. Information Please ± www.infoplease. technology, are available as free downloads
com and as more comprehensive paid versions:
Mata Hari (www.thewebtools.com) can
learn one set of power search commands and
Search utilities and intelligent agents then automatically translate these for each
search service/database that it queries for the
As already discussed, meta-search sites such user.
as Dogpile and Mamma have grown in BullsEye Pro (www.intelliseek.com)
popularity as they allow users to search across incorporates 11 different intelligent agents,
different search indexes simultaneously with including technology from Verity to conduct
duplicates removed and results re-ranked what it calls ``Web mining''. The different
(depending on the meta-search service used). agents are used to target specific types of
Search utilities represent the logical evolution information such as business news in over 450
of this functionality. Unlike meta-search sources on both the visible and invisible Web.
engines, where the processing power to refine It will automatically run searches, allows
results still remains on the server the user is import/export of searches to other users,
interrogating, search utilities are programs whilst users can chose to receive change alerts
that are installed on to the user's hard drive. by HTML e-mail, pager or other hand-held
By shifting processing power away from the data devices.
server, and on to the user's own desktop, Copernic (www.copernic.com) can translate
search utilities offer a much greater range of a search statement for different services and
search and results analysis functionality. then simultaneously submit the query to these
Like several of the second-generation search engines, Web directories and
search technologies that have emerged databases. There are also about 20 categories
(Electric Monk, Google) many of these search such as business and finance, science, etc.,
utilities incorporate intelligent agents (or with predetermined Web sources to search in.
bots). Indeed, many of the powerful features Recognising the advantages offered by
offered by search utilities, such as language search utilities, some search providers have
independent searching, filtering, automatic released a variety of free basic search utility
refinement of results and document programs as ``plug ins''. As the name suggests,
summaries, active hyperlinking of query once installed, they are incorporated within
131
the user's Web browser and enable the search example, hyperlinks can go through to the
engine provider to offer more features. Search relevant section of a document rather than the
providers that have released search utilities entire document. It also enables powerful
include Infoseek (Infoseek Express), AltaVista structured searching akin to database field
(AltaVista Discovery) and more recently searching, but on textual Web pages. In other
Lycos (See More). words, XML not only enables explicit
A common function of agents is that they description of Web page content, but also
allow the user to specify a high-level goal describes the rules for manipulating each data
instead of issuing explicit instructions, leaving set contained within the information. This
the ``how'' and ``when'' decisions to the agent. enables a small program such as a Java script
This, combined with their ability to search to process the information on the user's local
across data in unstructured formats, to hard drive according to their requirements,
automatically learn and adapt to user rather than the user requesting a new Web
preferences and to identify patterns, is giving page from the central server. Multiply by
agent technology an ever increasing role in millions of Web users, and this capability will
Web searching. dramatically decrease the demands on Web
servers and improve network traffic (Green,
1999d). Based on open standards, XML will
allow data exchange between different
XML computer systems regardless of operating
system or hardware.
HTML is dead. Since XML was completed
As XML is also based on Unicode, a
by the World Wide Web Consortium (W3C,
the body responsible for developing technical character encoding system that supports the
standards for the Web) in early 1998, it has intermingling of text in all of the world's
attracted an almost evangelical response. major languages, it will also allow the
Most Web pages are currently produced in exchange of information across national and
hypertext mark-up language (HTML). While cultural boundaries (Bosak and Bray, 1999).
HTML's ease of use fuelled its widespread Using various XML style sheets (XSL)
adoption, it is somewhat limited in that it is publishers will also be able to automatically
primarily concerned with the design/layout of redesign their content for various devices.
a Web page, rather than the information that There are even style sheets that will read the
actually appears on that page. Considering text of the Web page aloud, which is of great
that a primary use of the Web is for benefit to the visually impaired.
information retrieval this design focus is However, while XML will deliver great
something of a drawback. HTML is a spin-off benefits for searching, publishing and
from SGML, a much more robust mark-up exchanging information, these benefits will
language that was approved by the not be realised without some effort.
International Organisation for Standards First, each industry will need to agree on
(ISO) in 1986. However, SGML is too standards for the tags used to describe
complex for the Web. Seeking to address the information that is specific to their discipline.
limitations of HTML, the W3C developed Mathematicians, genealogists and chemists
XML as a subset of SGML that would have already agreed on standards to facilitate
address the semantic and structural the realisation of XML's benefits. In other
considerations of information retrieval and areas, standards are yet to be agreed upon and
exchange that would work on the Web. there will be struggles over who controls the
XML is an open technology that offers standard (``XML and search'', n.d).
tremendous possibilities for electronic Second, Web publishers will require greater
publishing, e-commerce, information retrieval sophistication than simply knowledge of
and data exchange. It consists of rules that HTML, graphics and a few other
enable anyone to create their own mark-up applications. They will need new XML tools
language. XML describes information using and computer programmers and information
pairs of tags that are nested inside one another scientists who will be able to interpret the
to multiple levels (Bosak and Bray, 1999). content of the information being published
These create a tree structure of nested and provide the appropriate data trees/nested
hierarchies. This convention allows users hierarchies, hyperlink structures, meta data,
direct access to just the particular segment of style sheets and document definition types
the information that they are interested in; for (DDTs).
132
Finally, search engines will need to learn portal model so beloved of their predecessors,
the standard tag structure that has been the ``big five''. Also, if the main portals wish to
agreed by each industry/interest group. They introduce micro-payments for searches, these
will also need to change their search interfaces second generation search companies will
to offer users the choice between text provide the refinement technologies. In the
searching and field/tag searching. Currently, evolution of Web searching this has created a
text-based search engines will return a list of symbiotic relationship between the two
documents that will contain some generations: to succeed in attracting as many
information relating to the user's request. users as possible and to generate as many e-
XML enabled query-searching, like any other commerce sales opportunities as possible, the
query language, will return the relevant data first-generation of search firms will continue
that has been extracted from a document, to focus their efforts on portalisation and
rather than the entire document. Such query- e-commerce. However, they will need the
based searching can also be used to perform new search technology offered by second-
computational analysis and manipulation of generation firms to provide consumers with
presentation on retrieved data items (``XML
the search requirements they demand ±
and search'', n.d).
search requirements that in turn will fuel
To facilitate the transition to XML, in
e-commerce consumer buying. The second-
August 1999 the W3C released a hybrid of
generation search firms need the portals to
HTML 4.0 and XML (XHTML 1.0) for
attract the consumers who will use their
review. It is highly unlikely that there will ever
search services.
be an HTML 5.0. Earlier in April, IBM
Taken to its logical conclusion, it is quite
launched the Internet's first search engine
possible that one or more of the ``big five''
that is exclusively focused on XML data,
search portal firms will drop out of the
called xCentral. This search engine is
sporadic yet ongoing search index size war.
available from IBM's XML Web site.
Instead, they may decide to contract out all
search functionality to second-generation
The Future
firms whose core focus is providing better
Micro-payments for searches?
search technology. By co-opting the Open
Jakob Nielsen predicts ``. . . that in the future
Directory, and relegating the results from its
we will have micro-payments for search.
own index as secondary to those from the
Realistically, to provide quality information
Open Directory, Lycos has hinted at the
over the long term requires serious effort.
shape of things that may yet come. However,
Companies have to be compensated for
until the market matures somewhat, the ``big
providing that'' (Janah, 1999). If users are to
five'' first-generation search portals may feel
be charged micro-payments then they are uncomfortable about completely
going to start demanding better refinement relinquishing control of search functionality.
technologies for their searches. Much of the Instead, they may develop their existing
search technology innovation over the last 18 relationships with second-generation firms
months has come from second-generation into an outsourced/partnership model with
search companies. By focusing on clearly defined service level agreements, etc.
portalisation and e-commerce the first Such a strategic realignment of their business
generation of search firms have ceded control operations would be in line with current
of technological innovation to their newer business process outsourcing (BPO) trends
equivalents. and would prove popular with their
According to data from institutional investors.
PriceWaterhouseCoopers and research firm
IPO monitor, in the last year search engine Portals go mobile
Using wireless application protocol (WAP),
companies have raised more than $274.7
search sites and publishers alike will be able to
million in private funds and another $282
extend their reach beyond the PC to mobile
million in public offerings. Almost all of these
phones and other hand-held devices such as
funds are going to this second-generation of
PDAs. One such portal already launched is
search firms (Investor's Business Daily, 1999).
Zingo, which has been jointly developed by
Searching outsourced? Lucent Technologies and Netscape. Aimed at
It is interesting that none of the second- telecommunications providers, Zingo also
generation search companies has adopted the enables HTML pages to be converted into
133
VXML (voice extensible mark-up language) many things ranging from plants to galaxies.
for audio applications on hand-held devices. Knowing this will help search engine
Coupled with the reduced broadband providers develop better algorithms that
demand that XML promises to deliver, the exploit the predictive behaviour of systems
future of information retrieval can be governed by the power principle. Now
anywhere you need it. emerging from its nascent stages, the Web
Search engine standardisation may evolve into a highly organised, vastly
Launched by Danny Sullivan of diverse and complicated system.
SearchEngineWatch
(www.searchenginewatch.com), the Search
Engines Standards Project aims to foster References
standards amongst the major search services.
Participants include representatives from the Albert, R., Jeong, H. et al. (1999), ``Diameter of the World-
largest Web search engines, academics and Wide Web'', Nature, 9 September.
Bosak, J. and Bray, T. (1999), ``XML and the second-
industry analysts. Some of the common
generation Web'', Scientific American, May.
standards that the project has helped to Chakrabarti, S., Van den Berg, M. and Dom, B. (n.d.),
develop include a common syntax for the ``Focused crawling: a new approach to topic-specific
command to narrow a search by a specific Web resource discovery'', www.almaden.ibm.com/
Web site, and the ability for all major search almaden/feat/www8
sites to locate an exact URL within their Clever Team (1999), ``Hypersearching the Web'', Scientific
indexes using the URL: command. Future American, June.
Green, D. (1998a), ``Search insider'', Information World
proposals include additional commands for
Review, Vol. 14, 1 November.
searching and meta tags for controlling search Green, D. (1998b), ``First through the portal: the business
indexing robots. potential of highly trafficked Web sites'', Business
This voluntary initiative parallels voluntary Information Review, Vol. 15 No. 3.
efforts to develop standardised XML tag sets Green, D. (1999a), ``Search insider'', Information World
for specific industries and interest groups. It Review, Vol. 14, 6 April.
Green, D. (1999b), `Ìn search of success'', The
would appear that the connectivity provided
Independent, 29 March.
by the Internet is also encouraging greater Green, D. (1999c), ``Search insider'', Information World
collaboration in general. These and other Review, Vol. 14, 7 May.
collaborative efforts (such as the Open Green, D. (1999d), ``Here come the X Files'', Information
Directory) represent admirable attempts to World Review, February.
create a degree of order. Hunerman, B.A. and Adamic, L.A. (1999), ``Growth
dynamics of the World Wide Web'', Nature, 9
Order vs. chaos September.
The tension between these two diametrically Investor's Business Daily (1999), ``Computers and
opposing forces can be witnessed on the technology ± investors betting on big hit in new
Internet. The relentless growth in activity is Web search engines'', Investor's Business Daily, 2
August.
reducing the Web into a state of digital chaos.
Janah, M. (1999), ``Web directories profit motive
Against this are the commendable efforts of complicates searches by consumers'', San Jose
paid indexers and volunteers attempting to Mercury News, 16 August.
create an ordered structure. Could the Web Lawrence, S. and Giles, C.L. (1998), `Àccessibility of
prove to be self-organising just like any other information on the Web'', Science, Vol. 280, April.
biological system that evolves to greater Lawrence, S. and Giles, C.L. (1999), `Àccessibility of
complexity and organisation? In the 9 information on the Web'', Nature, 8 July.
Sullivan, D. (1999a), ``Search engine sizes'', Search Engine
September 1999 issue of Nature, two
Watch, September, www.searchenginewatch.com/
surprising research papers were published reports/sizes.html
(Albert et al.; Hunerman and Adamic, 1999). Sullivan, D. (1999b), ``Company names test'', Search
Mathematicians had expected the Internet to Engine Watch, August, www.searchenginewatch.
follow the model of random inanimate com/reports/names.html
networks, but both studies discovered that the ``The top ten referring search engines'' (1999), September,
Internet did indeed appear to be `èvolving'' www.statmarket.com
[The] Wall Street Journal (1999), ``Web-search firm
and that its growth resembled organic life.
acquired in $55 million transaction'', The Wall Street
The Internet is evolving according to the Journal, 5 August.
universal ``power principle'' of physics. This ``XML and search'', Search Tools, www.searchtools.com/
power principle governs the order found in related/xml.html
134
Appendix. Company profiles large index and integration of non-Web

material such as company information. The
AltaVista company acquired two of its competitors,
Launched by Digital Equipment Corporation Magellan and WebCrawler during 1996. In
in December 1995 as the largest search January 1999 Excite was purchased by high-
engine on the Web. The launch of such a speed cable Internet access provider @Home
large index forced all the other major search
and the company became known as
engines to increase the size of their own
Excite@Home. In June 1999 the company
indexes during 1996. Has consistently
lost its licence to provide the search results at
remained as one of the largest search engines.
the AOL NetFind portal to competitor search
This, combined with its range of powerful
engine provider Inktomi. Later in September
search commands has ensured its popularity,
the company announced the launch of a huge
especially with researchers. In 1998 Digital
250 million page index and powerful new
Equipment Corp. was acquired by Compaq
search functionality.
who spun off AltaVista as a separate company
www.excite.com
in January 1999. Later that year in June,
Internet investment company CMGI (which
FAST
also has a shareholding in the Lycos Network,
Launched in May 1999 with the largest ever
who in turn are owners of HotBot), acquired
a majority 83 per cent stockholding in search engine index at the time ± over 200
AltaVista. CMGI has announced plans to add million pages. This Norwegian company
new services to AltaVista including an ambitiously aims to index all of the Web ±
updated index that refreshes at least every 28 hence its URL. Unlike other search engine
days. A multimedia search engine has already companies, who use mainframe computers to
been added. CMGI plans to publicly list power their services, FAST has linked
AltaVista at some point in the future. together a few hundred Dell PCs (Dell has a 5
www.altavista.com per cent stake in the company) and uses
parallel processing to deliver its service. The
Ask Jeeves company plans to have increased its index size
Launched in June 1998 as ``the first natural to 300 million pages by the end of 1999.
language search agent on the Internet''. www.alltheweb.com
Operates by matching a user's query against a
database of 7 million template questions, Go (Infoseek)
presenting variant questions if there is no The Infoseek search engine was launched in
match. It will also conduct a metasearch 1995. The Disney Corporation acquired a
across AltaVista, Go (Infoseek), Lycos and large stake in Infoseek in June 1998, and in
Yahoo! Has been licensed by AltaVista for its January 1999, Infoseek, was re-launched and
own search site. re-branded as a portal site known as Go. Like
www.askjeeves.com many other search portals, Go offers users the
option of searching the index or browsing
Direct Hit through a human-compiled Web directory.
Launched in April 1998. Offers co-branded www.go.com
search solutions to other search engine
providers. It operates by providing a second Google
level ranking of the user's search results on Launched in 1998. Developed by students at
the basis of ``popularity''. The company Stanford University (as was Yahoo!), Google
currently licenses its technology to ten search focuses on the link structure of the Web to
sites including AOL, HotBot, Lycos, MSN determine relevant results for the user. Its
and LookSmart. In August 1999 the company proprietary technology, PageRank (named
announced that it had raised almost $27 after co-founder Larry Page), crawls the Web
million from venture capital firms and private analysing both the links between Web sites,
investors. and the accompanying text around each
www.directhit.com hyperlink. The company estimates that its
index is between 70-100 million pages, but
Excite through the links analysis enables users to
Launched in late 1995. This search engine reach an estimated 300 million Web pages ±
was immediately popular with users due to its which is currently a much greater reach than
135
any other search engine provider. Like most areas (AngelFire, Tripod, WiseWire, etc.) and
other second-generation search providers, the e-commerce has become its primary focus.
company is focusing on co-branding its Although it acquired rival search engine
technology rather than building its own HotBot in October 1998, it switched to a Web
search portal. In August the company signed directory format in April 1999. Primary results
a deal with AOL subsidiary Netscape to be are now derived from the Open Directory (see
the main search provider on the Netcenter below), with secondary results coming from its
portal. own index. It has also added almost 8,000
www.google.com databases of information specific to different
industries. HotBot continues to be operated as
HotBot a separate venture.
Launched in May 1996 by Wired. Acquired www.lycos.com
by Lycos in October 1998, but continues to
be run as a separate service from the Lycos Northern Light
search engine. Accesses the Inktomi search Launched in August 1997. Has continually
engine index, rather than compiling its own been one of the largest indexes, gradually
index. However, primary results are derived increasing in size until it became the biggest
from Direct Hit, the popularity-based search search engine (indexing 16 per cent of the
provider (see above). Directory listings are Web). This leading position has since been
derived from the Open Directory (see below). superseded by the launch of FAST in May
www.hotbot.com 1999. The company also offers a special
collection of non-Web material such as
Inktomi newspaper and magazine articles. Whilst it is
Founded in February 1996, Inktomi is free to search within the special collection,
probably the most famous search engine users must pay a charge (up to $US4) to view
index. It powers the search results for several any articles from this collection. Search
famous portals and search sites including results are clustered in folders by topic. Like
HotBot (where it debuted), Yahoo!, AOL, AltaVista, this search engine is popular with
MSN Search and SNAP. However, not all of researchers due to its scope and functionality.
these companies access Inktomi's full 110 www.northernlight.com
million page index and there are variations in
results between the different search sites due Open Directory
to the different filtering and relevance ranking Launched in June 1998. This directory uses
algorithms Inktomi provides to each partner volunteer editors to catalogue the Web. This
company. It is not possible to interrogate the initiative quickly gained prominence and was
Inktomi index directly. acquired by Netscape later in November of
www.inktomi.com that year. Netscape pledged to allow anyone
to use the directory. In April Lycos re-
LookSmart launched itself as a directory service, deriving
Launched in October 1996. Like Yahoo!, its primary results from the Open Directory.
LookSmart is a human-compiled directory. In http://dmoz.org
addition to providing its own search site, the
company also licenses its directory to other RealNames
companies including AltaVista (who in turn Launched in 1998. Formerly known as
provide search results to LookSmart Centraal Corp., RealNames charges
whenever there is no match to a user's query companies an annual US$100 to register
within the directory) and in August 1999 with individual keywords, such as company name,
Excite (replacing Excite's own directory). or a brand name. Obviously many companies
During that same month, the company raised want to, and do, register many keywords to
US$92.4 million on its public listing of 7.7 protect their brands etc. This has proved a
million shares at US$12 each. very successful economic model for the
www.looksmart.com company and in August 1999 it successfully
raised over US$70 million from venture
Lycos capitalists in a third round of financing.
Launched as a search engine in May 1994. Although the index is directly available as a
The company rapidly diversified into other download from the company's Web site, and
136
is incorporated within Microsoft's Internet accounting for a staggering 43.5 per cent of all
Explorer 5 browser, its most notable success search engine referrals in August 1999 (``Top
has been access from search engines that ten'', 1999). It is the Web's largest human-
license its index, such as AltaVista and Go compiled directory, listing over 1 million sites.
(Infoseek). These directory listings are also supplemented
www.realnames.com by search results derived from Inktomi's 110
Yahoo! million page search index. Launched a new
Launched in late 1994, Yahoo! has become photo search service during the summer.
the most popular search site on the Web, www.yahoo.com
137

Evol. of Web Searching

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Evol. of Web Searching

Uploaded by

Copyright:

Available Formats

Defining the Web

The evolution of Web

search engines have three primary Portalisation

Appendix. Company profiles large index and integration of non-Web

You might also like