Adv Prob I-Cs & Adv Prob Ii-Cs (Internet Search Engines) : Sravan Kumar Busireddy Ausif Mahmood STUDENT ID: 550554

ADV PROB I-CS & ADV PROB II-CS (INTERNET
SEARCH ENGINES)
SRAVAN KUMAR BUSIREDDY PROF: AUSIF MAHMOOD

STUDENT ID: 550554
CONTENTS
1) Introduction------------------------------------------------------------------------------------3
2) Problem statement----------------------------------------------------------------------------3
3) Analysis----------------------------------------------------------------------------------------3
4) How Search Engine works------------------------------------------------------------------4
5) How Database is Built-----------------------------------------------------------------------5
6) Research work on search engines
6.1) Indexing----------------------------------------------------------------------------------6
6.2) Search Engine Comparison------------------------------------------------------------7
6.3) Google------------------------------------------------------------------------------------10
6.4) Automatic hyperlinking----------------------------------------------------------------12
6.5) Results and Challenges in Web Search Evaluation--------------------------------12
7) Conclusion and Requirements--------------------------------------------------------------13
8) Implementation of the Spider Program
8.1) Description of Spider Program in Java-----------------------------------------------15

8.2) Output of the Spider Program---------------------------------------------------------16
9) References------------------------------------------------------------------------------------17
2
INTRODUCTION
Knowledge
Knowledge is often seen as a rich form of information. A more useful definition of knowledge is that it is
about know-how and know-why. It is important to note that to make knowledge productive you need
information.
Search engines
Search engine is a tool using which one can retrieve information. These databases are not a compilation of
web pages rather, they consist of an index of websites related to the framed query. The database matches
the query to the information and presents the links of the related websites. Search engines compile their
databases using spiders to scan through the Internet from link to link, identifying and perusing pages.
Sites with no links to other pages may be missed by spider’s altogether. The search engines is designed to
scan the web-pages of related sites matching the keywords and phrases in the query and ranking the list in
an order of priority depending on the maximum information available in relation to the query. This
provides for obtaining maximum information in lesser search time.
PROBLEM STATEMENT
The scope of my project covers the applications and uses of internet search engines. It involves
understanding the basic concepts of search engines, knowing the source of the data and the processes
involved in the retrieval of the related data from the database. To present a comprehensive list of search
tools currently available. To look-out for alternative sources of retrieving information. To present an idea
of uses of specialized search engines.
ANALYSIS
In this project I intend to do a research study on the topic of knowledge and internet search engines with
the emphasis on the following topics
1) Defining a search engine and knowing how it works.
2) Uses and applications of internet search services in everyday life and in corporate sector
3) Alternative tools used in searches
Defining a search engine and knowing how it works:

Search engine is a tool using which one can retrieve information from databases. The database matches
the query to the information and presents the links of the related websites. Search engines compile their
databases using spiders to scan through the Internet from link to link, identifying and perusing pages.
Sites with no links to other pages may be missed by spiders’ altogether. When you perform a search
query, you're asking the engine to scan its index of sites and match your keywords and phrases with those
within the engine's database. Spiders regularly return to the web pages they index to look for changes,
when changes occur, the index is updated to reflect the new information. However the process of updating
3
is dependant upon how often the spiders visit the web page. No two search engines are exactly the same in
terms of size, speed and content; no two engines use exactly the same ranking schemes, and not every
search engine offers you exactly the same results. The search engines is designed to scan the web-pages of
related sites matching the keywords and phrases in the query and ranking the list in an order of priority
depending on the maximum information available in relation to the query. This provides for obtaining
maximum information in lesser search time
Uses and applications of internet search services in everyday life and in corporate sector:
In this age of information internet search engines serve as important resources in day to day life for
academic, professional and household purposes. Students and research scholars use internet to help in
their academic work and the internet search engines are extensively used in corporate offices today to
obtain data, on which crucial decisions are made.
Alternative tools used in searches:

Besides search engines we have other tools to search the information. We can search the data why giving
a query in discussion groups. If you are seeking international information you may want to search the
Web within the country or countries we want. And through the directories, Directories are concerned not
so much in displaying a large quantity of links upon a keyword query but rather a quality of links.
Directories employ teams of people, who follow strict guidelines, to review a submitted site and
determine its inclusion and ranking. Popular search engines are Alta Vista, AOL search, Google, Yahoo,
Lycos and many more.
HOW SEARCH ENGINE WORKS
Search engines match the query with the content in their databases and present the information about such
pages along with URL’s to such pages on the web. These search engines in-fact do not really search the
world-wide-web. A search engine has a database of information from the world-wide-web stored mostly
as a document consisting of the information presented on that particular web-page. However this
information is only a copy of the original web page.
Search engines build their databases by using programs called ‘spiders’. These programs are designed to
search the World Wide Web via the Internet, visit sites and databases collecting information. They
operate continuously, updating the search engine databases. Most large search engines operate several of
these programs all the time. When a query is presented it is only matched with the information existing in
the search engines database. The spiders may collect information by following links on the existing pages
in their database.
Some web pages having information on the query may not be displayed by the search engine for various
reasons. Such pages are called ‘invisible’ pages. These pages normally fail to appear in the search as they
do not have any links to the pages in the database. However such pages may be added by submitting the
URL manually to the database.
Information collected by the spiders is passed on to another program for indexing. Search engines use
several methods for ranking these web-pages. Some of the methods used in this process vary among
different search engines. An attempt is made here to enumerate, a few, most commonly used methods for
ranking.
1. Spiders rank higher, pages with the keywords in their title.
2. Pages with high density of the keyword seem to be ranked higher.
3. Pages containing the keyword in the first few lines are ranked higher.
4
4. URL’s with the keywords are ranked higher.
Search services that use spiders differ in the quality of output in comparison to the resources (indexes)
reported manually both having their own advantages. While the manually reported indexes seem to give a
short summary of the content in the web-pages retrieved in a search, the search engines using spiders have
the advantage of being Complete due to their inherent capacity to update information. These services
differ very much in size and selection of the indexed servers, pages and their user interface.
Web usability experts say that web users, both amateur and professionals waste a considerable amount of
their online time looking for relevant, correct, and up-to-date information. Although the search engine
technologies are advancing at a great pace, yet there remains a huge gap between information query of a
web user and relevant and accurate results. In the future, experts say that, Human-like systems, technically
called Neural Networks together with today’s highly comprehensive Search Engine indexes can meet this
challenge. A pragmatic and presently available alternative to this is the use of Web Guides & directories,
which are created and maintained by Humans [3].
HOW DATABASE IS BUILT
Internet search engines are special sites on the Web that are designed to help people find information
stored on other sites. There are differences in the ways various search engines work, but they all perform
three basic tasks:
• They search the Internet -- or select pieces of the Internet -- based on important words.
• They keep an index of the words they find, and where they find them.
• They allow users to look for words or combinations of words found in that index.
Building lists
Before a search engine can tell you where a file or document is, it must be found. To find information on
the hundreds of millions of Web pages that exist, a search engine employs special software robots, called
5
spiders, to build lists of the words found on Web sites. When a spider is building its lists, the process is
called Web crawling. In order to build and maintain a useful list of words, a search engine's spiders have
to look at a lot of pages. The usual starting points are lists of heavily used servers and very popular pages.
The spider will begin with a popular site, indexing the words on its pages and following every link found
within the site. In this way, the spidering system quickly begins to travel, spreading out across the most
widely used portions of the Web.
Building index:
Once the spiders have completed the task of finding information on Web, the search engine must store the
information in a way that makes it useful. There are two key components involved in making the gathered
data accessible to users:
• The information stored with the data
• The method by which the information is indexed
A search engine could just store the word and the URL where it was found. In reality, this would make for
an engine of limited use, since there would be no way of telling whether the word was used in an
important or a trivial way on the page, whether the word was used once or many times or whether the
page contained links to other pages containing the word. In other words, there would be no way of
building the ranking list that tries to present the most useful pages at the top of the list of search results.
To make for more useful results, most search engines store more than just the word and URL. An engine
might store the number of times that the word appears on a page. The engine might assign a weight to
each entry, with increasing values assigned to words as they appear near the top of the document, in sub-
headings, in links, in the Meta tags or in the title of the page.
Regardless of the precise combination of additional pieces of information stored by a search engine, the
data will be encoded to save storage space. As a result, a great deal of information can be stored in a very
compact form. After the information is compacted, it's ready for indexing.
An index has a single purpose: It allows information to be found as quickly as possible. There are quite a
few ways for an index to be built, but one of the most effective ways is to build a hash table. In hashing, a
formula is applied to attach a numerical value to each word. The formula is designed to evenly distribute
the entries across a predetermined number of divisions. This numerical distribution is different from the
distribution of words across the alphabet, and that is the key to a hash table's effectiveness. Hashing evens
out the difference, and reduces the average time it takes to find an entry. It also separates the index from
the actual entry. The hash table contains the hashed number along with a pointer to the actual data, which
can be sorted in whichever way allows it to be stored most efficiently. The combination of efficient
indexing and effective storage makes it possible to get results quickly, even when the user creates a
complicated search.
6
RESEARCH WORK ON SEARCH ENGINES
INDEXING
Pages that have been crawled and returned are processed and the resulting data stored in an index. The
index contains the information about the words found on the page, the location of the words and other
information dealing with presentation of the words. This forms a massive database of words and
associated links to pages that contain them.
Each search engine has its own rules to index a site. Some look at meta tags, some ignore them and look
at the beginning of the text of a page, some read page titles, and so on, but almost all the search engines
first check the site for a file called robots.txt. The robots.txt file is simple ASCII document where we put
instructions to the search engines. The robots.txt file basically tells search engines two things.
1) Which search engines are excluded to index
2) Which specific pages, directories are to be excluded from the indexing.
Four approaches to indexing documents on the web are:

1) Human or manual indexing
2) Automatic indexing
3) Intelligent or agent based indexing
4) Metadata and annotation-based indexing
Manual indexing is currently used my many search engines, in this type the volume of information
available over the internet increases at a greater pace. Major drawback with manual indexing is the lack of
consistency among two different indexers. Compared to most automatic indexers, human indexing is
currently the most accurate.
Intelligent agents are most commonly referred to as crawlers, but are also known as ants, automatic
indexers, bots, spiders and worms. Many search engines will depend on automatically generated indices,
either by themselves or combination with other technologies. The major problems associated with the use
of robots are:
1) These agents are too invasive.
2) Robots can overload system servers and cause systems to be virtually frozen and
3) Some sites are updated several times a day.
The term metadata usually refers to an invisible file attached to a web page which facilitates collection of
information by automatic indexers; the file is invisible in the sense that it has no effect on the visual
appearance of the page when viewed using a standard web browser. A number of metadata standards have
been proposed for web pages.
Indexer robot:
The indexer robot is an autonomous WWW browser which communicates with servers through HTTP. It
visits a WWW site, traverses hyperlinks, extracts keywords and data from the pages and inserts the
keywords and hyperlink data in to an index. The index consists of a page-ID table, a keyword-ID table, a
page-title table, a hyperlink table and two index tables namely, an inverted and forward index. The
forward index is actually partially sorted. It is stored in a number of barrels. Each barrel holds a range of
wordID’s. The inverted index consists of the same barrels as the forward index except that they have been
7
processed by the sorter. For every valid wordID, it contains a pointer in to the barrel that wordID falls
into.
SEARCH ENGINE COMPARISON

Google:
1) URL: www.google.com
2) Size: 1.5 Billion web pages
3) Update frequency: Monthly overall but daily for a few million time-sensitive pages.
4) Depth of Crawling and Indexing: All pages are crawled and indexed up to 110k of content. After that,
any remaining content is not searchable. All pages in cache are also limited to 110k.
5) Default Search: AND
6) Phrase Searching: Yes, put terms inside "quotation marks"
7) Case Sensitive: NO
8) Boolean Logic: Yes. Use - (minus) sign to exclude a term. Use OR to search either term. AND is
default so doesn’t need to use it between terms. The + (plus) sign is used only to search Stop Words.
9) Search by Field: Yes. Use advanced search form to specify: language, file format, date, where terms
occur, domain, and filter for adult content, find similar pages, and links to a page.
10) Directory: Yes, Open Directory Project.
11) Other Databases or Directories available at this website: Image Search, Google Groups, and Google
Web Directory. From Advanced Search page search: U.S. government and military sites, university
sites (such as Stanford), mail order catalogs, Linux, BSD Unix and Apple Macintosh.
12) Special Features: Google is the only search engine to crawl and make the content of 12 main file
types, including PDF, PostScript, and Microsoft Office, searchable. Translation of web documents.
Telephone Search, Address Search. Results include cached versions of web pages, Dictionary
Definitions, and Similar pages.
13) Sorting: Results are sorted by relevance which is determined by Google's Page Rank analysis,
Determined by links from other Pages with a greater weight given to authoritative sites. Pages are also
Clustered by site. Only two pages per site will be displayed.
14) Spider Name: BackRub/2.1
Alta vista
1) URL: www.altavista.com
2) Size: 550 million web pages
3) Update frequency: Not for last 6 months but more frequently for some pages.
4) Depth of Crawling and Indexing: All pages are crawled up to 100k of content. After that, any
remaining content is not searchable. Up to 4MB of links are crawled and indexed.
5) Default Search: AND
6) Phrase Searching: Yes, put terms inside "quotation marks".
7) Case Sensitive: YES
8) Boolean Logic: Yes. Boolean (+, -, AND, OR, AND NOT) in Basic Interface. Boolean (and, or, and
not) in Advanced Interface.
9) Search by Field: Yes. Fields include: (1) anchor: (2) applet: (3) domain: (4) host: (5) image: (6) like:
(7) link: (8) text: (9) title: (10) URL: Also limit search by language and date.
10) Directory: Yes, Look Smart.
11) Other Databases or Directories available at this website: News, Comparison shop, Yellow Pages,
People Finder, Maps, Education Search, Government Search, Text Only Search, Images, Video, and
MP3/Audio.
8
12) Special Features: Ticker symbols provide direct links to stock quote, news, SEC filings. Translate
web documents.
13) Sorting: In Basic Interface, results are ranked by Alta Vista's relevance ranking formula. Results are
also clustered by site and only one record per site appears on the main results page. In the Advanced
Interface, results are NOT sorted unless one or more terms is in the "sort by" box. In Basic Interface,
only 200 records can be displayed. In Advanced Interface, up to 1000 records can be displayed.
14) Spider Name: Scooter/2.0
All the Web

1) URL: www.alltheweb.com
2) Size: 625 million web pages
3) Update frequency: Entire site every 9-11 Days.
4) Depth of Crawling and Indexing: Entire content on a web page is crawled and indexed.
5) Default Search: AND.
6) Phrase Searching: Yes, put terms inside "quotation marks".
7) Case Sensitive: NO.
8) Boolean Logic: Yes. Use + (plus) sign, - (minus) sign or Boolean (AND, OR, NOT).
9) Search by Field: Yes. Use Advanced Search form to specify language, where terms occur, domain,
date, and filter for adult content.
10) Directory: NO.
11) Other Databases or Directories available at this website: News, Pictures, Videos, MP3 Files, and
FTP files plus Scirus (Elsevier scientific information) and Soccer Search.
12) Special Features: Customize search, language, filter, intelligence, results, search tips, and flash in
results. Note that News, Pictures, Videos, MP3 Files, and FTP files are all searched simultaneously.
At bottom of results page, click on tab to view. Folders grouped by category as well as suggested
terms to narrow a search result are also available near the top of the results page.
13) Sorting: By default, sites are sorted in order of perceived relevance. Only one page per domain is
displayed unless the Customize option for site collapsing is turned off. Sites clustered under the one
page per domain are not marked, and most users will not realize more hits from that domain might be
available.
There are lots of search engines other than these three, some of them are:
Excite:
URL: www.excite.com
It is a pretty good search engine despite its smaller index. It seems particularly good at rooting out
"official" sites of organizations, institutions, etc., and because it provides so much personalized content on
its home page, it's a kind of "portal", or starting place, for many people. Results are similar to AltaVista's,
though perhaps with less duplication in the results
Spider used: Architextspider.
HotBot:
URL: www.hotbot.com
It is also a large database, and has probably more search options than anyone. The interface isn't very
pleasing, though, and the results lists can be incredibly cluttered with newsgroup postings and multiple
pages. The real value in Hotbot lies in being able to search for particular content, such as images or sound
files, or documents from a particular domain.
Spider used: Slurp/2.0
9
Northern Light:
URL: www.northernlight.com
It is a combination of a search engine and full-text document delivery service. It's got a great searching
bot, in that it sorts the results into subject-oriented 'folders', but it tends not to deliver the kinds of results
that FAST.
Spider used: Gulliver/1.2
Yahoo:
URL: www.yahoo.com
It's basically a complicated directory, with a search interface added to that. But it's an indiscriminate
directory both in terms of content and organization, and tends to be a mess to try to search
Spider used: Fido/1.0 Harvest
Spiders used in other search engines:
1) Web Crawler: Architext spider

2) AOl: Slurp/2.0
3) Euroseek: Arachnoidea
4) Planet Search: Fido/1.4.pl2
5) Lycos: Lycos_Spider_(T-Rex)
6) Infoseek: Infoseek Sidewinder/0.9
All the search engines build their indexes or databases on similar guidelines, with the location and
frequency of words are the primary determining factors in ranking the pages. The companies build their
indexes according to their criteria. E.g. The google index is based on number of links between pages and
sites. All the search engines have their own customized software to search their databases. Even though
they operate on similar principles, ranking the web sites is determined by the algorithms that analyze the
location and frequency of the user’s search terms against the list of matching websites
GOOGLE
One of the first web search engines, World Wide Web Worm (WWWW) was developed in 1994. It had
an index of 110,000 web pages and web accessible documents. With the change in time, the index figure
was rising and the search engines are expected to tackle millions and billions of queries per day. The web
creates new challenges for information retrieval. Human maintained lists cover popular topics effectively
but are subjective, expensive to build and maintain, slow to improve and can not cover all important
topics. When it comes to automated search engines, they rely on keyword matching and usually return too
many low quality matches. The main goal is to improve the quality of web search engines. Google, a
common spelling of googol or 10100 , a large-scale search engine was developed which addresses many of
the problems of existing systems. It makes especially heavy use of the additional structure present in
hypertext to provide much higher quality search results. It uses fast crawling, a technology used to gather
web documents and keep them up to date; efficient storage space to store indices and finally efficient use
of indexing system to process hundreds of gigabytes of data. Google was designed in such a way that it
makes use of link structure and anchor text for making relevance judgements and quality filtering. One
more goal of its design is that reasonable numbers of people can actually use it. The last goal of the design
is to build an architecture that can support novel research activities on large-scale web data and to support
this feature, Google stores all of the documents crawled in compressed form.
A high precision result (number of relevant documents returned) is considered to be a key feature of a
search engine. Google has two major features that help in getting such a result; first, link structure used to
10
calculate quality ranking for each web page and second, link to improve search results. Page rank is an
objective measure of web page’s citation importance that corresponds well with the people’s subjective
idea of importance. Page rank is an excellent way to prioritize the results of web keyword searches.
Search engines associate the text of a link with the page that the link is on and the page the link points to.
Anchors provide more accurate descriptions of the web pages than the pages themselves. Anchors may
exist for documents which can not be indexed by a text-based search engine, such as images, programs
and databases. Anchor propagation is mostly used because anchor text helps in providing better quality
results.
Most of Google is implemented in C or C++ for efficiency and can run in either Solaris or Linux.
The architecture of Google is described here. There is an URLserver that sends lists of URLs to be
fetched to the crawlers. The web pages that are fetched are then sent to the store server. The store server
then compresses and stores the web pages into a repository. Every web page has an associated ID number
called a docID which is assigned whenever a new URL is parsed out of a web page. The indexing
function is performed by the indexer and the sorter. The indexer performs a number of functions. It reads
the repository, uncompresses the documents, and parses them. Each document is converted into a set of
word occurrences called hits. The hits record the word, position in document, an approximation of font
size, and capitalization. The indexer distributes these hits into a set of "barrels", creating a partially sorted
forward index. The indexer performs another important function. It parses out all the links in every web
page and stores important information about them in an anchors file. This file contains enough
information to determine where each link points from and to, and the text of the link.
The URLresolver reads the anchors file and converts relative URLs into absolute URLs and in turn into
docIDs. It puts the anchor text into the forward index, associated with the docID that the anchor points to.
It also generates a database of links which are pairs of docIDs. The links database is used to compute
Page Ranks for all the documents. The sorter takes the barrels, which are sorted by docID, and resorts
them by wordID to generate the inverted index. This is done in place so that little temporary space is
needed for this operation. The sorter also produces a list of wordIDs and offsets into the inverted index. A
program called DumpLexicon takes this list together with the lexicon produced by the indexer and
generates a new lexicon to be used by the searcher. The searcher is run by a web server and uses the
lexicon built by DumpLexicon together with the inverted index and the Page Ranks to answer queries.
Data structures play an important role in the search engines to improve cost efficiency. Disk seeks are an
obstacle in the design of data structures and Google uses minimum disk seeks and thereby proving its
robustness. The major data structures in Google are Big files, Repository, Document index, Lexicon, Hit
lists, Forward index and Inverted index.
Crawling involves interacting with hundreds of thousands of web servers and various name servers which
are all beyond the control of the system. Google has a fast distributed crawling system. Each crawler
keeps roughly 300 connections open at once. Each crawler maintains its own DNS cache so it does not
need to do a DNS lookup before crawling each document. Each of the hundreds of connections can be in a
number of different states: looking up DNS, connecting to host, sending request, and receiving response.
These factors make the crawler a complex component of the system. It uses asynchronous IO to manage
events, and a number of queues to move page fetches from state to state.
The indexing in Google has three steps. First is parsing, in which a parser designed to handle a huge array
of possible errors on the web. The second step involves indexing documents into barrels. In this step, after
each document is parsed, it is encoded into a number of barrels. Every word is converted into a wordID
by using an in-memory hash table called the lexicon. Once the words are converted into wordID’s, their
occurrences in the current document are translated into hit lists and are written into the forward barrels.
The third and the last step is sorting, where the sorter takes each of the forward barrels and sorts it by
11
wordID to produce an inverted barrel for title and anchor hits and a full text inverted barrel. Since the
barrels don’t fit into main memory, the sorter further subdivides them into baskets which do fit into
memory based on wordID and docID. Then the sorter loads each basket into memory, sorts it and writes
its contents into the short inverted barrel and the full inverted barrel.
Future work
The goal of Google search engine is to improve the search efficiency and scale to approximately 100
million web pages. Developers of Google believe expanding to a lot more than 100 million pages would
greatly increase the complexity of their system. Some of the improvements to improve the search include
query caching, sub indices and updates. Some of the simple features like Boolean operators, negations,
feedback and use of link structure and link text are being added to Google to improve search quality.
Google also make use of proximity and font information for high quality searches. Google’s major data
structures make efficient use of available storage space to be more scalable. Google is also considered to
be a research tool in addition to a high quality search engine.
In this way we an conclude that Google is one of the best search engine with features like Fast web
crawling, Search quality, Efficient storage space and Indexing system
AUTOMATIC HYPERLINKING
Automatic hyperlinking is defined as when automation enriches a document with hyperlinks. In most of
the document languages, hyperlinks are understood as a navigable reference from a short piece of text in a
source document to a point in a target document. An automatic hyperlinker generally acts as a filter,
which takes a set of documents, performs some changes on it and delivers the transformed set of
documents.
Since a hyperlink is a reference from a piece of text to a location in a document, a hyper linker has to
determine the source end and the target end[s] of every hyperlink. The dense the hyperlinks lie in a
document, the more important it gets. If one wants to make navigable more than about a handful of target
locations, it proved to be quite helpful to offer the reader not only one target per link source, but a whole,
possibly hierarchically traversable collection called link bundle.
A prototypic architecture for the linking strategies was proposed which should perform the following
tasks:
1) Create an abstract syntax tree of every source document.
2) Find the locations of the source phrases
3) Find a first selection of target documents
4) Perform a human language syntactical analysis of text sections
5) Annotate target documents
6) Annotate source documents.
From this we can conclude that automatic hyperlinking is very much useful in link-based web technology
by opening new fields of applications.
RESULTS AND CHALLENGES IN WEB SEARCH EVALUATION
We will come to know in this whether web search algorithms are effective or not. Aspects of effectiveness
includes whether web pages returned to the user are relevant or not. The foundations of such an evaluation
12
methodology known as Text Retrieval Conference (TREC) evaluation program is developed. Some
research has been conducted on commercial web search engines. Using TREC methodology, we can find
effectiveness of search evaluation. Things which work well on TREC often do not produce good results
on the web. The TREC approach to objective evaluation of effectiveness is to define a large set of
statements of user need. Some of the advantages using the TREC are, Reproducible results, Blind testing
and collaborative experiments. TREC judgements are binary and completely independent of other
judgements.
VLC2 (Very Large Collection) is considered to be frozen snapshot of the web obtained from the TREC
test. The data in the VLC2 was obtained by spidering from the web. The data is accessed on the terms and
conditions of permission available through the webpage. Web search engines were not penalized for
returning URLs of non-existent. The effectiveness of ranking algorithms in isolation cannot be compared.
The performance advantage to the TREC systems increased as the amount of topic text used in
constructing the queries.
The web track will make use of the VLC2 frozen dataset to enable reproducibility of results. It may also
be possible to estimate the benefit due to increasing query length. Web track allow measurement of both
speed and effectiveness, because an ideal web search engine should not only return answers fast but
should present results which satisfy the user. Evaluation measures include precision and recall. Precision
and recall can be calculated at arbitrary points in the search engine ranked list.
Conclusion
From this we can conclude that all the commercial search engine ranking algorithms are not most
advanced, it may be possible that the source of the problem may lie in spidering rather than the ranking.
In future the search engine operations will take up the challenge and measure the effectiveness of their
systems on the VLC2 data set.
CONCLUSION AND REQUIREMENTS
Internet search engine is a tool used for the information retrieval. It connects or maintains a database of
information of documents and links of web pages useful to common people, students and researchers. The
database is built using programs called spiders and these programs search the internet, building a list of
words found on the websites. The latter process is called web crawling. After the completion of finding
information on Web, the search engine must store the information in a way that makes it useful. The data
or pieces of information stored by a search engine are encoded to save storage space. After the
information is compacted, it's ready for indexing. An index allows information to be found as quickly as
possible. One of the most efficient ways to build an index is to build a hash table. The index contains the
information about the words found on the page, the location of the words and other information dealing
with presentation of the words. This forms a massive database of words and associated links to pages that
contain them. The indexer robot is an autonomous WWW browser, which communicates with servers
through HTTP. It visits a site, traverses hyperlinks, extracts keywords and data from the pages and inserts
the keywords and hyperlink data in to an index.
This is how an internet search engine works for the information retrieval and for the benefit of people. In
this project, many search engines have been compared based on their architecture, database, size and
efficiency in searching the web. Google is the one that turned out to be the best among the present search
engines, because of its search quality, efficient indexing system and fast crawling and finally efficient use
of indexing system to process hundreds of gigabytes of data. To test the efficiency of the web search
algorithms, an evaluation methodology known as Text Retrieval Conference (TREC) evaluation program
is developed. The advantages using the TREC are, reproducible results, blind testing and collaborative
13
experiments. The ranking of web pages by a search engine depends upon the algorithms. Besides search
engines there are other tools to search the information, some of the ways include discussion groups and
directories.
Internet search engine in today’s world is an important and the most reliable source of information. In this
project, it has been clearly explained about how an internet search engine works and its functionality. It is
concluded in this project that, Google is one of the best search engines that are in progress. Quantity and
quality should go hand in hand. The amount of results and the relevance to the search are very important
for a search engine.
REQUIREMENTS TO BUILD A SPIDER
1) Web Crawling: A search engine should have fast web crawling feature, which can crawl up to 20 web
pages per second (i.e., 72000 web pages per hour). Web crawling is used for the fast retrieval of
information, which is the key for any search engine.
2) Repository: A search engine contains a lot of information and an efficient storage space is required for
storing indices and web documents in a compressed form. If we have an uncompressed data of 80 GB, it
can be compressed to approximately 40 GB in the repository, thereby reducing the storage space and
increasing the pace of the search.
3) Indexing: The important purpose of an index in a search engine is to find correct information as quickly
as possible. The index database contains approximately 20 million web documents. The documents are
encoded into barrels and then sorted to generate an inverted index to occupy very little space.
4) Page Rank: Because of the need for high precision results, a search engine uses link structure to
estimate the ranking of the web pages. Page Rank is an excellent way to prioritize the results of web
keyword searches.
5) Checking Errors: A parser, which runs at a reasonable speed is the tool used in indexing the web for
checking variety of errors.
6) Occurrence: The number of times a particular word appears in an internet search, including font,
position and capitalization information is called the hit list.
EVALUATION
1) Relevance: A search engine should not only return answers fast but should present results which satisfy
the users requesting the search.
2) Browsability: How easy it can understand the results. Does the user receive enough information from
the retrieved results to make a decision.
3) Query features: search engine supports a variety of query features, some support full Boolean queries,
other supports only and queries. Queries with shorter length will yield better results.
4) Advertising: The use of advertisements in search engines yield very low precision results and will be
going away from the needs of the consumer.
5) Concurrency: If one user is updating an index, another user can still access the index for his search.
14
IMPLEMENTATION OF SPIDER PROGRAM IN JAVA
The Spider program has been implemented in Java programming language. The program, which has been
described here, explains the all the Java classes used in this program and function of a spider. A spider is
used to crawl the web to build a list of database. The spider in this program creates indexes of the
websites it visits and stores it.
The Java files, which have been used in this program, are:
1. BasicSpider:
This program is an invoker and for each host it will create a thread. This program will wait till all threads
are completed.
2. SpiderTask:
This program loops through the path to get information for a given host.
Program Logic: For each URL in to do list it will connect to the host and get the html content and it will
parse the html content to find the URLs. If not URL is already visited, bad link or already in the to list
then it will be queued for process. The information for the URL will be stored in data storage for retrieval
(this can be a database or indexer).
3. SpiderContext:
This class is the utility class to get configuration information and serve across multiple objects. Through
out the execution there is only one SpiderContext object will be created. SpiderContext needs
configuration file (spider.properties) to initialize.
4. HtmlDocument:
This class is a bean, which stores meta information of URL like title, content type and content.
5. HttpUrl:
This class is an extension to standard URL implementation in java.net package to store URL information
like port number and IP address.
6. HtmlTag:
This class stores static name values for html tags.
7. HtmlParser:
This is an interface, which used to extract information from given HTML content. Extracted information
is returned in the format of HashMap. The implemenation is dynamic we need to set the property
parser.driver property in spider.properties file.
8. TidyParse:
This is implementation class for HtmlParser. Which used Tidy for parsing html content.
9. Constants:
This class stores static name values for project.
10. Name Value:

This class is store name value relation. This class also tokenizes the string into name and value.
15
11. SpiderWriter:
This is an interface for storing content. The implementation class is dynamic for project where the content
will stored please mention in properties file set writer.driver property.
12. SpiderIndexWriter:
This is an implementaion for SpiderWriter interface which stores the content in index file format which
uses Jakarta lucent indexer.
THE OUTPUT OF THE SPIDER PROGRAM IN JAVA IS GIVEN BELOW AS A SNAP SHOT.
16
REFERENCES
[1] ‘Positioning search engines’--- http://www.positioning-search-

engines.com/searchengines.htm -2/3/03
[2] Angela Elkordy, ‘Web searching, Sleuthing and Sifting’
http://www.thelearningsite.net/cyberlibrarian/searching/ismain.html -2/8/03
[3] ‘Indian Web Site Ratings’---
http://www.geocities.com/indian_website_ratings/best_indian_entertainment.htm
-2/14/03
[4] ‘UC Berkeley - Teaching Library Internet Workshops’
http://www.lib.berkeley.edu/TeachingLib/Guides/Internet/FindInfo.html -2/14/03
[5] http://computer.howstuffworks.com/search-engine.htm -2/19/03
[6] http://www-sul.stanford.edu/depts/swain/libguides/searchenginecomp.html-2/24/03
[7] Stephanie Davidson, ‘Search engine comparison’---http://www.law.ohio-
state.edu/training/wengines2.htm -2/26/03
[8] Susan O’Neil, ‘What are the most important search engines’
http://www.fundsnetservices.com/Engines/business.htm -2/27/03
[9] Sergey Brin and Lawrence Page, ‘Anatomy of a large-scale Hypertextual Web Search Engine’--
http://www.stanford.edu/pub/papers/google.pdf -3/02/03
[10] http://citeseer.nj.nec.com/cache/papers/cs/1118/http:zSzzSzwww.cs.ust.hkzSz~dleez
SzPaperszSzwwwzSzwww4.pdf/yuwono95world.pdf -3/04/03
[11] Dr. Bruce Litow, ‘A review of world wide web searching techniques’--
http://citeseer.nj.nec.com/559198.html -3/09/03
17

Adv Prob I-Cs & Adv Prob Ii-Cs (Internet Search Engines) : Sravan Kumar Busireddy Ausif Mahmood STUDENT ID: 550554

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adv Prob I-Cs & Adv Prob Ii-Cs (Internet Search Engines) : Sravan Kumar Busireddy Ausif Mahmood STUDENT ID: 550554

Uploaded by

Copyright:

Available Formats

ADV PROB I-CS & ADV PROB II-CS (INTERNET

SRAVAN KUMAR BUSIREDDY PROF: AUSIF MAHMOOD

4) How Search Engine works------------------------------------------------------------------4

5) How Database is Built-----------------------------------------------------------------------5

6) Research work on search engines

7) Conclusion and Requirements--------------------------------------------------------------13

8) Implementation of the Spider Program

8.1) Description of Spider Program in Java-----------------------------------------------15

Defining a search engine and knowing how it works:

Alternative tools used in searches:

HOW SEARCH ENGINE WORKS

HOW DATABASE IS BUILT

Four approaches to indexing documents on the web are:

SEARCH ENGINE COMPARISON

All the Web

Spiders used in other search engines:

1) Web Crawler: Architext spider

RESULTS AND CHALLENGES IN WEB SEARCH EVALUATION

CONCLUSION AND REQUIREMENTS

REQUIREMENTS TO BUILD A SPIDER

10. Name Value:

[1] ‘Positioning search engines’--- http://www.positioning-search-

You might also like