Research Paper Search Engine: Priyanka Nitin Sonawane DR

Research Paper
Search Engine
Priyanka Nitin Sonawane 1 Dr Ashwini Bramhe2

Master of Computer Application(MCA) Assistant Professor,
priyankasonawane212@gmail.com SIMCA Pune
ashwinibramhe@sinhgad.edu
very different from three years ago.

Abstract This paper provides an in-depth
In this paper, we present Google, a description of our large-scale web
prototype of a large-scale search search engine -- the first such
engine which makes heavy use of the detailed public description we know
structure present in hypertext. of to date.
Google is designed to crawl and index This paper addresses this question of
the Web efficiently and produce how to build a practical large-scale
much more satisfying search results system which can exploit the
than existing systems. additional information present in
To engineer a search engine is a hypertext. Also we look at the
challenging task. Search engines problem of how to effectively deal
index tens to hundreds of millions of with uncontrolled hypertext
web pages involving a comparable collections where anyone can publish
number of distinct terms. They anything they want.
answer tens of millions of queries
every day. Despite the importance of
large-scale search engines on the
web, very little academic research
has been done on them.
Furthermore, due to rapid advance in
technology and web proliferation,
creating a web search engine today is
1
and fits well with our goal of building
1.Introduction
very large-scale search engines.
The web creates new challenges for
information retrieval. The amount of
information on the web is growing 1.1 Web Search Engines
rapidly, as well as the number of new
Search engine technology has had to
users inexperienced in the art of web
scale dramatically to keep up with
research. People are likely to surf the
the growth of the web. In 1994, one
web using its link graph, often
of the first web search engines, the
starting with high quality human
World Wide Web Worm (WWWW)
maintained indices such as Yahoo! or
[McBryan 94] had an index of
with search engines. Human
110,000 web pages and web
maintained lists cover popular topics
accessible documents.
effectively but are subjective,
expensive to build and maintain, slow At the same time, the number of
to improve, and cannot cover all queries search engines handle has
esoteric topics. Automated search grown incredibly too. In March and
engines that rely on keyword April 1994, the World Wide Web
matching usually return too many Worm received an average of about
low quality matches. To make 1500 queries per day. In November
matters worse, some advertisers 1997, Altavista claimed it handled
attempt to gain people's attention by roughly 20 million queries per day.
taking measures meant to mislead With the increasing number of users
automated search engines. We have on the web, and automated systems
built a large-scale search engine which query search engines, it is
which addresses many of the likely that top search engines will
problems of existing systems. It handle hundreds of millions of
makes especially heavy use of the queries per day by the year 2000. The
additional structure present in goal of our system is to address many
hypertext to provide much higher of the problems, both in quality and
quality search results. We chose our scalability, introduced by scaling
system name, Google, because it is a search engine technology to such
common spelling of googol, or 10100
2
extraordinary numbers. complete search index would make it
possible to find anything easily.
1.2.Google Anyone who has used a search
Creating a search engine which scales engine recently, can readily testify
even to today's web presents many that the completeness of the index is
challenges. Fast crawling technology not the only factor in the quality of
is needed to gather the web search results. "Junk results" often
documents and keep them up to wash out any results that a user is
date. Storage space must be used interested in. One of the main causes
efficiently to store indices and, of this problem is that the number of
optionally, the documents documents in the indices has been
themselves. The indexing system increasing by many orders of
must process hundreds of gigabytes magnitude, but the user's ability to
of data efficiently. Queries must be look at documents has not. People
handled quickly, at a rate of hundreds are still only willing to look at the first
to thousands per second. few tens of results. Because of this,
as the collection size grows, we need
In designing Google, we have tools that have very high precision.
considered both the rate of growth of There is quite a bit of recent
the Web and technological changes. optimism that the use of more
Google is designed to scale well to hypertextual information can help
extremely large data sets. It makes improve search and other
efficient use of storage space to store applications. In particular, link
the index. Its data structures are structure and link text provide a lot of
optimized for fast and efficient information for making relevance
access. judgments and quality filtering.
Google makes use of both link
1.3 Design Goals structure and anchor text.
1.3.1 Improved Search Quality
Our main goal is to improve the
quality of web search engines.In
1994, some people believed that a
3
1.3.2 Academic Search Engine many others are underway.
Research
Another important design goal was to
build systems that reasonable
numbers of people can actually use. 2. System Features
Usage was important to us because
we think some of the most The Google search engine has two
interesting research will involve important features that help it
leveraging the vast amount of usage produce high precision results. First,
data that is available from modern it makes use of the link structure of
web systems. For example, there are the Web to calculate a quality ranking
many tens of millions of searches for each web page. This ranking is
performed every day. However, it is called PageRank and is described in
very difficult to get this data, mainly detail in. Second, Google utilizes link
because it is considered commercially to improve search results.
valuable. Our final design goal was to
build an architecture that can support
novel research activities on large- 2.1 PageRank
scale web data. To support novel
The citation (link) graph of the web is
research uses, Google stores all of the
an important resource that has
actual documents it crawls in
largely gone unused in existing web
compressed form. One of our main
search engines. We have created
goals in designing Google was to set
maps containing as many as 518
up an environment where other
million of these hyperlinks, a
researchers can come in quickly,
significant sample of the total. These
process large chunks of the web, and
maps allow rapid calculation of a web
produce interesting results that
page's "PageRank", an objective
would have been very difficult to
measure of its citation importance
produce otherwise. In the short time
that corresponds well with people's
the system has been up, there have
subjective idea of importance.
already been several papers using
Because of this correspondence,
databases generated by Google, and
4
PageRank is an excellent way to
prioritize the results of web keyword
searches.
2.2 Anchor Text

The text of links is treated in a special 4. System Anatomy
way in our search engine. Most
search engines associate the text of a
link with the page that the link is on. 4.1 Google Architecture
In addition, we associate it with the Overview
page the link points to. This has
several advantages. First, anchors In this section, we will give a high
often provide more accurate level overview of how the whole
descriptions of web pages than the system works as pictured in Figure 1.
pages themselves. Second, anchors Further sections will discuss the
may exist for documents which applications and data structures not
cannot be indexed by a text-based mentioned in this section.
search engine, such as images,
programs, and databases. This makes
it possible to return web pages which
have not actually been crawled. Note
that pages that have not been
crawled can cause problems, since
they are never checked for validity
before being returned to the user. In
this case, the search engine can even
return a page that never actually
existed, but had hyperlinks pointing
to it. However, it is possible to sort
the results, so that this particular
problem rarely happens.
5
Each document is converted into a
set of word occurrences called hits.
The hits record the word, position in
document, an approximation of font
size, and capitalization. The indexer
distributes these hits into a set of
"barrels", creating a partially sorted
forward index.
4.2 Major Data Structures

Google's data structures are
optimized so that a large document
collection can be crawled, indexed,
and searched with little cost.
Although, CPUs and bulk input output
rates have improved dramatically
over the years, a disk seek still
In Google, the web crawling
requires about 10 ms to complete.
(downloading of web pages) is done
by several distributed crawlers. There Google is designed to avoid disk
is a URLserver that sends lists of URLs seeks whenever possible, and this has
to be fetched to the crawlers. The had a considerable influence on the
web pages that are fetched are then design of the data structures.
sent to the storeserver. The
storeserver then compresses and 4.2.1 BigFiles
stores the web pages into a
repository. Every web page has an BigFiles are virtual files spanning
associated ID number called a docID multiple file systems and are
which is assigned whenever a new addressable by 64 bit integers. The
URL is parsed out of a web page. The allocation among multiple file
indexing function is performed by the systems is handled automatically. The
indexer and the sorter. The indexer
BigFiles package also handles
performs a number of functions. It
reads the repository, uncompresses allocation and deallocation of file
the documents, and parses them. descriptors, since the operating
6
systems do not provide enough for into the repository, a document
our needs. BigFiles also support checksum, and various statistics. If
rudimentary compression options. the document has been crawled, it
also contains a pointer into a variable
4.2.2 Repository width file called docinfo which
The repository contains the full HTML contains its URL and title.
of every web page. Each page is
4.2.5 Hit Lists
compressed using zlib (see RFC1950).
The choice of compression technique A hit list corresponds to a list of
is a tradeoff between speed and occurrences of a particular word in a
compression ratio. We chose zlib's particular document including
speed over a significant improvement position, font, and capitalization
in compression offered by bzip. The information. Hit lists account for
compression rate of bzip was most of the space used in both the
approximately 4 to 1 on the forward and the inverted indices.
repository as compared to zlib's 3 to Because of this, it is important to
1 compression. In the repository, the represent them as efficiently as
documents are stored one after the possible. We considered several
other and are prefixed by docID, alternatives for encoding position,
length, and URL as can be seen in font, and capitalization -- simple
Figure 2. The repository requires no encoding (a triple of integers), a
other data structures to be used in compact encoding (a hand optimized
order to access it. allocation of bits), and Huffman
coding. In the end we chose a hand
4.2.3 Document Index optimized compact encoding since it
The document index keeps required far less space than the
information about each document. It simple encoding and far less bit
is a fixed width ISAM (Index manipulation than Huffman coding.
sequential access mode) index,
4.2.6 Forward Index
ordered by docID. The information
stored in each entry includes the The forward index is actually already
current document status, a pointer partially sorted. It is stored in a
7
number of barrels (we used 64). Each performance and reliability issues
barrel holds a range of wordID's. If a and even more importantly, there are
document contains words that fall social issues. Crawling is the most
into a particular barrel, the docID is fragile application since it involves
recorded into the barrel, followed by interacting with hundreds of
a list of wordID's with hitlists which thousands of web servers and various
correspond to those words. This name servers which are all beyond
scheme requires slightly more the control of the system.
storage because of duplicated docIDs
In order to scale to hundreds of
but the difference is very small for a
millions of web pages, Google has a
reasonable number of buckets and
fast distributed crawling system. A
saves considerable time and coding
single URLserver serves lists of URLs
complexity in the final indexing phase
to a number of crawlers (we typically
done by the sorter.
ran about 3). Both the URLserver and
4.2.7 Inverted Index the crawlers are implemented in
Python. Each crawler keeps roughly
The inverted index consists of the 300 connections open at once. This is
same barrels as the forward index, necessary to retrieve web pages at a
except that they have been fast enough pace. At peak speeds, the
processed by the sorter. For every system can crawl over 100 web pages
valid wordID, the lexicon contains a per second using four crawlers. This
pointer into the barrel that wordID amounts to roughly 600K per second
falls into. It points to a doclist of of data. A major performance stress
docID's together with their is DNS lookup. Each crawler
corresponding hit lists. This doclist maintains a its own DNS cache so it
represents all the occurrences of that does not need to do a DNS lookup
word in all documents. before crawling each document. Each
of the hundreds of connections can
be in a number of different states:
4.3 Crawling the Web looking up DNS, connecting to host,
Running a web crawler is a sending request, and receiving
challenging task. There are tricky response. These factors make the
8
crawler a complex component of the simplest case -- a single word query.
system. It uses asynchronous IO to In order to rank a document with a
manage events, and a number of single word query, Google looks at
queues to move page fetches from that document's hit list for that word.
state to state. Google considers each hit to be one
of several different types (title,
anchor, URL, plain text large font,
4.5 Searching plain text small font, ...), each of
which has its own type-weight. The
The goal of searching is to provide type-weights make up a vector
quality search results efficiently. indexed by type. Google counts the
Many of the large commercial search number of hits of each type in the hit
engines seemed to have made great list. Then every count is converted
progress in terms of efficiency. into a count-weight. Count-weights
Therefore, we have focused more on increase linearly with counts at first
quality of search in our research, but quickly taper off so that more
although we believe our solutions are than a certain count will not help. We
scalable to commercial volumes with take the dot product of the vector of
a bit more effort. count-weights with the vector of
4.5.1 The Ranking System type-weights to compute an IR score
for the document. Finally, the IR
Google maintains much more score is combined with PageRank to
information about web documents give a final rank to the document.
than typical search engines. Every
hitlist includes position, font, and
capitalization information. 4.5.2 Feedback
Additionally, we factor in hits from
anchor text and the PageRank of the The ranking function has many
document. Combining all of this parameters like the type-weights and
information into a rank is difficult. We the type-prox-weights. Figuring out
designed our ranking function so that the right values for these parameters
no particular factor can have too is something of a black art. In order
much influence. First, consider the to do this, we have a user feedback
9
mechanism in the search engine. A sifting through result sets. A number
trusted user may optionally evaluate of results are from the
all of the results that are returned. whitehouse.gov domain which is
This feedback is saved. Then when what one may reasonably expect
we modify the ranking function, we from such a search. Currently, most
can see the impact of this change on major commercial search engines do
all previous searches which were not return any results from
ranked. Although far from perfect, whitehouse.gov, much less the right
this gives us some idea of how a ones. Notice that there is no title for
change in the ranking function affects the first result. This is because it was
the search results. not crawled. Instead, Google relied
on anchor text to determine this was
5. Results and a good answer to the query.
Performance
5.1 Storage Requirements
The most important measure of a Aside from search quality, Google is
search engine is the quality of its designed to scale cost effectively to
search results. While a complete user the size of the Web as it grows. One
evaluation is beyond the scope of this aspect of this is to use storage
paper, our own experience with efficiently. Table 1 has a breakdown
Google has shown it to produce of some statistics and storage
better results than the major requirements of Google. Due to
commercial search engines for most compression the total size of the
searches. As an example which repository is about 53 GB, just over
illustrates the use of PageRank, one third of the total data it stores.
anchor text, and proximity, Figure 4 At current disk prices this makes the
shows Google's results for a search repository a relatively cheap source
on "bill clinton". These results of useful data. More importantly, the
demonstrates some of Google's total of all the data used by the
features. The results are clustered by search engine requires a comparable
server. This helps considerably when amount of storage, about 55 GB.
10
Furthermore, most queries can be Storage Statistics
answered using just the short
Total Size of Fetched Pages 147.8
inverted index. With better encoding
GB
and compression of the Document
Index, a high quality web search Compressed Repository 53.5 GB
engine may fit onto a 7GB drive of a
Short Inverted Index 4.1 GB
new PC.
Full Inverted Index 37.2 GB
Lexicon 293 MB
5.2 System Performance
Temporary Anchor Data
It is important for a search engine to
crawl and index efficiently. This way (not in total)6.6 GB
information can be kept up to date
Document Index Incl.
and major changes to the system can
be tested relatively quickly. For Variable Width Data 9.7 GB
Google, the major operations are
Links Database 3.9 GB
Crawling, Indexing, and Sorting. It is
difficult to measure how long Total Without Repository55.2 GB
crawling took overall because disks
Total With Repository 108.7 GB
filled up, name servers crashed, or
any number of other problems which
stopped the system. In total it took
5.3 Search Performance
roughly 9 days to download the 26
million pages (including errors). Improving the performance of search
However, once the system was was not the major focus of our
running smoothly, it ran much faster, research up to this point. The current
downloading the last 11 million pages version of Google answers most
in just 63 hours, averaging just over 4 queries in between 1 and 10 seconds.
million pages per day or 48.5 pages This time is mostly dominated by disk
per second. IO over NFS (since disks are spread
over a number of machines).
Furthermore, Google does not have
11
any optimizations such as query 6.1 Future Work
caching, subindices on common
A large-scale web search engine is a
terms, and other common
complex system and much remains to
optimizations.
be done. Our immediate goals are to
Web Page Statistics improve search efficiency and to
scale to approximately 100 million
Number of Web Pages Fetched 24
web pages. Some simple
million
improvements to efficiency include
Number of Urls Seen 76.5 million query caching, smart disk allocation,
and subindices. Another area which
Number of Email Addresses 1.7
requires much research is updates.
million
We must have smart algorithms to
Number of 404's 1.6 million decide what old web pages should be
recrawled and what new ones should
be crawled. Work toward this goal
has been done in [Cho 98]. One
6. Conclusions promising area of research is using
Google is designed to be a scalable proxy caches to build search
search engine. The primary goal is to databases, since they are demand
provide high quality search results driven. We are planning to add
over a rapidly growing World Wide simple features supported by
Web. Google employs a number of commercial search engines like
techniques to improve search quality boolean operators, negation, and
including page rank, anchor text, and stemming. However, other features
proximity information. Furthermore, are just starting to be explored such
Google is a complete architecture for as relevance feedback and clustering
gathering web pages, indexing them, (Google currently supports a simple
and performing search queries over hostname based clustering). We also
them. plan to support user context (like the
user's location), and result
summarization.
12
7. References
 Best of the Web 199 Navigators,
http://botw.org/1994/awards/nav
igators.html
 Bill Clinton Joke of the Day: April
14, 1997.
http://www.io.com/~cjburke/clint
on/970414.html.
 Bzip2 Homepage
http://www.muraroa.demon.co.u
k/
 Google Search Engine
http://google.stanford.edu/
 Harvest
http://harvest.transarc.com/
 Google-- www.google.com
13

Research Paper Search Engine: Priyanka Nitin Sonawane DR

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Paper Search Engine: Priyanka Nitin Sonawane DR

Uploaded by

Copyright:

Available Formats

Research Paper

Priyanka Nitin Sonawane 1 Dr Ashwini Bramhe2

priyankasonawane212@gmail.com SIMCA Pune

very different from three years ago.

2.2 Anchor Text

4.2 Major Data Structures

You might also like