You are on page 1of 38

Table of Contents

Contents Page no.


1. Abstract ii
2. GoogleBot iii
3. How Google works iv
(i) Googlebot, Google's web Crawler vii

4. GoogleBot techniques xvii


5. What Appears on the Results Page xviii
(i) File Formats read by GoogleBot xxii
(ii) File Formats avoided by GoogleBot xxiv
6. Guiding GoogleBot xxii
7. Advanced Operators Supported by Google xxiv
8. The Google Algorithm xxvi
(i) The PageRank Algorithm xxvi
(ii) How Page Rank Works xxvii
(iii) The Relevance algorithm xxviii

9. Google Services xxiv


Conclusion

1
ABSTRACT

Currently, Google is the largest and most used search engine on the World Wide
Web. Sergey Brin and Larry Page, fellow graduate students at Stanford University when
they met, created Google as a research project in 1997. Google is a search engine that
employs a robot to crawl the web finding web pages, which are then indexed and stored.
When a user uses Google, key terms are then matched to those indexed terms and a
ranked list of relevant results is returned to the user. Google has risen from an unheard of
search engine to the leader of the search engine industry. In addition to growing a large
search engine and producing high profits, Google is employing over 3,000 workers.
These workers are provided a unique and healthy atmosphere at the Googleplex, which is
where Google headquarters are located.

GoogleBot
2
• 15 years old
• Staff Crawler, Google
• Interested in automated discovery, curation, and citation analysis of large
document repositories
• PhD, Stanford
• Unmarried, no children
• Expert user
• Uses HTTP/HTTPs applications in quest to find all knowledge know to man

Overview

Googlebot is responsible for adding web sites to the organic search index of Google.com. He
is but one piece of the Google organic search puzzle: Googlebot is the crawler, the index is
master database of all pages know to Google, there is a PageRank calculation for each page
in the index, there is a search algorithm that ranks pages for specific user queries. Googlebot
is the Michael Jordan of web robots-- he routinely visits billions of web pages in search of
new content, giving his bosses, Larry and Sergey, a significant competitive advantage in
creating the world’s most popular search engine. Google.com delivers approximately 70% of
the unpaid, organic traffic to websites.

In general, Googlebot tries to help ensure that the pages he submits comply with Larry &
Sergey’s Webmaster Guidelines. In practice, many different parties make decisions about
which pages go into The Index or become visible to consumers, so for the purposes of this
persona, we’ll attribute them all to Googlebot.

Google strives to make it easy to find whatever you're seeking, whether it's a web page, a
news article, a definition, something to buy, or text in a book. By understanding what appears
on a results page, you'll be better able to determine if a page includes the information you're
seeking or links to it. After you enter a query, Google returns a results list ordered by what it
considers the items' relevance to your query, listing the best match first. Sponsored links
sometimes appear above to the right of the search results.

● How Google works

3
Google runs on a distributed network of thousands of low-cost computers and can therefore
carry out fast parallel processing. Parallel processing is a method of computation in which
many calculations can be performed simultaneously, significantly speeding up data
processing. Google has three distinct parts:

● Googlebot, a web crawler that finds and fetches web pages.

● The indexer that sorts every word on every page and stores the resulting index of words in
a huge database.

● The query processor, which compares your search query to the index and recommends the
documents that it considers most relevant.

In a nutshell, search engines use spiders to crawl the World Wide Web and collect
information. Then the search engine builds the index, encodes the index data, and stores
the data

Step 1: Crawl the Web


Before a search engine can tell you where to find a page, it has to find the page itself.
Search engines use special software robots called spiders that travel the Web to find
4
pages. A spider will start crawling a popular site and then follow all the links on that site
to record other sites. And so on and so on.

Step 2: Collect Information


The spider records the words found on each page and where those words were found on
the page. Some search engines use spiders that record every word on a page; other
engines record only the important words on a page, ignoring common words such as "a,"
10
"and," and "the." The spider may pay more attention to words stored in certain locations
on a page, such as the page's title.

Step 3: Build Index


Once the robots find information, the search engine must store the information in such a
way that it can be found easily. A database index is similar to an index at the back of a
book: A book index contains information taken from pages and pointers (page numbers)
to the original sources of the information. If an index has been built well, users will be
able to find pages quickly.
Different search engines build their indexes differently. The different indexing methods
are one of the reasons why the same search may yield different results using different
engines. Some possible considerations for building an index include:

• The number of times an important word appears on a page.


• Where on a page a word appears.
• Whether a word is capitalized or not.
• The number of times a page is linked to from other pages.
• The importance of other pages from which a page is linked to.

Step 4: Encode Data


Before the index information is stored in a database, the search engine encodes the data to
reduce the size of the database and to speed up the search engine's response time.

Step 5: Store Data


The final step in the process is store the search index in a database.

Let's take a closer look at each part.

Googlebot, Google's web Crawler


5
Crawling
The first thing which any search engine has to do is index or collect information about
websites. This is a complex process since many thousands of sites are added , some hundreds
of thousands modified , links created and dead links left scattered all over the internet. The
googlebot which is also called a bot or spider crawls all over the internet looking for
additions and modifications ,finally indexing them for use by the Google search engine.
Googlebot uses sitemaps extensively to index websites and this means you must create a
sitemap if you want to get indexed ( get all your web pages indexed ). I feel an understanding
of the process is important since we come to realize the importance of certain procedures
which we may not appreciate otherwise.

Googlebot is Google's web crawling robot, which finds and retrieves pages on the web and
hands them off to the Google indexer. It's easy to imagine Googlebot as a little spider
scurrying across the strands of cyberspace, but in reality Googlebot doesn't traverse the web
at all. It functions much like your web browser, by sending a request to a web server for a
web page, downloading the entire page, then handing it off to Google's indexer.

Googlebot consists of many computers requesting and fetching pages much more quickly
than you can with your web browser. In fact, Googlebot can request thousands of different
pages simultaneously. To avoid overwhelming web servers, or crowding out requests from
human users, Googlebot deliberately makes requests of each individual web server more
slowly than it's capable of doing. Googlebot finds pages in two ways: through an add URL
form, www.google.com/addurl.html, and through finding links by crawling the web.

6
Unfortunately, spammers figured out how to create automated bots that bombarded the add
URL form with millions of URLs pointing to commercial propaganda. Google rejects those
URLs submitted through its Add URL form that it suspects are trying to deceive users by
employing tactics such as including hidden text or links on a page, stuffing a page with
irrelevant words, cloaking (aka bait and switch), using sneaky redirects, creating doorways,
domains, or sub-domains with substantially similar content, sending automated queries to
Google, and linking to bad neighbors. So now the Add URL form also has a test: it displays
some squiggly letters designed to fool automated "letter-guessers"; it asks you to enter the
letters you see — something like an eye-chart test to stop spambots.

When Googlebot fetches a page, it culls all the links appearing on the page and adds them to
a queue for subsequent crawling. Googlebot tends to encounter little spam because most web
authors link only to what they believe are high-quality pages. By harvesting links from every
page it encounters, Googlebot can quickly build a list of links that can cover broad reaches of
the web. This technique, known as deep crawling, also allows Googlebot to probe deep
within individual sites. Because of their massive scale, deep crawls can reach almost every
page in the web. Because the web is vast, this can take some time, so some pages may be
crawled only once a month. Although its function is simple, Googlebot must be programmed
to handle several challenges. First, since Googlebot sends out simultaneous requests for
thousands of pages, the queue of "visit soon" URLs must be constantly examined and
compared with URLs already in Google's index. Duplicates in the queue must be eliminated
to prevent Googlebot from fetching the same page again. Googlebot must determine how
often to revisit a page. On the one hand, it's a waste of resources to re-index an unchanged
page. On the other hand, Google wants to re-index changed pages to deliver up-to-date
results.

To keep the index current, Google continuously recrawls popular frequently changing web
pages at a rate roughly proportional to how often the pages change. Such crawls keep an
index current and are known as fresh crawls. Newspaper pages are downloaded daily, pages
with stock quotes are downloaded much more frequently. Of course, fresh crawls return
fewer pages than the deep crawl. The combination of the two types of crawls allows Google
to both make efficient use of its resources and keep its index reasonably current.

DIFFERENT GOOGLE CRAWLERS


FEEDFETCHER-GOOGLE – RSS reader fetcher

GOOGLEBOT – Collects web pages

• FRESSHBOT – Collects updated data more frequently

• DEEPBOT – Follows every link that it finds

MEDIABOT – Used to Analyze Adsense pages

7
IMAGEBOT – Crawling for the Image search

ADSBOT – Crawling Adwords landing pages for quality

GSA-CRAWLER – Used by Google Search Appliance

FRESSHBOT – Collects updated data more frequently

DEEPBOT – Follows every link that it finds

GOOGLEBOT-MOBILE – Crawling for Mobile pages

MEDIABOT:
USER-AGENT: MEDIAPARTNERS-GOOGLE

This crawler from Google is the lynchpin for serving contextually relevant ads to the
Publishing sites. Its purpose is to analyze the content of the pages so that AdSense program
can serve meaningful ads on these sites. This crawler should not be restricted to crawl the
websites which are using Adsense.

IMAGEBOT:
USER-AGENT: GOOGLEBOT-IMAGE

Some of the known findings about IMAGEBOT are:


• The Imagebot scavenges the web for images to place in their image
• index.
0 The ranking of Images for a particular keywords depends on certain
1 factors –
2 Filename, Surrounding text, Alt text & Page Title.
• If your website is not focused on Image inventory & download, then it
• makes sense to block Imagebot from crawling your site – using your
• robot.txt file.
• Blocking ImageBot also saves some bandwidth

ADSBOT:
USER-AGENT: ADSBOT-GOOGLE
8
Some of the facts that have surfaced about this new member of the
• Ads Bot serves a very specific purpose as far as crawling is concerned.
• It is geared to provided wisdom to the Adwords program of Google by –
• By analyzing the content of the Landing pages related to an ad
• This content analysis helps in determining the Quality Score for a particular ad
• This Quality score in association with the Bid Amount & CTR (Click Through rate) is
used by Google to determine the ranking score of an Ad for a particular Keyword.

1 Thus, it makes sense not to block AdsBot if you are an advertiser & is using the
2 Adwords.

GOOGLEBOT-MOBILE:
USER-AGENT: GOOGLEBOT-MOBILE

Some known facts about the Mobile Content crawler from Google are:
• Google does use a specific crawler to gather mobile content
• Google indexes public mobile web content
• If your content appears to be available only to a subset of all mobile users (for
example, only to subscribers of a certain mobile service provider), it may not
be indexed.
• Users can search the mobile web on their mobile devices using Google Mobile
Web Search.

[a] Getting Your Mobile content Indexed


The steps for this are roughly the same as for Non-mobile content –
• Submit Mobile Sitemaps to the Google Mobile Index just in the same way as the
Non-mobile site maps are submitted.
• You create and add Mobile Sitemaps to your Google Webmaster Tools account
in a similar way to Sitemaps for non-mobile content.
• If your Mobile site has changed, then you can resubmit your map

GSA-CRAWLER:

The gsa-crawler is the search appliance robot that performs the crawling on a web site. The
crawler identifies itself with every page it downloads from any web server by specifying a
user agent that can be stored in a web server log file by webmasters.
1

9
• The Google Search Appliance uses standard Google rules, such as searching for
all words and treating upper and lowercase letters the same. It recognizes the
minus sign (-) to exclude unwanted words, but does not allow the word NOT.
• Google Search Appliance default search results look like the public search
engine.
• The search results header includes the search field, search terms, and number
of matches, and a suggested alternate spelling, based on the site dictionary, if
appropriate.
• Each search result item has the title, URL, with the size and date if available,
and a "snippet" from the document shows the matched term in context
whenever possible.
• The Google Search Appliance is an excellent search engine for HTTP-accessible
0 content.

The identifier used by the crawler consists of:


• The user agent name, which, by default, is set to gsa-crawler.
• A unique identifier that is assigned for each search appliance.
• The problem email address you entered in Administration > System Settings.

If you keep the user agent name gsa-crawler, the accessed web servers might see an identifier
such as
gsa-crawler (Enterprise; GID01065; yourname@yourcompany.com)

The email is a required part of the identification to allow webmasters to contact you if the
search appliance affects them negatively by crawling their sites too rapidly.
There may be pages or sites in your organization that you do not want the search appliance to
crawl, such as password-protected directories with information that you want to keep private.
To prevent the gsa-crawler from accessing the information on these servers, you can either:
• Enter their URL patterns in Do Not Crawl URLs with the Following Patterns
• Create and put a robots.txt file in the root of the server. A robots.txt file
0 consists of the user-agent name and one or more lines of instruction for the
1 robot.
For example:
# /robots.txt file for gsa-crawler (This is a comment line.) User-agent: gsa-crawler
(This names the user-agent that the file targets.) Disallow: /*.cgi (The gsa-crawler
will not be allowed to crawl any CGI files.) Disallow: /*.pl (The gsa-crawler will not
be allowed to crawl any Perl scripts.)

10
Allow: /$ (The gsa-crawler is allowed to crawl everything else.)
Disallow: / (This prevents the gsa-crawler from crawling anything on the site.)

FEEDFETCHER-GOOGLE:

This is the RSS & ATOM feed crawler of Google. The content that will be crawled by
Feedfetcher are:
• All Blogs – published through Wordpress, TypePad, Blogger etc. – from all
0 sources
• Blogs written in these languages apart from English –

French, Italian, German, Spanish, Korean, Brazilian Portuguese and other


languages as well.
• Usually, the average crawl frequency for Feedfetcher is more than an hour –
0 based on the frequency of your site update.
• So, if your Blog publishes a site feed in any format & pings an update service,
0 then the contents of this feed will be indexed in the Blog Search.

So, If you are on Google Blog Search, rest assured that the Feedfetcher is doing all the
collection task which makes the search easier for you.

11
Differing Functions of Googlebot Crawlers

12
•Googlebot News
•Googlebot Titles
•Googlebot Content
•Googlebot Links
•Googlebot Images
•Googlebot Blogs

Google's Indexer

Googlebot gives the indexer the full text of the pages it finds. These pages are stored in
Google's index database. This index is sorted alphabetically by search term, with each index
entry storing a list of documents in which the term appears and the location within the text
where it occurs. This data structure allows rapid access to documents that contain user query
terms. To improve search performance, Google ignores (doesn't index) common words called
stop words (such as the, is, on, or, of, how, why, as well as certain single digits and single
letters). Stop words are so common that they do little to narrow a search, and therefore they
can safely be discarded. The indexer also ignores
some punctuation and multiple spaces, as well as converting all letters to lowercase, to
improve Google's performance.

Google's Query Processor

The query processor has several parts, including the user interface (search box), the "engine"
that evaluates queries and matches them to relevant documents, and the results formatter.
Google considers over a hundred factors in determining which documents are most relevant
to a query, including the popularity of the page, the position and size of the search terms
within the page, and the proximity of the search terms to one another on the page. PageRank
is Google's system for ranking web
pages. (View a websites's PageRank from Google's ToolBar and from the Google Directory.)

Google also applies machine-learning techniques to improve its performance automatically


by learning relationships and associations within the stored data. For example, the spelling-
correcting system uses such techniques to figure out likely alternative spellings. Google
closely guards the formulas it uses to calculate relevance; they're tweaked to improve quality
and performance, and to outwit the latest devious
techniques used by spammers. Indexing the full text of the web allows Google to go beyond
simply matching single search terms. Google gives more priority to pages that have search
terms near each other and in the same order as the query. Google can also match multi-word
phrases and sentences. Since Google indexes HTML code in addition to the text on the page,
users can restrict searches on the basis of where query words appear, e.g., in the title, in the
URL, in the body, and in links to the page, options offered by the Advanced-Search page and
search operators

13
Let's see how Google processes a query..

1. The web server sends the query to


the index servers. The content inside
the index servers is similar to the index
in the back of a book--it tells which

14
pages contain the words that match any
particular query term.

2. The query travels to the doc


servers, which actually retrieve the
stored documents. Snippets are
generated to describe each search
result.

3. The search results are returned to


the user in a fraction of a second

GoogleBot techniques

Deep crawling technique :


When Googlebot fetches a page, it culls all the links appearing on the page and adds them to
a queue for subsequent crawling.
15
Because of their massive scale, deep crawls can reach almost every page in the web.
Because the web is vast, this can take some time, so some pages may be crawled only once a
month.

Fresh crawls :
To keep the index current, Google continuously recrawls popular frequently changing web
pages at a rate roughly proportional to how often the pages change.
The combination of the two types of crawls allows Google to both make efficient use of its
resources and keep its index reasonably current.

Deceiving tactics:
Google rejects those URLs submitted through its Add URL form that it suspects are trying to
deceive users by employing tactics such as:
• including hidden text or links on a page
• stuffing a page with irrelevant words (Keyword stuffing)
• Meta tag stuffing
• cloaking
• using sneaky redirects
• creating doorways, domains, or sub-domains with substantially similar content
• sending automated queries to Google
• and linking to bad neighbors

cloaking:
refers to any of several means to serve a page to the search-engine spider that is
different from that seen by human users.

code swapping:
optimizing a page for top ranking and then swapping another page in its place once a
top ranking is achieved.

Gateway or Doorway pages

Doorway pages are Web pages designed and built specifically to draw search engine visitors
to your website. They are standalone pages designed only to act as doorways to your site.

16
What Appears on the Results Page

The results page is filled with information and links, most of which relate to your query

17
● Google Logo: Click on the Google logo to go to Google's home page.

● Statistics Bar: Describes your search, includes the number of results on the current results
page and an estimate of the total number of results, as well as the time your search took. For
the sake of efficiency, Google estimates the number of results; it would take considerably
longer to compute the exact number. This estimate is unreliable. Every underlined term in the
statistics bar is linked to its dictionary definition. Queries that are linked to just one definition
are followed by a definition link.

● Tips: Sometimes Google displays a tip in a box just below the statistics bar.

● Search Results: Ordered by relevance to your query, with the result that Google
considersthe most relevant listed first. Consequently you are likely to find what you're
seeking quickly by looking at the results in the order in which they appear. Google assesses
relevance by considering over a hundred factors, including how many other pages link to the
page, the positions of the search terms within the page, and the proximity of the search terms
to one another.
Below are descriptions of some search-result components. These components appear in fonts
of different colors on the result page to make it easier to distinguish them from one another.

❍ Page Title: (blue) The web page's title, if the page has one, or its URL if the page
has no title or if Google has not indexed all of the page's content. Click on the page
title (e.g., The History of the Brassiere - Mary Phelps Jacob) to display the
corresponding page.

❍ Snippets: (black) Each search result usually includes one or more short excerpts of
the text that matches your query with your search terms in boldface type. Each
distinct excerpt or snippet is separated by an ellipsis (...). These snippets, which
appear in a black font, may provide you with

■ The information you are seeking


18
■ What you might find on the linked page
■ Ideas of terms to use in your subsequent searches

When Google hasn't crawled a page, it doesn't include a snippet. A page might not
be crawled because its publisher requested no crawling, or because the page was
written in such a way that it was too difficult to crawl.

❍ URL of Result: (green) Web address of the search result. In the screen shot, the
URL of the first result is inventors.about.com/library/weekly/aa042597.htm.

❍ Size: (green) The size of the text portion of the web page. It is omitted for sites not
yet indexed. In the screen shot, "5k" means that the text portion of the web page is 5
kilobytes. One kilobyte is 1,024 (210) bytes. One byte typically holds one character.
In general, the average size of a word is six characters. So each 1k of text is about
170 words. A page containing 5K characters thus is about 850 words long.

Large web pages are far less likely to be relevant to your query than smaller pages.
For the sake of efficiency, Google searches only the first 101 kilobytes
(approximately 17,000 words) of a web page and the first 120 kilobytes of a pdf file.
Assuming 15 words per line and 50 lines per page, Google searches the first 22
pages of a web page and the first 26 pages of a pdf file. If a page is larger, Google
will list the page as being 101 kilobytes or 120 kilobytes for a pdf file. This means
that Google's results won't reference any part of a web page beyond its first 101
kilobytes or any part of a pdf file beyond the first 120 kilobytes.

❍ Date: (green) Sometimes the date Google crawled a page appears just after the size
of the page. The date tells you the freshness of Google's copy of the page. Dates
are included for pages that have recently had a fresh crawl.
❍ Indented Result: When Google finds multiple results from the same website, it lists
the most relevant result first with the second most relevant page from that same site
indented below it. In the screen shot, the indented result and the one above it are
both from the site inventors.about.com.

Limiting the number of results from a given site to two ensures that pages from one
site will not dominate your search results and that Google provides pages from a
variety of sites.

❍ More Results: When there are more than two results from the same site, access the
remaining results from the "More results from..." link.
When Google returns more than one page of results, you can view subsequent pages by
clicking either a page number or one of the "o"s in the whimsical "Gooooogle" that appears
below the last search result on the page.

19
If you find yourself scrolling through pages of results, consider increasing the number
of results Google displays on each results page by changing your global preferences
(see the section Changing Your Global Preferences).

In practice, however, if pages of interest to you aren't within the first 10 results,
consider refining your query instead of sifting through pages of irrelevant results. To
simplify such refinements, Google includes a search box at the bottom of the page
you can use to enter your refined query.

● Sponsored Links: Your results may include some clearly identified sponsored links
(advertisements) relevant to your search. If any of your search terms appear in the ads,
Google displays them in boldface type.

● Spelling Corrections, Dictionary Definition, Cached, Similar Pages, News, Product


Information, Translation, Book results: Your results may include these links, which are
described on the next few pages.

Cached Pages
Google takes a snapshot of each page it examines and caches (stores) that version as a back-
up.
The cached version is what Google uses to judge if a page is a good match for your query.
This is useful if the original page is unavailable because of:

• Internet congestion
• A down, overloaded, or just slow website
20
• The owner’s recently removing the page from the Web

Note: Since Google’s servers are typically faster than many web servers, you can often
access a page’s cached version faster than the page itself.

Note: Google indexes a page (adds it to its index and caches it) frequently if the page is
popular (has a high PageRank) and if the page is updated regularly.

The new cached version replaces any previous cached versions of the page

File Formats read by GoogleBot:


2 Adobe Portable Document Format (pdf)

3 Adobe PostScript (ps)

4 Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)

21
5 Lotus WordPro (lwp)

6 MacWrite (mw)

7 Microsoft Excel (xls)

8 Microsoft PowerPoint (ppt)

9 Microsoft Word (doc)

10 Microsoft Works (wks, wps, wdb)

11 Microsoft Write (wri)

12 Rich Text Format (rtf)

13 Shockwave Flash (swf)

14 Text (ans, txt)

File Formats avoided by GoogleBot:

Some file extensions have a very large file size and are considered untouchables by
the Bot. Some of these are - exe, dll, zip, dmg etc.
1

Guiding GoogleBot:

GoogleBot can be directed to crawl a certain page through the use of Robot.txt file
(which uses the agent name: user-agent) or through the use of a Meta Robot tag –
This tag resides inside the website code & has the following format –

<meta name="googlebot" content="robots-terms">


22
robots-terms
noindex Document will not be indexed by Googlebot.

Nofollow Internal and external links in the document will not be followed by
Googlebot.

Noarchive Google will not archive a copy of the document (Google's Cached Page).
Nosnippet Google will not display snippets and will not archive a copy of the document
(Google's Cached Page). A snippet is a text excerpt from the returned result page that has all
query terms bolded.
1 If this Robots META Tag is missing, or if there is no content, or the robot terms are
not specified, then the robot terms will be assumed to be "index, follow" (e.g. "all") which
is the default indexing behavior for most major search engine spiders.

GOOGLEBOT has got two assistants to take away some of its load:
i. FRESHBOT: It is a relatively newer Bot which is used to crawl the updated pages on the
web. It crawls the pages which are already in the index and looks for their updated versions.
Thus, it makes sense to update your site as frequently as possible.

ii. DEEPBOT: This guy follows every link that it could find & download as many pages as it
could. It’s a crawl maniac to say the least. But, this bring a better and holistic picture of the
page to Google. In this process, Google also gets a more complete picture of the composition
of a site. This Bot usually arrive at the end of the Google’s monthly ritual of Backlink &
content inspection (called Google dance). At this point of time, they take as much content as
possible for a deeper level of Indexing.

Advanced Operators Supported by Google

Query modifiers

• filetype:
• intitle:
• inurl:
• site:
• synonyms
Alternative query types

23
• cache:
• link:
• related:
• info:

Other information needs


• phonebook
• stocks:
• define:
• Google Calculator
• weather
• movies:

Filetype:
• restricts your results to files ending in ".doc" (or .xls, .ppt. etc.), and shows you only
files created with the corresponding program.
• There can be no space between filetype: and the file extension

• The “dot” in the file extension – .doc – is optional.

Intitle:
• restricts the results to documents containing a particular word in its title.

• There can be no space between intitle: and the following word.


• You can also search for phrases. Just put your phrase in quotes.

Inurl:

• restricts the results to documents containing a particular word in its URL.

• There can be no space between inurl: and the following word.


site:
• restricts the results to those websites in a domain.
• There can be no space between site: and the domain.

cache:
• shows the version of a web page that Google has in its cache.
• There can be no space between cache: and the URL.
• You can use cache: in conjunction with a keyword or phrase, but few do.

link:
• restricts the results to those web pages that have links to the specified URL.

24
• There can be no space between link: and the URL.

related:
• lists web pages that are "similar" to a specified web page.
• There can be no space between related: and the URL.

info:
• presents some information that Google has about a particular web page.
• There can be no space between info: and the URL.

phonebook:
• There are two ways to use Google’s phonebook:
– Just do a regular search.
– Use one of Google’s phonebook commands.
• Phonebook commands [in lowercase]:
– phonebook: searches the entire Google phonebook.
– rphonebook: searches residential listings only.
– bphonebook: searches business listings only.

stocks:
• If you begin a query with stocks: Google will treat the rest of the query terms as stock
ticker symbols, and will link to a Yahoo finance page showing stock information for
those symbols.
• Go crazy with the spaces – Google ignores them!

define:
• If you begin a query with define: Google will display definitions for the word or
phrase that follows, if definitions are available.
• You don’t need quotes around your phrases.

The Google Algorithm

• The PageRank algorithm

• The Relevance algorithm

25
The PageRank Algorithm

• It’s an off-page factor.

• The purpose of PageRank is to assign a


numerical value to the Web pages according to
the number of times that other pages
recommend them and according to the
PageRank that these pages have. That is to say,
it establishes the importance of a Web page.

• The PageRank algorithm is complex and it’s


formed by many variables and many other
minor algorithms.

 T he PageRank algorithm is a numerical value


that goes from the 0 to the 10 in a logarithmic
scale. This means that it is much more
difficult to raise from 5 to 6 than from 2 to 3.

The PageRank algorithm it’s not being


calculate whenever we make a search. Google
calculates it every certain time (every day
though). The tool bar is updated every 3 or 4
months.

How Page Rank Works

PageRank relies on the uniquely democratic nature of the web by using its vast link
structure as an indicator of an individual page's value. In essence, Google interprets a link
from page A to page B as a vote, by page A, for page B. But, Google looks at more than
the sheer volume of votes, or links a page receives; it also analyzes the page that casts the
vote. Votes cast by pages that are themselves "important" weigh more heavily and help to
make other pages "important." Important, high-quality sites receive a higher PageRank,
which Google remembers each time it conducts a search. Of course, important pages
mean nothing to you if they don't match your query. So, Google combines PageRank
26
with sophisticated text-matching techniques to find pages that are both important and
relevant to your search. Google goes far beyond the number of times a term appears on a
page and examines all aspects of the page's content (and the content of the pages linking
to it) to determine if it's a good match for your query.

Google PageRank (PR) is a numerical ranking from 0 to 10. PR is calculated on a


periodic basis by Google for each and every page in their index. PageRank is page
specific, not site specific. The PR of the pages on your site can (and probably will) vary
from page to page.

Google uses their proprietary (and secret) PageRank Algorithm to calculate your web
page's PR based upon the quantity and quality of the links pointing to your page from
other web pages, including your own pages as well as those belonging to other
webmasters.

It's important to understand that it isn't just the quantity of links pointing to your page that
helps establish its PageRank, but also the quality of the pages that the links are on. In
general, the higher the PR of the page linking to your page, the larger the PR boost that
your page will receive from the link.

The quantity of other outbound links on the page linking to your page has an effect on the
PR calculation as well. In general, the more outbound links there are on the linking page,
the smaller the PR boost that your page will receive from the link.
For example, a link from a page with a PR of 5 that has 100 outbound links on it will
boost your page's PR less than a PR5 page containing only 10 outbound links.

The Relevance algorithm

• The logic of this algorithm is the following:

• Google wishes to know if your page really


verses about the subject that the user is
looking for.

• To be sure, Google verifies that the words


searched for by users appear in your page,
where they appear, and also analyzes the
27
words that third parties use to recommend
you.

What counts for the Relevance algorithm?

• The relevance algorithm considers the


following factors:

– Word relevance in the general context of


indexed pages: in how many pages the word
appears.

– The relevance of the word in each one of the


pages: how many times a word is repeated.

– The words that other web pages have used to


link to you (anchor text in links from third parties).

The relevance inside a page

– URL
– Page title (<title>)
– Description
– Headings (H1,H2, etc...)
– Links
– Bold text
– Alternative text (ALT)
– ...

Services, Tools, Programs and Downloads offered by


Google Services

Google Alerts
Google Alerts are emails automatically sent to you when there are new Google
results for your search terms. Google currently offers three types of alerts: "News",
"Web", and "News & Web". Google Alerts website (http://www.google.com/alerts)

Google Answers
In April 2002, Google launched a new service called Google Answers. It is an
extension to the conventional search — rather than doing the search themselves, users
pay someone else to do the search. Customers ask questions, offer a price for an answer,
28
and researchers answer them. Researchers are screened through an application process
that tests their research and communications abilities.

Prices for questions range from $2 to $200; Google keeps 25% of the payment, sends the rest
to the researchers, and charges an additional $0.50 listing fee. Once a question is answered, it
remains available for anyone to browse for free. This service came out of beta in May 2003
and presently receives more than one hundred question postings per day. Google states that
asking
questions about google is not allowed on Google Answers. Google Answers
website (http://answers.google.com)

Google Catalogs
As of late August 2004, Google Catalogs is in the beta stage. Numerous (over
6,600 at the time of this writing) print catalogs are archived on Google as scanned image
files. Through the use of character recognition, users can search for a text string in these
catalogs in a fashion similar to how they would for materials on the general web.

Matching results are displayed through thumbnails of the pages on which the text was
found, with the specific area of the page where the search result is found shaded in a
yellow box. Another image file next to the thumbnail, a shrunk version of the highlighted
area on the thumbnail, highlights the exact location of the search result. Users can then
access the page of the catalog (as a larger graphic file) and change pages by using a
navigation bar positioned above the page image. It might be worth noting that one can
access the catalogs without a search as well.
Google Catalogs website (http://catalogs.google.com)

Google Directory
The directory is a subset of the links in Google's database arranged into hierarchical
subcategories, like an advanced Yellow Pages of the web. The original source of the
directory, and the categorization is the Open Directory Project (ODP), which publishes an
easily parsed version of its database in Resource Description Framework format for other
sites, like Google, to use for derivative directories.
Google Directory website (http://directory.google.com)

Froogle
Froogle is a price engine that searches online stores for particular products. It is also offered
in Wireless Markup Language (WML) form and can be accessed from cellphones or other
wireless devices that have support for WML.

Google Groups
Google maintains a Usenet archive, called Google Groups (formerly an independent site
known as Deja News). Google is currently testing a new version of its Groups service, which
archives mailing lists hosted by Google in addition to Usenet posts, using the same interface
as Gmail (see below). Formally known as "Google Groups Beta," the new version of Google
Groups is much more advanced than the last, letting you more easily join a group, make a
group, and track your favorite topics.

29
The original Google Groups interface, which was preferred by a great number of regular
Usenet posters to the current Beta version, due to its closer adherence to established Usenet
Netiquette (and note that where the previous paragraph says "advanced," many Web users
would read "cluttered"), was available until May 4, 2005, on the overseas domains
http://www.google.ca and http://www.google.co.uk. As of May 4, 2005, the socalled "Google
Groups Classic" was taken offline and is only available on foreignlanguage overseas mirrors
such as http://www.google.es and http://www.google.fr. Even that minor functionality is
expected to be removed in the near future.
Google Groups Beta website (http://groups.google.com)

Google Images
In 2003, Google announced Google Images, which allows users to search the web for image
content. The keywords for the image search are based on the filename of the image, the link
text pointing to the image, and text adjacent to the image. When searching for an image, a
thumbnail of each matching image is displayed. Then when clicking on a thumbnail, the
image is displayed in a frame at the top of the page and the website on which that image was
found is displayed in a frame below it, making it easier to see from where the image is
coming. Google Images
website (http://images.google.com)

Google Labs
Google Labs consists of all of Google's experimental technologies. Google Labs is akin to a
directory page that links to all Google technologies under development or in beta that have
not yet been made widely available. From the Google Labs home page, a user can access
Google Suggest, Google Desktop Search, and other web technologies.

Google Local
Google Local helps you focus your search on a specific geographic location. Sometimes you
want to search the whole worldwide web, and sometimes you just want to find an auto parts
store within walking distance. The service lets you search for a "What" such as pizza and a
"Where" such as Poughkeepsie, New York. The purpose of Google Local is to help people
find local businesses. Not only does Google Local display the website of the businesses, but
often times it will also display the phone number and address. Google Local was introduced
to the Google home page a few weeks ago and is now the basis of Google Maps.
Google Local website (http://local.google.com)
Google Maps
On February 8, 2005, Google introduced a beta release of an online map
service called Google Maps, which currently only covers the USA, Canada, the UK and
Ireland. It can interact with Google Local to restrict results to a certain areas. The service
features draggable maps, a location search, and turn-by-turn directions. It has received early
praise for the speed of its operation, produced by the pre-rendering of the maps it uses. It
currently only works with Internet Explorer and Mozilla-based browsers such as Mozilla
Firefox. Google also recently added support for Opera and Safari web browser. On April
4,2005,
Google added satellite imagery to Google Maps.

Google Mobile
Allows users to search using Google from wireless devices such as mobile
phone and PDAs. Google Mobile website (http://mobile.google.com)

30
Google Movies
Allows users to search for info about movies using the main Google search interface.
You can search in various ways:

• Entering "movie: 10001" in the Google "search text" entry field will search for all
movies being shown in and around zipcode 10001- sorted by movie theater.
Within the listing you can see showtimes, the average rating for each movie, as
well as links to all reviews, and a link to the IMDB page for that movie.

• Entering "movie: movies 10001" provides a listing sorted by movie, showing all
locations and showtimes where each movie is shown in the area.

• Entering "movie: Julia Roberts" provides a listing sorted by movie, of many of the
movies starring this actor/actress. It is unclear what rules/algorithm is used for
including/excluding certain movies.

My Search History
Keeps a record of all searches and clicked results while a user is logged into a Google
Account and allows this to be accessed and searched. My Search History
website (http://www.google.com/searchhistory)

Google News
Google introduced a beta release of an automated news compilation service,
Google News, in April 2002. There are different versions of the aggregator for more than
20 languages, with more added all the time. While the selection of news stories is fully
automated, the sites included are selected by human editors, and the choices have
occasionally led to some controversy.

Google Personalized
This service allows users to create a profile based on their interests. Future search results
are prioritized based on this information. Google Personalized
website (http://labs.google.com/personalized)

Google PhoneBook
This search feature is built into Google's standard search bar; if the search terms match
certain criteria (http://www.google.com/help/features.html#wp), an option to view search
results of Google's telephone directory archive is provided. One can search both
residential and business listings. There is also an
option (http://www.google.com/help/pbremoval.html) available to remove one's phone
book entry from Google.
Google PhoneBook results for Google,Inc. (http://www.google.com/search?
hl=en&lr=&pb=f&q=Google%2C+CA)

Google Print
In August 2004, Google announced its new Google Print service. This tool
searches the contents of books submitted by publishers and displays matches above web
matches on the search result page. It offers links to purchase the book, as well as
contentrelated
31
advertisements. Google will limit the number of viewable pages from any book
through user-tracking. As of early January 2005, this service remains in the beta stage.

This feature is similar to a service offered by A9.com. In December 2004, Google announced
an extension to its Google Print program. (http://www.google.com/googleblog/2004/12/all-
bookedup.html) It is a non-exclusive deal with several high-profile university and public
libraries, including the University of Michigan, Harvard (Widener Library), Stanford
(Green Library), Oxford (Bodleian Library), and the New York Public Library.
According to press releases and university librarians, Google plans to have approximately
15 million public domain volumes online within a decade.
Google Print website (http://print.google.com)

Google Scholar
In November 2004, Google released Google Scholar, which indexes and
searches academic literature across an array of sources and disciplines. Results are ranked
by "relevance", which is based largely on the number of citations and in this sense is
similar to PageRank. Google Scholar website (http://scholar.google.com)

Google Special
Allows users to perform special searches such as U.S. Government Search,
Linux Search, BSD Search, Apple Macintosh Search, and a Microsoft Windows Search.
Google Special website (http://www.google.com/options/specialsearches.html)

Google Suggest
A new feature called Google Suggest Beta was introduced on December 10, 2004. It
provides an autocomplete functionality that gives the user suggestions as they type.
JavaScript is used to rapidly query the server and update the page for each keystroke that the
user types. The feature quickly drew widespread praise as an impressive innovation, and so
far competitors have not offered anything similarly realtime.

It was also quickly noticed that Google attempts to avoid suggesting potentially offensive
searches. For instance, there are no suggestions for searches containing the word porn, but
there are many for pr0n and other variations that aren't on the blacklist.

Although pr0n (with a zero) is allowed, pron is on the blacklist, which has the side-effect of
not
suggesting searches containing any words that include pron such as apron, mispronunciation,
pronunciation or prone. Unlike pron and sex, the word ass is only blacklisted when it appears
with a space after it, so words containing ass such as associated are suggested. The blacklist
also includes the word lesbian, but not faggot, nigger, shit, or several other words that are
often included on profanity blacklists.
Google Suggest website (http://www.google.com/webhp?complete=1&hl=en)

Google University
Allows users to search within a large number of educational institution

32
domains. Google University website (http://www.google.com/options/universities.html)

Google Video
On January 25, 2005, Google introduced a beta of Google Video, allowing
users to search through television content based on title, network or a closed caption
transcript. Google Video website (http://video.google.com/)

Google Web Search


Google's most famous creation is the Google search engine. Google.com has
indexed over 8 billion Web sites, has 200 million requests a day and is the largest search
engine on the Internet. The search engine allows you to search through images, products
(Froogle), news, and the usenet archive. It uses a proprietary system (including
PageRank) to return the search results. A culture has grown around the very popular
search engine, and to google has come to mean, "to search for something on Google."

Google X
Google X was a project released by Google Labs on
March 15, 2005 and rescinded a day later. It consisted of the traditional Google search
bar, but it was made to look like the Dock user interface feature of Apple's Mac OS X
operating system.

Google Tools

Blogger
In 2003, Google acquired the Pyra Labs and Blogger services. Formerly premium
features that needed to be paid for were made available for free by Google. The tool,
Blogger, is a service to make weblog publishing easier. The user does not have to write
any code or worry about installing server software or scripts. Nevertheless, the user can
influence the design of his blog freely.

Google Browser Buttons


This tool allows users to put links to Google services in their web browsers.
Google Browser Buttons website (http://www.google.com/options/buttons.html)

Gmail
On April 1, 2004, Google announced its own free webmail service, Gmail,
33
which would provide users with 1000 MB (actually 1 GB, or 1024 MB) of storage for
their mailboxes and would generate revenue by displaying advertisements from the
AdWords service based on words in users email messages. Owing to April Fool's Day,
however, the company's press release was greeted with much skepticism in the
technology world. Jonathan Rosenberg, Google's vice-president of products, re-assured
BBC News by saying "We are very serious about Gmail." When Gmail was announced,
the storage space available was vastly more than that of most other free webmail
providers—for example, Microsoft's Hotmail only offered 2 MB, and Yahoo!'s Mail
service offered 4 MB. (In response to Gmail, Yahoo's limits have been upgraded to 250
MB and then again, to 1 GB for their free accounts, and 2 GB for their premium account;
Hotmail's limits have also been upgraded.)

There has been a great deal of criticism regarding Gmail's privacy policy. Most of the
criticism was over Google's plans to add context-sensitive advertisements to emails by
automatically scanning them. On April 1,2005 Google announced that they would begin
constantly increasing mailbox size by approximately 1 MB every 75 seconds, with no plan to
stop. This actually was an April Fool's joke, but the company did simultaneously announce
that it was increasing mailbox size to 2 GB, with a promise to add more space in the future.
They are continuously adding more space, much slower than during April 1. On their
webpage, they show how much space they are currently providing. By April 11, Google was
adding storage at approximately 3.5 MB each day.

Google Language Tool


This tool allows users to use Google in many different languages. Google
Language Tools website (http://www.google.com/language_tools)

Google Web API


The Google Web API (or Google Web Services) is Google's public interface
for registered developers. Using Simple Object Access Protocol (SOAP), a programmer
can write services for search and data mining that rely on Google's results. Also,
websurfers can view cached pages and make suggestions for better spelling. By default a
developer has a limit of 1,000 requests per day. This program is still in a beta phase.
Google is one of the few search engines to make its results available via a public API;
Technorati is another good example.

Some popular implementations of the Google Web API include the alerting service Google
Alerts, or FindForward (http://www.findforward.com), as well as the Google Dance Tool,
which monitors when Google is spidering the Internet.
Google Web API website (http://www.google.com/apis/)

Google Programs

34
AdSense
AdSense enables text or image advertisements to be displayed on Web sites
that want ads to help raise money. The ads are administered by Google and generate
revenue on a per-click basis. Google utilizes its search technology to serve ads based on
Web site content, the user's geographical location, and other factors. Those wanting to
advertise with Google's targeted ad system may sign up through AdWords.

AdWords
AdWords is a service that allows advertisers ads appear on any Google search
page, GMail email or AdSense page if certain keywords are displayed using a self-service
system. The AdWords service is Google's largest source of income. The advertiser pays
Google per click and there is a bidding system to determine ad ordering.

Google Downloads

Google Browser
After Google registered "gbrowser.com" speculation began that
it plans to release an Internet browser to compete with Internet Explorer. Executives have
been secretive about whether they intend to develop a browser. A spokesman hinted that,
"[Google believes] in reinventing the wheel with respect to browser technologies."
Google has recently hired Adam Bosworth, a Microsoft former employee who helped
write Internet Explorer, and Joe Beda, the man who has been working on Microsoft's next
generation graphics engine.

Google has also recently hired Ben Goodger the lead developer of Mozilla Firefox. Mozilla is
most well known for their Firefox web browser. With the 1.0 version of the Mozilla Firefox
browser, the default home page is set to a web page hosted by Google. Further speculation
involves Google modifying either Netscape, Mozilla or Firefox browsers.

Google Deskbar
In December 2003, Google launched the beta version of the Google Deskbar, a
search tool which runs from the Microsoft Windows taskbar, without a browser having to
be open. It can return film reviews, stock quotes, dictionary and thesaurus definitions,
plus any pre-configured search of a third-party site (e.g. eBay or Amazon). In November
2004, Google launched an API for Google Deskbar. Google Deskbar
website (http://deskbar.google.com/)

Google Desktop Search


Known internally under the codename Puffin, Google Desktop Search enables
desktop search. It runs locally on a PC and will index all Microsoft Outlook, Outlook
Express, Netscape Mail, and Thunderbird emails, text documents, Microsoft Office
documents, AOL Instant Messenger conversations, Internet Explorer, Mozilla, Mozilla
Firefox,and Netscape history on that PC, PDF, music, images, video, and allow the user
to search them from a browser. A plug-in feature has been released which allows
developers to code their own applications into the catalog. Google Desktop Search is an
extension of Google Search. After indexing a user's files, his or her local results will turn
up on normal Google search on his or her local computer.
35
Google Desktop Search does not store users files on the web and users personal information
is not sent to Google. Google Desktop Search was likely developed in response to file and
Web search capabilities that will be offered in the next major release of Microsoft Windows,
codenamed Longhorn (slated for release in 2006) — features that directly compete with
Google's core Internet search business. However, some claim that Google Desktop
Search, as well as Longhorn's Desktop Search, was inspired by Apple Spotlight, a
competing technology that is currently being shipped with Mac OS X v10.4. Currently,
Google Desktop Search does not support Google's "Did You Mean" spelling-suggestion
feature. For example, if a user lets it look up his or her computers for "chicke", it will not
ask whether he or she meant "chicken".

Desktop Search received much attention because it may allow reverse engineering of
Google's proprietary search algorithm. Google Desktop Search website
(http://desktop.google.com/)

Keyhole
On October 27, 2004, Google acquired Keyhole, a company creating online satellite
maps that has the ability to view geographical information in 3D view. Keyhole does
cover the entire globe with satellite imagery but not high resolution. They will focus on
the larger and initially U.S. metropolitan areas. Keyhole has the largest commercial
imagery database online in 3D today. It covers over 80 major metropolitan areas and
thousands of cities. The satellite imagery, aerial photography, elevation data, street
vectors, business listings, together are worth millions of dollars. The data are updated
every two to three years on average. Keyhole website (http://www.keyhole.com)

orkut
Though not mentioned on the Google homepage, orkut is a service hosted,
created and maintained by Google engineers. Orkut is a social networking service, where
users can list their personal and professional information, create relationships amongst
friends and join communities of mutual interest. Affinity engines, a company based in
Palo Alto, has filed a lawsuit accusing that Orkut Büyükkökten, a co-founder of the
company, illegally took the code, which he wrote for the company, for use in Google.
(http://www.wired.com/news/business/0,1367,64046,00.html) There is some speculation
saying that orkut and Gmail are part of a Google effort to gather information about their
users, with the intention of offering a better personalized search service in future. Google
already has a personalized search in Google Labs.

Picasa
On July 13, 2004 Google acquired Picasa, software for management and
sharing of digital photographs. Since then, Google has released the latest edition of the
software with Picasa2. The aim of the software was to make photo editing simple and
easy to use. Picasa has also been integrated with Google's Blogger and Gmail services. It
is free to download.

36
Hello
This add-on to Google's software Picasa gives the user the ability to instant message
pictures it gives the user the ability to surf the web in a shared form. For example two
users instant messaging can surf the web together. It also allows a user to directly add
pictures from Picasa to his/her blog on blogger. This is the first instant messaging
download offered by Google.

Google Toolbar
This addition to Microsoft Internet Explorer 5 or later adds Google's searching
capabilities in a toolbar in the web browser. The latest version includes pop-up ads
blocking, automatic filling of forms, the ability to show the Google PageRank value for
the current page being viewed, and SpellCheck, AutoLink and the WordTranslator. It has
been criticized for being a security risk because it updates itself without user intervention.
A separately downloadable add-on for the toolbar allows participation in Google
Compute, a distributed computing project to help scientific research.

Other browsers, such as Mozilla, Mozilla Firefox, Opera, and Safari, have built-in search
tools that offer the same functionality. Mozilla Firefox also has its own version of the Google
Toolbar,
the Googlebar, which is developed independently of and is not supported by Google or the
Mozilla Firefox developers. It expands upon the official Google toolbar to the point that the
only feature not replicated is the Google PageRank functionality. There are other tools that
bring the PageRank functionality to Mozilla and Firefox, including a modification of
Googlebar. Googlebar has also been built into Safari for Apple Computer's Mac OS X
operating system. Google Toolbar website (http://toolbar.google.com)

Google Web Accelerator


On May 3, 2005, Google launched a downloadable web accelerator know as Google Web
Accelerator. Currently, it supports Mozilla Firefox and Microsoft Internet Explorer. It
speeds up webbrowsing through the use of a local proxy server. This server sends
request's to Google's Web Accelerator servers to help get a faster response. The data
between the local proxy and the accelerator servers is compressed to increase transfer
time. The Google Web Accelerator also uses caching and prefetching. Currently, The
Google Web Accelerator is only for North America and Europe.

37
Concluson

We studied about the working of GoogleBot. Most of the people don’t know about the
mechanism behind the Google searching. Now it is cleared that how Google works to
complete the searching queries of the user. We studied that what are the Algorithms uses by
Google for Searching of contents. In The Given context we studied about the Google
services. i.e. what are the services provided by Google. All the sections are well categorized
and well figured.

Now Everything is well understood by the Given Report that how google works and which
parts are play the important role for searching from the Google Database . So the Overall
concern of this Report to clearify the mechanism behind The Google Searching.

38

You might also like