ABSTRACT

:
The good news about the Internet and its most visible component, the World Wide Web, is that there are hundreds of millions of pages available, waiting to present information on an amazing variety of topics. The bad news about the Internet is that there are hundreds of millions of pages available, most of them titled according to the whim of their author, almost all of them sitting on servers with cryptic names. When you need to know about a particular subject, how do you know which pages to read? If you're like most people, you visit an Internet search engine. Internet search engines are special sites on the Web that are designed to help people find information stored on other sites. There are differences in the ways various search engines work, but they all perform three basic tasks: y y y They search the Internet -- or select pieces of the Internet -- based on important words. They keep an index of the words they find, and where they find them. They allow users to look for words or combinations of words found in that index.

Early search engines held an index of a few hundred thousand pages and documents, and received maybe one or two thousand inquiries each day. Today, a top search engine will index hundreds of millions of pages, and respond to tens of millions of queries per day. In this article, we'll tell you how these major tasks are performed, and how Internet search engines put the pieces together in order to let you find the information you need on the Web.

Finding key information from gigantic World Wide Web is similar to find a needle lost in haystack. For this purpose we would use a special magnet that would automatically, quickly and effortlessly attract that needle for us. In this scenario magnet is Search Engine Search Engine: A software program that searches a database and gathers and reports information that contains or is related to specified terms. OR A website whose primary function is providing a search for gathering and reporting information available on the Internet or a portion of the Internet 1990 - The first search engine Archie was released .There was no World Wide Web at the time. Data resided on defense contractor , university, and government computers, and techies were the only people accessing the data. The computers were interconnected by Telenet . File Transfer Protocol (FTP) used for transferring files from computer to computer. There was no such thing as a browser.Files were transferred in their native format and viewed using the associated file type software. Archie searched FTP servers and indexed their files into a searchable directory. 1991 - Gopherspace came into existence with the advent of Gopher.Gopher cataloged FTP sites, and the resulting catalog became known as Gopherspace . 1994 - WebCrawler, a new type of search engine that indexed the entire content of a web page , was introduced. Telenet / FTP passed information among the new web browsers accessing not FTP sites but

Marketers determined that pay-per click campaigns were an easy yet expensive approach to gaining top search rankings.Search engine rank-checking software was introduced.  Formulating queries: It needed to express exactly what kind of information is to retrieve.Webmasters and web site owners begin submitting sites for inclusion in the growing number of web directories. 1998 . 1995 -Meta tags in the web page were first utilized by some search engines to determine relevancy. To elevate sites in the search engine rankings web sites started adding useful and relevant content while optimizing their web pages for each specific search engine. . crawlers . 1997 . distributed over tens of thousands of servers.  A spider will find a web page. It provides an automated tool to determine web site position and ranking within the major search engines. 2000 . Another ranking approach was to determine the number of clicks (visitors) to a web site based upon keyword and phrase relevancy. Inclusion of the number of links to a web site to determine its link popularity.  Determining relevance: The system must determine whether a document contains the required information or not.Search engine algorithms begin incorporating esoteric information in their ranking algorithms. The web page will then be added to the search engine s database. E. download it and analyses the information presented on the web page.  Types of Search Engine On the basis of working. Search engine is categories in following group : Crawler-Based Search Engines  Directories  Hybrid Search Engines  Meta Search Engines  Crawler-Based Search Engines  It uses automated software programs to survey and categories web pages . robots or bots .WWW sites.g. which is known as spiders . Stages in information retrieval  Finding documents: It is potentially needed to find interesting documents on the Web consists of millions of documents.

A URL is taken from the list. url_table 2. A large linear array . They place websites within specific categories or subcategories in the directories database.ask.) the algorithm stops.com)  Robot Algorithm All robots use the following algorithm for retrieving documents from the Web: 1. time elapsed since startup.google. 4. Three data structure is needed for crawler or robot algorithm 1.  The results (list of suggested links to go to). size of the index database. The document is parsed to retrieve information for the index database and to extract the embedded links to other documents. Heap 3. Crawler works with a simple goal: indexing all the keywords in web pages titles. Examples of crawler-based search engines are:  Google (www. are listed on pages by order of which is closest (as defined by the bots). The URLs of the links found in the document are added to the list of known URLs. 5. Hash table 5. Crawler program treated World Wide Web as big graph having pages as nodes And the hyperlinks as arcs. A directory uses human editors who decide what category the site belongs to.com)  Ask Jeeves (www. the search engine will check its database of web pages for the key words the user searched. 2. 7. This list contains at least one URL to start with. . 3. otherwise the algorithm continues at step 2. 4. When a user performs a search. 3. The algorithm uses a list of known URLs. and the corresponding document is retrieved from the Web. If the list is empty or some limit is exceeded (number of documents retrieved. etc. Directories : 6. 2.

 High probability of finding the desired page(s)  It will get at least some results when no result had been obtained with traditional search engines. Examples of hybrid search engines are: Yahoo (www.com) Pros and Cons of Meta Search Engines Pros : Easy to use  Able to search more web pages in less time. The human editors comprehensively check the website and rank it. using a pre-defined set of rules.metacrawler. By focusing on particular categories and subcategories.com) Google (www. Cons :- .google. Examples of Meta search engines include: Metacrawler (www. based on the information they find.com) Open Directory (www.yahoo.org) Hybrid Search Engines Hybrid search engines use a combination of both crawler-based results and directory results.dogpile.dmoz.yahoo.com) Meta Search Engines ‡ ‡ Also known as Multiple Search Engines or Metacrawlers.8.com) Dogpile (www. Meta search engines query several other Web search engine databases in parallel and then combine the results in one list. 9. user can narrow the search to those records that are most likely to be relevant to his/her interests. There are two major directories : Yahoo Directory (www.

it fetches the pages requested by the user. +/-.c. Web page history. Google: ` ` ` Google use spiders Large index of keywords. Google s PAGE RANK . number of other Web pages that link to the page in question ` ` ` . "Pseudo" MSEs type II which open a separate browser window for each search engine used and 4. default AND between words e. Metasearch engine results are less relevant. ` CONCLUSION: Search engine plays important role in accessing the content over the internet.  Advanced search features (like. 1. searches with boolean operators and field limiting .  Since. software search tools.) are not usually available.t. only top 10-50 hits are retrieved from each search engine. Meta Search Engines (MSEs) Come In Four Flavors 1. "Pseudo" MSEs type I which exclusively group the results by search engine 3. since it doesn t know the internal alchemy of search engine used. The need for better search engines only increases The search engine sites are among the most popular websites. Search Utilities. use of " ". the total number of hits retrieved may be considerably less than found by doing a direct search. frequency and location of keywords within the Web page 2. It made the internet and accessing the information just a click away. "Real" MSEs which aggregate/rank the results in one page 2. 3.

Parallel processing is a method of computation in which many calculations can be performed simultaneously. Let¶s take a closer look at each part. It functions much like your web . 2001). The indexer that sorts every word on every page and stores the resulting index of words in a huge database. a web crawler that finds and fetches web pages. It¶s easy to imagine Googlebot as a little spider scurrying across the strands of cyberspace. Googlebot. Google¶s Web Crawler Googlebot is Google¶s web crawling robot. Google runs on a distributed network of thousands of low-cost computers and can therefore carry out fast parallel processing.Google Guide > Part II: Understanding Results > How Google Works Next: Results Page » How Google Works If you aren¶t interested in learning how Google creates the index and the database of documents that it accesses when processing a query. 1. but in reality Googlebot doesn¶t traverse the web at all. significantly speeding up data processing. Google has three distinct parts: y y Googlebot. I adapted the following overview from Chris Sherman and Gary Price¶s wonderful description of How Search Engines Work in Chapter 2 of The Invisible Web (CyberAge Books. which compares your search query to the index and recommends the documents that it considers most relevant. which finds and retrieves pages on the web and hands them off to the Google indexer. y The query processor. skip this description.

or sub-domains with substantially similar content. cloaking (aka bait and switch).com/addurl. Unfortunately. In fact. Googlebot deliberately makes requests of each individual web server more slowly than it¶s capable of doing. or crowding out requests from human users. To avoid overwhelming web servers. stuffing a page with irrelevant words. sending automated queries to Google. Googlebot consists of many computers requesting and fetching pages much more quickly than you can with your web browser. by sending a request to a web server for a web page. creating doorways.browser. www. Googlebot can request thousands of different pages simultaneously. and linking to bad neighbors. domains.google. spammers figured out how to create automated bots that bombarded the add URL form with millions of URLs pointing to commercial propaganda. then handing it off to Google¶s indexer. downloading the entire page. So now the Add URL form also has a test: it . and through finding links by crawling the web.html. Google rejects those URLs submitted through its Add URL form that it suspects are trying to deceive users by employing tactics such as including hidden text or links on a page. using sneaky redirects. Googlebot finds pages in two ways: through an add URL form.

Googlebot can quickly build a list of links that can cover broad reaches of the web. When Googlebot fetches a page. This data structure allows rapid access to documents that contain user query terms. By harvesting links from every page it encounters. Such crawls keep an index current and are known as fresh crawls. Google¶s Indexer Googlebot gives the indexer the full text of the pages it finds. Because of their massive scale. Newspaper pages are downloaded daily. Googlebot must be programmed to handle several challenges. the queue of ³visit soon´ URLs must be constantly examined and compared with URLs already in Google¶s index. it asks you to enter the letters you see ² something like an eye-chart test to stop spambots. . with each index entry storing a list of documents in which the term appears and the location within the text where it occurs. Duplicates in the queue must be eliminated to prevent Googlebot from fetching the same page again. The combination of the two types of crawls allows Google to both make efficient use of its resources and keep its index reasonably current. Of course. pages with stock quotes are downloaded much more frequently. also allows Googlebot to probe deep within individual sites. Google continuously recrawls popular frequently changing web pages at a rate roughly proportional to how often the pages change. Googlebot must determine how often to revisit a page. These pages are stored in Google¶s index database.displays some squiggly letters designed to fool automated ³letter-guessers´. Googlebot tends to encounter little spam because most web authors link only to what they believe are high-quality pages. On the one hand. it culls all the links appearing on the page and adds them to a queue for subsequent crawling. it¶s a waste of resources to re-index an unchanged page. 2. Although its function is simple. deep crawls can reach almost every page in the web. This index is sorted alphabetically by search term. known as deep crawling. this can take some time. First. This technique. fresh crawls return fewer pages than the deep crawl. so some pages may be crawled only once a month. To keep the index current. On the other hand. Google wants to re-index changed pages to deliver up-to-date results. since Googlebot sends out simultaneous requests for thousands of pages. Because the web is vast.

For example. Google ignores (doesn¶t index) common words called stop words (such as the. Google can also match multi-word phrases and sentences. and the proximity of the search terms to one another on the page. e.To improve search performance. Google¶s Query Processor The query processor has several parts. how. Indexing the full text of the web allows Google to go beyond simply matching single search terms.or. options offered by Google¶s Advanced Search Form and Using Search Operators (Advanced Operators).org¶s report for an interpretation of the concepts and the practical applications contained in Google¶s patent application. in the body. Stop words are so common that they do little to narrow a search. why. is. to improve Google¶s performance. The indexer also ignores some punctuation and multiple spaces. and therefore they can safely be discarded. the spelling-correcting system uses such techniques to figure out likely alternative spellings. and the results formatter. the position and size of the search terms within the page. Google gives more priority to pages that have search terms near each other and in the same order as the query. as well as converting all letters to lowercase. Since Google indexes HTML code in addition to the text on the page. . the ³engine´ that evaluates queries and matches them to relevant documents. in the title.g. including the popularity of the page.. users can restrict searches on the basis of where query words appear. A patent application discusses other factors that Google considers when ranking a page. 3. and in links to the page. on. including the user interface (search box). as well as certain single digits and single letters). A page with a higher PageRank is deemed more important and is more likely to be listed above a page with a lower PageRank. in the URL. Google considers over a hundred factors in computing a PageRank and determining which documents are most relevant to a query. PageRank is Google¶s system for ranking web pages. Google also applies machine-learning techniques to improve its performance automatically by learning relationships and associations within the stored data. of. Google closely guards the formulas it uses to calculate relevance. they¶re tweaked to improve quality and performance. and to outwit the latest devious techniques used by spammers. Visit SEOmoz.

Copyright © 2003 Google Inc. Used with permission.Let¶s see how Google processes a query. .