SEARCH ENGINE

1

A research project on SEARCH ENGINE

SUBMITTED BY

SATHISH KOTHA

108-00-0746

University Of Northern Virginia

CSCI 587 SEC 1220, SPECIAL TOPICS IN INFORMATION TECHNOLOGY-1 6/20/2010

SEARCH ENGINE
Abstract of the Project

2

A web search engine is designed to search for information on the World Wide Web. The search results are usually presented in a list of results and are commonly called hits. The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input. Here in this project, we are discussing about types of search engines. How a search engine works or finds the information for the user. What are the processes going behind the screen? Today how many search engines are existing to provide information and facts for the computer users, History of search engine. What are the different stages of search engines in searching information? What are the features of Web searching? And also different topics like Advanced Research Projects Agency Network. And what a BOT really mean? Types of search queries we use in seeking information using search engine. Web Directories? And very famous search engine like google, yahoo and their role of processing as search engine. Challenges in language processing. Characteristics of search engines. I used this topic for my project is reason that I find interest in working of search engine. And I want every one to come across this topic and learn. As many of them uses, but they don’t know what’s the real fact going behind the screen in a search engine. At the end of the project I also gave some references from which I have selected for topic discussion of project. I hope you like this and accept this project for my topic in this course.

SEARCH ENGINE

3

ACKNOWLEDGEMENT

The project entitled “SEARCH ENGINE” is of total effort of me. It is my duty to bring forward each and every one who is either directly or indirectly in relation with our project and without whom it would not have gained a structure. Accordingly sincere thanks to PROF. SOUROSHI , For his support and for her valuable suggestions and timely advice without them, the project would not be completed in time. We also thank many others who helped us through out the project and made our project successful.

PROJECT ASSOCIATES

SEARCH ENGINE
CONTENTS PRELIMINARIES Acknowledgement

4

1. History of search engine

page

1 - 15

Types of search queries World Wide Web wanderer Alliweb Primitive web search 2. Working of a search engine Web crawling Indexing Searching 3. New features for web searching page 33 – 35 page 15 – 32

4. Conclusion

page

36

5. References

page

37

In minor ways he may even improve. Having found one item. A record. when after enjoying the scientific camaraderie that was a side effect of WWII. fast. it must be stored. and the effort to bridge between disciplines is correspondingly superficial. moreover. Vannaver Bush's As We May Think was published in The Atlantic Monthly. It operates by association. reliable. He then proposed the idea of a virtually limitless.. but he also believed that if the data source was to be useful to the human mind we should have it represent how the mind works to the best of our abilities. for his records have relative permanency. .. Man cannot hope fully to duplicate this mental process artificially. He named this device a memex.SEARCH ENGINE 5 1. He not only was a firm believer in storing data. The human mind does not work this way. Our ineptitude in getting at the record is largely caused by the artificiality of the systems of indexing. if it is to be useful to science. Vertical Search 4.. but he certainly ought to be able to learn from it. . and above all it must be consulted. He urged scientists to work together to help build a body of knowledge for all mankind. associative memory storage and retrieval system.Directories 3. Specialization becomes increasingly necessary for progress. Here are a few selected sentences and paragraphs that drive his point home. . one has to emerge from the system and re-enter on a new path.. extensible.Early Technology 2. Search Engine Marketing History of Search Engines: From 1945 to Google 2007 As We May Think (1945): The concept of hypertext and a memory extension really came to life in July of 1945. must be continuously extended.

who died on August 28th of 1995. Salton’s Magic Automatic Retriever of Text included important concepts like the vector space model. but long before most of them existed came Archie. While Ted was against complex markup code. created in 1990 by Alan Emtage. Ted Nelson: Ted Nelson created Project Xanadu in 1960 and coined the term hypertext in 1963.SEARCH ENGINE Gerard Salton (1960s ." but it was shortened to Archie.1990s): 6 Gerard Salton. a student at McGill University in Montreal. Inverse Document Frequency (IDF). and relevancy feedback mechanisms. His teams at Harvard and Cornell developed the SMART informational retrieval system. Archie (1990): The first few hundred web sites began in 1993 and most of them were at colleges. and many other problems associated with traditional HTML on the WWW. There is still conflict surrounding the exact reasons why Project Xanadu failed to take off. . much of the inspiration to create the WWW was drawn from Ted's work. His goal with Project Xanadu was to create a computer network with a simple user interface that solved many social problems like attribution. The original intent of the name was "archives. broken links. was the father of modern search technology. Term Frequency (TF). The Wikipedia has a great background article on ARPANet and Google Video has a free interesting video about ARPANet from 1972. term discrimination values. Advanced Research Projects Agency Network: ARPANet is the network which eventually led to the internet. The first search engine created was Archie.

which was created as an Archie alternative by Mark McCahill at the University of Minnesota in 1991. In his words. The main way people shared data back then was via File Transfer Protocol (FTP). Essentially Archie became a database of web filenames which it would match with the users queries. In 1989. Bill Slawski has more background on Archie here.. CERN was the largest Internet node in Europe. however there was no World Wide Web. If someone was interested in retrieving the data they could using an FTP client. to facilitate sharing and updating information among researchers. he returned in 1984 as a fellow. Tim Berners-Lee & the WWW (1991): From the Wikipedia: While an independent contractor at CERN from June to December 1980. . This process worked effectively in small groups. Soon another user interface name Jughead appeared with the same purpose as Veronica. "I just had to take the hypertext idea and connect it to the TCP and DNS ideas and — ta-da! — the World Wide Web". Berners-Lee proposed a project based on the concept of hypertext. both of these were used for files sent via Gopher. After leaving CERN in 1980 to work at John Poole's Image Computer Systems Ltd. Veronica & Jughead: As word of mouth about Archie spread. With help from Robert Cailliau he built a prototype system named Enquire. but the data became as much fragmented as it was collected. If you had a file you wanted to share you would set up an FTP server. it started to become word of computer and Archie had such popularity that the University of Nevada System Computing Services group developed Veronica.SEARCH ENGINE 7 Archie helped solve this data scatter problem by combining a script-based data gatherer with a regular expression matcher for retrieving file names matching a user query. but it worked on plain text files. Veronica served the same purpose as Archie. and Berners-Lee saw an opportunity to join hypertext with the Internet. File Transfer Protocol: Tim Burners-Lee existed at this point.

The term bot on the internet is usually used to describe anything that interfaces with the user or that collects data. 1991. which are resource heavy on a specific topic.seeking static information about a topic Transactional . for which he designed and built the first web browser and editor (called WorldWideWeb and developed on NeXTSTEP) and the first Web server called httpd (short for HyperText Transfer Protocol daemon). These bots attempt to act like a human and communicate with humans on said topic.send me to a specific URL . Another bot example could be Chatterbots.ch/ and was first put online on August 6. It was also the world's first Web directory. which notes that most searches fall into the following 3 categories: • • • Informational . Types of Search Queries: Andrei Broder authored A Taxonomy of Web Search [PDF]. how one could own a browser and how to set up a Web server. downloading from.cern. What is a Bot? Computer robots are simply programs that automate repetitive tasks at speeds impossible for humans to reproduce. or otherwise interacting with the result Navigational . Tim also created the Virtual Library. In 1994.SEARCH ENGINE 8 He used similar ideas to those underlying the Enquire system to create the World Wide Web. It provided an explanation about what the World Wide Web was. since Berners-Lee maintained a list of other Web sites apart from his own. titled Weaving the Web.shopping at. The first Web site built was at http://info. Tim also wrote a book about creating the web. which is the oldest catalogue of the web. Berners-Lee founded the World Wide Web Consortium (W3C) at the Massachusetts Institute of Technology.

and Greg R. As the web grew. or ALIWEB in response to the Wanderer. the World Wide Web Worm. His database became knows as the Wandex. which created standards for how search engines should index or not index content. By default.Notess's Search Engine Showdown offers a search engine features chart. three full fledged bot fed search engines had surfaced on the web: JumpStation. The Wanderer was as much of a problem as it was a solution because it caused system lag by accessing the same page hundreds of times a day. Primitive Web Search: By December of 1993. He initially wanted to measure the growth of the web and created this bot to count active web servers. creating a nofollow attribute that can be applied at the individual link level.us allows you to search URLs that users have bookmarked.SEARCH ENGINE 9 Nancy Blachman's Google Guide offers searchers free Google search tips. The downside of ALIWEB is that many people did not know how to submit their site. ALIWEB crawled meta information and allowed users to submit their pages they wanted indexed with their own page description. It did not take long for him to fix this software. For example. This allows webmasters to block bots from their site on a whole site level or page by page basis. There are also many popular smaller vertical search services. and people link to it search engines generally will index it. if information is on a public web server. This meant it needed no bot to collect data and was not using excessive bandwidth. The problem with JumpStation and the World Wide . Del. ALIWEB: In October of 1993 Martijn Koster created Archie-Like Indexing of the Web. After this was pushed through Google quickly changed the scope of the purpose of the link nofollow to claim it was for any link that was sold or not under editorial control. and the Repository-Based Software Engineering (RBSE) spider. JumpStation slowed to a stop. He soon upgraded the bot to capture actual URL's. and Technorati allows you to search blogs. but people started to question the value of bots. JumpStation gathered info about the title and header from Web pages and retrieved these using a simple linear search. In June 1993 Matthew Gray introduced the World Wide Web Wanderer. World Wide Web Wanderer: Soon the web's first robot came. The WWW Worm indexed titles and URL's.icio. In 2005 Google led a crusade against blog comment spam. Robots Exclusion Standard: Martijn Kojer also hosts the web robots page.

Since early search algorithms did not do adequate link analysis or cache full page content if you did not know the exact name of what you were looking for it was extremely hard to find it. They had the idea of using statistical analysis of word relationships to make searching more efficient. InfoSpace bought Excite from bankruptcy court for $10 million. Excite was bought by a broadband provider named @Home in January.5 billion. The web size in early 1994 did not really require a web directory. and was named Excite@Home. EINet Galaxy The EINet Galaxy web directory was born in January of 1994. 2001 Excite@Home filed for bankruptcy. 1999 for $6.SEARCH ENGINE 10 Web Worm is that they listed results in the order that they found them. Yahoo! Directory . Web Directories: VLib: When Tim Berners-Lee set up the web he created the Virtual Library. other directories soon did follow. Excite: Excite came from the project Architext. In October. They were soon funded. The RSBE spider did implement a ranking system. The biggest reason the EINet Galaxy became a success was that it also contained Gopher and Telnet search features in addition to its web search feature. It was organized similar to how web directories are today. however. 1993 by six Stanford undergrad students. and provided no discrimination. which became a loose confederation of topical experts maintaining relevant topical link lists. which was started by in February. and in mid 1993 they released copies of their search software for use on web sites.

1998. almost entirely ran by a group of volunteer editors. The Internet Public Library is another well kept directory of websites.com. who is the director of Librarians' Internet Index. As time passed the inclusion rates for listing a commercial site increased.5 billion all stock deal. which is a directory which anybody can download and use in whole or part.SEARCH ENGINE 11 In April 1994 David Filo and Jerry Yang created the Yahoo! Directory as a collection of their favorite web pages. Many informational sites are still added to the Yahoo! Directory for free. Her article explains what she and her staff look for when looking for quality credible resources to add to the LII. Open Directory Project In 1998 Rich Skrenta and a small group of friends created the Open Directory Project. or locally oriented directories. What set the directories above The Wanderer is that they provided a human compiled description with each URL. Business. and the general lack of scalability of a business model the quality and size of directories sharply drops off after you get past the first half dozen or so general directories. The ODP (also known as DMOZ) is the largest internet directory. The Open Directory Project was grown out of frustration webmasters faced waiting to be included in the Yahoo! Directory. Later that same month AOL announced the intention of buying Netscape in a $4. for example. is a directory of business websites. As time passed and the Yahoo! Directory grew Yahoo! began charging commercial sites for inclusion. . The second Google librarian newsletter came from Karen G.com Due to the time intensive nature of running a directory. LII Google offers a librarian newsletter to help librarians and other web editors help make information more accessible and categorize the web. Netscape bought the Open Directory Project in November. There are also numerous smaller industry. The current cost is $299 per year. As their number of links grew they had to reorganize and become a searchable directory. vertically. hold lower standards than selected limited catalogs created by librarians. especially those which have a paid inclusion option. LII is a high quality directory aimed at librarians. Most other directories. Business. Schneider.

1994. and AOL began using Excite to power its NetFind. It was the first crawler which indexed entire pages. which charged listed sites a flat fee per click. Lycos went public with a catalog of 54. Looksmart bought a search engine by the name of WiseNut. and hope to drive traffic using Furl. But Lycos' main difference was the sheer size of its catalog: by August 1994. 2006 Looksmart shut down the Zeal directory. Excite bought out WebCrawler. 1994.000 documents. That caused the demise of any good faith or loyalty they had built up. when Microsoft announced they were dumping Looksmart that basically killed their business model. WebCrawler opened the door for many other services to follow suit.SEARCH ENGINE Looksmart 12 Looksmart was founded in 1995. In 2002 Looksmart transitioned into a pay per click provider. although it allowed them to profit by syndicating those paid listings to some major portals like MSN. The problem was that Looksmart became too dependant on MSN. Within 1 year of its debuted came Lycos. and OpenText. Infoseek. Lycos: Lycos was the next major search development. WebCrawler: Brian Pinkerton of the University of Washington released WebCrawler on April 20. On July 20. Michale Mauldin was responsible for this search engine and remains to be the chief scientist at Lycos Inc. a social bookmarking program. In addition to providing ranked relevance retrieval. and in 2003. In March of 2002. but due to limited relevancy Looksmart has lost most (if not all) of their momentum. In 1998 Looksmart tried to expand their directory by buying the non commercial Zeal directory for $20 million. but on March 28. Soon it became so popular that during daytime hours it could not be used. Looksmart also owns a catalog of content articles organized in vertical sites. Lycos . Lycos provided prefix matching and word proximity bonuses. They competed with the Yahoo! Directory by frequently increasing their inclusion rates back and forth. AOL eventually purchased WebCrawler and ran it on their network. Then in 1997. but it never gained traction. having been design at Carnegie Mellon University around July of 1994.

Although Inktomi pioneered the paid inclusion model it was nowhere near as efficient as the pay per click auction model developed by Overture. Infoseek: Infoseek also started out in 1994. and portal related clutter AltaVista was largely driven into irrelevancy around the time Inktomi and Google started becoming popular.000 documents. 1996 with its search engine Hotbot. the catalog had reached 1. by January 1995. In October of 2001 Danny Sullivan wrote an article titled Inktomi Spam Database Left Open To Public.more than any other Web search engine. In October 1994. Licensing their search results also was not profitable enough to pay for their scaling costs. One popular feature of Infoseek was allowing webmasters to submit a page to the search index in real time. AltaVista: AltaVista debut online came during this same month. and in December 1995 they convinced Netscape to use them as their default search. Due to poor mismanagement. but they offered a few add on's. which listed over 1 million URLs at that time. which was a search spammer's paradise. They failed to develop a profitable business . 2003. On February 18. they were the first to allow natural language queries. advanced searching techniques and they allowed users to add or delete their own URL within 24 hours. They even allowed inbound link checking. and by November 1996.’. which highlights how Inktomi accidentally allowed the public to access their database of spam sites. After Yahoo! bought out Overture they rolled some of the AltaVista technology into Yahoo! Search. They really did not bring a whole lot of innovation to the table. AltaVista also provided numerous search tips and advanced search features. claiming to have been founded in January. Two Cal Berkeley cohorts created Inktomi from the improved technology gained from their research. a fear of result manipulation. They had nearly unlimited bandwidth (for that time). Inktomi: The Inktomi Corporation came about on May 20. Lycos ranked first on Netscape's list of search engines by finding the most hits on the word ‘surf.SEARCH ENGINE 13 had identified 394. which gave them major exposure. and occasionally use AltaVista as a testing platform. Hotwire listed this site and it became hugely popular quickly. AltaVista brought many important features to the web scene. Lycos had indexed over 60 million documents -.5 million documents. Overture signed a letter of intent to buy AltaVista for $80 million in stock and $60 million cash.

After Yahoo! bought out Overture they rolled some of the AllTheWeb technology into Yahoo! Search. but that technology proved to easy to spam as the core algorithm component. AllTheWeb was bought by Overture for $70 million. 2005 Google launched a product called Google Base. For example.com (Formerly Ask Jeeves): In April of 1997 Ask Jeeves was launched as a natural language search engine. On November 15. which uses clustering to organize sites by Subject Specific Popularity. Ask. Ask was powered by DirectHit for a while. and occasionally use AllTheWeb as a testing platform. a radio ad placement firm. which aimed to rank results based on their popularity. In 2000 the Teoma search engine was released. or $1. In 2001 Ask Jeeves bought Teoma to replace the DirectHit search technology.SEARCH ENGINE 14 model. Jon Kleinberg's Authoritative sources in a hyperlinked environment [PDF] was a source of inspiration what lead to the eventual creation of Teoma. Based on usage statistics this tool can help Google understand which vertical search products they should create or place more emphasis on. Ask Jeeves used human editors to try to match search queries. They also believe that targeted measured advertising associated with search can be carried over to other mediums. Mike Grehan's Topic Distillation [PDF] also explains how subject specific popularity works. which is a database of just about anything imaginable. and sold out to Yahoo! for approximately approximately $235 million.65 a share. Google bought dMarc. They had a sleek user interface with rich advanced search features. Google also has a Scholar search program which aims to make scholarly research easier to do. Users can upload items and title. in December of 2003. but on February 23. They believe that owning other verticals will allow them to drive more traffic back to their core search service. AllTheWeb AllTheWeb was a search technology platform launched in May of 1999 to showcase Fast's search technologies. Yahoo! has also tried to extend their . 2003. describe. and tag them as they see fit. which is another way of saying they tried to find local web communities.

with the same flaws . To prevent the erosion of value of search ads Google allows advertisers to opt out of placing their ads on content sites. 2006. To help grow the network and make the market more efficient Google added a link which allows advertisers to sign up for AdWords account from content websites. 2003 Google announced their content targeted ad network. and Google allowed advertisers to buy ads targeted to specific websites. which had CIRCA technology that allowed them to drastically improve the targeting of those ads. AdSense allows web publishers large and small to automate the placement of relevant ads on their content. Yahoo! Search Marketing Yahoo! Search Marketing is the rebranded name for Overture after Yahoo! bought them out. or demographic categories. . As of September 2006 their platform is generally the exact same as the old Overture platform. and some select publishers can place ads in emails. like the photo sharing site Flickr. Google bought Applied Semantics. Microsoft added demographic targeting and dayparting features to the pay per click mix. .SEARCH ENGINE 15 reach by buying other high traffic properties. Microsoft AdCenter Microsoft AdCenter was launched on May 3. On the features front.icio. but eventually added image ads and video ads. Google also allows some publishers to place AdSense ads in their feeds. While Microsoft has limited marketshare. it's hard to run local ads. pages.us. they intend to increase their marketshare by baking search into Internet Explorer 7. Google adopted the name AdSense for the new ad program. and Google also introduced what they called smart pricing. Smart pricing automatically adjusts the click cost of an ad based on what Google perceives a click from that page to be worth.ad CTR not factored into click cost. In April 2003. An ad on a digital camera review page would typically be worth more than a click from a page with pictures on it. Google initially started off by allowing textual ads in numerous formats. Google AdSense On March 4. and it is just generally clunky. Microsoft's ad algorithm includes both cost per click and ad clickthrough rate. Ads targeted on websites are sold on a cost per thousand impression (CPM) basis in an ad auction against other keyword targeted and site targeted ads. Advertisers could chose which keywords they wanted to target and which ad formats they wanted to market. and the social bookmarking site del.

the pair took to haunting the department's loading docks in hopes of tracking down newly arrived computers that they could borrow for their network. Larry. Buzz about the new search technology began to build as word spread around campus. In the PageRank algorithm links count as votes. links act as citations. and on May 4. who had always enjoyed tinkering with machinery and had gained some notoriety for building a working printer out of Lego™ bricks. BackRub ranked pages using citation notation. Afflicted by the perennial shortage of cash common to graduate students everywhere. their unique approach to link analysis was earning BackRub a growing reputation among those who had seen it. but nobody was interested in buying or licensing their search technology at that time. In 1998. Early Years Google's corporate history page has a pretty strong background on Google. By January of 1996. a concept which is popular in academic circles. Larry and Sergey had begun collaboration on a search engine called BackRub. starting from when Larry met Sergey at Stanford right up to present day. In 1995 Larry Page met Sergey Brin at Stanford. A year later. Eventually video game ads will be sold from within Microsoft AdCenter. 2006 announced they bought a video game ad targeting firm named Massive Inc. Your ability to rank and the strength of your ability to vote for others depends upon your authority: how many people link to you and how trustworthy those links are. but some votes count more than others. On the web. Winning the Search War .SEARCH ENGINE 16 Microsoft also created the XBox game console. took on the task of creating a new kind of server environment that used low-end PCs instead of big expensive machines. named for its unique ability to analyze the "back links" pointing to a given website. Google was launched. If someone cites a source they usually think it is important. Sergey tried to shop their PageRank technology.

2004. Google Scholar: On November 18. Google launchedGoogle Book Search. Google dropped their IPO offer range from $85 to $95 per share from $108 to $135. and Yahoo! followed suit a year later. Google Book Search: On October 6. Google Universal Search: On May 16. 2002. In 2000 Google relaunched their AdWords program to sell ads on a CPM basis. Google announced Google Blog Search. Just Search. or services. Google Blog Search: On September 14. Google announced the launch of Google Base. including: • • • • • • • Google News: Google News launched in beta in September 2002. 2004 and its first trade was at 11:56 am ET at $100. On May 1. Google gained search market share year over year ever since. 2004. selling ads in an auction which would factor in bid price and ad clickthrough rate. and Google received $25 million Sequoia Capital and Kleiner Perkins Caufield & Byers the following year. Google Base: On November 15. an academic search program. We Promise! product.01. Google launched Google Scholar. 2006. After some controversy surrounding an interview in Playboy. In 2002 they retooled the service. In 1999 AOL selected Google as a search partner. Going Public Google used a two class stock structure. 2006. AOL announced they would use Google to deliver their search related ads. products.000 seed funding. In 2000 Google also launched their popular Google Toolbar. Google also runs a large number of vertical search services. which allowed them to expand their ad network by selling targeted ads on other websites. Google went public at $85 a share on August 19. 2007 Google began mixing many of their vertical results into their organic search results. They received virtually limitless negative press for the perceived hubris they expressed in their "AN OWNER'S MANUAL" FOR GOOGLE'S SHAREHOLDERS. which was a strong turning point in Google's battle against Overture. Google Video: On January 6.SEARCH ENGINE 17 Later that year Andy Bechtolsheim gave them $100. In 2003 Google also launched their AdSense program. On September 6. a database of uploaded information describing online or offline content. 2005. Google announced Google Video. Google announced an expanded Google News Archive Search that goes back over 200 years. 2005. Verticals Galore! In addition to running the world's most popular search service. decided not to give earnings guidance. and offered shares of their stock in a Dutch auction. .

. and Inktomi to power their search service. 2006. Until Microsoft saw the light they primarily relied on partners like Overture. 2006. They launched their technology preview of their search engine around July 1st of 2004. Microsoft announced they were launching their Live Search product. but Microsoft did not get serious about search until after Google proved the business model. MSN announced they dumped Yahoo!'s search ad program on May 4th. Looksmart. They formally switched from Yahoo! organic search results to their own in house technology on January 31st.SEARCH ENGINE Microsoft 18 In 1998 MSN Search was launched. 2005. On September 11.

in the following order 1. headings. Web crawling Indexing Searching 19 Web search engines work by storing information about many web pages. called the seeds. and worms. Many sites. store every word of every page they find. whereas others. called the crawl frontier. web robot. Also. URLs from the frontier are recursively visited according to a set of policies. store all or part of the source page (referred to as a cache) as well as information about the web pages. the engine examines its index and provides a listing of best-matching web pages according to its criteria. Other less frequently used names for web crawlers are ants. This cached page always holds the actual search text since it is the one that was actually indexed. bots. When a user enters a query into a search engine (typically by using key words). . Data about web pages are stored in an index database for use in later queries. such as checking links or validating HTML code. Some search engines. automated manner. usually with a short summary containing the document's title and sometimes parts of the text. it starts with a list of URLs to visit. such as AltaVista. crawlers can be used to gather specific types of information from Web pages. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. 2. such as harvesting e-mail addresses (usually for spam).txt. Exclusions can be made by the use of robots.SEARCH ENGINE 2. or special fields called meta tags). Crawlers can also be used for automating maintenance tasks on a website. OR and NOT to further specify the search query. The contents of each page are then analyzed to determine how it should be indexed (for example. it identifies all the hyperlinks in the page and adds them to the list of URLs to visit. or software agent. Working of a search engine A search engine operates. As the crawler visits these URLs. automatic indexers. which they retrieve from the WWW itself. In general. words are extracted from the titles. This process is called web crawling or spidering. such as Google. so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. in particular search engines. 3. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees. Most search engines support the use of the Boolean operators AND. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. A web crawler (also known as a web spider. or—especially in the FOAF community—web scutter) is a program or automated script which browses the World Wide Web in a methodical. use spidering as a means of providing up-to-date data. A web crawler is one type of bot.

and an option to disable user-provided contents. it is becoming essential to crawl the Web in not only a scalable. so it needs to prioritize its downloads. and not just a random sample of the Web. A politeness policy that states how to avoid overloading websites. The high rate of change implies that by the time the crawler is downloading the last pages from a site. For example. its popularity in terms of links or visits. As Edwards et al. as specified through HTTP GET parameters. all of which will be present on the site. noted. or that pages have already been updated or even deleted. and even of its URL (the . The behavior of a web crawler is the outcome of a combination of policies: • • • • A selection policy that states which pages to download. There are three important characteristics of the Web that make crawling it very difficult: • • • 20 its large volume. if some reasonable measure of quality or freshness is to be maintained. three choices of thumbnail size.SEARCH ENGINE Crawling policies . then that same set of content can be accessed with forty-eight different URLs. it is highly desirable that the downloaded fraction contains the most relevant pages. Selection policy Given the current size of the Web. The importance of a page is a function of its intrinsic quality. The recent increase in the number of pages being generated by server-side scripting languages has also created difficulty in that endless combination of HTTP GET parameters exist. as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content. A re-visit policy that states when to check for changes to the pages. "Given that the bandwidth for conducting crawls is neither infinite nor free. dynamic page generation which combine to produce a wide variety of possible crawlable URLs. The large volume implies that the crawler can only download a fraction of the web pages within a given time. its fast rate of change. This requires a metric of importance for prioritizing Web pages. but efficient way. If there exist four ways to sort images. This mathematical combination creates a problem for crawlers. even large search engines cover only a portion of the publicly available interne. A parallelization policy that states how to coordinate distributed web crawlers. two file formats. a simple online photo gallery may offer three options to users. As a crawler always downloads just a fraction of the Web pages. it is very likely that new pages have been added to the site. A crawler must carefully choose at each step which pages to visit next. only a small selection of which will actually return unique content.

This strategy may cause numerous HTML Web resources to be unintentionally skipped. used simulation on two subsets of the Web of 3 million pages from the .SEARCH ENGINE 21 latter is the case of vertical search engines restricted to a single top-level domain. .cl domain. Abiteboul (Abiteboul et al. 2004) used simulation on subsets of the Web of 40 million pages from the .htm.. Many Path-ascending crawlers are also known as Harvester software. random ordering and an omniscient strategy.html. (Boldi et al. testing breadth-first against depth-first. .gr and . designed a community based algorithm for discovering good seeds. 2003) designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation). this was the approach . A crawler may only want to seek out HTML pages and avoid all other MIME types. To avoid making numerous HEAD requests. because they're used to "harvest" or collect all the content . when given a seed URL of http://llama. Some crawlers intend to download as many resources as possible from a particular Web site.php. we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. it will attempt to crawl /hamster/monkey/.html. A similar strategy compares the extension of the web resource to a list of known HTML-page types: . For example. and that it is also very effective to use a previous crawl. and a slash. .perhaps the collection of photos in a gallery .org/hamster/monkey/page. when it is available. In order to request only HTML resources. /hamster/. Cothey found that a path-ascending crawler was very effective in finding isolated resources. Boldi et al. 2004) introduced a path-ascending crawler that would ascend to every path in each URL that it intends to crawl. testing several crawling strategies. . The main problem in focused crawling is that in the context of a web crawler. and /.aspx. . Cothey (Cothey.from a specific page or host..asp. or search engines restricted to a fixed Web site). to guide the current one. each page is given an initial sum of "cash" which is distributed equally among the pages it points to. Using these seeds a new crawl can be very effective. They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are both better than breadth-first crawling.html. a crawler may make an HTTP HEAD request to determine a Web resource's MIME type before requesting the entire resource with a GET request. Designing a good selection policy has an added difficulty: it must work with partial information. as the complete set of Web pages is not known during crawling. One can extract good seed from a previously crawled web graph using this new method.htm or a slash.it domain and 100 million pages from the WebBase crawl. or resources for which no inbound link would have been found in regular crawling. A possible predictor is the anchor text of links. In OPIC. Baeza-Yates et al. Their method crawls web pages with high PageRank from different communities in less iteration in comparison with crawl starting from random seeds. a crawler may alternatively examine the URL and only request the resource if the URL ends with . Daneshpajouh et al.

introduced in (Cho and Garcia-Molina. The most used cost functions. and a focused crawling usually relies on a general Web search engine for providing starting points. the repeated crawling order of pages can be done either at random or with a fixed order. are freshness and age. the crawler is concerned with how old the local copies of pages are. By the time a web crawler has finished its crawl. These objectives are not equivalent: in the first case. while in the second case. propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. and thus having an outdated copy of a resource. Proportional policy: This involves re-visiting more often the pages that change more frequently. regardless of their rates of change. Web 3. the crawler is just concerned with how many pages are out-dated. Diligenti et al.SEARCH ENGINE 22 taken by Pinkerton in a crawler developed in the early days of the Web. 2000). 2003a). From the search engine's point of view. at time t is defined as: Image:Web Crawling Freshness Age. (In both cases. Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The age of a page p in the repository. The freshness of a page p in the repository at time t is defined as: Age: This is a measure that indicates how outdated the local copy is.0 crawling and indexing technologies will be based on human-machine clever associations. and crawling a fraction of the Web can take a really long time.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in Semantic Web and Website Parse Template concepts for the present. Web 3. These events can include creations. or to keep the average age of pages as low as possible. we should penalize the elements that change too often (Cho and GarciaMolina. The optimal re-visiting policy is neither the uniform policy nor the proportional . updates and deletions. Two simple re-visiting policies were studied by Cho and Garcia-Molina: Uniform policy: This involves re-visiting all pages in the collection with the same frequency. many events could have happened. The visiting frequency is directly proportional to the (estimated) change frequency. usually measured in weeks or months. there is a cost associated with not detecting an event.) To improve freshness. The Web has a very dynamic nature. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched.svg Evolution of freshness and age in Web crawling The objective of the crawler is to keep the average freshness of pages in its collection as high as possible.

and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. Personal crawlers that. Server overload. A partial solution to these problems is the robots exclusion protocol.txt file to indicate the number of seconds to delay between requests. However. • • • • Network resources.SEARCH ENGINE 23 policy. also known as the robots. a more detailed cost-benefit analysis is needed and ethical considerations should be taken into account when deciding where to crawl and how fast to crawl.. Parallelization policy . a server would have a hard time keeping up with requests from multiple crawlers. This does not seem acceptable. it would take more than 2 months to download only that entire website..txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. running a crawler which connects to more than half a million servers (. For those using web crawlers for research purposes. if pages were downloaded at this rate from a website with more than 100. Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes. there are always those who do not know what a crawler is. as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. or which download pages they cannot handle. This standard does not include a suggestion for the interval of visits to the same server.. Brin and Page note that: ". Poorly written crawlers. The optimal method for keeping average freshness high includes ignoring the pages that change too often. Because of the vast number of people coming on line. so they can have a crippling impact on the performance of a site.". Politeness policy Crawlers can retrieve data much quicker and in greater depth than human searchers. which can crash servers or routers.000 pages over a perfect connection with zero latency and infinite bandwidth. especially if the frequency of accesses to a given server is too high. because this is the first one they have seen..) generates a fair amount of email and phone calls. only a fraction of the resources from that Web server would be used. also. if deployed by too many users. It is worth noticing that even when being very polite. The first proposal for the interval between connections was given in and was 60 seconds. can disrupt networks and Web servers. MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots. Recently commercial search engines like Ask Jeeves. some complaints from Web server administrators are received. even though this interval is the most effective way of avoiding server overload.. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files. and taking all the safeguards to avoid overloading web servers.

but it should also have a highly optimized architecture. which prevent major search engines from publishing their ranking algorithms. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. as noted in the previous sections. removal of ". crawlers may be accidentally trapped in a crawler trap or they may be overloading a web server with requests. It is important for web crawlers to identify themselves so Web site administrators can contact the owner if needed. Web site administrators typically examine their web servers’ log and use the user agent field to determine which crawlers have visited the web server and how often. Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field. also called URL canonicalization. To avoid downloading the same page more than once. Web crawler architectures High-level architecture of a standard Web crawler A crawler must not only have a good crawling strategy. Web crawlers are a central part of search engines. There are also emerging concerns about "search engine spamming"." segments. There are several types of normalization that may be performed including conversion of URLs to lowercase. URL normalization Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. In some cases.SEARCH ENGINE Main article: Distributed web crawling 24 A parallel crawler is a crawler that runs multiple processes in parallel." and ". there is often an important lack of detail that prevents others from reproducing the work. Crawler identification Web crawlers typically identify themselves to a web server by using the User-agent field of an HTTP request. and details on their algorithms and architecture are kept as business secrets. refers to the process of modifying and standardizing a URL in a consistent manner.. 2004). The user agent field may include a URL where the Web site administrator may find out more information about the crawler.. as the same URL can be found by two different crawling processes. and adding trailing slashes to the non-empty path component (Pant et al. The term URL normalization. and the owner needs to stop the crawler. When crawler designs are published. . the crawling system requires a policy for assigning the new URLs discovered during the crawling process. or they may mask their identity as a browser or other wellknown crawler.

and whether multiple indexers can work asynchronously. Index design incorporates interdisciplinary concepts from linguistics. as well as the considerable increase in the time required for an update to take place.000 large documents could take hours. natural language documents. and the second program "mite". is a modified www ASCII browser that downloads the pages from the Web. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs. It was based on two programs: the first program. • • RBSE was the first published web crawler.000 documents can be queried within milliseconds. WebCrawler was used to build the first publicly-available full-text index of a subset of the Web. physics and computer science. It also included a realtime crawler that followed links based on the similarity of the anchor text with the provided query. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing. Unlike full-text indices. Indexing The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. It was based on lib-WWW to download pages. the search engine would scan every document in the corpus. Meta search engines reuse the indices of other services and do not store a local index. which would require considerable time and computing power. Index Design Factors Major factors in designing a search engine's architecture include: Merge factors How data enters the index. and another program to parse and order URLs for breadth-first exploration of the Web graph. mathematics. and stores data to facilitate fast and accurate information retrieval. partial-text services restrict the depth indexed to reduce index size. Search engine indexing collects. parses. For example. The indexer must first check whether it is updating old content or adding new content. cognitive psychology. The additional computer storage required to store the index. are traded off for the time saved during information retrieval. Popular engines focus on the full-text indexing of online. informatics. or how words or subject features are added to the index during text corpus traversal. whereas cache-based search engines permanently store the index along with the corpus.SEARCH ENGINE 25 Identification is also useful for administrators that are interested in knowing when they may expect their Web pages to be indexed by a particular search engine. Media types such as video and audio and graphics are also searchable. "spider" maintains a queue in a relational database. a sequential scan of every word in 10. . while agent-based search engines index in real time. while an index of 10. Without an index.

and the inverted index is the consumer of information produced by the forward index. Index size How much computer storage is required to support the index. Storage techniques How to store the index data. Inverted indices . that is. Maintenance How the index is maintained over time. This is commonly referred to as a producer-consumer model. is a central focus of computer science. but the index simultaneously needs to continue responding to search queries. For example. a new document is added to the corpus and the index must be updated. where the search engine consists of several machines operating in unison. and schemes such as hash-based or composite partitioning. The speed of finding an entry in a data structure. determining whether bad data can be treated in isolation. Fault tolerance How important it is for the service to be reliable. as well as replication. dealing with bad hardware. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. whether information should be data compressed or filtered. The challenge is magnified when working with distributed storage and distributed processing. parallel architecture. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized. grabbing the text and storing it in a cache (or corpus). compared with how quickly it can be updated or removed. Lookup speed How quickly a word can be found in the inverted index. Issues include dealing with index corruption. This is a collision between two competing tasks. and a web crawler is the consumer of this information. Index Data Structures Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. partitioning. the search engine's architecture may involve distributed computing. The indexer is the producer of searchable information and users are the consumers that need to search. Consider that authors are producers of information. In an effort to scale with larger amounts of indexed information. distributed. There are many opportunities for race conditions and coherent faults.SEARCH ENGINE 26 Traversal typically correlates to the data collection policy. Types of indices include: Challenges in Parallelism A major challenge in the design of search engines is the management of parallel computing processes. The forward index is the consumer of the information produced by the corpus.

with the index cache residing on one or more computer hard drives. To reduce computer storage memory requirements. The following is a simplified illustration of an inverted index: Inverted Index Word Documents the cow says Document 1. frequency can be used to help in ranking the relevance of documents to the query. Index Merging The inverted index is filled via a merge or rebuild. The inverted index is a sparse matrix. the search engine can use direct access to find the documents associated with each word in the query in order to retrieve the matching documents quickly. the development of a forward index and a process which sorts . where a merge identifies the document or documents to be added or updated and then parses each document into words. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. After parsing. Document 3. A rebuild is similar to a merge but first deletes the contents of the inverted index. it is stored differently from a two dimensional array. Document 5 Document 2. it is therefore considered to be a boolean index. The architecture may be designed to support incremental indexing. In larger indices the architecture is typically a distributed hash table.SEARCH ENGINE 27 Many search engines incorporate an inverted index when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. Such topics are the central research focus of information retrieval. Because the inverted index stores a list of the documents containing each word. Document 3. Position information enables the search algorithm to identify word proximity to support searching for phrases. since not all words are present in each document. and so this process is commonly split up into two parts. since it stores no information regarding the frequency and position of the word. The inverted index can be considered a form of a hash table. typically residing in virtual memory. the indexer adds the referenced document to the document list for the appropriate words. a merge conflates newly indexed documents. the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming. The index is similar to the term document matrices employed by latent semantic analysis. In some cases the index is a form of a binary tree. Inverted indices can be programmed in several computer programming languages. Such an index determines which documents match a query but does not rank matched documents. Document 4 Document 5 moo Document 7 This index can only determine whether a word exists within a particular document. For technical accuracy. In a larger search engine. Document 4. which requires additional storage but may reduce the lookup time.

The delineation enables Asynchronous system processing. the . The forward index is sorted to transform it to an inverted index.cat.hat Document 3 the. The following is a simplified form of the forward index: Forward Index Document Words Document 1 the.000. • • • • An estimated 2.spoon The rationale behind developing a forward index is that as documents are parsing.SEARCH ENGINE 28 the contents of the forward index into the inverted index. In this regard.ran.dish. this would require 2500 gigabytes of storage space alone. index) for 2 billion web pages would need to store 500 billion word entries.000. Consider the following scenario for a full text.the.the. it is better to immediately store the words per document.cow. The inverted index is so named because it is an inversion of the forward index. The forward index is essentially a list of pairs consisting of a document and a word.with. or 5 bytes per word. which partially circumvents the inverted index update bottleneck.says. The Forward Index The forward index stores a list of words for each document. At 1 byte per character. an uncompressed index (assuming a non-conflated. Many search engines utilize a form of compression to reduce the size of the indices on disk. more than the average free disk space of 25 personal computers. Compression Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge.away. collated by the document. This space requirement may be even larger for a faulttolerant distributed storage architecture. Depending on the compression technique chosen. It takes 8 bits (or 1 byte) to store a single character.and.moo Document 2 the. the inverted index is a word-sorted forward index. Some encodings use 2 bytes per character The average number of characters in any given word on a page may be estimated at 5 (Wikipedia:Size comparisons) The average personal computer comes with 100 to 250 gigabytes of usable space Given this scenario.000 different web pages exist as of the year 2000Suppose there are 250 words on each webpage (based on the assumption they are similar to the pages of a novel. simple. Internet search engine. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words.

speech segmentation. is the subject of continuous research and technological improvement.SEARCH ENGINE 29 index can be reduced to a fraction of this size. In digital form. Language Ambiguity To assist with properly ranking matching documents. such as its language or lexical category (part of speech). Language-specific logic is employed to properly identify the boundaries of words. as words are not clearly delineated by whitespace. These techniques are language-dependent. Challenges in Natural Language Processing Word Boundary Ambiguity Native English speakers may at first consider tokenization to be a straightforward task. the implementation of which are commonly kept as corporate secrets. Thus compression is a measure of cost. many search engines collect additional information about each word. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. the file format must be correctly handled. The goal during tokenization is to identify words for which users will search. large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. and so. text mining. It is also sometimes called word boundary disambiguation. the texts of other languages such as Chinese. concordance generation. Notably. in the context of search engine indexing and natural language processing. Faulty Storage . Japanese or Arabic represent a greater challenge. Documents do not always clearly identify the language of the document or represent it accurately. The tradeoff is the time and processing power required to perform compression and decompression. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. but this is not the case with designing a multilingual indexer. some search engines attempt to automatically identify the language of the document. The terms 'indexing'. as the syntax varies among languages. Natural language processing. 'parsing'. as of 2006. content analysis. or lexical analysis. The words found are called tokens. lexing. and 'tokenization' are used interchangeably in corporate slang. text segmentation. tagging. Diverse File Formats In order to correctly identify which bytes of a document represent characters. In tokenizing the document. which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax). parsing is more commonly referred to as tokenization. text analysis. Tokenization for indexing involves multiple technologies. Document Parsing Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices.

Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. The challenge is that many document formats contain formatting information in addition to textual content. For example. sentence position. incorporate specialized programs for parsing. binary characters may be mistakenly encoded into various parts of a document. Such a program is commonly called a tokenizer or parser or lexer. During tokenization. Language recognition is the process by which a computer program attempts to automatically identify. leading to poor search results. language identification. and language tagging. Language Recognition If the search engine supports multiple languages. sentence number. language analysis. Finding which language the words belongs to may involve the use of a language recognition chart. like 'noun' or 'verb'). documents must be prepared for tokenization. humans must program the computer to identify what constitutes an individual or distinct word. and line number. language or encoding. Without recognition of these characters and appropriate handling. If the search engine were to ignore the difference between content and 'markup'. HTML documents contain HTML tags. such as the token's case (upper. and font size or style. proper). several characteristics may be stored. or categorize. particular on the Internet. the index quality or indexer performance could degrade. Automated language recognition is the subject of ongoing research in natural language processing. Tokenization Unlike literate humans. lexical category (part of speech. Many search engines. extraneous information would be included in the index. such as punctuation. many of the subsequent steps are language dependent (such as stemming and part of speech tagging). such as YACC or Lex. Instead. and URLs. phone numbers. which are represented by numeric codes. position. . length. lower. mixed. Other names for language recognition include language classification. The parser can also identify entities such as email addresses. Computers do not 'know' that a space character separates words in a document. some of which are non-printing control characters. the language of a document. as well as other natural language processing software. the parser identifies sequences of characters which represent words and other elements. Format Analysis If the search engine supports multiple document formats. An unspecified number of documents. bold emphasis. a document is only a sequence of bytes. referred to as a token. a common initial step during tokenization is to identify each document's language. do not closely obey proper file protocol. When identifying each token. which specify formatting information such as new line starts. To a computer. computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences.SEARCH ENGINE 30 The quality of the natural language data may not always be perfect.

maintains.Zip File RAR . Commonly supported compressed file formats include: • • • • • • ZIP . and writing a custom parser.Unix Gzip'ped Archives Format analysis can involve quality improvement methods to avoid including 'bad information' in the index.Microsoft Windows Cabinet File Gzip . while others are well documented.Bzip file TAR. Common. When working with a compressed format. tag stripping. format stripping.g. hidden "div" tag in HTML.SEARCH ENGINE 31 Format analysis is also referred to as structure analysis. format parsing.Archive File CAB . each of which must be indexed separately. the indexer first decompresses the document. Certain file formats are proprietary with very little information disclosed. Examples of abusing document formatting for spamdexing: • Including hundreds or thousands of words in a section which is hidden from view on the computer screen. or owns the format. well-documented file formats that many search engines support include: • • • • • • • • • • • • • Microsoft Word Microsoft Excel Microsoft Powerpoint IBM Lotus Notes HTML ASCII text files (a text document without any formatting) Adobe's Portable Document Format (PDF) PostScript (PS) LaTex The UseNet archive (NNTP) and other deprecated bulletin board formats XML and derivatives like RSS SGML (this is more of a general protocol) Multimedia meta data formats like ID3 Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed. text cleaning. by use of formatting (e. and text preparation.Gzip file BZIP . and TAR. but visible to the indexer. text normalization. this step may result in one or more files.GZ . Some search engines support inspection of files that are stored in a compressed or encrypted file format. which may incorporate the use of CSS or Javascript to do so). The challenge of format analysis is further complicated by the intricacies of various file formats. . Content can manipulate the formatting information to include additional content.

Words that appear sequentially in the raw source content are indexed sequentially. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed.SEARCH ENGINE • 32 Setting the foreground font color of words to the same as the background color. Some file formats. Section Recognition Some search engines incorporate section recognition. making words hidden on the computer screen to a person viewing the document. At the same time. or rendered. nor was the hardware able to support such technology. like HTML or PDF. essentially an abstract representation of the actual document. even though these sentences and paragraphs are rendered in different parts of the computer screen. If search engines index this content as if it were normal content. Earlier Internet search engine technology would only index the keywords in the meta tags for the forward index. Given that some search engines do not bother with rendering issues. but the side bar content does not contribute to the meaning of the document. For HTML pages. allow for content to be displayed in columns. and then index the representation instead. and the index is filled with a poor representation of its documents. this article displays a side menu with links to other web pages. prior to tokenization. divided into organized chapters and pages. this fact can also be exploited to cause the search engine indexer to 'see' different content than the viewer. when in reality it is not Organizational 'side bar' content is included in the index. keywords. in different areas of the view. it would not 'see' this content in the same way and would index the document incorrectly. many web page designers avoid displaying content via Javascript or use the Noscript tag to ensure that the web page is indexed properly. such as newsletters and corporate reports. Many documents on the web. some content on the Internet is rendered via Javascript. If the search engine does not render the page and evaluate the Javascript within the page. and language. the meta tag contains keywords which are also included in the index. The keywords used to describe webpages (many of which were . At that time fulltext indexing was not as well established. the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. Section analysis may require the search engine to implement the rendering logic of each document. As the Internet grew through the 1990s. but not hidden to the indexer. without requiring tokenization. Two primary problems are noted: • • Content in different sections is treated as related in the index. the raw markup content may store this information sequentially. description. Even though the content is displayed. For example. the identification of major parts of a document. Meta Tag Indexing Specific documents often contain embedded meta information such as author. contain erroneous content and side-sections which do not contain primary material (that which the document is about). many brick-and-mortar corporations went 'online' and established corporate websites. the full document would not be parsed. Not all the documents in a corpus read like a well-written book. For example.

In Desktop search. and How many pages are indexed from this domain name?). which in turn furthered research of full-text indexing technologies...g.g. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. they vary greatly from standard query languages which are governed by strict syntax rules. colorado or trucks) for which there may be thousands of relevant results. • • Search engines often support a fourth type of query that is used far less frequently: • Connectivity queries – Queries that report on the connectivity of the indexed web graph (e. Navigational queries – Queries that seek a single website or web page of a single entity (e. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky'. as it was one more step away from subjective control of search engine result placement. Transactional queries – Queries that reflect the intent of the user to perform a particular action. youtube or delta airlines). like purchasing a car or downloading a screen saver. See also A web search query is a query that a user enters into web search engine to satisfy his or her information needs. Web search queries are distinctive in that they are unstructured and often ambiguous. Which links point to this URL?. The fact that these keywords were subjectively-specified was leading to spamdexing. Types There are three broad categories that cover most web search queries: • Informational queries – Queries that cover a broad topic (e. . In this sense.SEARCH ENGINE 33 corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. while Internet search engines which must focus more on the full text index. Desktop search is more under the control of the user. many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. full-text indexing was more objective and increased the quality of search engine results..g. the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. which drove many search engines to adopt full-text indexing technologies in the 1990s.

OR.4 terms. > 100 million queries) are used most often..SEARCH ENGINE Characteristics 34 Most commercial web search engines do not disclose their search logs. Boolean operators like AND. a small portion of the terms observed in a large query log (e. of. much research has shown that query term frequency distributions conform to the power law. Close to half of the users examined only the first one or two pages of results (10 results per page). zip codes.). This example of the Pareto principle (or 80-20 rule) allows search engines to employ optimization techniques such as index or database partitioning.g. or long tail distribution curves. That is. A study of the same Excite query logs revealed that 19% of the queries contained a geographic term (e.g. geographic features.g. while the remaining terms are used less often individually.. This suggests that many users use repeat queries to revisit or re-find information. The top three most frequently used terms were and. . etc. and NOT). a study in 2001 analyzed the queries from the Excite search engine showed some interesting characteristics of web search: • • • • • The average length of a search query was 2. and sex. A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result. place names. Nevertheless. About half of the users entered a single query while a little less than a third of users entered three or more unique queries. caching and pre-fetching. In addition. so information about what users are searching for on the Web is difficult to come by. Less than 5% of users used advanced search features (e.

which extracts term relationships from the link structure of Websites.2 Deep Search: current search engines can only crawl and capture a small part of the Web. Web content providers are moving toward Semantic Web by applying technologies such as XML and RDF (Resource Description Framework) in order to create more structured Web resources. Different databases. BrightPlanet's "differencing" algorithm is designed to transfer queries across multiple deep Web resources at once. Google not only used this approach to capture the biggest amount of Web pages but also established PageRank . patents. After content-based indexing and link analysis the new area of study is page and layout structures. It is able to identify new terms and reflect the latest relationship between terms as the Web evolves.SEARCH ENGINE New Features for Web Searching 35 The incredible development of Web resources and services has become a motivation for many studies and for companies to invest on developing new search engines or adding new features and abilities to their search engines. 1998). Deep Web with structured information is a potential resource that search companies are trying to capture. Microsoft has started a big competition on Web searching through working on Web page blocks. aggregating the results and letting users compare changes to those results over time. AltaVista and other old search engines were made based on indexing the content of Web pages. Backlinks were used based on the Hyperlink-Induced Topic Search (HITS) algorithm to crawl billions of Web pages. 4. we can track several specifications and shifts in the future.the ranking system that improved the search results (Brin & Page. HTML and XML are important in this approach. digital books and journals. For example. It is believed that the size of invisible or deep Web is several times bigger than the size of the surface Web. However. MSN and many other popular search engines are competing to find solution for the . 4. Experimental results have shown that the constructed thesaurus. it was clear that the contents of a Web page could not be sufficient for capturing the huge amount of information.1 Page Structure Analysis: the first search engines concentrated on Web page contents. the Deep Web and mobile search. 2003). We can imagine also that a link in the middle of Web page is more important than a link in footnote. which is called the "visible" or "indexable" Web. They built huge centralized indices and this is still a part of every popular search engine. research reports and governmental archives are examples of resources that usually cannot be crawled and indexed by current search engines. By looking at the papers published in the mentioned conferences and other journals and seminars. MSN new ranking model will be based on object-level ranking rather than document-level. Ma (2004) from Asian Microsoft Research Centre reported features of the next generation of search engines in WISE04. New search engines are trying to find suitable methods for penetrating the database barriers. outperforms traditional association thesaurus (Chen et al. Web Graph algorithms such as HITS might be implemented to a sub-section of Web pages to improve search result ranking models. the value of information presented in < heading > tags can be more than information in < paragraph > tags. when applied to query expansion. The automatic thesaurus construction method is a page structure method. library catalogues. Meanwhile. A huge amount of scientific and other valuable information is behind closed doors. In 1996-1997 Google was designed based on a novel idea that the link structure of the Web is an important resource to improve the results of search engines. researchers have focused on Web page structure to increase the quality of search. Google. It is thought that Web page layout is a good resource for improving search results.

4. search results will be ranked not only based on the automatic ranking algorithms but also by using the ideas of scholars and scientific recommending groups.SEARCH ENGINE 36 invisible Web. Simply. Basic ranking algorithms are based on the occurrence rate of index terms in each page. As we already mentioned. it is believed that the best judgement about the importance and quality of Web pages is acquired when they are reviewed and recommended by human experts. Most of search engines just save a copy of Web pages in their repository and then make several indexes from the content of these pages. data is stored in tables and separated files. 2004). it aggregates multiple channels of information into a single searchable point. Recently. Current search engines cannot resolve this problem efficiently. Federated searching has several advantages for users. Discussion thread recommendation or peer reviews are expected to be used by search engines to improve their results. So.4 Recommending Group Ranking: while many search engines are able to crawl and index billions of Web pages.3 Structured Data: the World Wide Web is considered a huge collection of unstructured data presented in billions of Web pages. Traditional information retrieval and database management techniques have been used to extract data from different tables and resources and combine them to respond users' queries. Usually there is no overlap between databases covered by federated search engines. Federated search engines are different from metasearch engines. 4. and their relation to each other. As we already mentioned. Metasearch engines services for users are free while federated search engines are sold to libraries and other interested information service providers. As a part of both surface and deep Web. However. These methods are automatic and are done by machines. rather than just the number of times they appear. sorting the results of each query is still an issue.5 Federated Search: also known as parallel search. Most documents available on the Web are unstructured resources. The concept of structured searching is different from the way search engines currently operate. metasearch or broadcast search. In the future. Federated searche mostly covers subscription based databases that are usually a part of Invisible Web and ignored by Web-oriented metasearch engines. It reduces the time that is needed for searching several databases and also users do not need to know how to search through different interfaces (Fryer. structured data resources are very important and valuable. but in the future an intelligent search engine will be able to distinguish different structured resources and combine their data to find a high quality response for a complicated query. One of the . As Rein (1997) says a search engine supporting XMLbased queries can be programmed to search structured resources. 4. 2004). Such an engine would rank words based on their location in a document. Yahoo has developed a paid service for searching the deep Web that is called the Content Aggregation Program (CAP). if the search term is mathematics then a page that has the word mathematics 20 times must be ranked before a page which has mathematics 10 times. The method is secret but the company does acknowledge that its Content Aggregation Program will give paying customers a more direct pipeline into its search database (Wright. this alone is not a sufficient way. The idea is simple: more relevant pages must take a higher rank. search engines can just judge them based on the keyword occurrence. In many cases. Page ranking algorithms have been utilized to present a better ranked result. the amazing size and valuable resources of the deep Web have affected the industry of search engines and the next generation of search engines are supposed to be able to investigate deep Web information. recently link information and page structure information have been used to improve rank quality.

4. Yahoo developed its mobile Web search system and mobile phone users can have access to Yahoo Local. In the future everyone will have access to the Web information and services through his/her wireless phone without necessarily having a computer. . as well as quick links to stocks. Image and Web search. Search engine companies have focused on the big market of mobile phones and wireless telecommunication devices. Recently.SEARCH ENGINE 37 important reasons of the growing interest in federated searching is the complexity of the online materials environment such as the increasing number of electronic journals and online full-text databases. and are limited to basic Boolean search. sports scores and weather for fee. 2004). We need an open interoperable and uniform e-content environment to provide fully the interconnected assessable environment that librarians are seeking from metasearching. Also many other mobile technologies such as GPS devices are used widely. Webster (2004) maintains that although federated searching tools offer some real immediate advantages today. they cannot overcome the underlying problem of growing complexity and lack of uniformity. The platform also includes a modified Yahoo Instant Messaging client and Yahoo Mobile Games (Singer. One of the disadvantages of federated search engines is that they cannot be used for sophisticated search commands and queries.6 Mobile Search: the number of people who have a cell phone seems to be more than the number of people who have a PC.

Having the Beta version of Google Scholar (http://scholar. The World Wide Web will be more usable in the future. Meanwhile many issues have remained unsolved or incomplete still. we see not only a considerable increase in the quantity of Web search research papers since 2001.SEARCH ENGINE 38 4. Limitation in funds has enforced libraries and other major information user organizations to share their online resources. The structure of Web pages seems to be a good resource with which search engines can improve their results. By looking at papers published in popular conferences on Web and information management. Search engines are trying to consider recommendations of special-interest groups into their search techniques.com) released in November 2004. Information extraction. personalization and multimedia searching among others are major issues in the next few years. we addressed the efforts of search engine companies in breaking their borders through making search possible for mobile phones and other wireless information and communication devices. other major players in search engine industry are expected to invest on rivals for this new service. we reviewed the history of Web search tools and techniques and mentioned some big shifts in this field. Web search industry is opening new horizons for the global village. The gigantic size of the Web and vast variety of the users' needs and interests as well as the big potential of the Web as a commercial market have brought about many changes and a great demand for better search engines. . but also we can see that Web search and information retrieval topics such as ranking. Finally. In this article. there will be a shift towards providing specialised search facilities for the scholarly part of the Web that encompasses a considerable part of the deep Web. filtering and query formulation are still hot topics. ambiguity in addresses and names. Federated search is a sample of future cooperative search and information retrieval facilities. Local services and the personalization of search tools are two major ideas that have been studied for several years. many other algorithms and methods have since been added to them to improve their results. The Web's security and privacy are two important issues for the coming years. The next generations of search tools are expected to be able to extract structured data to offer high quality responses to users' questions. As well. This reveals that search engines have many unsolved and research-interesting areas. While the first search engines were established based on the traditional database and information retrieval methods. Google utilized Web graph or link structure of the Web to make one of the most comprehensive and reliable search engines. Conclusion The World Wide Web with its short history has experienced significant changes.google. We mentioned several important issues for the future of search engines.

(1998). M. Retrieved December 4.internetnews. Journal of the American Society for Information Science. D. & Page.infoworld. Brisbane. Spink. (2002). and Hon. Online. ACM Computer Surveys. (2004). (1997). from http://www. 2004. A. C. Webster.1282. techniques and systems. How specialization limited the Web.SEARCH ENGINE References • • 39 • • • • • • • • • • • • • • • • Brin. H. Yu. C. Image retrieval from the World Wide Web: issues. 131-145. from http://www. Building a web thesaurus from web link ltructure.com/bus-news/article. Retrieved December 2. Towards Next Generation Web Information Retrieval. from http://www. June 2). History of search engines & web history... Ziou. P. Retrieved November 20.html Holzschlag. Survey reveals search habits.com/article/04/03/17/HNgooglelocal_1. 36(14). McLean. Proceedings of the 7th International WWW Conference.com/sereport/00/06realnames. (2002).com/archives/2001/09/desi/ Jansen. (2004). Singer. & Bernardi. The anatomy of a large-scale hypertextual web search engine. L. & Pedersen. F. (2004). 186-192.. & Amoudi.html Schwartz. Virginia.00. B. 20-23. Australia. Z.search-marketing. Program. 140-151. W. 35-67. Ma. 31(2). from http://www. Chen. Fryer. (2001). Retrieved December 5. Gromov. R. S. D. Personalized web search by mapping user queries to categories. Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval. A. The design of World Wide Web search engines: a critical review. Web search engines. 2004.info/search-engine-history/ Watters. . L. Metasearching in an academic environment. XML Ushers in Structured Web Searches.searchenginewatch. from http://www. 2004. 2004. 2004. S. 2004..com/intvalstat.netvalley. A..wired. Federated search engines. 28(2). (2003). A. Retrieved December 1. Perez. Journal of the American Society for Information Science and Technology. (2000. Online. & Ma.. Retrieved December 3. 48– 55. 107-117. from http://www. D. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. The Search Engine Report. 54(2). 28(2).com/news/technology/0. & Meng. W. J. An analysis of multimedia searching on AltaVista. October 27). Brisbane. G. Google offers new local search service.html Wall. H. Liu. (2003). Toronto. E.webtechniques..html Poulter. Web Information Systems – WISE04: Proceedings of the fifth international Conference on Web Information System Engineering . (2004. Wenyin. C.7751. History of Internet and WWW: the roads and crossroads of Internet history. (2004). (1997). 17. (1998). G. L. Proceedings of the eleventh international conference on Information and knowledge management CIKM’02. (2004). USA. Kherfi. Rein. Pu. M. from http://www. G. (2004).php/3427831 Sullivan. 558-565. L. Liu. 16-19. Retrieved November 28. W. C. Zhang. 49(11). GeoSearcher: location-based ranking of search engine results. Australia. M. 2004. (2003). Yahoo sends search aloft. 973-982. J.

SEARCH ENGINE 40 .

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer for students: Only $4.99/month.

Master Your Semester with a Special Offer from Scribd & The New York Times

Cancel anytime.