SEARCH ENGINE

1

A research project on SEARCH ENGINE

SUBMITTED BY

SATHISH KOTHA

108-00-0746

University Of Northern Virginia

CSCI 587 SEC 1220, SPECIAL TOPICS IN INFORMATION TECHNOLOGY-1 6/20/2010

SEARCH ENGINE
Abstract of the Project

2

A web search engine is designed to search for information on the World Wide Web. The search results are usually presented in a list of results and are commonly called hits. The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input. Here in this project, we are discussing about types of search engines. How a search engine works or finds the information for the user. What are the processes going behind the screen? Today how many search engines are existing to provide information and facts for the computer users, History of search engine. What are the different stages of search engines in searching information? What are the features of Web searching? And also different topics like Advanced Research Projects Agency Network. And what a BOT really mean? Types of search queries we use in seeking information using search engine. Web Directories? And very famous search engine like google, yahoo and their role of processing as search engine. Challenges in language processing. Characteristics of search engines. I used this topic for my project is reason that I find interest in working of search engine. And I want every one to come across this topic and learn. As many of them uses, but they don’t know what’s the real fact going behind the screen in a search engine. At the end of the project I also gave some references from which I have selected for topic discussion of project. I hope you like this and accept this project for my topic in this course.

SEARCH ENGINE

3

ACKNOWLEDGEMENT

The project entitled “SEARCH ENGINE” is of total effort of me. It is my duty to bring forward each and every one who is either directly or indirectly in relation with our project and without whom it would not have gained a structure. Accordingly sincere thanks to PROF. SOUROSHI , For his support and for her valuable suggestions and timely advice without them, the project would not be completed in time. We also thank many others who helped us through out the project and made our project successful.

PROJECT ASSOCIATES

SEARCH ENGINE
CONTENTS PRELIMINARIES Acknowledgement

4

1. History of search engine

page

1 - 15

Types of search queries World Wide Web wanderer Alliweb Primitive web search 2. Working of a search engine Web crawling Indexing Searching 3. New features for web searching page 33 – 35 page 15 – 32

4. Conclusion

page

36

5. References

page

37

it must be stored. one has to emerge from the system and re-enter on a new path.SEARCH ENGINE 5 1. moreover. must be continuously extended. and the effort to bridge between disciplines is correspondingly superficial.Directories 3. He urged scientists to work together to help build a body of knowledge for all mankind. He not only was a firm believer in storing data. and above all it must be consulted. extensible. if it is to be useful to science. Here are a few selected sentences and paragraphs that drive his point home.. for his records have relative permanency. Specialization becomes increasingly necessary for progress. . Man cannot hope fully to duplicate this mental process artificially. reliable. In minor ways he may even improve. Having found one item. Search Engine Marketing History of Search Engines: From 1945 to Google 2007 As We May Think (1945): The concept of hypertext and a memory extension really came to life in July of 1945. but he also believed that if the data source was to be useful to the human mind we should have it represent how the mind works to the best of our abilities.. A record. Vannaver Bush's As We May Think was published in The Atlantic Monthly. He then proposed the idea of a virtually limitless. associative memory storage and retrieval system. Vertical Search 4. but he certainly ought to be able to learn from it.Early Technology 2. . The human mind does not work this way. It operates by association.. fast. He named this device a memex. when after enjoying the scientific camaraderie that was a side effect of WWII.. . Our ineptitude in getting at the record is largely caused by the artificiality of the systems of indexing.

1990s): 6 Gerard Salton. . While Ted was against complex markup code. Term Frequency (TF). and relevancy feedback mechanisms. Inverse Document Frequency (IDF). Advanced Research Projects Agency Network: ARPANet is the network which eventually led to the internet. broken links. a student at McGill University in Montreal. and many other problems associated with traditional HTML on the WWW. His teams at Harvard and Cornell developed the SMART informational retrieval system.SEARCH ENGINE Gerard Salton (1960s . term discrimination values." but it was shortened to Archie. was the father of modern search technology. Ted Nelson: Ted Nelson created Project Xanadu in 1960 and coined the term hypertext in 1963. who died on August 28th of 1995. His goal with Project Xanadu was to create a computer network with a simple user interface that solved many social problems like attribution. The Wikipedia has a great background article on ARPANet and Google Video has a free interesting video about ARPANet from 1972. The first search engine created was Archie. but long before most of them existed came Archie. much of the inspiration to create the WWW was drawn from Ted's work. The original intent of the name was "archives. created in 1990 by Alan Emtage. Salton’s Magic Automatic Retriever of Text included important concepts like the vector space model. There is still conflict surrounding the exact reasons why Project Xanadu failed to take off. Archie (1990): The first few hundred web sites began in 1993 and most of them were at colleges.

but it worked on plain text files. "I just had to take the hypertext idea and connect it to the TCP and DNS ideas and — ta-da! — the World Wide Web". Veronica served the same purpose as Archie. and Berners-Lee saw an opportunity to join hypertext with the Internet.SEARCH ENGINE 7 Archie helped solve this data scatter problem by combining a script-based data gatherer with a regular expression matcher for retrieving file names matching a user query. however there was no World Wide Web. The main way people shared data back then was via File Transfer Protocol (FTP). CERN was the largest Internet node in Europe. .. File Transfer Protocol: Tim Burners-Lee existed at this point. Soon another user interface name Jughead appeared with the same purpose as Veronica. With help from Robert Cailliau he built a prototype system named Enquire. Veronica & Jughead: As word of mouth about Archie spread. Bill Slawski has more background on Archie here. After leaving CERN in 1980 to work at John Poole's Image Computer Systems Ltd. Tim Berners-Lee & the WWW (1991): From the Wikipedia: While an independent contractor at CERN from June to December 1980. both of these were used for files sent via Gopher. If you had a file you wanted to share you would set up an FTP server. but the data became as much fragmented as it was collected. it started to become word of computer and Archie had such popularity that the University of Nevada System Computing Services group developed Veronica. he returned in 1984 as a fellow. to facilitate sharing and updating information among researchers. If someone was interested in retrieving the data they could using an FTP client. Essentially Archie became a database of web filenames which it would match with the users queries. which was created as an Archie alternative by Mark McCahill at the University of Minnesota in 1991. In his words. In 1989. This process worked effectively in small groups. Berners-Lee proposed a project based on the concept of hypertext.

It was also the world's first Web directory.SEARCH ENGINE 8 He used similar ideas to those underlying the Enquire system to create the World Wide Web. how one could own a browser and how to set up a Web server. What is a Bot? Computer robots are simply programs that automate repetitive tasks at speeds impossible for humans to reproduce. Types of Search Queries: Andrei Broder authored A Taxonomy of Web Search [PDF]. These bots attempt to act like a human and communicate with humans on said topic. Tim also wrote a book about creating the web. for which he designed and built the first web browser and editor (called WorldWideWeb and developed on NeXTSTEP) and the first Web server called httpd (short for HyperText Transfer Protocol daemon).ch/ and was first put online on August 6.shopping at. The first Web site built was at http://info.send me to a specific URL . since Berners-Lee maintained a list of other Web sites apart from his own. Berners-Lee founded the World Wide Web Consortium (W3C) at the Massachusetts Institute of Technology. which notes that most searches fall into the following 3 categories: • • • Informational . Tim also created the Virtual Library. In 1994. which is the oldest catalogue of the web. The term bot on the internet is usually used to describe anything that interfaces with the user or that collects data. 1991.cern. downloading from. or otherwise interacting with the result Navigational .seeking static information about a topic Transactional . It provided an explanation about what the World Wide Web was. Another bot example could be Chatterbots. titled Weaving the Web. which are resource heavy on a specific topic.

For example. He initially wanted to measure the growth of the web and created this bot to count active web servers. ALIWEB: In October of 1993 Martijn Koster created Archie-Like Indexing of the Web. Del. The WWW Worm indexed titles and URL's. three full fledged bot fed search engines had surfaced on the web: JumpStation. Primitive Web Search: By December of 1993. and Technorati allows you to search blogs. JumpStation gathered info about the title and header from Web pages and retrieved these using a simple linear search. creating a nofollow attribute that can be applied at the individual link level. ALIWEB crawled meta information and allowed users to submit their pages they wanted indexed with their own page description. and the Repository-Based Software Engineering (RBSE) spider.us allows you to search URLs that users have bookmarked. As the web grew.icio. He soon upgraded the bot to capture actual URL's. The Wanderer was as much of a problem as it was a solution because it caused system lag by accessing the same page hundreds of times a day. and Greg R. The problem with JumpStation and the World Wide . It did not take long for him to fix this software.SEARCH ENGINE 9 Nancy Blachman's Google Guide offers searchers free Google search tips. In 2005 Google led a crusade against blog comment spam. His database became knows as the Wandex. the World Wide Web Worm.Notess's Search Engine Showdown offers a search engine features chart. This allows webmasters to block bots from their site on a whole site level or page by page basis. JumpStation slowed to a stop. After this was pushed through Google quickly changed the scope of the purpose of the link nofollow to claim it was for any link that was sold or not under editorial control. The downside of ALIWEB is that many people did not know how to submit their site. and people link to it search engines generally will index it. Robots Exclusion Standard: Martijn Kojer also hosts the web robots page. There are also many popular smaller vertical search services. if information is on a public web server. which created standards for how search engines should index or not index content. World Wide Web Wanderer: Soon the web's first robot came. or ALIWEB in response to the Wanderer. but people started to question the value of bots. In June 1993 Matthew Gray introduced the World Wide Web Wanderer. By default. This meant it needed no bot to collect data and was not using excessive bandwidth.

and was named Excite@Home. 1999 for $6. They had the idea of using statistical analysis of word relationships to make searching more efficient. They were soon funded. Excite: Excite came from the project Architext. 1993 by six Stanford undergrad students.5 billion. and provided no discrimination. In October. which became a loose confederation of topical experts maintaining relevant topical link lists. Web Directories: VLib: When Tim Berners-Lee set up the web he created the Virtual Library. The biggest reason the EINet Galaxy became a success was that it also contained Gopher and Telnet search features in addition to its web search feature. which was started by in February. InfoSpace bought Excite from bankruptcy court for $10 million. It was organized similar to how web directories are today. The RSBE spider did implement a ranking system. Excite was bought by a broadband provider named @Home in January. other directories soon did follow. The web size in early 1994 did not really require a web directory. however. and in mid 1993 they released copies of their search software for use on web sites.SEARCH ENGINE 10 Web Worm is that they listed results in the order that they found them. Yahoo! Directory . Since early search algorithms did not do adequate link analysis or cache full page content if you did not know the exact name of what you were looking for it was extremely hard to find it. EINet Galaxy The EINet Galaxy web directory was born in January of 1994. 2001 Excite@Home filed for bankruptcy.

for example. The second Google librarian newsletter came from Karen G. is a directory of business websites. As time passed the inclusion rates for listing a commercial site increased. hold lower standards than selected limited catalogs created by librarians. Netscape bought the Open Directory Project in November. What set the directories above The Wanderer is that they provided a human compiled description with each URL. As time passed and the Yahoo! Directory grew Yahoo! began charging commercial sites for inclusion.5 billion all stock deal. Her article explains what she and her staff look for when looking for quality credible resources to add to the LII. and the general lack of scalability of a business model the quality and size of directories sharply drops off after you get past the first half dozen or so general directories. 1998. which is a directory which anybody can download and use in whole or part. The Internet Public Library is another well kept directory of websites. Later that same month AOL announced the intention of buying Netscape in a $4. The ODP (also known as DMOZ) is the largest internet directory. Business. vertically. who is the director of Librarians' Internet Index.SEARCH ENGINE 11 In April 1994 David Filo and Jerry Yang created the Yahoo! Directory as a collection of their favorite web pages. LII Google offers a librarian newsletter to help librarians and other web editors help make information more accessible and categorize the web. There are also numerous smaller industry. . Open Directory Project In 1998 Rich Skrenta and a small group of friends created the Open Directory Project. Schneider. Business. The Open Directory Project was grown out of frustration webmasters faced waiting to be included in the Yahoo! Directory. The current cost is $299 per year. LII is a high quality directory aimed at librarians. almost entirely ran by a group of volunteer editors. As their number of links grew they had to reorganize and become a searchable directory. or locally oriented directories.com Due to the time intensive nature of running a directory. especially those which have a paid inclusion option. Most other directories. Many informational sites are still added to the Yahoo! Directory for free.com.

Soon it became so popular that during daytime hours it could not be used. and OpenText. WebCrawler: Brian Pinkerton of the University of Washington released WebCrawler on April 20. But Lycos' main difference was the sheer size of its catalog: by August 1994. Excite bought out WebCrawler. and AOL began using Excite to power its NetFind. That caused the demise of any good faith or loyalty they had built up. Lycos provided prefix matching and word proximity bonuses. 1994. Looksmart bought a search engine by the name of WiseNut. a social bookmarking program. when Microsoft announced they were dumping Looksmart that basically killed their business model. On July 20. Looksmart also owns a catalog of content articles organized in vertical sites. Lycos . although it allowed them to profit by syndicating those paid listings to some major portals like MSN.000 documents. 1994. Lycos went public with a catalog of 54. In 2002 Looksmart transitioned into a pay per click provider. which charged listed sites a flat fee per click. but due to limited relevancy Looksmart has lost most (if not all) of their momentum. Within 1 year of its debuted came Lycos. They competed with the Yahoo! Directory by frequently increasing their inclusion rates back and forth. having been design at Carnegie Mellon University around July of 1994. Then in 1997. In addition to providing ranked relevance retrieval. AOL eventually purchased WebCrawler and ran it on their network. Lycos: Lycos was the next major search development.SEARCH ENGINE Looksmart 12 Looksmart was founded in 1995. and hope to drive traffic using Furl. Infoseek. The problem was that Looksmart became too dependant on MSN. It was the first crawler which indexed entire pages. WebCrawler opened the door for many other services to follow suit. In March of 2002. 2006 Looksmart shut down the Zeal directory. and in 2003. but on March 28. Michale Mauldin was responsible for this search engine and remains to be the chief scientist at Lycos Inc. but it never gained traction. In 1998 Looksmart tried to expand their directory by buying the non commercial Zeal directory for $20 million.

5 million documents. advanced searching techniques and they allowed users to add or delete their own URL within 24 hours. AltaVista also provided numerous search tips and advanced search features. After Yahoo! bought out Overture they rolled some of the AltaVista technology into Yahoo! Search. 1996 with its search engine Hotbot. claiming to have been founded in January. Lycos had indexed over 60 million documents -. and in December 1995 they convinced Netscape to use them as their default search. which gave them major exposure. a fear of result manipulation. which was a search spammer's paradise. They even allowed inbound link checking. One popular feature of Infoseek was allowing webmasters to submit a page to the search index in real time. In October of 2001 Danny Sullivan wrote an article titled Inktomi Spam Database Left Open To Public.’. Hotwire listed this site and it became hugely popular quickly.000 documents. by January 1995. and by November 1996. They really did not bring a whole lot of innovation to the table.more than any other Web search engine. Infoseek: Infoseek also started out in 1994. but they offered a few add on's. On February 18. and occasionally use AltaVista as a testing platform. Although Inktomi pioneered the paid inclusion model it was nowhere near as efficient as the pay per click auction model developed by Overture. Overture signed a letter of intent to buy AltaVista for $80 million in stock and $60 million cash. 2003. AltaVista: AltaVista debut online came during this same month. which listed over 1 million URLs at that time. they were the first to allow natural language queries. Licensing their search results also was not profitable enough to pay for their scaling costs. and portal related clutter AltaVista was largely driven into irrelevancy around the time Inktomi and Google started becoming popular. They had nearly unlimited bandwidth (for that time).SEARCH ENGINE 13 had identified 394. Two Cal Berkeley cohorts created Inktomi from the improved technology gained from their research. Inktomi: The Inktomi Corporation came about on May 20. the catalog had reached 1. In October 1994. Due to poor mismanagement. Lycos ranked first on Netscape's list of search engines by finding the most hits on the word ‘surf. They failed to develop a profitable business . AltaVista brought many important features to the web scene. which highlights how Inktomi accidentally allowed the public to access their database of spam sites.

Users can upload items and title. They had a sleek user interface with rich advanced search features. but on February 23. 2003. Based on usage statistics this tool can help Google understand which vertical search products they should create or place more emphasis on. They also believe that targeted measured advertising associated with search can be carried over to other mediums. AllTheWeb was bought by Overture for $70 million.com (Formerly Ask Jeeves): In April of 1997 Ask Jeeves was launched as a natural language search engine. Google also has a Scholar search program which aims to make scholarly research easier to do. Mike Grehan's Topic Distillation [PDF] also explains how subject specific popularity works. Yahoo! has also tried to extend their . and occasionally use AllTheWeb as a testing platform. which aimed to rank results based on their popularity. which uses clustering to organize sites by Subject Specific Popularity.SEARCH ENGINE 14 model. Ask was powered by DirectHit for a while.65 a share. Ask. and tag them as they see fit. They believe that owning other verticals will allow them to drive more traffic back to their core search service. For example. which is another way of saying they tried to find local web communities. in December of 2003. On November 15. AllTheWeb AllTheWeb was a search technology platform launched in May of 1999 to showcase Fast's search technologies. Jon Kleinberg's Authoritative sources in a hyperlinked environment [PDF] was a source of inspiration what lead to the eventual creation of Teoma. Google bought dMarc. but that technology proved to easy to spam as the core algorithm component. a radio ad placement firm. which is a database of just about anything imaginable. or $1. describe. 2005 Google launched a product called Google Base. In 2001 Ask Jeeves bought Teoma to replace the DirectHit search technology. and sold out to Yahoo! for approximately approximately $235 million. Ask Jeeves used human editors to try to match search queries. In 2000 the Teoma search engine was released. After Yahoo! bought out Overture they rolled some of the AllTheWeb technology into Yahoo! Search.

Ads targeted on websites are sold on a cost per thousand impression (CPM) basis in an ad auction against other keyword targeted and site targeted ads. and Google allowed advertisers to buy ads targeted to specific websites. Microsoft AdCenter Microsoft AdCenter was launched on May 3. An ad on a digital camera review page would typically be worth more than a click from a page with pictures on it. Google bought Applied Semantics. and it is just generally clunky. Microsoft added demographic targeting and dayparting features to the pay per click mix. they intend to increase their marketshare by baking search into Internet Explorer 7. like the photo sharing site Flickr.SEARCH ENGINE 15 reach by buying other high traffic properties. 2006. it's hard to run local ads. and some select publishers can place ads in emails. with the same flaws .ad CTR not factored into click cost. As of September 2006 their platform is generally the exact same as the old Overture platform. Google initially started off by allowing textual ads in numerous formats. Google also allows some publishers to place AdSense ads in their feeds. but eventually added image ads and video ads. pages. and Google also introduced what they called smart pricing. AdSense allows web publishers large and small to automate the placement of relevant ads on their content. Advertisers could chose which keywords they wanted to target and which ad formats they wanted to market. or demographic categories. Google adopted the name AdSense for the new ad program.icio. Smart pricing automatically adjusts the click cost of an ad based on what Google perceives a click from that page to be worth. Google AdSense On March 4. . which had CIRCA technology that allowed them to drastically improve the targeting of those ads. Yahoo! Search Marketing Yahoo! Search Marketing is the rebranded name for Overture after Yahoo! bought them out. To help grow the network and make the market more efficient Google added a link which allows advertisers to sign up for AdWords account from content websites.us. and the social bookmarking site del. 2003 Google announced their content targeted ad network. . In April 2003. On the features front. Microsoft's ad algorithm includes both cost per click and ad clickthrough rate. While Microsoft has limited marketshare. To prevent the erosion of value of search ads Google allows advertisers to opt out of placing their ads on content sites.

In the PageRank algorithm links count as votes. Winning the Search War . Larry. Buzz about the new search technology began to build as word spread around campus. Larry and Sergey had begun collaboration on a search engine called BackRub. Eventually video game ads will be sold from within Microsoft AdCenter. Early Years Google's corporate history page has a pretty strong background on Google. took on the task of creating a new kind of server environment that used low-end PCs instead of big expensive machines. the pair took to haunting the department's loading docks in hopes of tracking down newly arrived computers that they could borrow for their network. named for its unique ability to analyze the "back links" pointing to a given website.SEARCH ENGINE 16 Microsoft also created the XBox game console. In 1995 Larry Page met Sergey Brin at Stanford. BackRub ranked pages using citation notation. If someone cites a source they usually think it is important. and on May 4. a concept which is popular in academic circles. By January of 1996. starting from when Larry met Sergey at Stanford right up to present day. who had always enjoyed tinkering with machinery and had gained some notoriety for building a working printer out of Lego™ bricks. 2006 announced they bought a video game ad targeting firm named Massive Inc. links act as citations. but nobody was interested in buying or licensing their search technology at that time. but some votes count more than others. A year later. Sergey tried to shop their PageRank technology. Your ability to rank and the strength of your ability to vote for others depends upon your authority: how many people link to you and how trustworthy those links are. Afflicted by the perennial shortage of cash common to graduate students everywhere. their unique approach to link analysis was earning BackRub a growing reputation among those who had seen it. In 1998. Google was launched. On the web.

000 seed funding. Google Blog Search: On September 14. Google announced an expanded Google News Archive Search that goes back over 200 years. 2002. Google went public at $85 a share on August 19. Google announced Google Blog Search. Google Scholar: On November 18. Google launched Google Scholar. and Google received $25 million Sequoia Capital and Kleiner Perkins Caufield & Byers the following year. On May 1. a database of uploaded information describing online or offline content.SEARCH ENGINE 17 Later that year Andy Bechtolsheim gave them $100. Google Video: On January 6. products. After some controversy surrounding an interview in Playboy. which was a strong turning point in Google's battle against Overture. In 1999 AOL selected Google as a search partner. On September 6. 2005. Google announced Google Video. Google launchedGoogle Book Search. and offered shares of their stock in a Dutch auction. In 2000 Google also launched their popular Google Toolbar. 2006. or services. . Google Universal Search: On May 16. Google Base: On November 15. Google also runs a large number of vertical search services. 2004. 2007 Google began mixing many of their vertical results into their organic search results. 2004 and its first trade was at 11:56 am ET at $100. including: • • • • • • • Google News: Google News launched in beta in September 2002. 2006. Going Public Google used a two class stock structure. In 2002 they retooled the service. decided not to give earnings guidance. They received virtually limitless negative press for the perceived hubris they expressed in their "AN OWNER'S MANUAL" FOR GOOGLE'S SHAREHOLDERS. 2004. 2005. AOL announced they would use Google to deliver their search related ads. Google dropped their IPO offer range from $85 to $95 per share from $108 to $135. which allowed them to expand their ad network by selling targeted ads on other websites. We Promise! product. Verticals Galore! In addition to running the world's most popular search service. Just Search. Google announced the launch of Google Base. selling ads in an auction which would factor in bid price and ad clickthrough rate.01. In 2003 Google also launched their AdSense program. and Yahoo! followed suit a year later. In 2000 Google relaunched their AdWords program to sell ads on a CPM basis. Google Book Search: On October 6. Google gained search market share year over year ever since. an academic search program.

. and Inktomi to power their search service. 2005. MSN announced they dumped Yahoo!'s search ad program on May 4th.SEARCH ENGINE Microsoft 18 In 1998 MSN Search was launched. On September 11. Until Microsoft saw the light they primarily relied on partners like Overture. Microsoft announced they were launching their Live Search product. 2006. Looksmart. They formally switched from Yahoo! organic search results to their own in house technology on January 31st. but Microsoft did not get serious about search until after Google proved the business model. They launched their technology preview of their search engine around July 1st of 2004. 2006.

such as AltaVista. use spidering as a means of providing up-to-date data. Crawlers can also be used for automating maintenance tasks on a website. As the crawler visits these URLs. 2. and worms. so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. Working of a search engine A search engine operates.SEARCH ENGINE 2. called the seeds. In general. such as Google. in the following order 1. called the crawl frontier. Other less frequently used names for web crawlers are ants. usually with a short summary containing the document's title and sometimes parts of the text. Many sites. or—especially in the FOAF community—web scutter) is a program or automated script which browses the World Wide Web in a methodical. headings. This cached page always holds the actual search text since it is the one that was actually indexed. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees. A web crawler is one type of bot. such as harvesting e-mail addresses (usually for spam). or software agent. the engine examines its index and provides a listing of best-matching web pages according to its criteria. store every word of every page they find. crawlers can be used to gather specific types of information from Web pages. . Data about web pages are stored in an index database for use in later queries. in particular search engines. This process is called web crawling or spidering. such as checking links or validating HTML code. 3.txt. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. Exclusions can be made by the use of robots. words are extracted from the titles. Also. automatic indexers. it starts with a list of URLs to visit. The contents of each page are then analyzed to determine how it should be indexed (for example. automated manner. Most search engines support the use of the Boolean operators AND. URLs from the frontier are recursively visited according to a set of policies. it identifies all the hyperlinks in the page and adds them to the list of URLs to visit. Web crawling Indexing Searching 19 Web search engines work by storing information about many web pages. store all or part of the source page (referred to as a cache) as well as information about the web pages. OR and NOT to further specify the search query. bots. whereas others. or special fields called meta tags). which they retrieve from the WWW itself. web robot. Some search engines. A web crawler (also known as a web spider. When a user enters a query into a search engine (typically by using key words).

even large search engines cover only a portion of the publicly available interne. noted. The behavior of a web crawler is the outcome of a combination of policies: • • • • A selection policy that states which pages to download.SEARCH ENGINE Crawling policies . "Given that the bandwidth for conducting crawls is neither infinite nor free. its popularity in terms of links or visits. if some reasonable measure of quality or freshness is to be maintained. and an option to disable user-provided contents. and not just a random sample of the Web. This mathematical combination creates a problem for crawlers. it is becoming essential to crawl the Web in not only a scalable. so it needs to prioritize its downloads. A politeness policy that states how to avoid overloading websites. Selection policy Given the current size of the Web. only a small selection of which will actually return unique content. two file formats. There are three important characteristics of the Web that make crawling it very difficult: • • • 20 its large volume. If there exist four ways to sort images. As Edwards et al. then that same set of content can be accessed with forty-eight different URLs. The large volume implies that the crawler can only download a fraction of the web pages within a given time. it is very likely that new pages have been added to the site. it is highly desirable that the downloaded fraction contains the most relevant pages. all of which will be present on the site. dynamic page generation which combine to produce a wide variety of possible crawlable URLs. The importance of a page is a function of its intrinsic quality. This requires a metric of importance for prioritizing Web pages. as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content. The recent increase in the number of pages being generated by server-side scripting languages has also created difficulty in that endless combination of HTTP GET parameters exist. three choices of thumbnail size. The high rate of change implies that by the time the crawler is downloading the last pages from a site. its fast rate of change. A parallelization policy that states how to coordinate distributed web crawlers. as specified through HTTP GET parameters. A crawler must carefully choose at each step which pages to visit next. and even of its URL (the . A re-visit policy that states when to check for changes to the pages. but efficient way. For example. As a crawler always downloads just a fraction of the Web pages. a simple online photo gallery may offer three options to users. or that pages have already been updated or even deleted.

This strategy may cause numerous HTML Web resources to be unintentionally skipped. In OPIC.perhaps the collection of photos in a gallery . (Boldi et al. One can extract good seed from a previously crawled web graph using this new method. as the complete set of Web pages is not known during crawling.. it will attempt to crawl /hamster/monkey/. a crawler may alternatively examine the URL and only request the resource if the URL ends with . when it is available.it domain and 100 million pages from the WebBase crawl. A similar strategy compares the extension of the web resource to a list of known HTML-page types: . Baeza-Yates et al.html.gr and . random ordering and an omniscient strategy.asp. designed a community based algorithm for discovering good seeds.from a specific page or host.org/hamster/monkey/page. 2004) introduced a path-ascending crawler that would ascend to every path in each URL that it intends to crawl. and a slash. Using these seeds a new crawl can be very effective. /hamster/. 2003) designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation). .SEARCH ENGINE 21 latter is the case of vertical search engines restricted to a single top-level domain. testing several crawling strategies. Cothey (Cothey. They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are both better than breadth-first crawling. Cothey found that a path-ascending crawler was very effective in finding isolated resources. a crawler may make an HTTP HEAD request to determine a Web resource's MIME type before requesting the entire resource with a GET request. For example. A crawler may only want to seek out HTML pages and avoid all other MIME types.html. and /. Many Path-ascending crawlers are also known as Harvester software. we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. used simulation on two subsets of the Web of 3 million pages from the . .html.htm or a slash. to guide the current one. .php. To avoid making numerous HEAD requests. Daneshpajouh et al.. Some crawlers intend to download as many resources as possible from a particular Web site. or resources for which no inbound link would have been found in regular crawling. each page is given an initial sum of "cash" which is distributed equally among the pages it points to. or search engines restricted to a fixed Web site).htm. Designing a good selection policy has an added difficulty: it must work with partial information. Abiteboul (Abiteboul et al. 2004) used simulation on subsets of the Web of 40 million pages from the . .cl domain. and that it is also very effective to use a previous crawl. when given a seed URL of http://llama. Their method crawls web pages with high PageRank from different communities in less iteration in comparison with crawl starting from random seeds. this was the approach . In order to request only HTML resources. because they're used to "harvest" or collect all the content . . A possible predictor is the anchor text of links. Boldi et al. The main problem in focused crawling is that in the context of a web crawler.aspx. testing breadth-first against depth-first.

we should penalize the elements that change too often (Cho and GarciaMolina. updates and deletions. and crawling a fraction of the Web can take a really long time. Freshness: This is a binary measure that indicates whether the local copy is accurate or not. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched. The optimal re-visiting policy is neither the uniform policy nor the proportional . the crawler is just concerned with how many pages are out-dated. Two simple re-visiting policies were studied by Cho and Garcia-Molina: Uniform policy: This involves re-visiting all pages in the collection with the same frequency. 2000). These events can include creations. Web 3. The visiting frequency is directly proportional to the (estimated) change frequency.) To improve freshness.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in Semantic Web and Website Parse Template concepts for the present. By the time a web crawler has finished its crawl. propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. there is a cost associated with not detecting an event. (In both cases. These objectives are not equivalent: in the first case.SEARCH ENGINE 22 taken by Pinkerton in a crawler developed in the early days of the Web. regardless of their rates of change. are freshness and age. usually measured in weeks or months. 2003a). introduced in (Cho and Garcia-Molina. or to keep the average age of pages as low as possible. The most used cost functions.0 crawling and indexing technologies will be based on human-machine clever associations. the crawler is concerned with how old the local copies of pages are.svg Evolution of freshness and age in Web crawling The objective of the crawler is to keep the average freshness of pages in its collection as high as possible. The age of a page p in the repository. and thus having an outdated copy of a resource. Web 3. and a focused crawling usually relies on a general Web search engine for providing starting points. The Web has a very dynamic nature. The freshness of a page p in the repository at time t is defined as: Age: This is a measure that indicates how outdated the local copy is. Proportional policy: This involves re-visiting more often the pages that change more frequently. many events could have happened. From the search engine's point of view. the repeated crawling order of pages can be done either at random or with a fixed order. at time t is defined as: Image:Web Crawling Freshness Age. Diligenti et al. while in the second case.

A partial solution to these problems is the robots exclusion protocol. This standard does not include a suggestion for the interval of visits to the same server. Parallelization policy .txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots. some complaints from Web server administrators are received. Brin and Page note that: ".. However. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files. and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. if pages were downloaded at this rate from a website with more than 100.. It is worth noticing that even when being very polite.. Poorly written crawlers.. because this is the first one they have seen. especially if the frequency of accesses to a given server is too high. also known as the robots. or which download pages they cannot handle. For those using web crawlers for research purposes.SEARCH ENGINE 23 policy. only a fraction of the resources from that Web server would be used.". can disrupt networks and Web servers. a more detailed cost-benefit analysis is needed and ethical considerations should be taken into account when deciding where to crawl and how fast to crawl. it would take more than 2 months to download only that entire website. as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time. Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes. even though this interval is the most effective way of avoiding server overload. which can crash servers or routers. so they can have a crippling impact on the performance of a site. The first proposal for the interval between connections was given in and was 60 seconds. and taking all the safeguards to avoid overloading web servers. Recently commercial search engines like Ask Jeeves. Personal crawlers that. if deployed by too many users. The optimal method for keeping average freshness high includes ignoring the pages that change too often.) generates a fair amount of email and phone calls. This does not seem acceptable. also. Because of the vast number of people coming on line. running a crawler which connects to more than half a million servers (.txt file to indicate the number of seconds to delay between requests.000 pages over a perfect connection with zero latency and infinite bandwidth. • • • • Network resources. Politeness policy Crawlers can retrieve data much quicker and in greater depth than human searchers.. there are always those who do not know what a crawler is. a server would have a hard time keeping up with requests from multiple crawlers. Server overload.

When crawler designs are published. as the same URL can be found by two different crawling processes. There are several types of normalization that may be performed including conversion of URLs to lowercase. . Web crawlers are a central part of search engines. or they may mask their identity as a browser or other wellknown crawler. Web site administrators typically examine their web servers’ log and use the user agent field to determine which crawlers have visited the web server and how often. also called URL canonicalization. removal of ". as noted in the previous sections. and the owner needs to stop the crawler. The user agent field may include a URL where the Web site administrator may find out more information about the crawler. but it should also have a highly optimized architecture. crawlers may be accidentally trapped in a crawler trap or they may be overloading a web server with requests. To avoid downloading the same page more than once. Crawler identification Web crawlers typically identify themselves to a web server by using the User-agent field of an HTTP request." and ". The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. The term URL normalization. There are also emerging concerns about "search engine spamming". which prevent major search engines from publishing their ranking algorithms. It is important for web crawlers to identify themselves so Web site administrators can contact the owner if needed. there is often an important lack of detail that prevents others from reproducing the work. Web crawler architectures High-level architecture of a standard Web crawler A crawler must not only have a good crawling strategy.. In some cases. the crawling system requires a policy for assigning the new URLs discovered during the crawling process.SEARCH ENGINE Main article: Distributed web crawling 24 A parallel crawler is a crawler that runs multiple processes in parallel. and adding trailing slashes to the non-empty path component (Pant et al. and details on their algorithms and architecture are kept as business secrets. URL normalization Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once." segments. Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field. 2004). refers to the process of modifying and standardizing a URL in a consistent manner..

Popular engines focus on the full-text indexing of online. For example. or how words or subject features are added to the index during text corpus traversal. Unlike full-text indices. partial-text services restrict the depth indexed to reduce index size.SEARCH ENGINE 25 Identification is also useful for administrators that are interested in knowing when they may expect their Web pages to be indexed by a particular search engine. whereas cache-based search engines permanently store the index along with the corpus.000 large documents could take hours. as well as the considerable increase in the time required for an update to take place. while agent-based search engines index in real time. and whether multiple indexers can work asynchronously. Index design incorporates interdisciplinary concepts from linguistics. and the second program "mite". Indexing The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. Media types such as video and audio and graphics are also searchable. It also included a realtime crawler that followed links based on the similarity of the anchor text with the provided query. and another program to parse and order URLs for breadth-first exploration of the Web graph. The indexer must first check whether it is updating old content or adding new content. An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing. It was based on lib-WWW to download pages. The additional computer storage required to store the index. is a modified www ASCII browser that downloads the pages from the Web. . are traded off for the time saved during information retrieval. natural language documents. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs. "spider" maintains a queue in a relational database. a sequential scan of every word in 10. Index Design Factors Major factors in designing a search engine's architecture include: Merge factors How data enters the index. • • RBSE was the first published web crawler. It was based on two programs: the first program. and stores data to facilitate fast and accurate information retrieval. informatics. cognitive psychology. Without an index. physics and computer science. while an index of 10. mathematics. Search engine indexing collects.000 documents can be queried within milliseconds. Meta search engines reuse the indices of other services and do not store a local index. WebCrawler was used to build the first publicly-available full-text index of a subset of the Web. which would require considerable time and computing power. parses. the search engine would scan every document in the corpus.

This is commonly referred to as a producer-consumer model. the search engine's architecture may involve distributed computing. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. Issues include dealing with index corruption. Inverted indices . partitioning. The speed of finding an entry in a data structure. The indexer is the producer of searchable information and users are the consumers that need to search. and the inverted index is the consumer of information produced by the forward index. There are many opportunities for race conditions and coherent faults. Fault tolerance How important it is for the service to be reliable. but the index simultaneously needs to continue responding to search queries. The challenge is magnified when working with distributed storage and distributed processing. and schemes such as hash-based or composite partitioning. and a web crawler is the consumer of this information. Storage techniques How to store the index data. grabbing the text and storing it in a cache (or corpus). Index size How much computer storage is required to support the index. as well as replication. Consider that authors are producers of information. determining whether bad data can be treated in isolation.SEARCH ENGINE 26 Traversal typically correlates to the data collection policy. compared with how quickly it can be updated or removed. For example. dealing with bad hardware. Index Data Structures Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized. parallel architecture. In an effort to scale with larger amounts of indexed information. Maintenance How the index is maintained over time. Lookup speed How quickly a word can be found in the inverted index. where the search engine consists of several machines operating in unison. The forward index is the consumer of the information produced by the corpus. This is a collision between two competing tasks. a new document is added to the corpus and the index must be updated. whether information should be data compressed or filtered. Types of indices include: Challenges in Parallelism A major challenge in the design of search engines is the management of parallel computing processes. is a central focus of computer science. distributed. that is.

In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. In a larger search engine. it is therefore considered to be a boolean index. The architecture may be designed to support incremental indexing. and so this process is commonly split up into two parts. Because the inverted index stores a list of the documents containing each word. with the index cache residing on one or more computer hard drives. The index is similar to the term document matrices employed by latent semantic analysis. Document 3. typically residing in virtual memory. Index Merging The inverted index is filled via a merge or rebuild. Such an index determines which documents match a query but does not rank matched documents. In some cases the index is a form of a binary tree. Such topics are the central research focus of information retrieval.SEARCH ENGINE 27 Many search engines incorporate an inverted index when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. frequency can be used to help in ranking the relevance of documents to the query. where a merge identifies the document or documents to be added or updated and then parses each document into words. For technical accuracy. The inverted index is a sparse matrix. Document 4. which requires additional storage but may reduce the lookup time. After parsing. since it stores no information regarding the frequency and position of the word. To reduce computer storage memory requirements. Document 3. Position information enables the search algorithm to identify word proximity to support searching for phrases. In larger indices the architecture is typically a distributed hash table. The inverted index can be considered a form of a hash table. it is stored differently from a two dimensional array. the search engine can use direct access to find the documents associated with each word in the query in order to retrieve the matching documents quickly. A rebuild is similar to a merge but first deletes the contents of the inverted index. the indexer adds the referenced document to the document list for the appropriate words. the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming. Inverted indices can be programmed in several computer programming languages. since not all words are present in each document. a merge conflates newly indexed documents. The following is a simplified illustration of an inverted index: Inverted Index Word Documents the cow says Document 1. Document 4 Document 5 moo Document 7 This index can only determine whether a word exists within a particular document. the development of a forward index and a process which sorts . Document 5 Document 2.

and.SEARCH ENGINE 28 the contents of the forward index into the inverted index.says.spoon The rationale behind developing a forward index is that as documents are parsing. It takes 8 bits (or 1 byte) to store a single character. At 1 byte per character. The forward index is essentially a list of pairs consisting of a document and a word.000.cat.moo Document 2 the.away. • • • • An estimated 2.dish. it is better to immediately store the words per document.000. this would require 2500 gigabytes of storage space alone. which partially circumvents the inverted index update bottleneck. collated by the document. Consider the following scenario for a full text.with. The delineation enables Asynchronous system processing. more than the average free disk space of 25 personal computers. The inverted index is so named because it is an inversion of the forward index. Many search engines utilize a form of compression to reduce the size of the indices on disk.cow.the. Internet search engine. or 5 bytes per word.the. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. index) for 2 billion web pages would need to store 500 billion word entries.hat Document 3 the. In this regard.ran.000 different web pages exist as of the year 2000Suppose there are 250 words on each webpage (based on the assumption they are similar to the pages of a novel. The Forward Index The forward index stores a list of words for each document. The forward index is sorted to transform it to an inverted index. Compression Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge. simple. This space requirement may be even larger for a faulttolerant distributed storage architecture. the . Depending on the compression technique chosen. The following is a simplified form of the forward index: Forward Index Document Words Document 1 the. Some encodings use 2 bytes per character The average number of characters in any given word on a page may be estimated at 5 (Wikipedia:Size comparisons) The average personal computer comes with 100 to 250 gigabytes of usable space Given this scenario. an uncompressed index (assuming a non-conflated. the inverted index is a word-sorted forward index.

text segmentation. Tokenization for indexing involves multiple technologies. Language Ambiguity To assist with properly ranking matching documents. is the subject of continuous research and technological improvement. text mining. which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax). 'parsing'. many search engines collect additional information about each word. Diverse File Formats In order to correctly identify which bytes of a document represent characters. the implementation of which are commonly kept as corporate secrets. content analysis. as the syntax varies among languages. Japanese or Arabic represent a greater challenge. Thus compression is a measure of cost. Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. such as its language or lexical category (part of speech). Faulty Storage . These techniques are language-dependent. The words found are called tokens. tagging. parsing is more commonly referred to as tokenization. as words are not clearly delineated by whitespace. the file format must be correctly handled. as of 2006. and 'tokenization' are used interchangeably in corporate slang. concordance generation. text analysis. Notably. Natural language processing. lexing. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. It is also sometimes called word boundary disambiguation. in the context of search engine indexing and natural language processing. Documents do not always clearly identify the language of the document or represent it accurately. Challenges in Natural Language Processing Word Boundary Ambiguity Native English speakers may at first consider tokenization to be a straightforward task. Language-specific logic is employed to properly identify the boundaries of words. In tokenizing the document. the texts of other languages such as Chinese. speech segmentation. The terms 'indexing'. and so. some search engines attempt to automatically identify the language of the document. or lexical analysis. The tradeoff is the time and processing power required to perform compression and decompression. but this is not the case with designing a multilingual indexer. The goal during tokenization is to identify words for which users will search. large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. Document Parsing Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices.SEARCH ENGINE 29 index can be reduced to a fraction of this size. In digital form.

An unspecified number of documents. such as punctuation. If the search engine were to ignore the difference between content and 'markup'. Without recognition of these characters and appropriate handling. bold emphasis. and language tagging. leading to poor search results. Format Analysis If the search engine supports multiple document formats. Tokenization Unlike literate humans. lower. a common initial step during tokenization is to identify each document's language. do not closely obey proper file protocol. proper). Language recognition is the process by which a computer program attempts to automatically identify. Automated language recognition is the subject of ongoing research in natural language processing. a document is only a sequence of bytes. and font size or style. the parser identifies sequences of characters which represent words and other elements.SEARCH ENGINE 30 The quality of the natural language data may not always be perfect. particular on the Internet. several characteristics may be stored. Other names for language recognition include language classification. such as the token's case (upper. referred to as a token. . language or encoding. HTML documents contain HTML tags. Many search engines. position. During tokenization. When identifying each token. documents must be prepared for tokenization. or categorize. For example. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. sentence position. To a computer. Finding which language the words belongs to may involve the use of a language recognition chart. language analysis. incorporate specialized programs for parsing. such as YACC or Lex. sentence number. many of the subsequent steps are language dependent (such as stemming and part of speech tagging). lexical category (part of speech. and URLs. mixed. humans must program the computer to identify what constitutes an individual or distinct word. which specify formatting information such as new line starts. Instead. phone numbers. extraneous information would be included in the index. the language of a document. Language Recognition If the search engine supports multiple languages. The parser can also identify entities such as email addresses. language identification. some of which are non-printing control characters. like 'noun' or 'verb'). the index quality or indexer performance could degrade. Computers do not 'know' that a space character separates words in a document. computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. as well as other natural language processing software. length. The challenge is that many document formats contain formatting information in addition to textual content. Such a program is commonly called a tokenizer or parser or lexer. binary characters may be mistakenly encoded into various parts of a document. and line number. which are represented by numeric codes.

text cleaning. this step may result in one or more files. maintains. text normalization.Unix Gzip'ped Archives Format analysis can involve quality improvement methods to avoid including 'bad information' in the index. which may incorporate the use of CSS or Javascript to do so).GZ . hidden "div" tag in HTML. . each of which must be indexed separately. and writing a custom parser. tag stripping. The challenge of format analysis is further complicated by the intricacies of various file formats.SEARCH ENGINE 31 Format analysis is also referred to as structure analysis. and text preparation. format parsing. Examples of abusing document formatting for spamdexing: • Including hundreds or thousands of words in a section which is hidden from view on the computer screen.Gzip file BZIP .g. by use of formatting (e.Archive File CAB . Certain file formats are proprietary with very little information disclosed. Commonly supported compressed file formats include: • • • • • • ZIP . the indexer first decompresses the document. Common. or owns the format.Microsoft Windows Cabinet File Gzip . well-documented file formats that many search engines support include: • • • • • • • • • • • • • Microsoft Word Microsoft Excel Microsoft Powerpoint IBM Lotus Notes HTML ASCII text files (a text document without any formatting) Adobe's Portable Document Format (PDF) PostScript (PS) LaTex The UseNet archive (NNTP) and other deprecated bulletin board formats XML and derivatives like RSS SGML (this is more of a general protocol) Multimedia meta data formats like ID3 Options for dealing with various formats include using a publicly available commercial parsing tool that is offered by the organization which developed. and TAR. format stripping.Bzip file TAR. Content can manipulate the formatting information to include additional content. Some search engines support inspection of files that are stored in a compressed or encrypted file format. When working with a compressed format.Zip File RAR . but visible to the indexer. while others are well documented.

Some file formats. or rendered. contain erroneous content and side-sections which do not contain primary material (that which the document is about). the identification of major parts of a document. allow for content to be displayed in columns. prior to tokenization. the meta tag contains keywords which are also included in the index. and the index is filled with a poor representation of its documents. like HTML or PDF. nor was the hardware able to support such technology. in different areas of the view. many brick-and-mortar corporations went 'online' and established corporate websites. but the side bar content does not contribute to the meaning of the document. but not hidden to the indexer. such as newsletters and corporate reports. As the Internet grew through the 1990s. If the search engine does not render the page and evaluate the Javascript within the page. For example. Given that some search engines do not bother with rendering issues. divided into organized chapters and pages. Not all the documents in a corpus read like a well-written book. without requiring tokenization. description. when in reality it is not Organizational 'side bar' content is included in the index. this article displays a side menu with links to other web pages. some content on the Internet is rendered via Javascript. Earlier Internet search engine technology would only index the keywords in the meta tags for the forward index. many web page designers avoid displaying content via Javascript or use the Noscript tag to ensure that the web page is indexed properly. At that time fulltext indexing was not as well established. it would not 'see' this content in the same way and would index the document incorrectly. the quality of the index and search quality may be degraded due to the mixed content and improper word proximity. At the same time. essentially an abstract representation of the actual document. even though these sentences and paragraphs are rendered in different parts of the computer screen. Even though the content is displayed. the raw markup content may store this information sequentially. the full document would not be parsed. and then index the representation instead. Section analysis may require the search engine to implement the rendering logic of each document. For HTML pages. For example. Many documents on the web. Meta Tag Indexing Specific documents often contain embedded meta information such as author. The keywords used to describe webpages (many of which were .SEARCH ENGINE • 32 Setting the foreground font color of words to the same as the background color. Section Recognition Some search engines incorporate section recognition. Words that appear sequentially in the raw source content are indexed sequentially. making words hidden on the computer screen to a person viewing the document. Two primary problems are noted: • • Content in different sections is treated as related in the index. and language. The design of the HTML markup language initially included support for meta tags for the very purpose of being properly and easily indexed. keywords. this fact can also be exploited to cause the search engine indexer to 'see' different content than the viewer. If search engines index this content as if it were normal content.

. Web search queries are distinctive in that they are unstructured and often ambiguous. youtube or delta airlines). like purchasing a car or downloading a screen saver.g. • • Search engines often support a fourth type of query that is used far less frequently: • Connectivity queries – Queries that report on the connectivity of the indexed web graph (e. Navigational queries – Queries that seek a single website or web page of a single entity (e. . Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. as it was one more step away from subjective control of search engine result placement. and How many pages are indexed from this domain name?). In Desktop search.SEARCH ENGINE 33 corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries.g. which drove many search engines to adopt full-text indexing technologies in the 1990s.. The fact that these keywords were subjectively-specified was leading to spamdexing. many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content. In this sense. Which links point to this URL?. which in turn furthered research of full-text indexing technologies. while Internet search engines which must focus more on the full text index. they vary greatly from standard query languages which are governed by strict syntax rules. Desktop search is more under the control of the user.g. See also A web search query is a query that a user enters into web search engine to satisfy his or her information needs. Transactional queries – Queries that reflect the intent of the user to perform a particular action. colorado or trucks) for which there may be thousands of relevant results. Types There are three broad categories that cover most web search queries: • Informational queries – Queries that cover a broad topic (e.. the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. full-text indexing was more objective and increased the quality of search engine results. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky'.

and NOT). of. a small portion of the terms observed in a large query log (e. while the remaining terms are used less often individually.SEARCH ENGINE Characteristics 34 Most commercial web search engines do not disclose their search logs. geographic features. a study in 2001 analyzed the queries from the Excite search engine showed some interesting characteristics of web search: • • • • • The average length of a search query was 2. place names. zip codes. .). and sex.g. A study of the same Excite query logs revealed that 19% of the queries contained a geographic term (e. etc. Nevertheless. Close to half of the users examined only the first one or two pages of results (10 results per page). In addition. > 100 million queries) are used most often. OR. That is. or long tail distribution curves..4 terms. A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result.. This example of the Pareto principle (or 80-20 rule) allows search engines to employ optimization techniques such as index or database partitioning. so information about what users are searching for on the Web is difficult to come by. Boolean operators like AND.g. much research has shown that query term frequency distributions conform to the power law. About half of the users entered a single query while a little less than a third of users entered three or more unique queries. caching and pre-fetching.g. Less than 5% of users used advanced search features (e. The top three most frequently used terms were and. This suggests that many users use repeat queries to revisit or re-find information.

1998). A huge amount of scientific and other valuable information is behind closed doors. which extracts term relationships from the link structure of Websites. It is able to identify new terms and reflect the latest relationship between terms as the Web evolves.the ranking system that improved the search results (Brin & Page.1 Page Structure Analysis: the first search engines concentrated on Web page contents. digital books and journals. Microsoft has started a big competition on Web searching through working on Web page blocks.2 Deep Search: current search engines can only crawl and capture a small part of the Web. Web Graph algorithms such as HITS might be implemented to a sub-section of Web pages to improve search result ranking models. 4. 2003). Experimental results have shown that the constructed thesaurus. patents. Google not only used this approach to capture the biggest amount of Web pages but also established PageRank . MSN new ranking model will be based on object-level ranking rather than document-level. the value of information presented in < heading > tags can be more than information in < paragraph > tags. Backlinks were used based on the Hyperlink-Induced Topic Search (HITS) algorithm to crawl billions of Web pages. The automatic thesaurus construction method is a page structure method. aggregating the results and letting users compare changes to those results over time. it was clear that the contents of a Web page could not be sufficient for capturing the huge amount of information. Google. MSN and many other popular search engines are competing to find solution for the . when applied to query expansion. We can imagine also that a link in the middle of Web page is more important than a link in footnote. BrightPlanet's "differencing" algorithm is designed to transfer queries across multiple deep Web resources at once. the Deep Web and mobile search. Deep Web with structured information is a potential resource that search companies are trying to capture. However. They built huge centralized indices and this is still a part of every popular search engine. New search engines are trying to find suitable methods for penetrating the database barriers. outperforms traditional association thesaurus (Chen et al. we can track several specifications and shifts in the future. research reports and governmental archives are examples of resources that usually cannot be crawled and indexed by current search engines. Meanwhile. Web content providers are moving toward Semantic Web by applying technologies such as XML and RDF (Resource Description Framework) in order to create more structured Web resources. For example.SEARCH ENGINE New Features for Web Searching 35 The incredible development of Web resources and services has become a motivation for many studies and for companies to invest on developing new search engines or adding new features and abilities to their search engines. library catalogues. 4. It is thought that Web page layout is a good resource for improving search results. AltaVista and other old search engines were made based on indexing the content of Web pages. researchers have focused on Web page structure to increase the quality of search. By looking at the papers published in the mentioned conferences and other journals and seminars. which is called the "visible" or "indexable" Web. It is believed that the size of invisible or deep Web is several times bigger than the size of the surface Web. Different databases. Ma (2004) from Asian Microsoft Research Centre reported features of the next generation of search engines in WISE04. In 1996-1997 Google was designed based on a novel idea that the link structure of the Web is an important resource to improve the results of search engines. After content-based indexing and link analysis the new area of study is page and layout structures. HTML and XML are important in this approach.

Recently. Metasearch engines services for users are free while federated search engines are sold to libraries and other interested information service providers. this alone is not a sufficient way. but in the future an intelligent search engine will be able to distinguish different structured resources and combine their data to find a high quality response for a complicated query. the amazing size and valuable resources of the deep Web have affected the industry of search engines and the next generation of search engines are supposed to be able to investigate deep Web information. Page ranking algorithms have been utilized to present a better ranked result. The concept of structured searching is different from the way search engines currently operate. Usually there is no overlap between databases covered by federated search engines. Discussion thread recommendation or peer reviews are expected to be used by search engines to improve their results.4 Recommending Group Ranking: while many search engines are able to crawl and index billions of Web pages. Traditional information retrieval and database management techniques have been used to extract data from different tables and resources and combine them to respond users' queries. In the future. 2004). rather than just the number of times they appear. 4. and their relation to each other. The method is secret but the company does acknowledge that its Content Aggregation Program will give paying customers a more direct pipeline into its search database (Wright. So. In many cases. search results will be ranked not only based on the automatic ranking algorithms but also by using the ideas of scholars and scientific recommending groups. Such an engine would rank words based on their location in a document. Federated searche mostly covers subscription based databases that are usually a part of Invisible Web and ignored by Web-oriented metasearch engines. Simply. As we already mentioned. 4. Federated search engines are different from metasearch engines. it is believed that the best judgement about the importance and quality of Web pages is acquired when they are reviewed and recommended by human experts. Federated searching has several advantages for users. data is stored in tables and separated files. One of the . if the search term is mathematics then a page that has the word mathematics 20 times must be ranked before a page which has mathematics 10 times. The idea is simple: more relevant pages must take a higher rank. Most documents available on the Web are unstructured resources. Most of search engines just save a copy of Web pages in their repository and then make several indexes from the content of these pages. As we already mentioned. As Rein (1997) says a search engine supporting XMLbased queries can be programmed to search structured resources. 2004). Basic ranking algorithms are based on the occurrence rate of index terms in each page. recently link information and page structure information have been used to improve rank quality. structured data resources are very important and valuable. Current search engines cannot resolve this problem efficiently.3 Structured Data: the World Wide Web is considered a huge collection of unstructured data presented in billions of Web pages. search engines can just judge them based on the keyword occurrence. it aggregates multiple channels of information into a single searchable point.SEARCH ENGINE 36 invisible Web. However. As a part of both surface and deep Web. Yahoo has developed a paid service for searching the deep Web that is called the Content Aggregation Program (CAP).5 Federated Search: also known as parallel search. These methods are automatic and are done by machines. 4. sorting the results of each query is still an issue. metasearch or broadcast search. It reduces the time that is needed for searching several databases and also users do not need to know how to search through different interfaces (Fryer.

Recently. as well as quick links to stocks. they cannot overcome the underlying problem of growing complexity and lack of uniformity. One of the disadvantages of federated search engines is that they cannot be used for sophisticated search commands and queries. The platform also includes a modified Yahoo Instant Messaging client and Yahoo Mobile Games (Singer. sports scores and weather for fee.6 Mobile Search: the number of people who have a cell phone seems to be more than the number of people who have a PC. Search engine companies have focused on the big market of mobile phones and wireless telecommunication devices. . Also many other mobile technologies such as GPS devices are used widely. Yahoo developed its mobile Web search system and mobile phone users can have access to Yahoo Local. In the future everyone will have access to the Web information and services through his/her wireless phone without necessarily having a computer. Image and Web search. We need an open interoperable and uniform e-content environment to provide fully the interconnected assessable environment that librarians are seeking from metasearching. 4. 2004). and are limited to basic Boolean search. Webster (2004) maintains that although federated searching tools offer some real immediate advantages today.SEARCH ENGINE 37 important reasons of the growing interest in federated searching is the complexity of the online materials environment such as the increasing number of electronic journals and online full-text databases.

Web search industry is opening new horizons for the global village. The structure of Web pages seems to be a good resource with which search engines can improve their results. Federated search is a sample of future cooperative search and information retrieval facilities. The next generations of search tools are expected to be able to extract structured data to offer high quality responses to users' questions. filtering and query formulation are still hot topics. The World Wide Web will be more usable in the future.com) released in November 2004. The Web's security and privacy are two important issues for the coming years. other major players in search engine industry are expected to invest on rivals for this new service. ambiguity in addresses and names. there will be a shift towards providing specialised search facilities for the scholarly part of the Web that encompasses a considerable part of the deep Web. The gigantic size of the Web and vast variety of the users' needs and interests as well as the big potential of the Web as a commercial market have brought about many changes and a great demand for better search engines. Search engines are trying to consider recommendations of special-interest groups into their search techniques. Having the Beta version of Google Scholar (http://scholar. personalization and multimedia searching among others are major issues in the next few years. we see not only a considerable increase in the quantity of Web search research papers since 2001. Information extraction. We mentioned several important issues for the future of search engines. but also we can see that Web search and information retrieval topics such as ranking. Meanwhile many issues have remained unsolved or incomplete still. we reviewed the history of Web search tools and techniques and mentioned some big shifts in this field.google. Google utilized Web graph or link structure of the Web to make one of the most comprehensive and reliable search engines.SEARCH ENGINE 38 4. By looking at papers published in popular conferences on Web and information management. Local services and the personalization of search tools are two major ideas that have been studied for several years. This reveals that search engines have many unsolved and research-interesting areas. . Finally. In this article. we addressed the efforts of search engine companies in breaking their borders through making search possible for mobile phones and other wireless information and communication devices. Limitation in funds has enforced libraries and other major information user organizations to share their online resources. While the first search engines were established based on the traditional database and information retrieval methods. many other algorithms and methods have since been added to them to improve their results. As well. Conclusion The World Wide Web with its short history has experienced significant changes.

J. Kherfi. Journal of the American Society for Information Science. 2004. L.html Schwartz. (2003). 2004. Program. 2004. Retrieved December 5. (2004). B.com/article/04/03/17/HNgooglelocal_1. P. 48– 55. 131-145. Brisbane. Webster. Pu.webtechniques. Building a web thesaurus from web link ltructure.wired.SEARCH ENGINE References • • 39 • • • • • • • • • • • • • • • • Brin. Yahoo sends search aloft. 107-117. (2004). 49(11). from http://www. Google offers new local search service.com/bus-news/article. Rein. & Page. L. from http://www. Personalized web search by mapping user queries to categories. Proceedings of the 7th International WWW Conference.. (2004. Australia. Wenyin. Australia. Retrieved December 1. June 2). History of Internet and WWW: the roads and crossroads of Internet history. Fryer. 973-982. ACM Computer Surveys. Online. 31(2). M.infoworld. 28(2). (2004). from http://www. 17. D. C. & Bernardi. (2003). (2002). . Retrieved December 3. M. W. from http://www. Brisbane. (2002). A.html Holzschlag. 54(2). & Ma. Retrieved December 2. Online.7751. Journal of the American Society for Information Science and Technology.search-marketing. (2000. (2001).com/intvalstat. Gromov. & Amoudi. 28(2). G. (1998).html Wall.html Poulter. Retrieved November 28. How specialization limited the Web. 140-151. Z. Liu. Federated search engines. USA. 2004. G. (2004).php/3427831 Sullivan. XML Ushers in Structured Web Searches. Image retrieval from the World Wide Web: issues. The design of World Wide Web search engines: a critical review. Web search engines. E. (1998).. D. 2004. A. Proceedings of the eleventh international conference on Information and knowledge management CIKM’02. An analysis of multimedia searching on AltaVista. Towards Next Generation Web Information Retrieval. 186-192. 558-565..1282. G. from http://www. Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval. & Meng. 2004. R. Ziou. F.com/sereport/00/06realnames.info/search-engine-history/ Watters. from http://www. Ma. A. Spink. The anatomy of a large-scale hypertextual web search engine. S. 35-67. Perez. (1997). Chen. Metasearching in an academic environment.internetnews. (1997). techniques and systems. 36(14).. Yu. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval.. The Search Engine Report. W. L. Retrieved December 4. and Hon. (2004).netvalley. W. M.com/news/technology/0. 16-19. C.00.. Virginia. D. GeoSearcher: location-based ranking of search engine results. A. H. History of search engines & web history. October 27). & Pedersen. L.com/archives/2001/09/desi/ Jansen. (2004). McLean. Survey reveals search habits. from http://www.. C. Retrieved November 20. H.searchenginewatch. S. (2003). J. 2004. 20-23. C. Web Information Systems – WISE04: Proceedings of the fifth international Conference on Web Information System Engineering . Singer. Liu. Zhang. Toronto.

SEARCH ENGINE 40 .

Sign up to vote on this title
UsefulNot useful

Master Your Semester with Scribd & The New York Times

Special offer: Get 4 months of Scribd and The New York Times for just $1.87 per week!

Master Your Semester with a Special Offer from Scribd & The New York Times