This action might not be possible to undo. Are you sure you want to continue?
A research project on SEARCH ENGINE
University Of Northern Virginia
CSCI 587 SEC 1220, SPECIAL TOPICS IN INFORMATION TECHNOLOGY-1 6/20/2010
Abstract of the Project
A web search engine is designed to search for information on the World Wide Web. The search results are usually presented in a list of results and are commonly called hits. The information may consist of web pages, images, information and other types of files. Some search engines also mine data available in databases or open directories. Unlike Web directories, which are maintained by human editors, search engines operate algorithmically or are a mixture of algorithmic and human input. Here in this project, we are discussing about types of search engines. How a search engine works or finds the information for the user. What are the processes going behind the screen? Today how many search engines are existing to provide information and facts for the computer users, History of search engine. What are the different stages of search engines in searching information? What are the features of Web searching? And also different topics like Advanced Research Projects Agency Network. And what a BOT really mean? Types of search queries we use in seeking information using search engine. Web Directories? And very famous search engine like google, yahoo and their role of processing as search engine. Challenges in language processing. Characteristics of search engines. I used this topic for my project is reason that I find interest in working of search engine. And I want every one to come across this topic and learn. As many of them uses, but they don’t know what’s the real fact going behind the screen in a search engine. At the end of the project I also gave some references from which I have selected for topic discussion of project. I hope you like this and accept this project for my topic in this course.
The project entitled “SEARCH ENGINE” is of total effort of me. It is my duty to bring forward each and every one who is either directly or indirectly in relation with our project and without whom it would not have gained a structure. Accordingly sincere thanks to PROF. SOUROSHI , For his support and for her valuable suggestions and timely advice without them, the project would not be completed in time. We also thank many others who helped us through out the project and made our project successful.
CONTENTS PRELIMINARIES Acknowledgement
1. History of search engine
1 - 15
Types of search queries World Wide Web wanderer Alliweb Primitive web search 2. Working of a search engine Web crawling Indexing Searching 3. New features for web searching page 33 – 35 page 15 – 32
extensible. but he certainly ought to be able to learn from it. He not only was a firm believer in storing data. It operates by association.SEARCH ENGINE 5 1. it must be stored. fast. The human mind does not work this way. but he also believed that if the data source was to be useful to the human mind we should have it represent how the mind works to the best of our abilities. and above all it must be consulted.. when after enjoying the scientific camaraderie that was a side effect of WWII. if it is to be useful to science. Here are a few selected sentences and paragraphs that drive his point home.Early Technology 2. Having found one item.. . Search Engine Marketing History of Search Engines: From 1945 to Google 2007 As We May Think (1945): The concept of hypertext and a memory extension really came to life in July of 1945.Directories 3.. He then proposed the idea of a virtually limitless. moreover. Vertical Search 4. Our ineptitude in getting at the record is largely caused by the artificiality of the systems of indexing. one has to emerge from the system and re-enter on a new path. He urged scientists to work together to help build a body of knowledge for all mankind. and the effort to bridge between disciplines is correspondingly superficial. associative memory storage and retrieval system. . Specialization becomes increasingly necessary for progress. . In minor ways he may even improve. He named this device a memex. Vannaver Bush's As We May Think was published in The Atlantic Monthly. A record. reliable. must be continuously extended.. for his records have relative permanency. Man cannot hope fully to duplicate this mental process artificially.
a student at McGill University in Montreal. There is still conflict surrounding the exact reasons why Project Xanadu failed to take off. and relevancy feedback mechanisms. term discrimination values.SEARCH ENGINE Gerard Salton (1960s . Inverse Document Frequency (IDF). but long before most of them existed came Archie. Advanced Research Projects Agency Network: ARPANet is the network which eventually led to the internet.1990s): 6 Gerard Salton. . While Ted was against complex markup code. The original intent of the name was "archives. broken links. Archie (1990): The first few hundred web sites began in 1993 and most of them were at colleges." but it was shortened to Archie. was the father of modern search technology. The Wikipedia has a great background article on ARPANet and Google Video has a free interesting video about ARPANet from 1972. Salton’s Magic Automatic Retriever of Text included important concepts like the vector space model. created in 1990 by Alan Emtage. Term Frequency (TF). His teams at Harvard and Cornell developed the SMART informational retrieval system. much of the inspiration to create the WWW was drawn from Ted's work. who died on August 28th of 1995. and many other problems associated with traditional HTML on the WWW. Ted Nelson: Ted Nelson created Project Xanadu in 1960 and coined the term hypertext in 1963. The first search engine created was Archie. His goal with Project Xanadu was to create a computer network with a simple user interface that solved many social problems like attribution.
he returned in 1984 as a fellow. but it worked on plain text files. Soon another user interface name Jughead appeared with the same purpose as Veronica. . both of these were used for files sent via Gopher. After leaving CERN in 1980 to work at John Poole's Image Computer Systems Ltd. but the data became as much fragmented as it was collected. Veronica served the same purpose as Archie. This process worked effectively in small groups. it started to become word of computer and Archie had such popularity that the University of Nevada System Computing Services group developed Veronica. If you had a file you wanted to share you would set up an FTP server. Berners-Lee proposed a project based on the concept of hypertext. Tim Berners-Lee & the WWW (1991): From the Wikipedia: While an independent contractor at CERN from June to December 1980. With help from Robert Cailliau he built a prototype system named Enquire. In 1989. Bill Slawski has more background on Archie here. If someone was interested in retrieving the data they could using an FTP client. and Berners-Lee saw an opportunity to join hypertext with the Internet. however there was no World Wide Web. The main way people shared data back then was via File Transfer Protocol (FTP).SEARCH ENGINE 7 Archie helped solve this data scatter problem by combining a script-based data gatherer with a regular expression matcher for retrieving file names matching a user query. which was created as an Archie alternative by Mark McCahill at the University of Minnesota in 1991. Essentially Archie became a database of web filenames which it would match with the users queries.. "I just had to take the hypertext idea and connect it to the TCP and DNS ideas and — ta-da! — the World Wide Web". to facilitate sharing and updating information among researchers. CERN was the largest Internet node in Europe. In his words. File Transfer Protocol: Tim Burners-Lee existed at this point. Veronica & Jughead: As word of mouth about Archie spread.
What is a Bot? Computer robots are simply programs that automate repetitive tasks at speeds impossible for humans to reproduce. It provided an explanation about what the World Wide Web was. Types of Search Queries: Andrei Broder authored A Taxonomy of Web Search [PDF].SEARCH ENGINE 8 He used similar ideas to those underlying the Enquire system to create the World Wide Web.shopping at. for which he designed and built the first web browser and editor (called WorldWideWeb and developed on NeXTSTEP) and the first Web server called httpd (short for HyperText Transfer Protocol daemon).ch/ and was first put online on August 6. It was also the world's first Web directory. In 1994.seeking static information about a topic Transactional . which are resource heavy on a specific topic.send me to a specific URL . Another bot example could be Chatterbots. downloading from.cern. Tim also wrote a book about creating the web. since Berners-Lee maintained a list of other Web sites apart from his own. how one could own a browser and how to set up a Web server. The first Web site built was at http://info. Berners-Lee founded the World Wide Web Consortium (W3C) at the Massachusetts Institute of Technology. The term bot on the internet is usually used to describe anything that interfaces with the user or that collects data. which is the oldest catalogue of the web. which notes that most searches fall into the following 3 categories: • • • Informational . or otherwise interacting with the result Navigational . titled Weaving the Web. These bots attempt to act like a human and communicate with humans on said topic. 1991. Tim also created the Virtual Library.
and Technorati allows you to search blogs. Robots Exclusion Standard: Martijn Kojer also hosts the web robots page.Notess's Search Engine Showdown offers a search engine features chart. the World Wide Web Worm. ALIWEB: In October of 1993 Martijn Koster created Archie-Like Indexing of the Web. His database became knows as the Wandex. By default. After this was pushed through Google quickly changed the scope of the purpose of the link nofollow to claim it was for any link that was sold or not under editorial control. It did not take long for him to fix this software. and the Repository-Based Software Engineering (RBSE) spider. creating a nofollow attribute that can be applied at the individual link level. In 2005 Google led a crusade against blog comment spam. three full fledged bot fed search engines had surfaced on the web: JumpStation. if information is on a public web server.icio. The Wanderer was as much of a problem as it was a solution because it caused system lag by accessing the same page hundreds of times a day. but people started to question the value of bots. He initially wanted to measure the growth of the web and created this bot to count active web servers. The WWW Worm indexed titles and URL's. As the web grew.SEARCH ENGINE 9 Nancy Blachman's Google Guide offers searchers free Google search tips. or ALIWEB in response to the Wanderer. For example. The problem with JumpStation and the World Wide . and Greg R. JumpStation slowed to a stop. World Wide Web Wanderer: Soon the web's first robot came. The downside of ALIWEB is that many people did not know how to submit their site. In June 1993 Matthew Gray introduced the World Wide Web Wanderer. JumpStation gathered info about the title and header from Web pages and retrieved these using a simple linear search. which created standards for how search engines should index or not index content. ALIWEB crawled meta information and allowed users to submit their pages they wanted indexed with their own page description. This meant it needed no bot to collect data and was not using excessive bandwidth. This allows webmasters to block bots from their site on a whole site level or page by page basis. There are also many popular smaller vertical search services. Primitive Web Search: By December of 1993. Del. and people link to it search engines generally will index it.us allows you to search URLs that users have bookmarked. He soon upgraded the bot to capture actual URL's.
and was named Excite@Home. other directories soon did follow. They were soon funded. which became a loose confederation of topical experts maintaining relevant topical link lists. 1999 for $6. EINet Galaxy The EINet Galaxy web directory was born in January of 1994.5 billion. The RSBE spider did implement a ranking system. They had the idea of using statistical analysis of word relationships to make searching more efficient. Since early search algorithms did not do adequate link analysis or cache full page content if you did not know the exact name of what you were looking for it was extremely hard to find it. 1993 by six Stanford undergrad students. Web Directories: VLib: When Tim Berners-Lee set up the web he created the Virtual Library. however. It was organized similar to how web directories are today. The biggest reason the EINet Galaxy became a success was that it also contained Gopher and Telnet search features in addition to its web search feature. 2001 Excite@Home filed for bankruptcy. which was started by in February. The web size in early 1994 did not really require a web directory. and in mid 1993 they released copies of their search software for use on web sites. Yahoo! Directory . In October. Excite was bought by a broadband provider named @Home in January.SEARCH ENGINE 10 Web Worm is that they listed results in the order that they found them. and provided no discrimination. InfoSpace bought Excite from bankruptcy court for $10 million. Excite: Excite came from the project Architext.
Netscape bought the Open Directory Project in November. Later that same month AOL announced the intention of buying Netscape in a $4. Many informational sites are still added to the Yahoo! Directory for free. is a directory of business websites.com. LII Google offers a librarian newsletter to help librarians and other web editors help make information more accessible and categorize the web. especially those which have a paid inclusion option.SEARCH ENGINE 11 In April 1994 David Filo and Jerry Yang created the Yahoo! Directory as a collection of their favorite web pages. Business. As time passed and the Yahoo! Directory grew Yahoo! began charging commercial sites for inclusion. who is the director of Librarians' Internet Index. As time passed the inclusion rates for listing a commercial site increased. There are also numerous smaller industry.5 billion all stock deal. almost entirely ran by a group of volunteer editors. or locally oriented directories. The current cost is $299 per year. vertically. What set the directories above The Wanderer is that they provided a human compiled description with each URL. The Internet Public Library is another well kept directory of websites. Most other directories. Schneider. 1998. The ODP (also known as DMOZ) is the largest internet directory. for example. . Her article explains what she and her staff look for when looking for quality credible resources to add to the LII.com Due to the time intensive nature of running a directory. The second Google librarian newsletter came from Karen G. As their number of links grew they had to reorganize and become a searchable directory. hold lower standards than selected limited catalogs created by librarians. which is a directory which anybody can download and use in whole or part. Business. The Open Directory Project was grown out of frustration webmasters faced waiting to be included in the Yahoo! Directory. and the general lack of scalability of a business model the quality and size of directories sharply drops off after you get past the first half dozen or so general directories. Open Directory Project In 1998 Rich Skrenta and a small group of friends created the Open Directory Project. LII is a high quality directory aimed at librarians.
Looksmart also owns a catalog of content articles organized in vertical sites. but it never gained traction. WebCrawler: Brian Pinkerton of the University of Washington released WebCrawler on April 20. which charged listed sites a flat fee per click. 1994. Lycos . and OpenText. In addition to providing ranked relevance retrieval. But Lycos' main difference was the sheer size of its catalog: by August 1994. On July 20. Lycos provided prefix matching and word proximity bonuses. Michale Mauldin was responsible for this search engine and remains to be the chief scientist at Lycos Inc. 1994. Looksmart bought a search engine by the name of WiseNut. The problem was that Looksmart became too dependant on MSN. and hope to drive traffic using Furl. and in 2003. when Microsoft announced they were dumping Looksmart that basically killed their business model. AOL eventually purchased WebCrawler and ran it on their network. Within 1 year of its debuted came Lycos. Lycos: Lycos was the next major search development. WebCrawler opened the door for many other services to follow suit. Lycos went public with a catalog of 54. In March of 2002. but on March 28. It was the first crawler which indexed entire pages. a social bookmarking program. They competed with the Yahoo! Directory by frequently increasing their inclusion rates back and forth. Infoseek. In 1998 Looksmart tried to expand their directory by buying the non commercial Zeal directory for $20 million. and AOL began using Excite to power its NetFind. but due to limited relevancy Looksmart has lost most (if not all) of their momentum.SEARCH ENGINE Looksmart 12 Looksmart was founded in 1995. Soon it became so popular that during daytime hours it could not be used. having been design at Carnegie Mellon University around July of 1994.000 documents. Excite bought out WebCrawler. Then in 1997. That caused the demise of any good faith or loyalty they had built up. 2006 Looksmart shut down the Zeal directory. although it allowed them to profit by syndicating those paid listings to some major portals like MSN. In 2002 Looksmart transitioned into a pay per click provider.
Hotwire listed this site and it became hugely popular quickly. which gave them major exposure.SEARCH ENGINE 13 had identified 394. by January 1995. 2003. In October of 2001 Danny Sullivan wrote an article titled Inktomi Spam Database Left Open To Public. which listed over 1 million URLs at that time. Inktomi: The Inktomi Corporation came about on May 20. the catalog had reached 1. Two Cal Berkeley cohorts created Inktomi from the improved technology gained from their research. In October 1994. they were the first to allow natural language queries. 1996 with its search engine Hotbot. and portal related clutter AltaVista was largely driven into irrelevancy around the time Inktomi and Google started becoming popular.5 million documents. After Yahoo! bought out Overture they rolled some of the AltaVista technology into Yahoo! Search. Overture signed a letter of intent to buy AltaVista for $80 million in stock and $60 million cash. a fear of result manipulation. Although Inktomi pioneered the paid inclusion model it was nowhere near as efficient as the pay per click auction model developed by Overture.’. Lycos had indexed over 60 million documents -. Due to poor mismanagement. AltaVista brought many important features to the web scene. They really did not bring a whole lot of innovation to the table. Licensing their search results also was not profitable enough to pay for their scaling costs. One popular feature of Infoseek was allowing webmasters to submit a page to the search index in real time. They failed to develop a profitable business . but they offered a few add on's.more than any other Web search engine. which was a search spammer's paradise. claiming to have been founded in January. and by November 1996. Lycos ranked first on Netscape's list of search engines by finding the most hits on the word ‘surf. and in December 1995 they convinced Netscape to use them as their default search. AltaVista: AltaVista debut online came during this same month. They even allowed inbound link checking. which highlights how Inktomi accidentally allowed the public to access their database of spam sites. AltaVista also provided numerous search tips and advanced search features. and occasionally use AltaVista as a testing platform. On February 18. Infoseek: Infoseek also started out in 1994. advanced searching techniques and they allowed users to add or delete their own URL within 24 hours.000 documents. They had nearly unlimited bandwidth (for that time).
and sold out to Yahoo! for approximately approximately $235 million.SEARCH ENGINE 14 model. and occasionally use AllTheWeb as a testing platform. Ask was powered by DirectHit for a while. which uses clustering to organize sites by Subject Specific Popularity. Yahoo! has also tried to extend their . Google also has a Scholar search program which aims to make scholarly research easier to do. Based on usage statistics this tool can help Google understand which vertical search products they should create or place more emphasis on. They believe that owning other verticals will allow them to drive more traffic back to their core search service.65 a share. Users can upload items and title. but on February 23. which is another way of saying they tried to find local web communities. in December of 2003. but that technology proved to easy to spam as the core algorithm component. In 2000 the Teoma search engine was released. describe. a radio ad placement firm. On November 15. 2003. AllTheWeb was bought by Overture for $70 million. which is a database of just about anything imaginable. 2005 Google launched a product called Google Base. Mike Grehan's Topic Distillation [PDF] also explains how subject specific popularity works.com (Formerly Ask Jeeves): In April of 1997 Ask Jeeves was launched as a natural language search engine. Google bought dMarc. In 2001 Ask Jeeves bought Teoma to replace the DirectHit search technology. which aimed to rank results based on their popularity. Jon Kleinberg's Authoritative sources in a hyperlinked environment [PDF] was a source of inspiration what lead to the eventual creation of Teoma. After Yahoo! bought out Overture they rolled some of the AllTheWeb technology into Yahoo! Search. For example. Ask. AllTheWeb AllTheWeb was a search technology platform launched in May of 1999 to showcase Fast's search technologies. They had a sleek user interface with rich advanced search features. They also believe that targeted measured advertising associated with search can be carried over to other mediums. or $1. and tag them as they see fit. Ask Jeeves used human editors to try to match search queries.
To prevent the erosion of value of search ads Google allows advertisers to opt out of placing their ads on content sites. or demographic categories. 2006. Advertisers could chose which keywords they wanted to target and which ad formats they wanted to market. and Google allowed advertisers to buy ads targeted to specific websites. Ads targeted on websites are sold on a cost per thousand impression (CPM) basis in an ad auction against other keyword targeted and site targeted ads. they intend to increase their marketshare by baking search into Internet Explorer 7. pages. AdSense allows web publishers large and small to automate the placement of relevant ads on their content. Google initially started off by allowing textual ads in numerous formats. Microsoft's ad algorithm includes both cost per click and ad clickthrough rate. and the social bookmarking site del. Google also allows some publishers to place AdSense ads in their feeds. like the photo sharing site Flickr. and it is just generally clunky. 2003 Google announced their content targeted ad network.ad CTR not factored into click cost. To help grow the network and make the market more efficient Google added a link which allows advertisers to sign up for AdWords account from content websites. . An ad on a digital camera review page would typically be worth more than a click from a page with pictures on it. Yahoo! Search Marketing Yahoo! Search Marketing is the rebranded name for Overture after Yahoo! bought them out. and some select publishers can place ads in emails. Google AdSense On March 4. Google bought Applied Semantics. with the same flaws .icio. Google adopted the name AdSense for the new ad program. it's hard to run local ads.us. Microsoft AdCenter Microsoft AdCenter was launched on May 3. Microsoft added demographic targeting and dayparting features to the pay per click mix. In April 2003. Smart pricing automatically adjusts the click cost of an ad based on what Google perceives a click from that page to be worth. On the features front. As of September 2006 their platform is generally the exact same as the old Overture platform.SEARCH ENGINE 15 reach by buying other high traffic properties. While Microsoft has limited marketshare. but eventually added image ads and video ads. . which had CIRCA technology that allowed them to drastically improve the targeting of those ads. and Google also introduced what they called smart pricing.
If someone cites a source they usually think it is important. In the PageRank algorithm links count as votes. links act as citations. In 1995 Larry Page met Sergey Brin at Stanford. but some votes count more than others. their unique approach to link analysis was earning BackRub a growing reputation among those who had seen it. named for its unique ability to analyze the "back links" pointing to a given website. the pair took to haunting the department's loading docks in hopes of tracking down newly arrived computers that they could borrow for their network. In 1998. On the web. starting from when Larry met Sergey at Stanford right up to present day. Eventually video game ads will be sold from within Microsoft AdCenter. and on May 4. Sergey tried to shop their PageRank technology. Buzz about the new search technology began to build as word spread around campus. Larry and Sergey had begun collaboration on a search engine called BackRub. Larry. Your ability to rank and the strength of your ability to vote for others depends upon your authority: how many people link to you and how trustworthy those links are. By January of 1996. Afflicted by the perennial shortage of cash common to graduate students everywhere. 2006 announced they bought a video game ad targeting firm named Massive Inc. Winning the Search War . Google was launched. BackRub ranked pages using citation notation. a concept which is popular in academic circles. Early Years Google's corporate history page has a pretty strong background on Google. who had always enjoyed tinkering with machinery and had gained some notoriety for building a working printer out of Lego™ bricks. but nobody was interested in buying or licensing their search technology at that time.SEARCH ENGINE 16 Microsoft also created the XBox game console. A year later. took on the task of creating a new kind of server environment that used low-end PCs instead of big expensive machines.
2007 Google began mixing many of their vertical results into their organic search results. which was a strong turning point in Google's battle against Overture.000 seed funding. Google also runs a large number of vertical search services. selling ads in an auction which would factor in bid price and ad clickthrough rate. Google dropped their IPO offer range from $85 to $95 per share from $108 to $135. and Yahoo! followed suit a year later. 2005. On September 6. Google announced the launch of Google Base. In 2002 they retooled the service. In 1999 AOL selected Google as a search partner. In 2000 Google relaunched their AdWords program to sell ads on a CPM basis. Google Blog Search: On September 14. Google announced Google Video. Google Base: On November 15.01. Just Search. After some controversy surrounding an interview in Playboy. products. or services. 2004. 2004. Google launchedGoogle Book Search. 2006. an academic search program. Verticals Galore! In addition to running the world's most popular search service. including: • • • • • • • Google News: Google News launched in beta in September 2002. Google Book Search: On October 6. Google announced an expanded Google News Archive Search that goes back over 200 years. Google gained search market share year over year ever since.SEARCH ENGINE 17 Later that year Andy Bechtolsheim gave them $100. decided not to give earnings guidance. Google launched Google Scholar. and Google received $25 million Sequoia Capital and Kleiner Perkins Caufield & Byers the following year. Google Universal Search: On May 16. On May 1. and offered shares of their stock in a Dutch auction. Google went public at $85 a share on August 19. 2005. In 2000 Google also launched their popular Google Toolbar. a database of uploaded information describing online or offline content. which allowed them to expand their ad network by selling targeted ads on other websites. . 2006. Google Scholar: On November 18. In 2003 Google also launched their AdSense program. AOL announced they would use Google to deliver their search related ads. Google Video: On January 6. 2004 and its first trade was at 11:56 am ET at $100. Going Public Google used a two class stock structure. Google announced Google Blog Search. We Promise! product. 2002. They received virtually limitless negative press for the perceived hubris they expressed in their "AN OWNER'S MANUAL" FOR GOOGLE'S SHAREHOLDERS.
SEARCH ENGINE Microsoft 18 In 1998 MSN Search was launched. They formally switched from Yahoo! organic search results to their own in house technology on January 31st. 2005. On September 11. 2006. . They launched their technology preview of their search engine around July 1st of 2004. Until Microsoft saw the light they primarily relied on partners like Overture. 2006. and Inktomi to power their search service. but Microsoft did not get serious about search until after Google proved the business model. Microsoft announced they were launching their Live Search product. MSN announced they dumped Yahoo!'s search ad program on May 4th. Looksmart.
Some search engines. it identifies all the hyperlinks in the page and adds them to the list of URLs to visit. store all or part of the source page (referred to as a cache) as well as information about the web pages. use spidering as a means of providing up-to-date data. automatic indexers. and worms. it starts with a list of URLs to visit. whereas others. such as harvesting e-mail addresses (usually for spam). store every word of every page they find. bots. Crawlers can also be used for automating maintenance tasks on a website. Exclusions can be made by the use of robots. Many sites. Web crawlers are mainly used to create a copy of all the visited pages for later processing by a search engine that will index the downloaded pages to provide fast searches. such as AltaVista. called the seeds. web robot. Working of a search engine A search engine operates. the engine examines its index and provides a listing of best-matching web pages according to its criteria. so it can be very useful when the content of the current page has been updated and the search terms are no longer in it. such as Google. Other less frequently used names for web crawlers are ants. These pages are retrieved by a Web crawler (sometimes also known as a spider) — an automated Web browser which follows every link it sees. In general. Web crawling Indexing Searching 19 Web search engines work by storing information about many web pages. Most search engines support the use of the Boolean operators AND. 3. . usually with a short summary containing the document's title and sometimes parts of the text. or software agent. As the crawler visits these URLs. 2. in the following order 1. or—especially in the FOAF community—web scutter) is a program or automated script which browses the World Wide Web in a methodical. Data about web pages are stored in an index database for use in later queries. OR and NOT to further specify the search query. in particular search engines. A web crawler is one type of bot. words are extracted from the titles. called the crawl frontier. which they retrieve from the WWW itself. such as checking links or validating HTML code. When a user enters a query into a search engine (typically by using key words).SEARCH ENGINE 2. URLs from the frontier are recursively visited according to a set of policies. headings. Also. This cached page always holds the actual search text since it is the one that was actually indexed. This process is called web crawling or spidering. crawlers can be used to gather specific types of information from Web pages. A web crawler (also known as a web spider. Some search engines provide an advanced feature called proximity search which allows users to define the distance between keywords. or special fields called meta tags).txt. automated manner. The contents of each page are then analyzed to determine how it should be indexed (for example.
its fast rate of change. and even of its URL (the . two file formats. but efficient way. it is very likely that new pages have been added to the site.SEARCH ENGINE Crawling policies . This mathematical combination creates a problem for crawlers. so it needs to prioritize its downloads. There are three important characteristics of the Web that make crawling it very difficult: • • • 20 its large volume. The high rate of change implies that by the time the crawler is downloading the last pages from a site. The behavior of a web crawler is the outcome of a combination of policies: • • • • A selection policy that states which pages to download. A parallelization policy that states how to coordinate distributed web crawlers. A politeness policy that states how to avoid overloading websites. as they must sort through endless combinations of relatively minor scripted changes in order to retrieve unique content. only a small selection of which will actually return unique content. noted. if some reasonable measure of quality or freshness is to be maintained. a simple online photo gallery may offer three options to users. The recent increase in the number of pages being generated by server-side scripting languages has also created difficulty in that endless combination of HTTP GET parameters exist. dynamic page generation which combine to produce a wide variety of possible crawlable URLs. it is becoming essential to crawl the Web in not only a scalable. or that pages have already been updated or even deleted. A re-visit policy that states when to check for changes to the pages. The importance of a page is a function of its intrinsic quality. A crawler must carefully choose at each step which pages to visit next. it is highly desirable that the downloaded fraction contains the most relevant pages. As Edwards et al. For example. As a crawler always downloads just a fraction of the Web pages. even large search engines cover only a portion of the publicly available interne. all of which will be present on the site. If there exist four ways to sort images. three choices of thumbnail size. and an option to disable user-provided contents. as specified through HTTP GET parameters. This requires a metric of importance for prioritizing Web pages. The large volume implies that the crawler can only download a fraction of the web pages within a given time. its popularity in terms of links or visits. Selection policy Given the current size of the Web. "Given that the bandwidth for conducting crawls is neither infinite nor free. then that same set of content can be accessed with forty-eight different URLs. and not just a random sample of the Web.
. Using these seeds a new crawl can be very effective. /hamster/. Cothey (Cothey. testing several crawling strategies.html.asp. One can extract good seed from a previously crawled web graph using this new method.htm. This strategy may cause numerous HTML Web resources to be unintentionally skipped. this was the approach . testing breadth-first against depth-first.it domain and 100 million pages from the WebBase crawl.org/hamster/monkey/page. when given a seed URL of http://llama. 2004) introduced a path-ascending crawler that would ascend to every path in each URL that it intends to crawl. Abiteboul (Abiteboul et al.cl domain. Many Path-ascending crawlers are also known as Harvester software. .php. Their method crawls web pages with high PageRank from different communities in less iteration in comparison with crawl starting from random seeds. Daneshpajouh et al. Baeza-Yates et al. it will attempt to crawl /hamster/monkey/. a crawler may make an HTTP HEAD request to determine a Web resource's MIME type before requesting the entire resource with a GET request. and that it is also very effective to use a previous crawl. . designed a community based algorithm for discovering good seeds. Boldi et al. we would like to be able to predict the similarity of the text of a given page to the query before actually downloading the page. . A possible predictor is the anchor text of links.gr and . For example.. 2004) used simulation on subsets of the Web of 40 million pages from the .perhaps the collection of photos in a gallery . or resources for which no inbound link would have been found in regular crawling. or search engines restricted to a fixed Web site). In order to request only HTML resources. Cothey found that a path-ascending crawler was very effective in finding isolated resources.aspx.. as the complete set of Web pages is not known during crawling. They showed that both the OPIC strategy and a strategy that uses the length of the per-site queues are both better than breadth-first crawling. random ordering and an omniscient strategy. used simulation on two subsets of the Web of 3 million pages from the . 2003) designed a crawling strategy based on an algorithm called OPIC (On-line Page Importance Computation). and /.from a specific page or host. A crawler may only want to seek out HTML pages and avoid all other MIME types. In OPIC. because they're used to "harvest" or collect all the content . when it is available. and a slash. A similar strategy compares the extension of the web resource to a list of known HTML-page types: .html. Designing a good selection policy has an added difficulty: it must work with partial information. each page is given an initial sum of "cash" which is distributed equally among the pages it points to. To avoid making numerous HEAD requests. a crawler may alternatively examine the URL and only request the resource if the URL ends with .SEARCH ENGINE 21 latter is the case of vertical search engines restricted to a single top-level domain. Some crawlers intend to download as many resources as possible from a particular Web site. (Boldi et al. .htm or a slash.html. to guide the current one. The main problem in focused crawling is that in the context of a web crawler.
and thus having an outdated copy of a resource. By the time a web crawler has finished its crawl. and a focused crawling usually relies on a general Web search engine for providing starting points. The freshness of a page p in the repository at time t is defined as: Age: This is a measure that indicates how outdated the local copy is. many events could have happened. Freshness: This is a binary measure that indicates whether the local copy is accurate or not. Two simple re-visiting policies were studied by Cho and Garcia-Molina: Uniform policy: This involves re-visiting all pages in the collection with the same frequency. the crawler is just concerned with how many pages are out-dated.svg Evolution of freshness and age in Web crawling The objective of the crawler is to keep the average freshness of pages in its collection as high as possible. regardless of their rates of change. the repeated crawling order of pages can be done either at random or with a fixed order. Proportional policy: This involves re-visiting more often the pages that change more frequently.SEARCH ENGINE 22 taken by Pinkerton in a crawler developed in the early days of the Web. Web 3. Diligenti et al. usually measured in weeks or months. The optimal re-visiting policy is neither the uniform policy nor the proportional . Web 3. are freshness and age. These objectives are not equivalent: in the first case. The Web has a very dynamic nature.0 defines advanced technologies and new principles for the next generation search technologies that is summarized in Semantic Web and Website Parse Template concepts for the present.0 crawling and indexing technologies will be based on human-machine clever associations. (In both cases. and crawling a fraction of the Web can take a really long time. the crawler is concerned with how old the local copies of pages are. propose to use the complete content of the pages already visited to infer the similarity between the driving query and the pages that have not been visited yet. These events can include creations. The visiting frequency is directly proportional to the (estimated) change frequency. updates and deletions. The most used cost functions. 2003a). 2000). introduced in (Cho and Garcia-Molina. at time t is defined as: Image:Web Crawling Freshness Age. we should penalize the elements that change too often (Cho and GarciaMolina. From the search engine's point of view. or to keep the average age of pages as low as possible.) To improve freshness. while in the second case. The performance of a focused crawling depends mostly on the richness of links in the specific topic being searched. The age of a page p in the repository. there is a cost associated with not detecting an event.
if deployed by too many users. because this is the first one they have seen. The optimal method for keeping average freshness high includes ignoring the pages that change too often.. Politeness policy Crawlers can retrieve data much quicker and in greater depth than human searchers..SEARCH ENGINE 23 policy. if pages were downloaded at this rate from a website with more than 100. a more detailed cost-benefit analysis is needed and ethical considerations should be taken into account when deciding where to crawl and how fast to crawl. This standard does not include a suggestion for the interval of visits to the same server. Server overload. MSN and Yahoo are able to use an extra "Crawl-delay:" parameter in the robots. Parallelization policy . as crawlers require considerable bandwidth and operate with a high degree of parallelism during a long period of time.txt protocol that is a standard for administrators to indicate which parts of their Web servers should not be accessed by crawlers. it would take more than 2 months to download only that entire website. a server would have a hard time keeping up with requests from multiple crawlers. running a crawler which connects to more than half a million servers (. so they can have a crippling impact on the performance of a site. there are always those who do not know what a crawler is. Because of the vast number of people coming on line. and taking all the safeguards to avoid overloading web servers. only a fraction of the resources from that Web server would be used. Recently commercial search engines like Ask Jeeves. also. can disrupt networks and Web servers. also known as the robots. Brin and Page note that: ". and the optimal for keeping average age low is to use access frequencies that monotonically (and sub-linearly) increase with the rate of change of each page. which can crash servers or routers.. It is worth noticing that even when being very polite. For those using web crawlers for research purposes. Personal crawlers that. some complaints from Web server administrators are received. even though this interval is the most effective way of avoiding server overload. However..) generates a fair amount of email and phone calls. • • • • Network resources. or which download pages they cannot handle. especially if the frequency of accesses to a given server is too high.". Anecdotal evidence from access logs shows that access intervals from known crawlers vary between 20 seconds and 3–4 minutes. This does not seem acceptable. A partial solution to these problems is the robots exclusion protocol. The first proposal for the interval between connections was given in and was 60 seconds. Needless to say if a single crawler is performing multiple requests per second and/or downloading large files. Poorly written crawlers.txt file to indicate the number of seconds to delay between requests.000 pages over a perfect connection with zero latency and infinite bandwidth..
as noted in the previous sections. there is often an important lack of detail that prevents others from reproducing the work. The user agent field may include a URL where the Web site administrator may find out more information about the crawler.. There are also emerging concerns about "search engine spamming". The term URL normalization. . When crawler designs are published. the crawling system requires a policy for assigning the new URLs discovered during the crawling process. and details on their algorithms and architecture are kept as business secrets. removal of ". To avoid downloading the same page more than once. crawlers may be accidentally trapped in a crawler trap or they may be overloading a web server with requests.. and adding trailing slashes to the non-empty path component (Pant et al. Web site administrators typically examine their web servers’ log and use the user agent field to determine which crawlers have visited the web server and how often." segments. Spambots and other malicious Web crawlers are unlikely to place identifying information in the user agent field. and the owner needs to stop the crawler. 2004). Web crawler architectures High-level architecture of a standard Web crawler A crawler must not only have a good crawling strategy. The goal is to maximize the download rate while minimizing the overhead from parallelization and to avoid repeated downloads of the same page. There are several types of normalization that may be performed including conversion of URLs to lowercase. or they may mask their identity as a browser or other wellknown crawler. refers to the process of modifying and standardizing a URL in a consistent manner. URL normalization Crawlers usually perform some type of URL normalization in order to avoid crawling the same resource more than once. which prevent major search engines from publishing their ranking algorithms. but it should also have a highly optimized architecture. also called URL canonicalization." and ". as the same URL can be found by two different crawling processes. It is important for web crawlers to identify themselves so Web site administrators can contact the owner if needed.SEARCH ENGINE Main article: Distributed web crawling 24 A parallel crawler is a crawler that runs multiple processes in parallel. In some cases. Crawler identification Web crawlers typically identify themselves to a web server by using the User-agent field of an HTTP request. Web crawlers are a central part of search engines.
An alternate name for the process in the context of search engines designed to find web pages on the Internet is Web indexing. informatics. a sequential scan of every word in 10. It was based on lib-WWW to download pages. Meta search engines reuse the indices of other services and do not store a local index. while agent-based search engines index in real time. It also included a realtime crawler that followed links based on the similarity of the anchor text with the provided query. Index Design Factors Major factors in designing a search engine's architecture include: Merge factors How data enters the index. .000 documents can be queried within milliseconds. • • RBSE was the first published web crawler.SEARCH ENGINE 25 Identification is also useful for administrators that are interested in knowing when they may expect their Web pages to be indexed by a particular search engine. and another program to parse and order URLs for breadth-first exploration of the Web graph. Popular engines focus on the full-text indexing of online. Index design incorporates interdisciplinary concepts from linguistics. partial-text services restrict the depth indexed to reduce index size. are traded off for the time saved during information retrieval. Without an index. "spider" maintains a queue in a relational database. Unlike full-text indices.000 large documents could take hours. or how words or subject features are added to the index during text corpus traversal. parses. and whether multiple indexers can work asynchronously. which would require considerable time and computing power. The additional computer storage required to store the index. cognitive psychology. Indexing The purpose of storing an index is to optimize speed and performance in finding relevant documents for a search query. For example. mathematics. and stores data to facilitate fast and accurate information retrieval. is a modified www ASCII browser that downloads the pages from the Web. while an index of 10. The indexer must first check whether it is updating old content or adding new content. whereas cache-based search engines permanently store the index along with the corpus. the search engine would scan every document in the corpus. physics and computer science. natural language documents. as well as the considerable increase in the time required for an update to take place. Search engine indexing collects. Media types such as video and audio and graphics are also searchable. It was based on two programs: the first program. and the second program "mite". WebCrawler was used to build the first publicly-available full-text index of a subset of the Web. Larger services typically perform indexing at a predetermined time interval due to the required time and processing costs.
In an effort to scale with larger amounts of indexed information. compared with how quickly it can be updated or removed. a new document is added to the corpus and the index must be updated. Index Data Structures Search engine architectures vary in the way indexing is performed and in methods of index storage to meet the various design factors. Maintenance How the index is maintained over time. For example. where the search engine consists of several machines operating in unison. This is a collision between two competing tasks. distributed. This increases the possibilities for incoherency and makes it more difficult to maintain a fully-synchronized. Lookup speed How quickly a word can be found in the inverted index. determining whether bad data can be treated in isolation. whether information should be data compressed or filtered. and the inverted index is the consumer of information produced by the forward index. but the index simultaneously needs to continue responding to search queries. dealing with bad hardware.SEARCH ENGINE 26 Traversal typically correlates to the data collection policy. Search engine index merging is similar in concept to the SQL Merge command and other merge algorithms. and schemes such as hash-based or composite partitioning. Consider that authors are producers of information. partitioning. and a web crawler is the consumer of this information. as well as replication. The speed of finding an entry in a data structure. Storage techniques How to store the index data. Index size How much computer storage is required to support the index. is a central focus of computer science. The challenge is magnified when working with distributed storage and distributed processing. parallel architecture. There are many opportunities for race conditions and coherent faults. Issues include dealing with index corruption. Types of indices include: Challenges in Parallelism A major challenge in the design of search engines is the management of parallel computing processes. that is. Inverted indices . the search engine's architecture may involve distributed computing. Fault tolerance How important it is for the service to be reliable. The forward index is the consumer of the information produced by the corpus. The indexer is the producer of searchable information and users are the consumers that need to search. This is commonly referred to as a producer-consumer model. grabbing the text and storing it in a cache (or corpus).
frequency can be used to help in ranking the relevance of documents to the query. The following is a simplified illustration of an inverted index: Inverted Index Word Documents the cow says Document 1. Position information enables the search algorithm to identify word proximity to support searching for phrases. typically residing in virtual memory. since not all words are present in each document. A rebuild is similar to a merge but first deletes the contents of the inverted index. After parsing. the search engine can use direct access to find the documents associated with each word in the query in order to retrieve the matching documents quickly. The index is similar to the term document matrices employed by latent semantic analysis. The inverted index is a sparse matrix. Such an index determines which documents match a query but does not rank matched documents. Document 5 Document 2. the development of a forward index and a process which sorts . Because the inverted index stores a list of the documents containing each word. since it stores no information regarding the frequency and position of the word. which requires additional storage but may reduce the lookup time. Document 4 Document 5 moo Document 7 This index can only determine whether a word exists within a particular document. the indexer adds the referenced document to the document list for the appropriate words. where a merge identifies the document or documents to be added or updated and then parses each document into words. In larger indices the architecture is typically a distributed hash table. with the index cache residing on one or more computer hard drives. Document 3. The inverted index can be considered a form of a hash table. For technical accuracy. To reduce computer storage memory requirements. In some cases the index is a form of a binary tree. Such topics are the central research focus of information retrieval. Index Merging The inverted index is filled via a merge or rebuild. and so this process is commonly split up into two parts.SEARCH ENGINE 27 Many search engines incorporate an inverted index when evaluating a search query to quickly locate documents containing the words in a query and then rank these documents by relevance. the process of finding each word in the inverted index (in order to report that it occurred within a document) may be too time consuming. Document 4. a merge conflates newly indexed documents. it is stored differently from a two dimensional array. In a larger search engine. In some designs the index includes additional information such as the frequency of each word in each document or the positions of a word in each document. Inverted indices can be programmed in several computer programming languages. Document 3. The architecture may be designed to support incremental indexing. it is therefore considered to be a boolean index.
simple.and. At 1 byte per character. The following is a simplified form of the forward index: Forward Index Document Words Document 1 the. In this regard.000. Many search engines utilize a form of compression to reduce the size of the indices on disk.spoon The rationale behind developing a forward index is that as documents are parsing. It takes 8 bits (or 1 byte) to store a single character. Compression Generating or maintaining a large-scale search engine index represents a significant storage and processing challenge.away. the .SEARCH ENGINE 28 the contents of the forward index into the inverted index. this would require 2500 gigabytes of storage space alone.says.the.cat.cow.with.000 different web pages exist as of the year 2000Suppose there are 250 words on each webpage (based on the assumption they are similar to the pages of a novel. Some encodings use 2 bytes per character The average number of characters in any given word on a page may be estimated at 5 (Wikipedia:Size comparisons) The average personal computer comes with 100 to 250 gigabytes of usable space Given this scenario. index) for 2 billion web pages would need to store 500 billion word entries. This space requirement may be even larger for a faulttolerant distributed storage architecture. The Forward Index The forward index stores a list of words for each document. The forward index is essentially a list of pairs consisting of a document and a word. it is better to immediately store the words per document.the. more than the average free disk space of 25 personal computers. Internet search engine. Converting the forward index to an inverted index is only a matter of sorting the pairs by the words. Depending on the compression technique chosen. collated by the document. the inverted index is a word-sorted forward index.dish. • • • • An estimated 2. or 5 bytes per word. The inverted index is so named because it is an inversion of the forward index. The delineation enables Asynchronous system processing. which partially circumvents the inverted index update bottleneck.ran.000. an uncompressed index (assuming a non-conflated. The forward index is sorted to transform it to an inverted index.hat Document 3 the. Consider the following scenario for a full text.moo Document 2 the.
Document Parsing Document parsing breaks apart the components (words) of a document or other form of media for insertion into the forward and inverted indices. concordance generation. In tokenizing the document. the texts of other languages such as Chinese. large scale search engine designs incorporate the cost of storage as well as the costs of electricity to power the storage. as of 2006. but this is not the case with designing a multilingual indexer. speech segmentation. The words found are called tokens. The tradeoff is the time and processing power required to perform compression and decompression. 'parsing'. text segmentation. the implementation of which are commonly kept as corporate secrets.SEARCH ENGINE 29 index can be reduced to a fraction of this size. many search engines collect additional information about each word. Tokenization for indexing involves multiple technologies. tagging. Search engines which support multiple file formats must be able to correctly open and access the document and be able to tokenize the characters of the document. text mining. It is also sometimes called word boundary disambiguation. and 'tokenization' are used interchangeably in corporate slang. Thus compression is a measure of cost. is the subject of continuous research and technological improvement. parsing is more commonly referred to as tokenization. In digital form. lexing. or lexical analysis. some search engines attempt to automatically identify the language of the document. which is often the rationale for designing a parser for each language supported (or for groups of languages with similar boundary markers and syntax). Japanese or Arabic represent a greater challenge. as the syntax varies among languages. Challenges in Natural Language Processing Word Boundary Ambiguity Native English speakers may at first consider tokenization to be a straightforward task. Faulty Storage . and so. such as its language or lexical category (part of speech). Tokenization presents many challenges in extracting the necessary information from documents for indexing to support quality searching. as words are not clearly delineated by whitespace. Documents do not always clearly identify the language of the document or represent it accurately. Language-specific logic is employed to properly identify the boundaries of words. the file format must be correctly handled. Diverse File Formats In order to correctly identify which bytes of a document represent characters. text analysis. Natural language processing. Language Ambiguity To assist with properly ranking matching documents. The terms 'indexing'. Notably. in the context of search engine indexing and natural language processing. content analysis. These techniques are language-dependent. The goal during tokenization is to identify words for which users will search.
and language tagging. language analysis. Instead. Language recognition is the process by which a computer program attempts to automatically identify. computers do not understand the structure of a natural language document and cannot automatically recognize words and sentences. Format analysis is the identification and handling of the formatting content embedded within documents which controls the way the document is rendered on a computer screen or interpreted by a software program. and font size or style. a document is only a sequence of bytes. Other names for language recognition include language classification. binary characters may be mistakenly encoded into various parts of a document. length. Such a program is commonly called a tokenizer or parser or lexer. Many search engines. like 'noun' or 'verb'). lexical category (part of speech. the language of a document. which specify formatting information such as new line starts. as well as other natural language processing software. Language Recognition If the search engine supports multiple languages. many of the subsequent steps are language dependent (such as stemming and part of speech tagging). HTML documents contain HTML tags. lower. and line number. sentence position. language identification. the parser identifies sequences of characters which represent words and other elements. The parser can also identify entities such as email addresses. phone numbers. mixed. During tokenization. Automated language recognition is the subject of ongoing research in natural language processing. or categorize. do not closely obey proper file protocol. Tokenization Unlike literate humans. humans must program the computer to identify what constitutes an individual or distinct word. a common initial step during tokenization is to identify each document's language. To a computer. . Without recognition of these characters and appropriate handling. referred to as a token. An unspecified number of documents. Finding which language the words belongs to may involve the use of a language recognition chart. proper). Format Analysis If the search engine supports multiple document formats. leading to poor search results. If the search engine were to ignore the difference between content and 'markup'. such as punctuation. When identifying each token. sentence number. For example. language or encoding. incorporate specialized programs for parsing. documents must be prepared for tokenization. The challenge is that many document formats contain formatting information in addition to textual content. such as YACC or Lex. the index quality or indexer performance could degrade. which are represented by numeric codes. and URLs. such as the token's case (upper. extraneous information would be included in the index. particular on the Internet. several characteristics may be stored.SEARCH ENGINE 30 The quality of the natural language data may not always be perfect. Computers do not 'know' that a space character separates words in a document. some of which are non-printing control characters. position. bold emphasis.
while Internet search engines which must focus more on the full text index. Transactional queries – Queries that reflect the intent of the user to perform a particular action. . colorado or trucks) for which there may be thousands of relevant results. youtube or delta airlines). Which links point to this URL?. Given that conflict of interest with the business goal of designing user-oriented websites which were 'sticky'. Navigational queries – Queries that seek a single website or web page of a single entity (e. Desktop search is more under the control of the user.g. Search engine designers and companies could only place so many 'marketing keywords' into the content of a webpage before draining it of all interesting and useful information. In Desktop search. the customer lifetime value equation was changed to incorporate more useful content into the website in hopes of retaining the visitor. Types There are three broad categories that cover most web search queries: • Informational queries – Queries that cover a broad topic (e.SEARCH ENGINE 33 corporate-oriented webpages similar to product brochures) changed from descriptive to marketing-oriented keywords designed to drive sales by placing the webpage high in the search results for specific search queries. • • Search engines often support a fourth type of query that is used far less frequently: • Connectivity queries – Queries that report on the connectivity of the indexed web graph (e. full-text indexing was more objective and increased the quality of search engine results. which drove many search engines to adopt full-text indexing technologies in the 1990s. they vary greatly from standard query languages which are governed by strict syntax rules. Web search queries are distinctive in that they are unstructured and often ambiguous. and How many pages are indexed from this domain name?).. The fact that these keywords were subjectively-specified was leading to spamdexing.g. like purchasing a car or downloading a screen saver. as it was one more step away from subjective control of search engine result placement. many solutions incorporate meta tags to provide a way for authors to further customize how the search engine will index content from various files that is not evident from the file content..g. which in turn furthered research of full-text indexing technologies.. In this sense. See also A web search query is a query that a user enters into web search engine to satisfy his or her information needs.
. That is. A study of the same Excite query logs revealed that 19% of the queries contained a geographic term (e. This suggests that many users use repeat queries to revisit or re-find information. a study in 2001 analyzed the queries from the Excite search engine showed some interesting characteristics of web search: • • • • • The average length of a search query was 2.SEARCH ENGINE Characteristics 34 Most commercial web search engines do not disclose their search logs.. place names. The top three most frequently used terms were and. much research has shown that query term frequency distributions conform to the power law. > 100 million queries) are used most often. geographic features. caching and pre-fetching. of. About half of the users entered a single query while a little less than a third of users entered three or more unique queries. zip codes. This example of the Pareto principle (or 80-20 rule) allows search engines to employ optimization techniques such as index or database partitioning. OR. Boolean operators like AND. a small portion of the terms observed in a large query log (e.. Less than 5% of users used advanced search features (e. and sex. Close to half of the users examined only the first one or two pages of results (10 results per page). etc. A 2005 study of Yahoo's query logs revealed 33% of the queries from the same user were repeat queries and that 87% of the time the user would click on the same result.g.). so information about what users are searching for on the Web is difficult to come by. In addition. and NOT).g. Nevertheless.4 terms.g. or long tail distribution curves. while the remaining terms are used less often individually.
Backlinks were used based on the Hyperlink-Induced Topic Search (HITS) algorithm to crawl billions of Web pages. digital books and journals. 2003).the ranking system that improved the search results (Brin & Page. library catalogues. In 1996-1997 Google was designed based on a novel idea that the link structure of the Web is an important resource to improve the results of search engines. Ma (2004) from Asian Microsoft Research Centre reported features of the next generation of search engines in WISE04. It is thought that Web page layout is a good resource for improving search results. The automatic thesaurus construction method is a page structure method. research reports and governmental archives are examples of resources that usually cannot be crawled and indexed by current search engines. it was clear that the contents of a Web page could not be sufficient for capturing the huge amount of information. researchers have focused on Web page structure to increase the quality of search. patents. the Deep Web and mobile search. which is called the "visible" or "indexable" Web. We can imagine also that a link in the middle of Web page is more important than a link in footnote. we can track several specifications and shifts in the future. It is able to identify new terms and reflect the latest relationship between terms as the Web evolves. aggregating the results and letting users compare changes to those results over time. 4. which extracts term relationships from the link structure of Websites. It is believed that the size of invisible or deep Web is several times bigger than the size of the surface Web. Google not only used this approach to capture the biggest amount of Web pages but also established PageRank . the value of information presented in < heading > tags can be more than information in < paragraph > tags. Web Graph algorithms such as HITS might be implemented to a sub-section of Web pages to improve search result ranking models. A huge amount of scientific and other valuable information is behind closed doors. 1998). when applied to query expansion. They built huge centralized indices and this is still a part of every popular search engine. MSN and many other popular search engines are competing to find solution for the .1 Page Structure Analysis: the first search engines concentrated on Web page contents. MSN new ranking model will be based on object-level ranking rather than document-level. New search engines are trying to find suitable methods for penetrating the database barriers. However. By looking at the papers published in the mentioned conferences and other journals and seminars. Meanwhile. After content-based indexing and link analysis the new area of study is page and layout structures. Experimental results have shown that the constructed thesaurus. 4. outperforms traditional association thesaurus (Chen et al. Deep Web with structured information is a potential resource that search companies are trying to capture. For example. AltaVista and other old search engines were made based on indexing the content of Web pages.SEARCH ENGINE New Features for Web Searching 35 The incredible development of Web resources and services has become a motivation for many studies and for companies to invest on developing new search engines or adding new features and abilities to their search engines. Different databases. HTML and XML are important in this approach.2 Deep Search: current search engines can only crawl and capture a small part of the Web. BrightPlanet's "differencing" algorithm is designed to transfer queries across multiple deep Web resources at once. Web content providers are moving toward Semantic Web by applying technologies such as XML and RDF (Resource Description Framework) in order to create more structured Web resources. Google. Microsoft has started a big competition on Web searching through working on Web page blocks.
These methods are automatic and are done by machines. Discussion thread recommendation or peer reviews are expected to be used by search engines to improve their results. if the search term is mathematics then a page that has the word mathematics 20 times must be ranked before a page which has mathematics 10 times. As we already mentioned. Federated searche mostly covers subscription based databases that are usually a part of Invisible Web and ignored by Web-oriented metasearch engines. 2004). It reduces the time that is needed for searching several databases and also users do not need to know how to search through different interfaces (Fryer. Simply. and their relation to each other. In many cases. but in the future an intelligent search engine will be able to distinguish different structured resources and combine their data to find a high quality response for a complicated query. Most documents available on the Web are unstructured resources. 4. search results will be ranked not only based on the automatic ranking algorithms but also by using the ideas of scholars and scientific recommending groups. Metasearch engines services for users are free while federated search engines are sold to libraries and other interested information service providers. this alone is not a sufficient way.5 Federated Search: also known as parallel search. 4. Such an engine would rank words based on their location in a document. Basic ranking algorithms are based on the occurrence rate of index terms in each page. Most of search engines just save a copy of Web pages in their repository and then make several indexes from the content of these pages. data is stored in tables and separated files. Page ranking algorithms have been utilized to present a better ranked result. the amazing size and valuable resources of the deep Web have affected the industry of search engines and the next generation of search engines are supposed to be able to investigate deep Web information. Federated searching has several advantages for users. rather than just the number of times they appear. Usually there is no overlap between databases covered by federated search engines. The method is secret but the company does acknowledge that its Content Aggregation Program will give paying customers a more direct pipeline into its search database (Wright. One of the . However. it aggregates multiple channels of information into a single searchable point. recently link information and page structure information have been used to improve rank quality. search engines can just judge them based on the keyword occurrence. In the future. As we already mentioned. The idea is simple: more relevant pages must take a higher rank. structured data resources are very important and valuable. Federated search engines are different from metasearch engines. 4. metasearch or broadcast search.SEARCH ENGINE 36 invisible Web. Yahoo has developed a paid service for searching the deep Web that is called the Content Aggregation Program (CAP). Recently. Current search engines cannot resolve this problem efficiently. As a part of both surface and deep Web. sorting the results of each query is still an issue. As Rein (1997) says a search engine supporting XMLbased queries can be programmed to search structured resources. So. 2004). it is believed that the best judgement about the importance and quality of Web pages is acquired when they are reviewed and recommended by human experts.3 Structured Data: the World Wide Web is considered a huge collection of unstructured data presented in billions of Web pages.4 Recommending Group Ranking: while many search engines are able to crawl and index billions of Web pages. Traditional information retrieval and database management techniques have been used to extract data from different tables and resources and combine them to respond users' queries. The concept of structured searching is different from the way search engines currently operate.
sports scores and weather for fee. Search engine companies have focused on the big market of mobile phones and wireless telecommunication devices. as well as quick links to stocks. they cannot overcome the underlying problem of growing complexity and lack of uniformity. Recently. One of the disadvantages of federated search engines is that they cannot be used for sophisticated search commands and queries. Webster (2004) maintains that although federated searching tools offer some real immediate advantages today. and are limited to basic Boolean search. Image and Web search. In the future everyone will have access to the Web information and services through his/her wireless phone without necessarily having a computer. 2004). Also many other mobile technologies such as GPS devices are used widely. . Yahoo developed its mobile Web search system and mobile phone users can have access to Yahoo Local.SEARCH ENGINE 37 important reasons of the growing interest in federated searching is the complexity of the online materials environment such as the increasing number of electronic journals and online full-text databases. We need an open interoperable and uniform e-content environment to provide fully the interconnected assessable environment that librarians are seeking from metasearching. The platform also includes a modified Yahoo Instant Messaging client and Yahoo Mobile Games (Singer. 4.6 Mobile Search: the number of people who have a cell phone seems to be more than the number of people who have a PC.
ambiguity in addresses and names.SEARCH ENGINE 38 4. The Web's security and privacy are two important issues for the coming years. other major players in search engine industry are expected to invest on rivals for this new service. Federated search is a sample of future cooperative search and information retrieval facilities. This reveals that search engines have many unsolved and research-interesting areas. Having the Beta version of Google Scholar (http://scholar. we reviewed the history of Web search tools and techniques and mentioned some big shifts in this field.com) released in November 2004. there will be a shift towards providing specialised search facilities for the scholarly part of the Web that encompasses a considerable part of the deep Web. we addressed the efforts of search engine companies in breaking their borders through making search possible for mobile phones and other wireless information and communication devices. As well. Meanwhile many issues have remained unsolved or incomplete still. Search engines are trying to consider recommendations of special-interest groups into their search techniques.google. . many other algorithms and methods have since been added to them to improve their results. Finally. Conclusion The World Wide Web with its short history has experienced significant changes. The next generations of search tools are expected to be able to extract structured data to offer high quality responses to users' questions. While the first search engines were established based on the traditional database and information retrieval methods. In this article. Information extraction. Google utilized Web graph or link structure of the Web to make one of the most comprehensive and reliable search engines. The gigantic size of the Web and vast variety of the users' needs and interests as well as the big potential of the Web as a commercial market have brought about many changes and a great demand for better search engines. The World Wide Web will be more usable in the future. We mentioned several important issues for the future of search engines. but also we can see that Web search and information retrieval topics such as ranking. Web search industry is opening new horizons for the global village. Local services and the personalization of search tools are two major ideas that have been studied for several years. we see not only a considerable increase in the quantity of Web search research papers since 2001. The structure of Web pages seems to be a good resource with which search engines can improve their results. By looking at papers published in popular conferences on Web and information management. Limitation in funds has enforced libraries and other major information user organizations to share their online resources. filtering and query formulation are still hot topics. personalization and multimedia searching among others are major issues in the next few years.
186-192.wired. Brisbane.html Holzschlag. Ziou. M. . Perez. Program. History of search engines & web history. 140-151.com/intvalstat. History of Internet and WWW: the roads and crossroads of Internet history. (1997). Web search engines. Rein.00. Retrieved December 5. USA. Proceedings of the 5th ACM SIGMM international workshop on Multimedia information retrieval. (2001). B. C. 48– 55. G. G. Zhang. 2004. The Search Engine Report.com/bus-news/article. Spink. Personalized web search by mapping user queries to categories. Google offers new local search service. (2004).internetnews. Towards Next Generation Web Information Retrieval.netvalley.searchenginewatch. 2004. & Pedersen.. (2002).. (1998).com/news/technology/0. 973-982. L. McLean. How specialization limited the Web.. A. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. G. Pu. 558-565. (2003). Proceedings of the 7th International WWW Conference. R. 28(2). C. 20-23. from http://www. October 27). 36(14). 31(2). (2004). Webster. (2004. (2004). H. Retrieved December 1. Retrieved December 4. GeoSearcher: location-based ranking of search engine results. techniques and systems.php/3427831 Sullivan. Australia. L. XML Ushers in Structured Web Searches. from http://www. S. Yu. W. June 2). Journal of the American Society for Information Science.com/sereport/00/06realnames.com/article/04/03/17/HNgooglelocal_1. 49(11). Online. Gromov. Brisbane. 131-145. L. Image retrieval from the World Wide Web: issues.SEARCH ENGINE References • • 39 • • • • • • • • • • • • • • • • Brin.info/search-engine-history/ Watters. 2004.webtechniques. 28(2). from http://www. C. & Ma. The design of World Wide Web search engines: a critical review. Federated search engines. 2004. from http://www. and Hon. 2004. 16-19. ACM Computer Surveys.com/archives/2001/09/desi/ Jansen.search-marketing.. 2004. Web Information Systems – WISE04: Proceedings of the fifth international Conference on Web Information System Engineering . (2003). Australia. Retrieved December 2. A. & Meng. L. C. P. Retrieved December 3. H. from http://www. Liu. M. Toronto.html Schwartz. & Page. A. S.html Poulter. E. from http://www. Kherfi. 54(2). (1997). Ma.infoworld. Proceedings of the eleventh international conference on Information and knowledge management CIKM’02. D. (1998). (2003). & Amoudi. (2002). 35-67.html Wall. from http://www. (2004). An analysis of multimedia searching on AltaVista. Retrieved November 20.7751.. J. (2000. D.. Journal of the American Society for Information Science and Technology. (2004). J. Survey reveals search habits. Building a web thesaurus from web link ltructure. 17. Online. Chen. M. Singer. Liu. Metasearching in an academic environment. Retrieved November 28. Yahoo sends search aloft. Wenyin. (2004). W. Fryer. The anatomy of a large-scale hypertextual web search engine. D. W.1282. 107-117. Virginia. Z. A.. & Bernardi. F. 2004.
SEARCH ENGINE 40 .