The Anatomy of A Search Engine

TheAnatomyofaLargeScaleHypertextual WebSearchEngine
SergeyBrinandLawrencePage {sergey,page}@cs.stanford.edu ComputerScienceDepartment,StanfordUniversity,Stanford,CA94305
Abstract
Inthispaper,wepresentGoogle,aprototypeofalargescalesearchenginewhich makesheavyuseofthestructurepresentinhypertext.Googleisdesignedtocrawland indextheWebefficientlyandproducemuchmoresatisfyingsearchresultsthanexisting systems.Theprototypewithafulltextandhyperlinkdatabaseofatleast24millionpagesis availableathttp://google.stanford.edu/ Toengineerasearchengineisachallengingtask.Searchenginesindextensto hundredsofmillionsofwebpagesinvolvingacomparablenumberofdistinctterms.They answertensofmillionsofquerieseveryday.Despitetheimportanceoflargescalesearch enginesontheweb,verylittleacademicresearchhasbeendoneonthem.Furthermore,due torapidadvanceintechnologyandwebproliferation,creatingawebsearchenginetodayis verydifferentfromthreeyearsago.Thispaperprovidesanindepthdescriptionofour largescalewebsearchenginethefirstsuchdetailedpublicdescriptionweknowofto date. Apartfromtheproblemsofscalingtraditionalsearchtechniquestodataofthis magnitude,therearenewtechnicalchallengesinvolvedwithusingtheadditional informationpresentinhypertexttoproducebettersearchresults.Thispaperaddressesthis questionofhowtobuildapracticallargescalesystemwhichcanexploittheadditional informationpresentinhypertext.Alsowelookattheproblemofhowtoeffectivelydeal withuncontrolledhypertextcollectionswhereanyonecanpublishanythingtheywant. Keywords:WorldWideWeb,SearchEngines,InformationRetrieval,PageRank,Google
1.Introduction
(Note:Therearetwoversionsofthispaperalongerfullversionandashorterprintedversion.The fullversionisavailableonthewebandtheconferenceCDROM.) Thewebcreatesnewchallengesforinformationretrieval.Theamountofinformationonthewebis growingrapidly,aswellasthenumberofnewusersinexperiencedintheartofwebresearch.People arelikelytosurfthewebusingitslinkgraph,oftenstartingwithhighqualityhumanmaintainedindices suchasYahoo!orwithsearchengines.Humanmaintainedlistscoverpopulartopicseffectivelybutare subjective,expensivetobuildandmaintain,slowtoimprove,andcannotcoverallesoterictopics. Automatedsearchenginesthatrelyonkeywordmatchingusuallyreturntoomanylowqualitymatches. Tomakemattersworse,someadvertisersattempttogainpeople'sattentionbytakingmeasuresmeantto misleadautomatedsearchengines.Wehavebuiltalargescalesearchenginewhichaddressesmanyof theproblemsofexistingsystems.Itmakesespeciallyheavyuseoftheadditionalstructurepresentin hypertexttoprovidemuchhigherqualitysearchresults.Wechoseoursystemname,Google,becauseit isacommonspellingofgoogol,or10100 andfitswellwithourgoalofbuildingverylargescalesearch engines.
1.1WebSearchEnginesScalingUp:19942000
Searchenginetechnologyhashadtoscaledramaticallytokeepupwiththegrowthoftheweb.In1994,
oneofthefirstwebsearchengines,theWorldWideWebWorm(WWWW)[McBryan94]hadan indexof110,000webpagesandwebaccessibledocuments.AsofNovember,1997,thetopsearch enginesclaimtoindexfrom2million(WebCrawler)to100millionwebdocuments(fromSearch EngineWatch).Itisforeseeablethatbytheyear2000,acomprehensiveindexoftheWebwillcontain overabilliondocuments.Atthesametime,thenumberofqueriessearchengineshandlehasgrown incrediblytoo.InMarchandApril1994,theWorldWideWebWormreceivedanaverageofabout 1500queriesperday.InNovember1997,Altavistaclaimedithandledroughly20millionqueriesper day.Withtheincreasingnumberofusersontheweb,andautomatedsystemswhichquerysearch engines,itislikelythattopsearchengineswillhandlehundredsofmillionsofqueriesperdaybythe year2000.Thegoalofoursystemistoaddressmanyoftheproblems,bothinqualityandscalability, introducedbyscalingsearchenginetechnologytosuchextraordinarynumbers.
1.2.Google:ScalingwiththeWeb
Creatingasearchenginewhichscaleseventotoday'swebpresentsmanychallenges.Fastcrawling technologyisneededtogatherthewebdocumentsandkeepthemuptodate.Storagespacemustbe usedefficientlytostoreindicesand,optionally,thedocumentsthemselves.Theindexingsystemmust processhundredsofgigabytesofdataefficiently.Queriesmustbehandledquickly,atarateofhundreds tothousandspersecond. ThesetasksarebecomingincreasinglydifficultastheWebgrows.However,hardwareperformanceand costhaveimproveddramaticallytopartiallyoffsetthedifficulty.Thereare,however,severalnotable exceptionstothisprogresssuchasdiskseektimeandoperatingsystemrobustness.IndesigningGoogle, wehaveconsideredboththerateofgrowthoftheWebandtechnologicalchanges.Googleisdesigned toscalewelltoextremelylargedatasets.Itmakesefficientuseofstoragespacetostoretheindex.Its datastructuresareoptimizedforfastandefficientaccess(seesection4.2).Further,weexpectthatthe costtoindexandstoretextorHTMLwilleventuallydeclinerelativetotheamountthatwillbeavailable (seeAppendixB).ThiswillresultinfavorablescalingpropertiesforcentralizedsystemslikeGoogle.
1.3DesignGoals
1.3.1ImprovedSearchQuality Ourmaingoalistoimprovethequalityofwebsearchengines.In1994,somepeoplebelievedthata completesearchindexwouldmakeitpossibletofindanythingeasily.AccordingtoBestoftheWeb 1994Navigators,"Thebestnavigationserviceshouldmakeiteasytofindalmostanythingonthe Web(onceallthedataisentered)."However,theWebof1997isquitedifferent.Anyonewhohasused asearchenginerecently,canreadilytestifythatthecompletenessoftheindexisnottheonlyfactorin thequalityofsearchresults."Junkresults"oftenwashoutanyresultsthatauserisinterestedin.Infact, asofNovember1997,onlyoneofthetopfourcommercialsearchenginesfindsitself(returnsitsown searchpageinresponsetoitsnameinthetoptenresults).Oneofthemaincausesofthisproblemisthat thenumberofdocumentsintheindiceshasbeenincreasingbymanyordersofmagnitude,buttheuser's abilitytolookatdocumentshasnot.Peoplearestillonlywillingtolookatthefirstfewtensofresults. Becauseofthis,asthecollectionsizegrows,weneedtoolsthathaveveryhighprecision(numberof relevantdocumentsreturned,sayinthetoptensofresults).Indeed,wewantournotionof"relevant"to onlyincludetheverybestdocumentssincetheremaybetensofthousandsofslightlyrelevant documents.Thisveryhighprecisionisimportantevenattheexpenseofrecall(thetotalnumberof relevantdocumentsthesystemisabletoreturn).Thereisquiteabitofrecentoptimismthattheuseof morehypertextualinformationcanhelpimprovesearchandotherapplications[Marchiori97][Spertus 97][Weiss96][Kleinberg98].Inparticular,linkstructure[Page98]andlinktextprovidealotof informationformakingrelevancejudgmentsandqualityfiltering.Googlemakesuseofbothlink structureandanchortext(seeSections2.1and2.2).
1.3.2AcademicSearchEngineResearch Asidefromtremendousgrowth,theWebhasalsobecomeincreasinglycommercialovertime.In1993, 1.5%ofwebserverswereon.comdomains.Thisnumbergrewtoover60%in1997.Atthesametime, searchengineshavemigratedfromtheacademicdomaintothecommercial.Upuntilnowmostsearch enginedevelopmenthasgoneonatcompanieswithlittlepublicationoftechnicaldetails.Thiscauses searchenginetechnologytoremainlargelyablackartandtobeadvertisingoriented(seeAppendixA). WithGoogle,wehaveastronggoaltopushmoredevelopmentandunderstandingintotheacademic realm. Anotherimportantdesigngoalwastobuildsystemsthatreasonablenumbersofpeoplecanactuallyuse. Usagewasimportanttousbecausewethinksomeofthemostinterestingresearchwillinvolve leveragingthevastamountofusagedatathatisavailablefrommodernwebsystems.Forexample,there aremanytensofmillionsofsearchesperformedeveryday.However,itisverydifficulttogetthisdata, mainlybecauseitisconsideredcommerciallyvaluable. Ourfinaldesigngoalwastobuildanarchitecturethatcansupportnovelresearchactivitiesonlarge scalewebdata.Tosupportnovelresearchuses,Googlestoresalloftheactualdocumentsitcrawlsin compressedform.OneofourmaingoalsindesigningGooglewastosetupanenvironmentwhereother researcherscancomeinquickly,processlargechunksoftheweb,andproduceinterestingresultsthat wouldhavebeenverydifficulttoproduceotherwise.Intheshorttimethesystemhasbeenup,there havealreadybeenseveralpapersusingdatabasesgeneratedbyGoogle,andmanyothersareunderway. AnothergoalwehaveistosetupaSpacelablikeenvironmentwhereresearchersorevenstudentscan proposeanddointerestingexperimentsonourlargescalewebdata.
2.SystemFeatures
TheGooglesearchenginehastwoimportantfeaturesthathelpitproducehighprecisionresults.First,it makesuseofthelinkstructureoftheWebtocalculateaqualityrankingforeachwebpage.This rankingiscalledPageRankandisdescribedindetailin[Page98].Second,Googleutilizeslinkto improvesearchresults.
2.1PageRank:BringingOrdertotheWeb
Thecitation(link)graphofthewebisanimportantresourcethathaslargelygoneunusedinexisting websearchengines.Wehavecreatedmapscontainingasmanyas518millionofthesehyperlinks,a significantsampleofthetotal.Thesemapsallowrapidcalculationofawebpage's"PageRank",an objectivemeasureofitscitationimportancethatcorrespondswellwithpeople'ssubjectiveideaof importance.Becauseofthiscorrespondence,PageRankisanexcellentwaytoprioritizetheresultsof webkeywordsearches.Formostpopularsubjects,asimpletextmatchingsearchthatisrestrictedtoweb pagetitlesperformsadmirablywhenPageRankprioritizestheresults(demoavailableat google.stanford.edu).ForthetypeoffulltextsearchesinthemainGooglesystem,PageRankalsohelps agreatdeal. 2.1.1DescriptionofPageRankCalculation Academiccitationliteraturehasbeenappliedtotheweb,largelybycountingcitationsorbacklinkstoa givenpage.Thisgivessomeapproximationofapage'simportanceorquality.PageRankextendsthis ideabynotcountinglinksfromallpagesequally,andbynormalizingbythenumberoflinksonapage. PageRankisdefinedasfollows: WeassumepageAhaspagesT1...Tnwhichpointtoit(i.e.,arecitations).Theparameterd isadampingfactorwhichcanbesetbetween0and1.Weusuallysetdto0.85.Thereare
moredetailsaboutdinthenextsection.AlsoC(A)isdefinedasthenumberoflinksgoing outofpageA.ThePageRankofapageAisgivenasfollows: PR(A)=(1d)+d(PR(T1)/C(T1)+...+PR(Tn)/C(Tn)) NotethatthePageRanksformaprobabilitydistributionoverwebpages,sothesumofall webpages'PageRankswillbeone. PageRankorPR(A)canbecalculatedusingasimpleiterativealgorithm,andcorrespondstothe principaleigenvectorofthenormalizedlinkmatrixoftheweb.Also,aPageRankfor26millionweb pagescanbecomputedinafewhoursonamediumsizeworkstation.Therearemanyotherdetails whicharebeyondthescopeofthispaper. 2.1.2IntuitiveJustification PageRankcanbethoughtofasamodelofuserbehavior.Weassumethereisa"randomsurfer"whois givenawebpageatrandomandkeepsclickingonlinks,neverhitting"back"buteventuallygetsbored andstartsonanotherrandompage.TheprobabilitythattherandomsurfervisitsapageisitsPageRank. And,theddampingfactoristheprobabilityateachpagethe"randomsurfer"willgetboredandrequest anotherrandompage.Oneimportantvariationistoonlyaddthedampingfactordtoasinglepage,ora groupofpages.Thisallowsforpersonalizationandcanmakeitnearlyimpossibletodeliberately misleadthesysteminordertogetahigherranking.WehaveseveralotherextensionstoPageRank, againsee[Page98]. AnotherintuitivejustificationisthatapagecanhaveahighPageRankiftherearemanypagesthatpoint toit,oriftherearesomepagesthatpointtoitandhaveahighPageRank.Intuitively,pagesthatarewell citedfrommanyplacesaroundthewebareworthlookingat.Also,pagesthathaveperhapsonlyone citationfromsomethingliketheYahoo!homepagearealsogenerallyworthlookingat.Ifapagewas nothighquality,orwasabrokenlink,itisquitelikelythatYahoo'shomepagewouldnotlinktoit. PageRankhandlesboththesecasesandeverythinginbetweenbyrecursivelypropagatingweights throughthelinkstructureoftheweb.
2.2AnchorText
Thetextoflinksistreatedinaspecialwayinoursearchengine.Mostsearchenginesassociatethetext ofalinkwiththepagethatthelinkison.Inaddition,weassociateitwiththepagethelinkpointsto. Thishasseveraladvantages.First,anchorsoftenprovidemoreaccuratedescriptionsofwebpagesthan thepagesthemselves.Second,anchorsmayexistfordocumentswhichcannotbeindexedbyatext basedsearchengine,suchasimages,programs,anddatabases.Thismakesitpossibletoreturnweb pageswhichhavenotactuallybeencrawled.Notethatpagesthathavenotbeencrawledcancause problems,sincetheyarenevercheckedforvaliditybeforebeingreturnedtotheuser.Inthiscase,the searchenginecanevenreturnapagethatneveractuallyexisted,buthadhyperlinkspointingtoit. However,itispossibletosorttheresults,sothatthisparticularproblemrarelyhappens. ThisideaofpropagatinganchortexttothepageitreferstowasimplementedintheWorldWideWeb Worm[McBryan94]especiallybecauseithelpssearchnontextinformation,andexpandsthesearch coveragewithfewerdownloadeddocuments.Weuseanchorpropagationmostlybecauseanchortext canhelpprovidebetterqualityresults.Usinganchortextefficientlyistechnicallydifficultbecauseofthe largeamountsofdatawhichmustbeprocessed.Inourcurrentcrawlof24millionpages,wehadover 259millionanchorswhichweindexed.
2.3OtherFeatures
AsidefromPageRankandtheuseofanchortext,Googlehasseveralotherfeatures.First,ithaslocation
informationforallhitsandsoitmakesextensiveuseofproximityinsearch.Second,Googlekeepstrack ofsomevisualpresentationdetailssuchasfontsizeofwords.Wordsinalargerorbolderfontare weightedhigherthanotherwords.Third,fullrawHTMLofpagesisavailableinarepository.
3RelatedWork
Searchresearchonthewebhasashortandconcisehistory.TheWorldWideWebWorm(WWWW) [McBryan94]wasoneofthefirstwebsearchengines.Itwassubsequentlyfollowedbyseveralother academicsearchengines,manyofwhicharenowpubliccompanies.Comparedtothegrowthofthe Webandtheimportanceofsearchenginestherearepreciousfewdocumentsaboutrecentsearch engines[Pinkerton94].AccordingtoMichaelMauldin(chiefscientist,LycosInc)[Mauldin],"the variousservices(includingLycos)closelyguardthedetailsofthesedatabases".However,therehas beenafairamountofworkonspecificfeaturesofsearchengines.Especiallywellrepresentediswork whichcangetresultsbypostprocessingtheresultsofexistingcommercialsearchengines,orproduce smallscale"individualized"searchengines.Finally,therehasbeenalotofresearchoninformation retrievalsystems,especiallyonwellcontrolledcollections.Inthenexttwosections,wediscusssome areaswherethisresearchneedstobeextendedtoworkbetterontheweb.
3.1InformationRetrieval
Workininformationretrievalsystemsgoesbackmanyyearsandiswelldeveloped[Witten94]. However,mostoftheresearchoninformationretrievalsystemsisonsmallwellcontrolled homogeneouscollectionssuchascollectionsofscientificpapersornewsstoriesonarelatedtopic. Indeed,theprimarybenchmarkforinformationretrieval,theTextRetrievalConference[TREC96], usesafairlysmall,wellcontrolledcollectionfortheirbenchmarks.The"VeryLargeCorpus" benchmarkisonly20GBcomparedtothe147GBfromourcrawlof24millionwebpages.Thingsthat workwellonTRECoftendonotproducegoodresultsontheweb.Forexample,thestandardvector spacemodeltriestoreturnthedocumentthatmostcloselyapproximatesthequery,giventhatbothquery anddocumentarevectorsdefinedbytheirwordoccurrence.Ontheweb,thisstrategyoftenreturnsvery shortdocumentsthatarethequeryplusafewwords.Forexample,wehaveseenamajorsearchengine returnapagecontainingonly"BillClintonSucks"andpicturefroma"BillClinton"query.Someargue thatontheweb,usersshouldspecifymoreaccuratelywhattheywantandaddmorewordstotheir query.Wedisagreevehementlywiththisposition.Ifauserissuesaquerylike"BillClinton"they shouldgetreasonableresultssincethereisaenormousamountofhighqualityinformationavailableon thistopic.Givenexampleslikethese,webelievethatthestandardinformationretrievalworkneedstobe extendedtodealeffectivelywiththeweb.
3.2DifferencesBetweentheWebandWellControlledCollections
Thewebisavastcollectionofcompletelyuncontrolledheterogeneousdocuments.Documentsonthe webhaveextremevariationinternaltothedocuments,andalsointheexternalmetainformationthat mightbeavailable.Forexample,documentsdifferinternallyintheirlanguage(bothhumanand programming),vocabulary(emailaddresses,links,zipcodes,phonenumbers,productnumbers),typeor format(text,HTML,PDF,images,sounds),andmayevenbemachinegenerated(logfilesoroutput fromadatabase).Ontheotherhand,wedefineexternalmetainformationasinformationthatcanbe inferredaboutadocument,butisnotcontainedwithinit.Examplesofexternalmetainformationinclude thingslikereputationofthesource,updatefrequency,quality,popularityorusage,andcitations.Not onlyarethepossiblesourcesofexternalmetainformationvaried,butthethingsthatarebeingmeasured varymanyordersofmagnitudeaswell.Forexample,comparetheusageinformationfromamajor homepage,likeYahoo'swhichcurrentlyreceivesmillionsofpageviewseverydaywithanobscure historicalarticlewhichmightreceiveonevieweverytenyears.Clearly,thesetwoitemsmustbetreated verydifferentlybyasearchengine.
Anotherbigdifferencebetweenthewebandtraditionalwellcontrolledcollectionsisthatthereis virtuallynocontroloverwhatpeoplecanputontheweb.Couplethisflexibilitytopublishanything withtheenormousinfluenceofsearchenginestoroutetrafficandcompanieswhichdeliberately manipulatingsearchenginesforprofitbecomeaseriousproblem.Thisproblemthathasnotbeen addressedintraditionalclosedinformationretrievalsystems.Also,itisinterestingtonotethatmetadata effortshavelargelyfailedwithwebsearchengines,becauseanytextonthepagewhichisnotdirectly representedtotheuserisabusedtomanipulatesearchengines.Thereareevennumerouscompanies whichspecializeinmanipulatingsearchenginesforprofit.
4SystemAnatomy
First,wewillprovideahighleveldiscussionofthearchitecture.Then,thereissomeindepth descriptionsofimportantdatastructures.Finally,themajorapplications:crawling,indexing,and searchingwillbeexaminedindepth.
4.1GoogleArchitectureOverview
Inthissection,wewillgiveahighleveloverviewof howthewholesystemworksaspicturedinFigure1. Furthersectionswilldiscusstheapplicationsanddata structuresnotmentionedinthissection.MostofGoogle isimplementedinCorC++forefficiencyandcanrun ineitherSolarisorLinux. InGoogle,thewebcrawling(downloadingofweb pages)isdonebyseveraldistributedcrawlers.Thereis aURLserverthatsendslistsofURLstobefetchedto thecrawlers.Thewebpagesthatarefetchedarethen senttothestoreserver.Thestoreserverthencompresses andstoresthewebpagesintoarepository.Everyweb pagehasanassociatedIDnumbercalledadocID whichisassignedwheneveranewURLisparsedout ofawebpage.Theindexingfunctionisperformedby Figure1.HighLevelGoogleArchitecture theindexerandthesorter.Theindexerperformsa numberoffunctions.Itreadstherepository, uncompressesthedocuments,andparsesthem.Eachdocumentisconvertedintoasetofword occurrencescalledhits.Thehitsrecordtheword,positionindocument,anapproximationoffontsize, andcapitalization.Theindexerdistributesthesehitsintoasetof"barrels",creatingapartiallysorted forwardindex.Theindexerperformsanotherimportantfunction.Itparsesoutallthelinksineveryweb pageandstoresimportantinformationabouttheminananchorsfile.Thisfilecontainsenough informationtodeterminewhereeachlinkpointsfromandto,andthetextofthelink. TheURLresolverreadstheanchorsfileandconvertsrelativeURLsintoabsoluteURLsandinturninto docIDs.Itputstheanchortextintotheforwardindex,associatedwiththedocIDthattheanchorpoints to.ItalsogeneratesadatabaseoflinkswhicharepairsofdocIDs.Thelinksdatabaseisusedtocompute PageRanksforallthedocuments. Thesortertakesthebarrels,whicharesortedbydocID(thisisasimplification,seeSection4.2.5),and resortsthembywordIDtogeneratetheinvertedindex.Thisisdoneinplacesothatlittletemporary spaceisneededforthisoperation.ThesorteralsoproducesalistofwordIDsandoffsetsintothe invertedindex.AprogramcalledDumpLexicontakesthislisttogetherwiththelexiconproducedbythe indexerandgeneratesanewlexicontobeusedbythesearcher.Thesearcherisrunbyawebserverand
usesthelexiconbuiltbyDumpLexicontogetherwiththeinvertedindexandthePageRankstoanswer queries.
4.2MajorDataStructures
Google'sdatastructuresareoptimizedsothatalargedocumentcollectioncanbecrawled,indexed,and searchedwithlittlecost.Although,CPUsandbulkinputoutputrateshaveimproveddramaticallyover theyears,adiskseekstillrequiresabout10mstocomplete.Googleisdesignedtoavoiddiskseeks wheneverpossible,andthishashadaconsiderableinfluenceonthedesignofthedatastructures. 4.2.1BigFiles BigFilesarevirtualfilesspanningmultiplefilesystemsandareaddressableby64bitintegers.The allocationamongmultiplefilesystemsishandledautomatically.TheBigFilespackagealsohandles allocationanddeallocationoffiledescriptors,sincetheoperatingsystemsdonotprovideenoughforour needs.BigFilesalsosupportrudimentarycompressionoptions. 4.2.2Repository TherepositorycontainsthefullHTMLofeveryweb page.Eachpageiscompressedusingzlib(seeRFC1950). Thechoiceofcompressiontechniqueisatradeoff betweenspeedandcompressionratio.Wechosezlib's speedoverasignificantimprovementincompression offeredbybzip.Thecompressionrateofbzipwas approximately4to1ontherepositoryascomparedto Figure2.RepositoryDataStructure zlib's3to1compression.Intherepository,thedocuments arestoredoneaftertheotherandareprefixedbydocID, length,andURLascanbeseeninFigure2.Therepositoryrequiresnootherdatastructurestobeused inordertoaccessit.Thishelpswithdataconsistencyandmakesdevelopmentmucheasierwecan rebuildalltheotherdatastructuresfromonlytherepositoryandafilewhichlistscrawlererrors. 4.2.3DocumentIndex Thedocumentindexkeepsinformationabouteachdocument.ItisafixedwidthISAM(Index sequentialaccessmode)index,orderedbydocID.Theinformationstoredineachentryincludesthe currentdocumentstatus,apointerintotherepository,adocumentchecksum,andvariousstatistics.Ifthe documenthasbeencrawled,italsocontainsapointerintoavariablewidthfilecalleddocinfowhich containsitsURLandtitle.OtherwisethepointerpointsintotheURLlistwhichcontainsjusttheURL. Thisdesigndecisionwasdrivenbythedesiretohaveareasonablycompactdatastructure,andthe abilitytofetcharecordinonediskseekduringasearch Additionally,thereisafilewhichisusedtoconvertURLsintodocIDs.ItisalistofURLchecksums withtheircorrespondingdocIDsandissortedbychecksum.InordertofindthedocIDofaparticular URL,theURL'schecksumiscomputedandabinarysearchisperformedonthechecksumsfiletofind itsdocID.URLsmaybeconvertedintodocIDsinbatchbydoingamergewiththisfile.Thisisthe techniquetheURLresolverusestoturnURLsintodocIDs.Thisbatchmodeofupdateiscrucial becauseotherwisewemustperformoneseekforeverylinkwhichassumingonediskwouldtakemore thanamonthforour322millionlinkdataset. 4.2.4Lexicon
Thelexiconhasseveraldifferentforms.Oneimportantchangefromearliersystemsisthatthelexicon canfitinmemoryforareasonableprice.Inthecurrentimplementationwecankeepthelexiconin memoryonamachinewith256MBofmainmemory.Thecurrentlexiconcontains14millionwords (thoughsomerarewordswerenotaddedtothelexicon).Itisimplementedintwopartsalistofthe words(concatenatedtogetherbutseparatedbynulls)andahashtableofpointers.Forvariousfunctions, thelistofwordshassomeauxiliaryinformationwhichisbeyondthescopeofthispapertoexplainfully. 4.2.5HitLists Ahitlistcorrespondstoalistofoccurrencesofaparticularwordinaparticulardocumentincluding position,font,andcapitalizationinformation.Hitlistsaccountformostofthespaceusedinboththe forwardandtheinvertedindices.Becauseofthis,itisimportanttorepresentthemasefficientlyas possible.Weconsideredseveralalternativesforencodingposition,font,andcapitalizationsimple encoding(atripleofintegers),acompactencoding(ahandoptimizedallocationofbits),andHuffman coding.Intheendwechoseahandoptimizedcompactencodingsinceitrequiredfarlessspacethanthe simpleencodingandfarlessbitmanipulationthanHuffmancoding.Thedetailsofthehitsareshownin Figure3. Ourcompactencodingusestwobytesforeveryhit.Therearetwotypesofhits:fancyhitsandplain hits.FancyhitsincludehitsoccurringinaURL,title,anchortext,ormetatag.Plainhitsinclude everythingelse.Aplainhitconsistsofacapitalizationbit,fontsize,and12bitsofwordpositionina document(allpositionshigherthan4095arelabeled4096).Fontsizeisrepresentedrelativetotherestof thedocumentusingthreebits(only7valuesareactuallyusedbecause111istheflagthatsignalsafancy hit).Afancyhitconsistsofacapitalizationbit,thefontsizesetto7toindicateitisafancyhit,4bitsto encodethetypeoffancyhit,and8bitsofposition.Foranchorhits,the8bitsofpositionaresplitinto4 bitsforpositioninanchorand4bitsforahashofthedocIDtheanchoroccursin.Thisgivesussome limitedphrasesearchingaslongastherearenotthatmanyanchorsforaparticularword.Weexpectto updatethewaythatanchorhitsarestoredtoallowforgreaterresolutioninthepositionanddocIDhash fields.Weusefontsizerelativetotherestofthedocumentbecausewhensearching,youdonotwantto rankotherwiseidenticaldocumentsdifferentlyjustbecauseoneofthedocumentsisinalargerfont. Thelengthofahitlistisstoredbeforethehits themselves.Tosavespace,thelengthofthehitlistis combinedwiththewordIDintheforwardindexandthe docIDintheinvertedindex.Thislimitsitto8and5bits respectively(therearesometrickswhichallow8bitsto beborrowedfromthewordID).Ifthelengthislonger thanwouldfitinthatmanybits,anescapecodeisused inthosebits,andthenexttwobytescontaintheactual length. 4.2.6ForwardIndex Theforwardindexisactuallyalreadypartiallysorted.It isstoredinanumberofbarrels(weused64).Each barrelholdsarangeofwordID's.Ifadocumentcontains wordsthatfallintoaparticularbarrel,thedocIDis Figure3.ForwardandReverseIndexesand recordedintothebarrel,followedbyalistofwordID's theLexicon withhitlistswhichcorrespondtothosewords.This schemerequiresslightlymorestoragebecauseof duplicateddocIDsbutthedifferenceisverysmallforareasonablenumberofbucketsandsaves considerabletimeandcodingcomplexityinthefinalindexingphasedonebythesorter.Furthermore,
insteadofstoringactualwordID's,westoreeachwordIDasarelativedifferencefromtheminimum wordIDthatfallsintothebarrelthewordIDisin.Thisway,wecanusejust24bitsforthewordID'sin theunsortedbarrels,leaving8bitsforthehitlistlength. 4.2.7InvertedIndex Theinvertedindexconsistsofthesamebarrelsastheforwardindex,exceptthattheyhavebeen processedbythesorter.ForeveryvalidwordID,thelexiconcontainsapointerintothebarrelthat wordIDfallsinto.ItpointstoadoclistofdocID'stogetherwiththeircorrespondinghitlists.Thisdoclist representsalltheoccurrencesofthatwordinalldocuments. AnimportantissueisinwhatorderthedocID'sshouldappearinthedoclist.Onesimplesolutionisto storethemsortedbydocID.Thisallowsforquickmergingofdifferentdoclistsformultipleword queries.Anotheroptionistostorethemsortedbyarankingoftheoccurrenceofthewordineach document.Thismakesansweringonewordqueriestrivialandmakesitlikelythattheanswersto multiplewordqueriesarenearthestart.However,mergingismuchmoredifficult.Also,thismakes developmentmuchmoredifficultinthatachangetotherankingfunctionrequiresarebuildoftheindex. Wechoseacompromisebetweentheseoptions,keepingtwosetsofinvertedbarrelsonesetforhit listswhichincludetitleoranchorhitsandanothersetforallhitlists.Thisway,wecheckthefirstsetof barrelsfirstandiftherearenotenoughmatcheswithinthosebarrelswecheckthelargerones.
4.3CrawlingtheWeb
Runningawebcrawlerisachallengingtask.Therearetrickyperformanceandreliabilityissuesand evenmoreimportantly,therearesocialissues.Crawlingisthemostfragileapplicationsinceitinvolves interactingwithhundredsofthousandsofwebserversandvariousnameserverswhichareallbeyond thecontrolofthesystem. Inordertoscaletohundredsofmillionsofwebpages,Googlehasafastdistributedcrawlingsystem.A singleURLserverserveslistsofURLstoanumberofcrawlers(wetypicallyranabout3).Boththe URLserverandthecrawlersareimplementedinPython.Eachcrawlerkeepsroughly300connections openatonce.Thisisnecessarytoretrievewebpagesatafastenoughpace.Atpeakspeeds,thesystem cancrawlover100webpagespersecondusingfourcrawlers.Thisamountstoroughly600Kper secondofdata.AmajorperformancestressisDNSlookup.EachcrawlermaintainsaitsownDNS cachesoitdoesnotneedtodoaDNSlookupbeforecrawlingeachdocument.Eachofthehundredsof connectionscanbeinanumberofdifferentstates:lookingupDNS,connectingtohost,sendingrequest, andreceivingresponse.Thesefactorsmakethecrawleracomplexcomponentofthesystem.Ituses asynchronousIOtomanageevents,andanumberofqueuestomovepagefetchesfromstatetostate. Itturnsoutthatrunningacrawlerwhichconnectstomorethanhalfamillionservers,andgeneratestens ofmillionsoflogentriesgeneratesafairamountofemailandphonecalls.Becauseofthevastnumber ofpeoplecomingonline,therearealwaysthosewhodonotknowwhatacrawleris,becausethisisthe firstonetheyhaveseen.Almostdaily,wereceiveanemailsomethinglike,"Wow,youlookedatalot ofpagesfrommywebsite.Howdidyoulikeit?"Therearealsosomepeoplewhodonotknowabout therobotsexclusionprotocol,andthinktheirpageshouldbeprotectedfromindexingbyastatement like,"Thispageiscopyrightedandshouldnotbeindexed",whichneedlesstosayisdifficultforweb crawlerstounderstand.Also,becauseofthehugeamountofdatainvolved,unexpectedthingswill happen.Forexample,oursystemtriedtocrawlanonlinegame.Thisresultedinlotsofgarbage messagesinthemiddleoftheirgame!Itturnsoutthiswasaneasyproblemtofix.Butthisproblemhad notcomeupuntilwehaddownloadedtensofmillionsofpages.Becauseoftheimmensevariationin webpagesandservers,itisvirtuallyimpossibletotestacrawlerwithoutrunningitonlargepartofthe Internet.Invariably,therearehundredsofobscureproblemswhichmayonlyoccurononepageoutof thewholewebandcausethecrawlertocrash,orworse,causeunpredictableorincorrectbehavior. SystemswhichaccesslargepartsoftheInternetneedtobedesignedtobeveryrobustandcarefully
tested.Sincelargecomplexsystemssuchascrawlerswillinvariablycauseproblems,thereneedstobe significantresourcesdevotedtoreadingtheemailandsolvingtheseproblemsastheycomeup.
4.4IndexingtheWeb
ParsingAnyparserwhichisdesignedtorunontheentireWebmusthandleahugearrayof possibleerrors.TheserangefromtyposinHTMLtagstokilobytesofzerosinthemiddleofatag, nonASCIIcharacters,HTMLtagsnestedhundredsdeep,andagreatvarietyofothererrorsthat challengeanyone'simaginationtocomeupwithequallycreativeones.Formaximumspeed, insteadofusingYACCtogenerateaCFGparser,weuseflextogeneratealexicalanalyzer whichweoutfitwithitsownstack.Developingthisparserwhichrunsatareasonablespeedand isveryrobustinvolvedafairamountofwork. IndexingDocumentsintoBarrelsAftereachdocumentisparsed,itisencodedintoanumber ofbarrels.EverywordisconvertedintoawordIDbyusinganinmemoryhashtablethe lexicon.Newadditionstothelexiconhashtableareloggedtoafile.Oncethewordsare convertedintowordID's,theiroccurrencesinthecurrentdocumentaretranslatedintohitlistsand arewrittenintotheforwardbarrels.Themaindifficultywithparallelizationoftheindexingphase isthatthelexiconneedstobeshared.Insteadofsharingthelexicon,wetooktheapproachof writingalogofalltheextrawordsthatwerenotinabaselexicon,whichwefixedat14million words.Thatwaymultipleindexerscanruninparallelandthenthesmalllogfileofextrawords canbeprocessedbyonefinalindexer. SortingInordertogeneratetheinvertedindex,thesortertakeseachoftheforwardbarrelsand sortsitbywordIDtoproduceaninvertedbarrelfortitleandanchorhitsandafulltextinverted barrel.Thisprocesshappensonebarrelatatime,thusrequiringlittletemporarystorage.Also,we parallelizethesortingphasetouseasmanymachinesaswehavesimplybyrunningmultiple sorters,whichcanprocessdifferentbucketsatthesametime.Sincethebarrelsdon'tfitintomain memory,thesorterfurthersubdividesthemintobasketswhichdofitintomemorybasedon wordIDanddocID.Thenthesorter,loadseachbasketintomemory,sortsitandwritesits contentsintotheshortinvertedbarrelandthefullinvertedbarrel.
4.5Searching
Thegoalofsearchingistoprovidequalitysearchresultsefficiently.Manyofthelargecommercial searchenginesseemedtohavemadegreatprogressintermsofefficiency.Therefore,wehavefocused moreonqualityofsearchinourresearch,althoughwebelieveoursolutionsarescalabletocommercial volumeswithabitmoreeffort.ThegooglequeryevaluationprocessisshowinFigure4. 1. Parsethequery. 2. ConvertwordsintowordIDs. Toputalimitonresponsetime,onceacertainnumber 3. Seektothestartofthedoclistin (currently40,000)ofmatchingdocumentsarefound,the theshortbarrelforeveryword. searcherautomaticallygoestostep8inFigure4.Thismeans 4. Scanthroughthedoclistsuntil thatitispossiblethatsuboptimalresultswouldbereturned. thereisadocumentthatmatches Wearecurrentlyinvestigatingotherwaystosolvethis allthesearchterms. problem.Inthepast,wesortedthehitsaccordingto 5. Computetherankofthat PageRank,whichseemedtoimprovethesituation. documentforthequery. 6. Ifweareintheshortbarrelsand 4.5.1TheRankingSystem attheendofanydoclist,seekto thestartofthedoclistinthefull Googlemaintainsmuchmoreinformationaboutweb barrelforeverywordandgoto documentsthantypicalsearchengines.Everyhitlistincludes step4. position,font,andcapitalizationinformation.Additionally, 7. Ifwearenotattheendofany wefactorinhitsfromanchortextandthePageRankofthe doclistgotostep4. document.Combiningallofthisinformationintoarankis
difficult.Wedesignedourrankingfunctionsothatno Sortthedocumentsthathave particularfactorcanhavetoomuchinfluence.First,consider matchedbyrankandreturnthe thesimplestcaseasinglewordquery.Inordertoranka topk. documentwithasinglewordquery,Googlelooksatthat Figure4.GoogleQueryEvaluation document'shitlistforthatword.Googleconsiderseachhitto beoneofseveraldifferenttypes(title,anchor,URL,plain textlargefont,plaintextsmallfont,...),eachofwhichhasits owntypeweight.Thetypeweightsmakeupavectorindexedbytype.Googlecountsthenumberof hitsofeachtypeinthehitlist.Theneverycountisconvertedintoacountweight.Countweights increaselinearlywithcountsatfirstbutquicklytaperoffsothatmorethanacertaincountwillnothelp. WetakethedotproductofthevectorofcountweightswiththevectoroftypeweightstocomputeanIR scoreforthedocument.Finally,theIRscoreiscombinedwithPageRanktogiveafinalranktothe document. Foramultiwordsearch,thesituationismorecomplicated.Nowmultiplehitlistsmustbescanned throughatoncesothathitsoccurringclosetogetherinadocumentareweightedhigherthanhits occurringfarapart.Thehitsfromthemultiplehitlistsarematchedupsothatnearbyhitsarematched together.Foreverymatchedsetofhits,aproximityiscomputed.Theproximityisbasedonhowfar apartthehitsareinthedocument(oranchor)butisclassifiedinto10differentvalue"bins"rangingfrom aphrasematchto"notevenclose".Countsarecomputednotonlyforeverytypeofhitbutforevery typeandproximity.Everytypeandproximitypairhasatypeproxweight.Thecountsareconverted intocountweightsandwetakethedotproductofthecountweightsandthetypeproxweightsto computeanIRscore.Allofthesenumbersandmatricescanallbedisplayedwiththesearchresults usingaspecialdebugmode.Thesedisplayshavebeenveryhelpfulindevelopingtherankingsystem. 4.5.2Feedback Therankingfunctionhasmanyparameterslikethetypeweightsandthetypeproxweights.Figuring outtherightvaluesfortheseparametersissomethingofablackart.Inordertodothis,wehaveauser feedbackmechanisminthesearchengine.Atrustedusermayoptionallyevaluatealloftheresultsthat arereturned.Thisfeedbackissaved.Thenwhenwemodifytherankingfunction,wecanseetheimpact ofthischangeonallprevioussearcheswhichwereranked.Althoughfarfromperfect,thisgivesus someideaofhowachangeintherankingfunctionaffectsthesearchresults.
5ResultsandPerformance
Themostimportantmeasureofa searchengineisthequalityofits searchresults.Whileacompleteuser evaluationisbeyondthescopeofthis paper,ourownexperiencewith Googlehasshownittoproduce betterresultsthanthemajor commercialsearchenginesformost searches.Asanexamplewhich illustratestheuseofPageRank, anchortext,andproximity,Figure4 showsGoogle'sresultsforasearch on"billclinton".Theseresults demonstratessomeofGoogle's features.Theresultsareclusteredby server.Thishelpsconsiderablywhen Query:billclinton
http://www.whitehouse.gov/ 100.00% (nodate)(0K) http://www.whitehouse.gov/ OfficeofthePresident 99.67% (Dec231996)(2K) http://www.whitehouse.gov/WH/EOP/OP/html/OP_Home.html WelcomeToTheWhiteHouse 99.98% (Nov091997)(5K) http://www.whitehouse.gov/WH/Welcome.html SendElectronicMailtothePresident 99.86% (Jul141997)(5K) http://www.whitehouse.gov/WH/Mail/html/Mail_President.html mailto:president@whitehouse.gov 99.98% mailto:President@whitehouse.gov
siftingthroughresultsets.Anumber ofresultsarefromthe whitehouse.govdomainwhichis whatonemayreasonablyexpect fromsuchasearch.Currently,most majorcommercialsearchenginesdo notreturnanyresultsfrom whitehouse.gov,muchlesstheright ones.Noticethatthereisnotitlefor thefirstresult.Thisisbecauseitwas notcrawled.Instead,Googlerelied onanchortexttodeterminethiswas agoodanswertothequery. Similarly,thefifthresultisanemail addresswhich,ofcourse,isnot crawlable.Itisalsoaresultofanchor text.
99.27% The"Unofficial"BillClinton 94.06% (Nov111997)(14K) http://zpub.com/un/unbc.html BillClintonMeetsTheShrinks 86.27% (Jun291997)(63K) http://zpub.com/un/unbc9.html PresidentBillClintonTheDarkSide 97.27% (Nov101997)(15K) http://www.realchange.org/clinton.htm $3BillClinton 94.73% (nodate)(4K) http://www.gatewy.net/~tjohnson/clinton1.html
Figure4.SampleResultsfromGoogle
Alloftheresultsarereasonablyhighqualitypagesand,atlastcheck,nonewerebrokenlinks.Thisis largelybecausetheyallhavehighPageRank.ThePageRanksarethepercentagesinredalongwithbar graphs.Finally,therearenoresultsaboutaBillotherthanClintonoraboutaClintonotherthanBill. Thisisbecauseweplaceheavyimportanceontheproximityofwordoccurrences.Ofcourseatruetest ofthequalityofasearchenginewouldinvolveanextensiveuserstudyorresultsanalysiswhichwedo nothaveroomforhere.Instead,weinvitethereadertotryGoogleforthemselvesat http://google.stanford.edu.
5.1StorageRequirements
Asidefromsearchquality,GoogleisdesignedtoscalecosteffectivelytothesizeoftheWebasit grows.Oneaspectofthisistousestorageefficiently.Table1hasabreakdownofsomestatisticsand storagerequirementsofGoogle.Duetocompressionthetotalsizeoftherepositoryisabout53GB,just overonethirdofthetotaldataitstores.Atcurrentdiskpricesthismakestherepositoryarelatively cheapsourceofusefuldata.Moreimportantly,thetotalofallthedatausedbythesearchenginerequires acomparableamountofstorage,about55GB.Furthermore,mostqueriescanbeansweredusingjust theshortinvertedindex.WithbetterencodingandcompressionoftheDocumentIndex,ahighquality websearchenginemayfitontoa7GBdriveofanewPC. StorageStatistics
5.2SystemPerformance
Itisimportantforasearchenginetocrawlandindex efficiently.Thiswayinformationcanbekeptuptodate andmajorchangestothesystemcanbetestedrelatively quickly.ForGoogle,themajoroperationsareCrawling, Indexing,andSorting.Itisdifficulttomeasurehowlong crawlingtookoverallbecausedisksfilledup,name serverscrashed,oranynumberofotherproblemswhich stoppedthesystem.Intotalittookroughly9daysto downloadthe26millionpages(includingerrors). However,oncethesystemwasrunningsmoothly,itran muchfaster,downloadingthelast11millionpagesinjust 63hours,averagingjustover4millionpagesperdayor
TotalSizeofFetchedPages 147.8GB CompressedRepository 53.5GB ShortInvertedIndex FullInvertedIndex Lexicon TemporaryAnchorData (notintotal) DocumentIndexIncl. VariableWidthData LinksDatabase 4.1GB 37.2GB 293MB 6.6GB 9.7GB 3.9GB
TotalWithoutRepository 55.2GB TotalWithRepository 108.7GB
48.5pagespersecond.Werantheindexerandthe crawlersimultaneously.Theindexerranjustfasterthan thecrawlers.Thisislargelybecausewespentjustenough timeoptimizingtheindexersothatitwouldnotbea bottleneck.Theseoptimizationsincludedbulkupdatesto thedocumentindexandplacementofcriticaldata structuresonthelocaldisk.Theindexerrunsatroughly 54pagespersecond.Thesorterscanberuncompletelyin parallelusingfourmachines,thewholeprocessofsorting takesabout24hours.
WebPageStatistics NumberofWebPages 24million Fetched 76.5 NumberofUrlsSeen million NumberofEmailAddresses Numberof404's Table1.Statistics 1.7million 1.6million
5.3SearchPerformance
Improvingtheperformanceofsearchwasnotthemajorfocusofourresearchuptothispoint.The currentversionofGoogleanswersmostqueriesinbetween1and10seconds.Thistimeismostly dominatedbydiskIOoverNFS(sincedisksarespreadoveranumberofmachines).Furthermore, Googledoesnothaveanyoptimizationssuchasquerycaching,subindicesoncommonterms,andother commonoptimizations.WeintendtospeedupGoogleconsiderablythroughdistributionandhardware, software,andalgorithmicimprovements.Ourtargetistobeabletohandleseveralhundredqueriesper second.Table2hassomesamplequerytimesfromthecurrentversionofGoogle.Theyarerepeatedto showthespeedupsresultingfromcachedIO. SameQuery InitialQuery Repeated(IO 6Conclusions mostlycached) Googleisdesignedtobeascalablesearch engine.Theprimarygoalistoprovidehigh qualitysearchresultsoverarapidlygrowing WorldWideWeb.Googleemploysanumberof techniquestoimprovesearchqualityincluding pagerank,anchortext,andproximity information.Furthermore,Googleisacomplete architectureforgatheringwebpages,indexing them,andperformingsearchqueriesoverthem. Query CPU Total CPU Total Time(s) Time(s) Time(s) Time(s) 2.13 3.84 4.86 9.63 0.06 1.66 0.20 1.16 0.06 1.80 0.24 1.16
algore 0.09 vice 1.77 president hard 0.25 disks search 1.31 engines
6.1FutureWork
Table2.SearchTimes
Alargescalewebsearchengineisacomplexsystemandmuchremainstobedone.Ourimmediate goalsaretoimprovesearchefficiencyandtoscaletoapproximately100millionwebpages.Some simpleimprovementstoefficiencyincludequerycaching,smartdiskallocation,andsubindices.Another areawhichrequiresmuchresearchisupdates.Wemusthavesmartalgorithmstodecidewhatoldweb pagesshouldberecrawledandwhatnewonesshouldbecrawled.Worktowardthisgoalhasbeendone in[Cho98].Onepromisingareaofresearchisusingproxycachestobuildsearchdatabases,sincethey aredemanddriven.Weareplanningtoaddsimplefeaturessupportedbycommercialsearchengineslike booleanoperators,negation,andstemming.However,otherfeaturesarejuststartingtobeexploredsuch asrelevancefeedbackandclustering(Googlecurrentlysupportsasimplehostnamebasedclustering). Wealsoplantosupportusercontext(liketheuser'slocation),andresultsummarization.Wearealso workingtoextendtheuseoflinkstructureandlinktext.SimpleexperimentsindicatePageRankcanbe personalizedbyincreasingtheweightofauser'shomepageorbookmarks.Asforlinktext,weare experimentingwithusingtextsurroundinglinksinadditiontothelinktextitself.AWebsearchengine isaveryrichenvironmentforresearchideas.Wehavefartoomanytolistheresowedonotexpectthis FutureWorksectiontobecomemuchshorterinthenearfuture.
6.2HighQualitySearch
Thebiggestproblemfacingusersofwebsearchenginestodayisthequalityoftheresultstheygetback. Whiletheresultsareoftenamusingandexpandusers'horizons,theyareoftenfrustratingandconsume precioustime.Forexample,thetopresultforasearchfor"BillClinton"ononeofthemostpopular commercialsearchengineswastheBillClintonJokeoftheDay:April14,1997.Googleisdesignedto providehigherqualitysearchsoastheWebcontinuestogrowrapidly,informationcanbefoundeasily. InordertoaccomplishthisGooglemakesheavyuseofhypertextualinformationconsistingoflink structureandlink(anchor)text.Googlealsousesproximityandfontinformation.Whileevaluationofa searchengineisdifficult,wehavesubjectivelyfoundthatGooglereturnshigherqualitysearchresults thancurrentcommercialsearchengines.TheanalysisoflinkstructureviaPageRankallowsGoogleto evaluatethequalityofwebpages.Theuseoflinktextasadescriptionofwhatthelinkpointstohelps thesearchenginereturnrelevant(andtosomedegreehighquality)results.Finally,theuseofproximity informationhelpsincreaserelevanceagreatdealformanyqueries.
6.3ScalableArchitecture
Asidefromthequalityofsearch,Googleisdesignedtoscale.Itmustbeefficientinbothspaceandtime, andconstantfactorsareveryimportantwhendealingwiththeentireWeb.InimplementingGoogle,we haveseenbottlenecksinCPU,memoryaccess,memorycapacity,diskseeks,diskthroughput,disk capacity,andnetworkIO.Googlehasevolvedtoovercomeanumberofthesebottlenecksduring variousoperations.Google'smajordatastructuresmakeefficientuseofavailablestoragespace. Furthermore,thecrawling,indexing,andsortingoperationsareefficientenoughtobeabletobuildan indexofasubstantialportionoftheweb24millionpages,inlessthanoneweek.Weexpecttobe abletobuildanindexof100millionpagesinlessthanamonth.
6.4AResearchTool
Inadditiontobeingahighqualitysearchengine,Googleisaresearchtool.ThedataGooglehas collectedhasalreadyresultedinmanyotherpaperssubmittedtoconferencesandmanymoreonthe way.Recentresearchsuchas[Abiteboul97]hasshownanumberoflimitationstoqueriesaboutthe WebthatmaybeansweredwithouthavingtheWebavailablelocally.ThismeansthatGoogle(ora similarsystem)isnotonlyavaluableresearchtoolbutanecessaryoneforawiderangeofapplications. WehopeGooglewillbearesourceforsearchersandresearchersallaroundtheworldandwillsparkthe nextgenerationofsearchenginetechnology.
7Acknowledgments
ScottHassanandAlanSteremberghavebeencriticaltothedevelopmentofGoogle.Theirtalented contributionsareirreplaceable,andtheauthorsowethemmuchgratitude.Wewouldalsoliketothank HectorGarciaMolina,RajeevMotwani,JeffUllman,andTerryWinogradandthewholeWebBase groupfortheirsupportandinsightfuldiscussions.Finallywewouldliketorecognizethegenerous supportofourequipmentdonorsIBM,Intel,andSunandourfunders.Theresearchdescribedherewas conductedaspartoftheStanfordIntegratedDigitalLibraryProject,supportedbytheNationalScience FoundationunderCooperativeAgreementIRI9411306.Fundingforthiscooperativeagreementisalso providedbyDARPAandNASA,andbyIntervalResearch,andtheindustrialpartnersoftheStanford DigitalLibrariesProject.
References
BestoftheWeb1994Navigatorshttp://botw.org/1994/awards/navigators.html BillClintonJokeoftheDay:April14,1997.http://www.io.com/~cjburke/clinton/970414.html.
Bzip2Homepagehttp://www.muraroa.demon.co.uk/ GoogleSearchEnginehttp://google.stanford.edu/ Harvesthttp://harvest.transarc.com/ Mauldin,MichaelL.LycosDesignChoicesinanInternetSearchService,IEEEExpertInterview http://www.computer.org/pubs/expert/1997/trends/x1008/mauldin.htm TheEffectofCellularPhoneUseUponDriverAttention http://www.webfirst.com/aaa/text/cell/cell0toc.htm SearchEngineWatchhttp://www.searchenginewatch.com/ RFC1950(zlib)ftp://ftp.uu.net/graphics/png/documents/zlib/zdocindex.html RobotsExclusionProtocol:http://info.webcrawler.com/mak/projects/robots/exclusion.htm WebGrowthSummary:http://www.mit.edu/people/mkgray/net/webgrowthsummary.html Yahoo!http://www.yahoo.com/ [Abiteboul97]SergeAbiteboulandVictorVianu,QueriesandComputationontheWeb. ProceedingsoftheInternationalConferenceonDatabaseTheory.Delphi,Greece1997. [Bagdikian97]BenH.Bagdikian.TheMediaMonopoly.5thEdition.Publisher:Beacon,ISBN: 0807061557 [Chakrabarti98]S.Chakrabarti,B.Dom,D.Gibson,J.Kleinberg,P.RaghavanandS. Rajagopalan.AutomaticResourceCompilationbyAnalyzingHyperlinkStructureandAssociated Text.SeventhInternationalWebConference(WWW98).Brisbane,Australia,April1418,1998. [Cho98]JunghooCho,HectorGarciaMolina,LawrencePage.EfficientCrawlingThroughURL Ordering.SeventhInternationalWebConference(WWW98).Brisbane,Australia,April1418, 1998. [Gravano94]LuisGravano,HectorGarciaMolina,andA.Tomasic.TheEffectivenessofGlOSS fortheTextDatabaseDiscoveryProblem.Proc.ofthe1994ACMSIGMODInternational ConferenceOnManagementOfData,1994. [Kleinberg98]JonKleinberg,AuthoritativeSourcesinaHyperlinkedEnvironment,Proc.ACM SIAMSymposiumonDiscreteAlgorithms,1998. [Marchiori97]MassimoMarchiori.TheQuestforCorrectInformationontheWeb:Hyper SearchEngines.TheSixthInternationalWWWConference(WWW97).SantaClara,USA, April711,1997. [McBryan94]OliverA.McBryan.GENVLandWWWW:ToolsforTamingtheWeb.First InternationalConferenceontheWorldWideWeb.CERN,Geneva(Switzerland),May252627 1994.http://www.cs.colorado.edu/home/mcbryan/mypapers/www94.ps [Page98]LawrencePage,SergeyBrin,RajeevMotwani,TerryWinograd.ThePageRank CitationRanking:BringingOrdertotheWeb.Manuscriptinprogress. http://google.stanford.edu/~backrub/pageranksub.ps [Pinkerton94]BrianPinkerton,FindingWhatPeopleWant:ExperienceswiththeWebCrawler. TheSecondInternationalWWWConferenceChicago,USA,October1720,1994. http://info.webcrawler.com/bp/WWW94.html [Spertus97]EllenSpertus.ParaSite:MiningStructuralInformationontheWeb.TheSixth InternationalWWWConference(WWW97).SantaClara,USA,April711,1997. [TREC96]ProceedingsofthefifthTextREtrievalConference(TREC5).Gaithersburg, Maryland,November2022,1996.Publisher:DepartmentofCommerce,NationalInstituteof StandardsandTechnology.Editors:D.K.HarmanandE.M.Voorhees.Fulltextat: http://trec.nist.gov/ [Witten94]IanHWitten,AlistairMoffat,andTimothyC.Bell.ManagingGigabytes: CompressingandIndexingDocumentsandImages.NewYork:VanNostrandReinhold,1994. [Weiss96]RonWeiss,BienvenidoVelez,MarkA.Sheldon,ChanathipManprempre,Peter Szilagyi,AndrzejDuda,andDavidK.Gifford.HyPursuit:AHierarchicalNetworkSearch EnginethatExploitsContentLinkHypertextClustering.Proceedingsofthe7thACM ConferenceonHypertext.NewYork,1996.
Vitae
SergeyBrinreceivedhisB.S.degreeinmathematics andcomputersciencefromtheUniversityof MarylandatCollegeParkin1993.Currently,heisa Ph.D.candidateincomputerscienceatStanford UniversitywherehereceivedhisM.S.in1995.Heis arecipientofaNationalScienceFoundation GraduateFellowship.Hisresearchinterestsinclude searchengines,informationextractionfrom unstructuredsources,anddataminingoflargetext collectionsandscientificdata. LawrencePagewasborninEastLansing, Michigan,andreceivedaB.S.E.inComputerEngineeringattheUniversityofMichiganAnnArborin 1995.HeiscurrentlyaPh.D.candidateinComputerScienceatStanfordUniversity.Someofhis researchinterestsincludethelinkstructureoftheweb,humancomputerinteraction,searchengines, scalabilityofinformationaccessinterfaces,andpersonaldatamining.
8AppendixA:AdvertisingandMixedMotives
Currently,thepredominantbusinessmodelforcommercialsearchenginesisadvertising.Thegoalsof theadvertisingbusinessmodeldonotalwayscorrespondtoprovidingqualitysearchtousers.For example,inourprototypesearchengineoneofthetopresultsforcellularphoneis"TheEffectof CellularPhoneUseUponDriverAttention",astudywhichexplainsingreatdetailthedistractionsand riskassociatedwithconversingonacellphonewhiledriving.Thissearchresultcameupfirstbecause ofitshighimportanceasjudgedbythePageRankalgorithm,anapproximationofcitationimportanceon theweb[Page,98].Itisclearthatasearchenginewhichwastakingmoneyforshowingcellularphone adswouldhavedifficultyjustifyingthepagethatoursystemreturnedtoitspayingadvertisers.Forthis typeofreasonandhistoricalexperiencewithothermedia[Bagdikian83],weexpectthatadvertising fundedsearchengineswillbeinherentlybiasedtowardstheadvertisersandawayfromtheneedsofthe consumers. Sinceitisverydifficultevenforexpertstoevaluatesearchengines,searchenginebiasisparticularly insidious.AgoodexamplewasOpenText,whichwasreportedtobesellingcompaniestherighttobe listedatthetopofthesearchresultsforparticularqueries[Marchiori97].Thistypeofbiasismuchmore insidiousthanadvertising,becauseitisnotclearwho"deserves"tobethere,andwhoiswillingtopay moneytobelisted.Thisbusinessmodelresultedinanuproar,andOpenTexthasceasedtobeaviable searchengine.Butlessblatantbiasarelikelytobetoleratedbythemarket.Forexample,asearchengine couldaddasmallfactortosearchresultsfrom"friendly"companies,andsubtractafactorfromresults fromcompetitors.Thistypeofbiasisverydifficulttodetectbutcouldstillhaveasignificanteffecton themarket.Furthermore,advertisingincomeoftenprovidesanincentivetoprovidepoorqualitysearch results.Forexample,wenoticedamajorsearchenginewouldnotreturnalargeairline'shomepage whentheairline'snamewasgivenasaquery.Itsohappenedthattheairlinehadplacedanexpensive ad,linkedtothequerythatwasitsname.Abettersearchenginewouldnothaverequiredthisad,and possiblyresultedinthelossoftherevenuefromtheairlinetothesearchengine.Ingeneral,itcouldbe arguedfromtheconsumerpointofviewthatthebetterthesearchengineis,thefeweradvertisements willbeneededfortheconsumertofindwhattheywant.Thisofcourseerodestheadvertisingsupported businessmodeloftheexistingsearchengines.However,therewillalwaysbemoneyfromadvertisers whowantacustomertoswitchproducts,orhavesomethingthatisgenuinelynew.Butwebelievethe issueofadvertisingcausesenoughmixedincentivesthatitiscrucialtohaveacompetitivesearchengine thatistransparentandintheacademicrealm.
9AppendixB:Scalability
9.1ScalabilityofGoogle
WehavedesignedGoogletobescalableintheneartermtoagoalof100millionwebpages.Wehave justreceiveddiskandmachinestohandleroughlythatamount.Allofthetimeconsumingpartsofthe systemareparallelizeandroughlylineartime.Theseincludethingslikethecrawlers,indexers,and sorters.Wealsothinkthatmostofthedatastructureswilldealgracefullywiththeexpansion.However, at100millionwebpageswewillbeverycloseupagainstallsortsofoperatingsystemlimitsinthe commonoperatingsystems(currentlywerunonbothSolarisandLinux).Theseincludethingslike addressablememory,numberofopenfiledescriptors,networksocketsandbandwidth,andmanyothers. Webelieveexpandingtoalotmorethan100millionpageswouldgreatlyincreasethecomplexityofour system.
9.2ScalabilityofCentralizedIndexingArchitectures
Asthecapabilitiesofcomputersincrease,itbecomespossibletoindexaverylargeamountoftextfora reasonablecost.Ofcourse,othermorebandwidthintensivemediasuchasvideoislikelytobecome morepervasive.But,becausethecostofproductionoftextislowcomparedtomedialikevideo,textis likelytoremainverypervasive.Also,itislikelythatsoonwewillhavespeechrecognitionthatdoesa reasonablejobconvertingspeechintotext,expandingtheamountoftextavailable.Allofthisprovides amazingpossibilitiesforcentralizedindexing.Hereisanillustrativeexample.Weassumewewantto indexeverythingeveryoneintheUShaswrittenforayear.Weassumethatthereare250millionpeople intheUSandtheywriteanaverageof10kperday.Thatworksouttobeabout850terabytes.Also assumethatindexingaterabytecanbedonenowforareasonablecost.Wealsoassumethatthe indexingmethodsusedoverthetextarelinear,ornearlylinearintheircomplexity.Givenallthese assumptionswecancomputehowlongitwouldtakebeforewecouldindexour850terabytesfora reasonablecostassumingcertaingrowthfactors.Moore'sLawwasdefinedin1965asadoublingevery 18monthsinprocessorpower.Ithasheldremarkablytrue,notjustforprocessors,butforother importantsystemparameterssuchasdiskaswell.IfweassumethatMoore'slawholdsforthefuture, weneedonly10moredoublings,or15yearstoreachourgoalofindexingeverythingeveryoneinthe UShaswrittenforayearforapricethatasmallcompanycouldafford.Ofcourse,hardwareexpertsare somewhatconcernedMoore'sLawmaynotcontinuetoholdforthenext15years,butthereare certainlyalotofinterestingcentralizedapplicationsevenifweonlygetpartofthewaytoour hypotheticalexample. OfcourseadistributedsystemslikeGloss[Gravano94]orHarvestwilloftenbethemostefficientand eleganttechnicalsolutionforindexing,butitseemsdifficulttoconvincetheworldtousethesesystems becauseofthehighadministrationcostsofsettinguplargenumbersofinstallations.Ofcourse,itisquite likelythatreducingtheadministrationcostdrasticallyispossible.Ifthathappens,andeveryonestarts runningadistributedindexingsystem,searchingwouldcertainlyimprovedrastically. Becausehumanscanonlytypeorspeakafiniteamount,andascomputerscontinueimproving,text indexingwillscaleevenbetterthanitdoesnow.Ofcoursetherecouldbeaninfiniteamountofmachine generatedcontent,butjustindexinghugeamountsofhumangeneratedcontentseemstremendously useful.Soweareoptimisticthatourcentralizedwebsearchenginearchitecturewillimproveinitsability tocoverthepertinenttextinformationovertimeandthatthereisabrightfutureforsearch.

The Anatomy of A Search Engine

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Anatomy of A Search Engine

Uploaded by

Copyright:

Available Formats

TheAnatomyofaLargeScaleHypertextual WebSearchEngine

SergeyBrinandLawrencePage {sergey,page}@cs.stanford.edu ComputerScienceDepartment,StanfordUniversity,Stanford,CA94305

informationforallhitsandsoitmakesextensiveuseofproximityinsearch.Second,Googlekeepstrack ofsomevisualpresentationdetailssuchasfontsizeofwords.Wordsinalargerorbolderfontare weightedhigherthanotherwords.Third,fullrawHTMLofpagesisavailableinarepository.

TotalWithoutRepository 55.2GB TotalWithRepository 108.7GB

You might also like