You are on page 1of 23

SearchEngines

InformationRetrievalinPractice

AllslidesAddisonWesley,2008

SearchEngineArchitecture
Asoftwarearchitectureconsistsofsoftware
components,theinterfacesprovidedbythose
components,andtherelationshipsbetween
them
describesasystemataparticularlevelofabstraction

Architectureofasearchenginedeterminedby2
requirements
effectiveness(qualityofresults)andefficiency
(responsetimeandthroughput)

IndexingProcess

IndexingProcess
Textacquisition
identifiesandstoresdocumentsforindexing

Texttransformation
transformsdocumentsintoindextermsor
features

Indexcreation
takesindextermsandcreatesdatastructures
(indexes)tosupportfastsearching

QueryProcess

QueryProcess
Userinteraction
supportscreationandrefinementofquery,display
ofresults

Ranking
usesqueryandindexestogeneraterankedlistof
documents

Evaluation
monitorsandmeasureseffectivenessand
efficiency(primarilyoffline)

Details:TextAcquisition
Crawler
Identifiesandacquiresdocumentsforsearch
engine
Manytypes web,enterprise,desktop
Webcrawlersfollowlinks tofinddocuments
Mustefficientlyfindhugenumbersofwebpages
(coverage)andkeepthemuptodate(freshness)
Singlesitecrawlersforsitesearch
Topicalor focusedcrawlersforvertical search

Document crawlersforenterpriseanddesktop
search
Followlinksandscandirectories

TextAcquisition
Feeds
Realtimestreamsofdocuments
e.g.,webfeedsfornews,blogs,video,radio,tv

RSSiscommonstandard
RSSreadercanprovidenewXMLdocumentstosearch
engine

Conversion
Convertvarietyofdocumentsintoaconsistenttext
plusmetadataformat
e.g.HTML,XML,Word,PDF,etc.XML

Converttextencodingfordifferentlanguages
UsingaUnicodestandardlikeUTF8

TextAcquisition
Documentdatastore
Storestext,metadata,andotherrelatedcontent
fordocuments
Metadataisinformationaboutdocumentsuchastype
andcreationdate
Othercontentincludeslinks,anchortext

Providesfastaccesstodocumentcontentsfor
searchenginecomponents
e.g.resultlistgeneration

Coulduserelationaldatabasesystem
Moretypically,asimpler,moreefficientstoragesystem
isusedduetohugenumbersofdocuments

TextTransformation
Parser
Processingthesequenceoftexttokensinthe
documenttorecognizestructuralelements
e.g.,titles,links,headings,etc.

Tokenizer recognizeswordsinthetext
mustconsiderissueslikecapitalization,hyphens,
apostrophes,nonalphacharacters,separators

MarkuplanguagessuchasHTML,XMLoftenusedto
specifystructure
Tags usedtospecifydocumentelements
E.g.,<h2>Overview</h2>

Documentparserusessyntax ofmarkuplanguage(orother
formatting)toidentifystructure

TextTransformation
Stopping
Removecommonwords
e.g.,and,or,the,in

Someimpactonefficiencyandeffectiveness
Canbeaproblemforsomequeries

Stemming
Groupwordsderivedfromacommonstem
e.g.,computer,computers,computing,compute

Usuallyeffective,butnotforallqueries
Benefitsvaryfordifferentlanguages

TextTransformation
LinkAnalysis
Makesuseoflinks andanchortextinwebpages
Linkanalysisidentifiespopularity andcommunity
information
e.g.,PageRank

Anchortextcansignificantlyenhancethe
representationofpagespointedtobylinks
Significantimpactonwebsearch
Lessimportanceinotherapplications

TextTransformation
InformationExtraction
Identifyclassesofindextermsthatareimportant
forsomeapplications
e.g.,namedentityrecognizersidentifyclasses
suchaspeople, locations, companies, dates, etc.

Classifier
Identifiesclassrelatedmetadatafordocuments
i.e.,assignslabelstodocuments
e.g.,topics,readinglevels,sentiment,genre

Usedependsonapplication

IndexCreation
DocumentStatistics
Gatherscountsandpositionsofwordsandother
features
Usedinrankingalgorithm

Weighting
Computesweightsforindexterms
Usedinrankingalgorithm
e.g.,tf.idf weight
Combinationoftermfrequencyindocumentand
inversedocumentfrequencyinthecollection

IndexCreation
Inversion
Coreofindexingprocess
Convertsdocumentterminformationtoterm
documentforindexing
Difficultforverylargenumbersofdocuments

Formatofinvertedfileisdesignedforfastquery
processing
Mustalsohandleupdates
Compressionusedforefficiency

IndexCreation
IndexDistribution
Distributesindexesacrossmultiplecomputers
and/ormultiplesites
Essentialforfastqueryprocessingwithlarge
numbersofdocuments
Manyvariations
Documentdistribution,termdistribution,replication

P2P anddistributedIR involvesearchacross


multiplesites

UserInteraction
Queryinput
Providesinterfaceandparserforquerylanguage
Mostwebqueriesareverysimple,other
applicationsmayuseforms
Querylanguageusedtodescribemorecomplex
queriesandresultsofquerytransformation
e.g.,Booleanqueries,IndriandGalago querylanguages
similartoSQLlanguageusedindatabaseapplications
IRquerylanguagesalsoallowcontentandstructure
specifications,butfocusoncontent

UserInteraction
Querytransformation
Improvesinitialquery,bothbeforeandafterinitial
search
Includestexttransformationtechniquesusedfor
documents
Spellcheckingandquerysuggestion provide
alternativestooriginalquery
Queryexpansionandrelevancefeedback modify
theoriginalquerywithadditionalterms

UserInteraction
Resultsoutput
Constructsthedisplayofrankeddocumentsfora
query
Generatessnippets toshowhowqueriesmatch
documents
Highlights importantwordsandpassages
Retrievesappropriateadvertising inmany
applications
Mayprovideclustering andothervisualization
tools

Ranking
Scoring
Calculatesscoresfordocumentsusingaranking
algorithm
Corecomponentofsearchengine
Basicformofscoreis qi di
qi anddi arequeryanddocumenttermweightsfor
termi

Manyvariationsofrankingalgorithmsand
retrievalmodels

Ranking
Performanceoptimization
Designingrankingalgorithmsforefficient
processing
Termatatimevs.documentatatime processing
Safe vs.unsafe optimizations

Distribution
Processingqueriesinadistributedenvironment
Querybrokerdistributesqueriesandassembles
results
Caching isaformofdistributedsearching

Evaluation
Logging
Logginguserqueriesandinteractioniscrucialfor
improvingsearcheffectivenessandefficiency
Querylogsandclickthrough datausedforquery
suggestion,spellchecking,querycaching,ranking,
advertisingsearch,andothercomponents

Rankinganalysis
Measuringandtuningrankingeffectiveness

Performanceanalysis
Measuringandtuningsystemefficiency

HowDoesItReally Work?
Thiscourseexplainsthesecomponentsofa
searchengineinmoredetail
Oftenmanypossibleapproachesandtechniques
foragivencomponent
Focusisonthemostimportantalternatives
i.e.,explainasmallnumberofapproachesindetail
ratherthanmanyapproaches
Importancebasedonresearchresultsandusein
actualsearchengines
Alternativesdescribedinreferences

You might also like