Professional Documents
Culture Documents
Search Engines: Information Retrieval in Practice
Search Engines: Information Retrieval in Practice
InformationRetrievalinPractice
AllslidesAddisonWesley,2008
SearchEngineArchitecture
Asoftwarearchitectureconsistsofsoftware
components,theinterfacesprovidedbythose
components,andtherelationshipsbetween
them
describesasystemataparticularlevelofabstraction
Architectureofasearchenginedeterminedby2
requirements
effectiveness(qualityofresults)andefficiency
(responsetimeandthroughput)
IndexingProcess
IndexingProcess
Textacquisition
identifiesandstoresdocumentsforindexing
Texttransformation
transformsdocumentsintoindextermsor
features
Indexcreation
takesindextermsandcreatesdatastructures
(indexes)tosupportfastsearching
QueryProcess
QueryProcess
Userinteraction
supportscreationandrefinementofquery,display
ofresults
Ranking
usesqueryandindexestogeneraterankedlistof
documents
Evaluation
monitorsandmeasureseffectivenessand
efficiency(primarilyoffline)
Details:TextAcquisition
Crawler
Identifiesandacquiresdocumentsforsearch
engine
Manytypes web,enterprise,desktop
Webcrawlersfollowlinks tofinddocuments
Mustefficientlyfindhugenumbersofwebpages
(coverage)andkeepthemuptodate(freshness)
Singlesitecrawlersforsitesearch
Topicalor focusedcrawlersforvertical search
Document crawlersforenterpriseanddesktop
search
Followlinksandscandirectories
TextAcquisition
Feeds
Realtimestreamsofdocuments
e.g.,webfeedsfornews,blogs,video,radio,tv
RSSiscommonstandard
RSSreadercanprovidenewXMLdocumentstosearch
engine
Conversion
Convertvarietyofdocumentsintoaconsistenttext
plusmetadataformat
e.g.HTML,XML,Word,PDF,etc.XML
Converttextencodingfordifferentlanguages
UsingaUnicodestandardlikeUTF8
TextAcquisition
Documentdatastore
Storestext,metadata,andotherrelatedcontent
fordocuments
Metadataisinformationaboutdocumentsuchastype
andcreationdate
Othercontentincludeslinks,anchortext
Providesfastaccesstodocumentcontentsfor
searchenginecomponents
e.g.resultlistgeneration
Coulduserelationaldatabasesystem
Moretypically,asimpler,moreefficientstoragesystem
isusedduetohugenumbersofdocuments
TextTransformation
Parser
Processingthesequenceoftexttokensinthe
documenttorecognizestructuralelements
e.g.,titles,links,headings,etc.
Tokenizer recognizeswordsinthetext
mustconsiderissueslikecapitalization,hyphens,
apostrophes,nonalphacharacters,separators
MarkuplanguagessuchasHTML,XMLoftenusedto
specifystructure
Tags usedtospecifydocumentelements
E.g.,<h2>Overview</h2>
Documentparserusessyntax ofmarkuplanguage(orother
formatting)toidentifystructure
TextTransformation
Stopping
Removecommonwords
e.g.,and,or,the,in
Someimpactonefficiencyandeffectiveness
Canbeaproblemforsomequeries
Stemming
Groupwordsderivedfromacommonstem
e.g.,computer,computers,computing,compute
Usuallyeffective,butnotforallqueries
Benefitsvaryfordifferentlanguages
TextTransformation
LinkAnalysis
Makesuseoflinks andanchortextinwebpages
Linkanalysisidentifiespopularity andcommunity
information
e.g.,PageRank
Anchortextcansignificantlyenhancethe
representationofpagespointedtobylinks
Significantimpactonwebsearch
Lessimportanceinotherapplications
TextTransformation
InformationExtraction
Identifyclassesofindextermsthatareimportant
forsomeapplications
e.g.,namedentityrecognizersidentifyclasses
suchaspeople, locations, companies, dates, etc.
Classifier
Identifiesclassrelatedmetadatafordocuments
i.e.,assignslabelstodocuments
e.g.,topics,readinglevels,sentiment,genre
Usedependsonapplication
IndexCreation
DocumentStatistics
Gatherscountsandpositionsofwordsandother
features
Usedinrankingalgorithm
Weighting
Computesweightsforindexterms
Usedinrankingalgorithm
e.g.,tf.idf weight
Combinationoftermfrequencyindocumentand
inversedocumentfrequencyinthecollection
IndexCreation
Inversion
Coreofindexingprocess
Convertsdocumentterminformationtoterm
documentforindexing
Difficultforverylargenumbersofdocuments
Formatofinvertedfileisdesignedforfastquery
processing
Mustalsohandleupdates
Compressionusedforefficiency
IndexCreation
IndexDistribution
Distributesindexesacrossmultiplecomputers
and/ormultiplesites
Essentialforfastqueryprocessingwithlarge
numbersofdocuments
Manyvariations
Documentdistribution,termdistribution,replication
UserInteraction
Queryinput
Providesinterfaceandparserforquerylanguage
Mostwebqueriesareverysimple,other
applicationsmayuseforms
Querylanguageusedtodescribemorecomplex
queriesandresultsofquerytransformation
e.g.,Booleanqueries,IndriandGalago querylanguages
similartoSQLlanguageusedindatabaseapplications
IRquerylanguagesalsoallowcontentandstructure
specifications,butfocusoncontent
UserInteraction
Querytransformation
Improvesinitialquery,bothbeforeandafterinitial
search
Includestexttransformationtechniquesusedfor
documents
Spellcheckingandquerysuggestion provide
alternativestooriginalquery
Queryexpansionandrelevancefeedback modify
theoriginalquerywithadditionalterms
UserInteraction
Resultsoutput
Constructsthedisplayofrankeddocumentsfora
query
Generatessnippets toshowhowqueriesmatch
documents
Highlights importantwordsandpassages
Retrievesappropriateadvertising inmany
applications
Mayprovideclustering andothervisualization
tools
Ranking
Scoring
Calculatesscoresfordocumentsusingaranking
algorithm
Corecomponentofsearchengine
Basicformofscoreis qi di
qi anddi arequeryanddocumenttermweightsfor
termi
Manyvariationsofrankingalgorithmsand
retrievalmodels
Ranking
Performanceoptimization
Designingrankingalgorithmsforefficient
processing
Termatatimevs.documentatatime processing
Safe vs.unsafe optimizations
Distribution
Processingqueriesinadistributedenvironment
Querybrokerdistributesqueriesandassembles
results
Caching isaformofdistributedsearching
Evaluation
Logging
Logginguserqueriesandinteractioniscrucialfor
improvingsearcheffectivenessandefficiency
Querylogsandclickthrough datausedforquery
suggestion,spellchecking,querycaching,ranking,
advertisingsearch,andothercomponents
Rankinganalysis
Measuringandtuningrankingeffectiveness
Performanceanalysis
Measuringandtuningsystemefficiency
HowDoesItReally Work?
Thiscourseexplainsthesecomponentsofa
searchengineinmoredetail
Oftenmanypossibleapproachesandtechniques
foragivencomponent
Focusisonthemostimportantalternatives
i.e.,explainasmallnumberofapproachesindetail
ratherthanmanyapproaches
Importancebasedonresearchresultsandusein
actualsearchengines
Alternativesdescribedinreferences