Apache Solr Beyond The Box

Chris Hostetter
2008-11-05
http://people.apache.org/~hossman/apachecon2008us/ http://lucene.apache.org/solr/

Why Are We Here?

Plugins!
● What, How, Where, When, Why? ● Solr Internals In A Nutshell ● Real World Examples ● Testing ● Questions

2

What, How, Where, Who, When, Why?

3

Efficient Replication To Other Solr Search Servers ● Highly Configurable Caching ● Flexible And Adaptable With XML Configuration    4 Customizable Request Handlers And Response Writers Data Schema With Dynamic Fields And Unique Keys Analyzers Created At Runtime From Tokenizers And TokenFilters .What Is Solr (To Users) ● Information Retrieval Application ● Index/Query Via HTTP ● Comprehensive HTML Administration Interfaces ● Scalability .

What Is Solr (To Developers) ● Information Retrieval Application ● Java5 WebApp (WAR) With A Web Services-ish API ● Extensible Plugin Architecture ● MVC-ish Framework Around The Java Lucene Search Library ● Allows Custom Business Logic and Text Analysis Rules To Live Close To The Data ● Abstracts Away The Tricky Stuff:    Index Consistency Data Replication Cache Management .

How It Started .

” OR “To force X for all clients.When/Why To Write A Plugin “X can be done more efficiently closer to the data.” .

Solr Internals In A Nutshell 8 .

000' View HTTP Java SolrDispatchFilter EmbeddedSolrServer SolrCore CoreContainer QueryResponseWriter SolrCore SolrCore SolrQuery(Request/Response) SolrRequestHandler 9 .50.

MVC-ish ● SolrRequestHandler ... SolrQueryRequest. An Event (++)    ● SolrQueryResponse . SolrQueryResponse ) Input Parameters List of ContentStreams Maintains SolrCore & SolrIndexSearcher References Tree of "Simple" Objects and DocLists write(Writer.. A Controller  handleRequest( SolrQueryRequest. View  .. Model  ● ResponseWriter .. SolrQueryResponse) ● SolrQueryRequest ....

add("yourage".getParams().                                 SolrQueryResponse rsp) {     String name = req.add("greeting". "Hello " + name).get("name").getInt("age").getParams(). }   public String getSourceId() { return "$URL:$".     rsp. } } 11 .     rsp.   }   public String getVersion() { return "$Revision:$".     Integer age = req. age).Hello World public class HelloWorld extends RequestHandlerBase {   public void handleRequestBody(SolrQueryRequest req. }   public String getSource() { return "$Id:$". }   public String getDescription() { return "Says Hello".

      "yourage":32     } 12 . "Qtime":1}.Hello World Output http://localhost:8983/solr/hello?name=Hoss&age=32&wt=xml     <response>       <lst name="responseHeader">         <int name="status">0</int>         <int name="QTime">1</int>       </lst>       <str name="greeting">Hello Hoss</str>       <int name="yourage">32</int>     </response> http://localhost:8983/solr/hello?name=Hoss&age=32&wt=json     { "responseHeader":{ "status":0.       "greeting":"Hello Hoss".

Types Of Plugins ● SolrRequestHandler ● Similarity(Factory) ● Analyzer SearchComponent  QparserPlugin  ValueSourceParser  TokenizerFactory  TokenFilterFactory  ● SolrHighlighter ● FieldType ● SolrCache SolrFragmenter  SolrFormatter  ● UpdateRequestProcessorFactory ● QueryResponseWriter CacheRegenerator ● SolrEventListener  ● UpdateHandler Italics: Only One Per SolrCore Color: Likelihood Of Needing To Write Your Own or .

Real World Examples 14 .

Tibetan And Himalayan Digital Library Tools 15 .

       }    } 16 .Tsheg Analysis Factories    public class TshegBarTokenizerFactory                  extends BaseTokenizerFactory {      public TokenStream create(Reader input) {        return new TshegBarTokenizer(input).      }    }    public class EdgeTshegTrimmerFactory                  extends BaseTokenFilterFactory {        public TokenStream create(TokenStream input) {            return new EdgeTshegTrimmer(input).

DFLL 17 .

DFLL: Faceted Browsing .

2” . Alphabetical...) . etc.1” “OSX10...DFLL Category Metadata ● Category ID and Label: 3126 == “Tablet PCs” ● Category Query: tablet_form:[* TO *] ● Ordered List of Facets    Facet ID and Label: 500016 == “OS Provided” Facet Display Info: Count vs. Ordered List of Constraints ● ● Constraint ID and Label: 111536 == “Apple OS X” Constraint Query: os:(“OSX10.

 results.docList) foreach (Facet f : m) {   foreach (Constraint c : f) {     c..getDocListAndSet(m.add(“products”.clone() DocListAndSet results =               searcher.query.) response.getFirstMatch(catDocId) Metadata m = parseAndCacheMetadata(catMetaDoc.numDocs(c. .catQuery.. searcher) m = m.docSet))   } } response.setCount(searcher.asSimpleObjects()) 20 . m.add(“metadata”.DfllHandler Psuedo-Code Document catMetaDoc = searcher.                                 results.

n) price:[500 TO 1000] = 689 manu:Dell = 104 = 92 = 75 DocList DocSet numDocs() manu:HP manu:Lenovo Query Response .Query[]..) memory:[1GB TO *] tablet_form:[* TO *] price asc proc_manu:Intel proc_manu:AMD price:[0 TO 500] Section of ordered results Unordered set of all results = 594 = 382 = 247 getDocListAndSet(Query.offset.1” “OSX10..Sort.2” .Conceptual Picture os:(“OSX10.

...</results> <lst name="metadata">  ..  <lst name="500016">    <int name="rankDir">0</int><int name="datatype">1</int>    <int name="rating">88</int><str name="name">OS provided</str>    <lst name="values">      <lst name="111536">        <int name="valueId">111536</int>        <str name="label">Apple Mac OS X</str>        <str name="rating">50</str>        <int name="count">1</int>      </lst>      ..DFLL Response <result name="products" numFound="394" start="0">.. 22    </lst> .

                                 SolrCache newCache.DfllCacheRegenerator SolrCore “Auto-warms” all SolrCaches when new versions of the index are opened for searching (after a commit).                                   Object oldVal)            throws IOException.  public interface CacheRegenerator {    public boolean regenerateItem(SolrIndexSearcher newSearcher.                                   SolrCache oldCache.                                   Object oldKey. } 23 .

DataImportHandler 24 .

last_index_time}'">  <field column="NAME" name="name" />  .DataImportHandler Builds and incrementally updates indexes based on configured SQL or XPath queries..  <entity name="f" pk="ITEMID"      query="select DESC from FEATURE where ITEMID='${item..ITEMID}">   <field name="features" column="DESC" />   . 25 ..last_index_time}'"     parentDeltaQuery="select ID from ITEM where ID=${f.ID}'"     deltaQuery="select ITEMID from FEATURE where                  UPDATEDATE > '${dataimporter... <entity name="item" pk="ID" query="select * from ITEM"    deltaQuery="select ID . where                 ITEMDATE > '${dataimporter..

DataImportHandler Plugins ● DataSource    ● Transformer      FileDataSource HttpDataSource JdbcDataSource DateFormatTransformer NumberFormatTransformer RegexTransformer ScriptTransformer TemplateTransformer ● EntityProcessor    FileListEntityProcessor SqlEntityProcessor ● CachedSqlEntityProcessor XPathEntityProcessor .

LocalSolr 27 .

LocalSolr .

RunUpdateProcessorFactory" />  </updateRequestProcessorChain> .LocalUpdateProcessorFactory ● Uses lat/lon fields to compute Cartesian Tier info ● Adds grid bodes of various sizes as new fields  <updateRequestProcessorChain name="standard" default=”true”>    <processor class=".LogUpdateProcessorFactory" />    <processor class="solr.LocalUpdateProcessorFactory">       <str name="latField">lat</str>       <str name="lngField">lng</str>       <int name="startTier">9</int>       <int name="endTier">17</int>    </processor>    <processor class="solr....

LocalSolr Cartesian Tiers .

...LocalSolrQueryComponent ● Use in place of default QueryComponent ● Augments regular query with DistanceQuery and DistanceSortSource ● Can use a custom SolrCache for distances for commonly used points   <searchComponent name="geoquery"                    class="..LocalSolrQueryComponent" />   <requestHandler name="geo" class="solr..SearchHandler">      <arr name="components">        <str>geoquery</str>        .      </arr>   </requestHandler> .

GuardianComponent 32 .

 Let's Dance! (2004) (V)   Shrek in the Swamp Karaoke Dance Party (2001) (V) .GuardianComponent Goal ● When Searching Really Short Docs. USA (2006)   Workout Party... Rule Out Matches That Are “Significantly” Longer Then Query ● Increase Precision At The Expense Of Recall        q = Dance Party      Dance Party (1995)   Dance Party (2005) (V)   Dance Party.

Implementation ● SearchComponent ● Configured To Run After QueryComponent ● Post-Processes DocList    Pick MAX_LEN Based On Number Of Query Clauses Re-analyze Stored “title“ Field Eliminate Any Results That Are With More Then MAX_LEN Tokens In “title“ .

Alternate Approach ● <copyField source=“title” dest=“titleLen”/> ● Write TokenCountingTokenFilter For titleLen ● Write MaxLenQParserPlugin    Subclass Your Favorite QParser Pick MAX_LEN Based On Number Of Query Clauses From Super Add +titleLen:[* TO MAX_LEN] Clause To Query .

Testing Your Plugins 36 .

 "title^2 description^1")              . "7".     assertU(adoc("id".     assertQ("multi qf". "Hitch Hiker's Guide to the Galaxy")). "Travel Guide”.='7']"             ).     assertU(commit())."//result/doc[1]/int[@name='id'][. "Cool Book". "Paris in 10 Days")). 37   } .                             "qt".='42']"             . "42".  "guide"."//*[@numFound='2']"             . req("q"..."//result/doc[2]/int[@name='id'][.                  "title".                   "title". "dismax".                             "qf".AbstractSolrTestCase public class YourTest extends AbstractSolrTestCase {   .   "description".   public void testSomeStuff() throws Exception {     assertU(adoc("id".    "description".

Questions? http://lucene.apache.org/solr/ 38 ? .

Sign up to vote on this title
UsefulNot useful