Introduction to Apache Solr & Lucid Imagination

Grant Ingersoll
Thursday, 29 July 2010

Co-sponsored by Sponsored by

We deliver information solutions

Co-sponsored by…

Sponsored by We consult and design. Steve Odart
www.ixxus.com

We architect and build. We support. And we realise the true value of your content...

We deliver information solutions.
© 2010 Lucid Imagination, Inc.

2

Agenda
Introductions About Lucid Imagination & Open Source Search LucidWorks for Solr Searching your domain with Solr Putting Solr into production Questions
Slides are posted for download at the end of this presentation; full replay available within ~48 hours of live webcast
© 2010
Lucid Imagination, Inc.

3

About me
Grant Ingersoll
Lucene/Solr committer Co-founder Apache Mahout project Co-author of upcoming “Taming Text” Chair, Apache Lucene PMC

© 2010

Lucid Imagination, Inc.

4

About Lucid Imagination

Build on, complement the open source technology & install base of Apache Lucene and Solr Deliver subscription-based value-add software, support and training to enhance & extend Lucene/Solr Center of excellence for Lucene/Solr app developers

© 2010

Lucid Imagination, Inc.

5

Company Background
Lucene Project Launched: 1997 Solr Project Launched Company Launched: Financing: Paying Customers: HQ: Partners: 2006 Aug. 2007 Shasta Ventures, Granite Ventures, Walden International, In-Q-Tel 100+ (and counting…) San Mateo, California, USA US, Europe, Japan, Latin America

© 2010

Lucid Imagination, Inc.

6

Lucid Imagination Offerings

Consulting

Subscriptions

Training

Search Customers
Best Practices

Certified Distributions

Building Better, Faster, Less Costly Search Applications

Health Checks

© 2010

Lucid Imagination, Inc.

7

Lucene/Solr Success Stories with Lucid Imagination

© 2010

Lucid Imagination, Inc.

8

Data Happens

Data constantly growing faster, more diverse
Mix of content, composition, and repositories: new terms, fields, range of data types grow in tandem with volume

Diversity and location of data are an application development problem
Search and discovery tools are the solution Scalability, performance and relevancy key to user success Transparency, breadth and flexibility are key to development success
© 2010
Lucid Imagination, Inc.

9

© 2010

Lucid Imagination, Inc.

10

Lucene/Solr Lucene: powerful flexible search library
Speed, accuracy, scalability, efficiency Cross-platform portability of indexes Java ported to 7 other environments (PHP, C++, Python, etc.) Liberal Apache License One of Top 5 Apache Projects Top 10 Open Source Project

Solr: The Lucene Search Server
REST-like interface Faceting Rich Document Handling Easy configuration
© 2010

Hit highlighting RDBMS integration Distributed scalability
•Lucene, Solr and their logos are trademarks of the Apache Software Foundation

Lucid Imagination, Inc.

11

Lucene/Solr Open Source Quality @ the tipping point

Scalability
823 billion documents searched by Lucene at MySpace.com

Performance
Real time: LinkedIn search covers 48 million members, adding one new member (with new content) per second

Relevancy
Open source APIs deliver better customization and the ability to fine tune results

Economics
5-8x reduction in server footprint over commercial search No vendor lock-in lowers lifecycle costs

© 2010

Lucid Imagination, Inc.

12

Creating Lasting Business Value
Three key trends…
From being locked into single-vendor relationships

…result in:
CREATING COMPETITIVE ADVANTAGE: Focus on core process innovations unique to your business instead of operating and maintaining 3rd party software packages

Reduced risk

Shorter time to market
Resulting from direct communication between innovators and users
© 2010

Better fit
Access to code results in increased adaptability of process to systems

13
Lucid Imagination, Inc.

Search 101
Search tools are designed for dealing with fuzzy data
Works well with structured and unstructured data
Performs well when dealing with large volumes of data

Many apps don’t need the limits that databases place on content
Search fits well alongside a DB too

Given a user’s information need, (query) find and, optionally, score content relevant to that need
Many different ways to solve this problem, each with tradeoffs

What’s “relevant” mean?

© 2010

Lucid Imagination, Inc.

14

Two Foundation Concepts Relevance
Vector Space Model (VSM) for relevance Common across many search engines Apache Lucene is a highly optimized implementation of the VSM

Indexing

Finds and maps terms and documents Conceptually similar to a book index At the heart of fast search/retrieve

© 2010

Lucid Imagination, Inc.

15

Solr Basics
Content is modeled via Documents and Fields
Content can be text, integers, floats, dates, custom Analysis can be employed to alter content before indexing Controlled via schema.xml

Searches are supported through a wide range of Query options
Keyword Terms Phrases Wildcards, other

© 2010

Lucid Imagination, Inc.

16

Solr Basics
Schema Define Fields, field metadata and Analysis <field name="name" type="text" indexed="true" stored="true"/> Solr Config Define low-level Lucene controls Specify how clients interact with Solr via Request Handlers (“mini servlets”) Configure highlighting, spell checking, admin, etc.
© 2010
Lucid Imagination, Inc.

17

Getting Started
1. 2. 3. 4. 5. Install LucidWorks Certified Distribution Model your domain Index your content Test Deploy

© 2010

Lucid Imagination, Inc.

18

LucidWorks Certified Distribution
Free certified distribution Installer Simple Plugins and enhancements Updateable Complete Reference Guide Support for Linux, Windows, Mac UI and headless both available Get started at http://lucene.li/R
© 2010
Lucid Imagination, Inc.

19

Master Your Domain with Solr

Get to know your content Get to know your users

© 2010

Lucid Imagination, Inc.

20

Modeling your Content
Collection/Aggregate Examine collection level stats, like:
MIME Types Number of Docs Update rates Languages present Much, much more

Look for patterns and relationships Identify helpful resources
© 2010
Lucid Imagination, Inc.

21

Modeling your Content
Randomly sample a set of your documents Look for:
Common structures like titles, tables, columns, etc. Important metadata Tokenization issues Try out in http://localhost:8983/solr/admin/analysis.jsp Importance Indicators May also look at paragraph, sentence, word and character issues

© 2010

Lucid Imagination, Inc.

22

Understanding your Users
Sophisticated vs. Simple Speed and Relevance Search and Discovery Search Faceting Did you mean? Similar Pages (More Like This) Highlighting UI expectations

© 2010

Lucid Imagination, Inc.

23

Build your Application
Map your content into Documents and Fields via the Solr schema Setup your Solr access patterns in the solrconfig.xml Index your content Search/Browse/Discover

© 2010

Lucid Imagination, Inc.

24

Indexing
Many Clients
Java, PHP, Ruby, etc. See example/exampledocs

Example: Upload CSV, Solr XML
<add><doc> <field name="id">EN7800GTX/2DHTV/25 6M</field> <field name="manu">ASUS Computer Inc.</field> <field name="cat">electronics</field> </doc></add>
© 2010 Lucid Imagination, Inc. 25

Search
Clients also support search through API calls HTTP support by definition: http://localhost:8983/sol r/select/?q=*:*&fl=score, id http://localhost:8983/sol r/select/?q=name:iPod&f l=score,id
© 2010 Lucid Imagination, Inc. 26

Getting to Production

Some Issues to think about: Scaling Improving Findability

© 2010

Lucid Imagination, Inc.

27

Scaling Solr

Get the most out of each machine
Typical Hardware (your mileage may vary):
Modern multicore CPU, Fast disk (SSD?), 4-16 GB RAM

http://lucene.li/V

High Query Volume Large Index Both
© 2010
Lucid Imagination, Inc.

28

Improving Findability
Common Techniques Analysis:
Lowercase, stemming, synonyms, stopwords, compound analysis (e.g. STRAV220 -> STR AV 220)

Faceting Spell Checking Editorial See http://lucene.li/U
© 2010 Lucid Imagination, Inc. 29

Improving Findability
Phrase Queries and other Position-based Queries (SpanQuery) Disjunction Max Query (aka “DisMax”) Intent Analysis Invisible Queries Fake Queries Relevance Feedback and “More Like This” See http://lucene.li/S

© 2010

Lucid Imagination, Inc.

30

Resources
Websites http://www.lucidimagination.com http://search.lucidimagination.com http://lucene.apache.org/solr Solr Support http://www.lucidimagination.com/How-We-Can-Help solr-user@lucene.apache.org

© 2010

Lucid Imagination, Inc.

31

Q&A
Slides are posted for download at http://lucene.li/a ; full replay available within ~48 hours of live webcast
© 2010 Lucid Imagination, Inc.

Sign up to vote on this title
UsefulNot useful