You are on page 1of 32

Introduction to Apache

Solr & Lucid Imagination


Grant Ingersoll
Thursday, 29 July 2010

Co-sponsored by

Sponsored by

We deliver information solutions


Co-sponsored by…

We consult and design.by


Sponsored
Steve Odart We architect and build.
www.ixxus.com
We support.
And we realise the
true value of your content...

We deliver information solutions.


© 2010 Lucid Imagination, Inc. 2
Agenda
Introductions
About Lucid Imagination &
Open Source Search
LucidWorks for Solr
Searching your domain with Solr
Putting Solr into production
Questions
Slides are posted for
download at the end of this
presentation; full replay
available within
~48 hours of live webcast

© 2010 Lucid Imagination, Inc. 3


About me
Grant Ingersoll
Lucene/Solr committer
Co-founder Apache Mahout project
Co-author of upcoming “Taming Text”
Chair, Apache Lucene PMC

© 2010 Lucid Imagination, Inc. 4


About Lucid Imagination

Build on, complement the open source technology &


install base of Apache Lucene and Solr
Deliver subscription-based value-add software,
support and training to enhance & extend Lucene/Solr
Center of excellence for Lucene/Solr app developers

© 2010 Lucid Imagination, Inc. 5


Company Background

Lucene Project Launched: 1997


Solr Project Launched 2006
Company Launched: Aug. 2007
Financing: Shasta Ventures, Granite Ventures, Walden
International, In-Q-Tel
Paying Customers: 100+ (and counting…)
HQ: San Mateo, California, USA
Partners: US, Europe, Japan, Latin America

© 2010 Lucid Imagination, Inc. 6


Lucid Imagination Offerings

Consulting Subscriptions

Training
Certified
Search Distributions

Customers
Building Better,
Faster, Less Costly Health
Best Practices
Search Applications Checks

© 2010 Lucid Imagination, Inc.


7
Lucene/Solr Success Stories with Lucid Imagination

© 2010 Lucid Imagination, Inc. 8


Data Happens

Data constantly growing faster, more diverse


Mix of content, composition, and repositories: new terms,
fields, range of data types grow in tandem with volume
Diversity and location of data are
an application development problem
Search and discovery tools are the solution
Scalability, performance and relevancy key to user success
Transparency, breadth and flexibility are key to development
success

© 2010 Lucid Imagination, Inc. 9


© 2010 Lucid Imagination, Inc. 10
Lucene/Solr

Lucene: powerful flexible search library


Java ported to 7 other
Speed, accuracy, scalability, environments (PHP, C++, Python, etc.)
efficiency
Liberal Apache License
Cross-platform portability of
indexes One of Top 5 Apache Projects
Top 10 Open Source Project

Solr: The Lucene Search Server


REST-like interface Hit highlighting
Faceting RDBMS integration
Rich Document Handling
Distributed scalability
Easy configuration
•Lucene, Solr and their logos are trademarks of the Apache Software Foundation

© 2010 Lucid Imagination, Inc. 11


Lucene/Solr Open Source
Quality @ the tipping point

Scalability
823 billion documents searched by Lucene at MySpace.com
Performance
Real time: LinkedIn search covers 48 million members, adding one
new member (with new content) per second
Relevancy
Open source APIs deliver better customization and the ability to fine
tune results
Economics
5-8x reduction in server footprint over commercial search
No vendor lock-in lowers lifecycle costs

© 2010 Lucid Imagination, Inc. 12


Creating Lasting Business Value

Three key trends… …result in:


From being
CREATING
locked into Reduced COMPETITIVE
single-vendor risk
relationships ADVANTAGE:
Focus on core process
Shorter innovations unique to
time to Better fit your business instead
market of operating and
maintaining
Resulting from Access to code results 3rd party software
direct communication in increased
packages
between innovators and adaptability of
users process to systems
© 2010 13

Lucid Imagination, Inc.


Search 101
Search tools are designed for dealing with fuzzy data
Works well with structured and unstructured data
Performs well when dealing with large volumes of data
Many apps don’t need the limits that databases place on content
Search fits well alongside a DB too

Given a user’s information need,


(query) find and, optionally, score
content relevant to that need
Many different ways to solve
this problem, each with tradeoffs
What’s “relevant” mean?

© 2010 Lucid Imagination, Inc. 14


Two Foundation Concepts

Relevance Indexing
Vector Space Model (VSM) for relevance Finds and maps terms and documents
Common across many search engines Conceptually similar to a book index
Apache Lucene is a highly optimized At the heart of fast search/retrieve
implementation of the VSM

© 2010 Lucid Imagination, Inc. 15


Solr Basics
Content is modeled via Documents and Fields
Content can be text, integers, floats, dates, custom
Analysis can be employed to alter content before indexing
Controlled via schema.xml
Searches are supported through a wide range of Query
options
Keyword
Terms
Phrases
Wildcards, other

© 2010 Lucid Imagination, Inc. 16


Solr Basics
Schema
Define Fields, field metadata and Analysis
<field name="name" type="text" indexed="true"
stored="true"/>

Solr Config
Define low-level Lucene controls
Specify how clients interact with Solr via Request
Handlers (“mini servlets”)
Configure highlighting, spell checking, admin, etc.

© 2010 Lucid Imagination, Inc. 17


Getting Started
1. Install LucidWorks Certified Distribution
2. Model your domain
3. Index your content
4. Test
5. Deploy

© 2010 Lucid Imagination, Inc. 18


LucidWorks Certified Distribution
Free certified distribution
Installer
Simple
Plugins and enhancements
Updateable
Complete Reference Guide
Support for Linux, Windows, Mac
UI and headless both available
Get started at http://lucene.li/R

© 2010 Lucid Imagination, Inc. 19


Master Your Domain with Solr

Get to know
your content

Get to know
your users

© 2010 Lucid Imagination, Inc. 20


Modeling your Content

Collection/Aggregate
Examine collection level stats, like:
MIME Types
Number of Docs
Update rates
Languages present
Much, much more
Look for patterns and relationships
Identify helpful resources
© 2010 Lucid Imagination, Inc. 21
Modeling your Content
Randomly sample a set of your documents

Look for:
Common structures like titles, tables, columns, etc.
Important metadata
Tokenization issues
Try out in http://localhost:8983/solr/admin/analysis.jsp
Importance Indicators
May also look at paragraph, sentence,
word and character issues

© 2010 Lucid Imagination, Inc. 22


Understanding your Users
Sophisticated vs. Simple
Speed and Relevance
Search and Discovery
Search
Faceting
Did you mean?
Similar Pages (More Like This)
Highlighting
UI expectations

© 2010 Lucid Imagination, Inc. 23


Build your Application

Map your content into


Documents and Fields via the
Solr schema
Setup your Solr access patterns
in the solrconfig.xml
Index your content
Search/Browse/Discover

© 2010 Lucid Imagination, Inc. 24


Indexing
Many Clients
Java, PHP, Ruby, etc.
See example/exampledocs
Example:
Upload CSV, Solr XML
<add><doc>
<field
name="id">EN7800GTX/2DHTV/25
6M</field>
<field name="manu">ASUS Computer
Inc.</field>
<field name="cat">electronics</field>
</doc></add>
© 2010 Lucid Imagination, Inc. 25
Search

Clients also support search


through API calls

HTTP support by
definition:
http://localhost:8983/sol
r/select/?q=*:*&fl=score,
id
http://localhost:8983/sol
r/select/?q=name:iPod&f
l=score,id

© 2010 Lucid Imagination, Inc. 26


Getting to Production

Some Issues to think about:


Scaling
Improving Findability

© 2010 Lucid Imagination, Inc. 27


Scaling Solr

Get the most out of each machine


http://lucene.li/V
Typical Hardware (your mileage may vary):
Modern multicore CPU, Fast disk (SSD?), 4-16 GB RAM

High Query Volume


Large Index
Both
© 2010 Lucid Imagination, Inc. 28
Improving Findability

Common Techniques
Analysis:
Lowercase, stemming,
synonyms, stopwords,
compound analysis (e.g. STR-
AV220 -> STR AV 220)
Faceting
Spell Checking
Editorial

See http://lucene.li/U
© 2010 Lucid Imagination, Inc. 29
Improving Findability
Phrase Queries and other Position-based Queries
(SpanQuery)
Disjunction Max Query (aka “DisMax”)
Intent Analysis
Invisible Queries
Fake Queries
Relevance Feedback and “More Like This”

See http://lucene.li/S

© 2010 Lucid Imagination, Inc. 30


Resources
Websites
http://www.lucidimagination.com
http://search.lucidimagination.com
http://lucene.apache.org/solr

Solr Support
http://www.lucidimagination.com/How-We-Can-Help
solr-user@lucene.apache.org

© 2010 Lucid Imagination, Inc. 31


Q&A
Slides are posted for
download at
http://lucene.li/a ;
full replay available within
~48 hours of live webcast
© 2010 Lucid Imagination, Inc.

You might also like