Professional Documents
Culture Documents
Apache Solr Search Engine PDF
Apache Solr Search Engine PDF
Built on Solr
Simplified, Accelerated Produc�vity
Cost Effec�ve Architecture
Apache Solr and Lucene are trademarks of The Apache So�ware Founda�on.
lucidimagina�on.com
brought to you by...
#120
Get More Refcardz! Visit refcardz.com
Apache Solr:
CONTENTS INCLUDE:
n
About Solr
n
Running Solr
n
schema.xml
Field Types
Getting Optimal Search Results
n
n
Analyzers
n
Hot Tips and more...
By Chris Hostetter
When LucidWorks is installed at ~/LucidWorks the Solr Home
ABOUT SOLR
directory is ~/LucidWorks/lucidworks/solr/.
Solr makes it easy for programmers to develop sophisticated, Single Core and Multicore Setup
high performance search applications with advanced features By default, Solr is set up to manage a single “Solr Core”
such as faceting, dynamic clustering, database integration and which contains one index. It is also possible to segment Solr
rich document handling. into multiple virtual instances of cores, each with its own
configuration and indices. Cores can be dedicated to a single
Solr (http://lucene.apache.org/solr/) is the HTTP based server
application, or to different ones, but all are administered
product of the Apache Lucene Project. It uses the Lucene Java
through a common administration interface.
library at its core for indexing and search technology, as well
as spell checking, hit highlighting, and advanced analysis/ Multiple Solr Cores can be configured by placing a file named
tokenization capabilities. solr.xml in your Solr Home directory, identifying each Solr
Core, and the corresponding instance directory for each.
The fundamental premise of Solr is simple. You feed it a lot of
When using a single Solr Core, the Solr Home directory is
information, then later you can ask it questions and find the
automatically the instance directory for your Solr Core.
piece of information you want. Feeding in information is called
indexing or updating. Asking a question is called a querying.
Configuration of each Solr Core is done through two main
www.dzone.com
Solr Administration
Administration for Solr can be done through
http://[hostname]:8983 /solr/admin which provides a section
with menu items for monitoring indexing and performance
statistics, information about index distribution and replication,
and information on all threads running in the JVM at the time.
There is also a section where you can run queries, and an
Figure 1: A typical Solr setup
assistance area.
Core Solr Concepts
Solr’s basic unit of information is a document: a set of
information that describes something, like a class in Java.
Documents themselves are composed of fields. These are Get the Solr
Reference
more specific pieces of information, like attributes in a class.
Apache Solr
Guide!
RUNNING SOLR
Solr Installation
The LucidWorks for Solr installer (http://www.lucidimagination.
com/Downloads/LucidWorks-for-Solr) makes it easy to set
up your initial Solr instance. The installer brings you through Free download at bit.ly/solrguide
configuration and deployment of the Web service on either
Jetty or Tomcat.
Check out the Solr and Lucene docs,
Solr Home Directory webcasts, white papers and tech ar�cles at
Solr Home is the main directory where Solr will look for lucidimagina�on.com
configuration files, data and plug-ins.
RandomSortField Does not contain a value. Queries that sort on this field type will return
SCHEMA.XML results in random order. Use a dynamic field to use this feature.
StrField String
To build a searchable index, Solr takes in documents TextField Text, usually multiple words or tokens
composed of data fields of specific field types. The schema.xml UUIDField Universally Unique Identifier (UUID). Pass in a value of “NEW” and Solr
configuration file defines the field types and specific fields that will create a new UUID.
your documents can contain, as well as how Solr should handle
those fields when adding documents to the index or when Date Field
querying those fields. When you perform a query, Dates are of the format YYYY-MM-DDThh:mm:ssZ.
schema.xml is structured as follows: The Z is the timezone indicator (for UTC) in the
canonical representation. Solr requires date and
<schema>
<types>
Hot times to be in the canonical form, so clients are
<fields>
<uniqueKey>
Tip required to format and parse dates in UTC when
<defaultSearchField>
<solrQueryParser> dealing with Solr. Date fields also support date math,
<copyField>
</schema>
such as expressing a time two months from now
using NOW+2MONTHS.
FIELD TYPES Field Type Properties
The field class determines most of the behavior of a field type,
A field type includes three important pieces of information: but optional properties can also be defined in schema.xml.
• The name of the field type Some important Boolean properties are:
• Implementation class name Property Description
• Field attributes indexed If true, the value of the field can be used in queries to retrieve matching
Field types are defined in the types element of schema.xml. documents. This is also required for fields where sorting is needed.
stored If true, the actual value of the field can be retrieved in query results.
<fieldType name=”textTight” class=”solr.TextField”>
… sortMissingFirst Control the placement of documents when a sort field is not present in
</fieldType> sortMissingLast supporting field types.
multiValued If true, indicates that a single document might contain multiple values
The type name is specified in the name attribute of the for this field type.
fieldType element. The name of the implementing class, which
makes sure the field is handled correctly, is referenced using ANALYZERS
the class attribute.
Field analyzers are used both during ingestion, when a
Shorthand for Class References
document is indexed, and at query time. Analyzers are
When referencing classes in Solr, the string solr
Hot is used as shorthand in place of full Solr package
only valid for <fieldType> declarations that specify the
Tip names, such as org.apache.solr.schema
TextField class. Analyzers may be a single class or they may
be composed of a series of zero or more CharFilter, one
or org.apache.solr.analysis. Tokenizer and zero or more TokenFilter classes.
Element Description
<httpCaching> Specifies how Solr should generate its HTTP caching-related headers
Indexing is the process of adding content to a Solr index, and
Internal Caching as necessary, modifying that content or deleting it. By adding
The <query> section contains settings that affect how Solr will content to an index, it becomes searchable by Solr.
process and respond to queries.
Client Libraries
There are three predefined types of caches that you can There are a number of client libraries available to access
configure whose settings affect performance: Solr. SolrJ is a Java client included with the Solr 1.4 release
which allows clients to add, update and query the Solr
Element Description
index. http://wiki.apache.org/solr/IntegratingSolr provides a list of
<filterCache> Used by SolrIndexSearcher for filters for unordered sets of all such libraries.
documents that match a query.
Solr usese the filterCache to cache results of queries that use the fq
search parameter.
Indexing Using XML
Solr accepts POSTed XML messages that add/update,
<queryResultCache> Holds the sorted and paginated results of previous searches
commit, delete and delete by query using the
<documentCache> Holds Lucene Document objects (the stored fields for each document).
http://[hostname]:8983/solr/update url. Multiple documents can be
specified in a single <add> command.
Request Handlers
A Request Handler defines the logic executed for any request. <add>
<doc>
Multiple instances of various request handlers, each with <field name=”employeeId”>05991</field>
different names and configuration options can be declared. <field name=”office”>Bridgewater</field>
</doc>
The qt url parameter or the path of the url can be used to [<doc> ... </doc>[<doc> ... </doc>]]
</add>
select the request handler by name.
their declaration: commit Writes all documents loaded since last commit
• default, which is used when a request does not include optimize Requests Solr to merge the entire index into a single segment to improve
a parameter. search performance
• append, which is added to the parameter values specified Delete by id deletes the document with the specified ID (i.e.
in the request. uniqueKey), while delete by query deletes documents that
• invariant, which overrides values specified in the query. match the specified query:
LucidWorks for Solr includes the following indexing handlers: <delete><id>05991</id></delete>
<delete><query>office:Bridgewater</query></delete>
• XMLUpdateRequestHandler: processes XML messages
containing data and other index modification instructions. Indexing Using CSV
• BinaryUpdateRequestHandler: processes messages from CSV records can be uploaded to Solr by sending the data to
the Solr Java client. the http://[hostname]:8983/solr/update/csv URL.
The CSV handler accepts various parameters, some of which Query Parsing
can be overridden on a per field basis using the form: Input to a query parser can include:
f.fieldname.parameter=value
• Search strings—that is, terms to search for in the index.
•P
arameters for fine-tuning the query by increasing the
These parameters can be used to specify how data should be importance of particular strings or fields, by applying
parsed, such as specifying the delimiter, quote character and Boolean logic among the search terms, or by excluding
escape characters. You can also handle whitespace, define content from the search results.
which lines or field names to skip, map columns to fields, or
•P
arameters for controlling the presentation of the query
specify if columns should be split into multiple values.
response, such as specifying the order in which results
Indexing Using SolrCell are to be presented or limiting the response to particular
Using the Solr Cell framework, Solr uses Tika to automatically fields of the search application’s schema.
determine the type of a document and extract fields from it. Search parameters may also specify a filter query. As part of a
These fields are then indexed directly, or mapped to other search response, a filter query runs a query against the entire
fields in your schema. index and caches the results. Because Solr allocates a separate
cache for filter queries, the strategic use of filter queries can
The URL for this handler is http://[hostname]:8983:solr/update/extract.
improve search performance.
The Extraction Request Handler accepts various parameters
Common Query Parameters
that can be used to specify how data should be mapped to
The table below summarizes Solr’s common query parameters:
fields in the schema, including specific XPaths of content to be
Parameter Description
extracted, how content should be mapped to fields, whether
attributes should be extracted, and in which format to extract defType The query parser to be used to process the query
content. You can also specify a dynamic field prefix to use when sort Sort results in ascending or descending order based on the documents
score or another characteristic
extracting content that has no corresponding field.
start An offset (0 by default) to the results that Solr should begin displaying
Indexing Using Data Import Handler rows Indicates how many rows of results are displayed at a time (10 by default)
The Data Import Handler (DIH) can pull data from relational fq Applies a filter query to the search results
databases (through JDBC), RSS feeds, emails repositories, and fl Limits the query’s results to a listed set of fields
structure XML using XPath to generate fields. debugQuery Causes Solr to include additional debugging information in the response,
including score explain information for each document returned
The Data Import Handler is registered in solrconfig.xml, with explainOther Allows client to specify a Lucene query to identify a set of documents not
a pointer to its data-config.xml file which has the following already included in the response, returning explain information for each of
those documents
structure:
wt Specified the Response Writer to be used to format the query response
<dataConfig>
<dataSource/> Lucene Query Parser
<document>
<entity> The standard query parser syntax allows users to specify
<field column=”” name=””/> queries containing complex expressions, such as: .
<field column=”” name=””/>
</entity> http://[hostname]:8983/solr/select?q=id:SP2514N+price:[*+TO+10].
</document>
</dataConfig> The standard query parser supports the parameters described
in the following table:
The Data Import Handler is accessed using the
http://[hostname]:8983/solr/dataimport URL but it also includes Parameter Description
a browser-based console which allows you to experiment q Query string using the Lucene Query syntax
with data-config.xml changes and demonstrates all of the q.op Specified the default operator for the query expression, overriding that in
schema.xml. May be AND or OR
commands and options to help with development. You can
access the console at this address: df Default field, overriding what is defined in schema.xml
http://[hostname]:port/solr/admin/dataimport.jsp
DisMax Query Parser
The DisMax query parser is designed to provide an experience
SEARCHING similar to that of popular search engines such as Google, which
rarely display syntax errors to users.
Instead of allowing complex expressions in the query string,
Data can be queried using either the http://[hostname]:8983/solr/
additional parameters can be used to specify how the query
select?qt=name URL, or by using the http://[hostname]:8983/solr/name
string should be used to find matching documents.
syntax for SearchHandler instances with names that begin with
a “/”. Parameter Description
pf Phrase Fields: Fields that give a score boost when all terms of the q When an index becomes too large to fit on a single system, or
parameter appear in close proximity when a query takes too long to execute, the index can be split
ps Phrase Slop: the number of positions all terms can be apart in order to into multiple shards on different Solr servers, for Distributed
match the pf boost
Search. Solr can query and merge results across shards. It’s up
tie Tie Breaker: a float value (less than 1) used as a multiplier with more then
one of the qf fields containing a term from the query string. The smaller the
to you to get all your documents indexed on each shard of your
value, the less influence multiple matching fields have server farm. Solr does not include out-of-the-box support for
bq Boost Query: a raw Lucene query that will be added to the users query to distributed indexing, but your method can be as simple as a
influence the score
round robin technique. Just index each document to the next
bf Boost Function: like bq, but directly supports the Solr function query syntax
server in the circle.
Download Now
http://bit.ly/solrguide
#82
CONTENTS INCLUDE:
■
■
About Cloud Computing
Usage Scenarios Getting Started with
Aldon Cloud#64Computing
■
Underlying Concepts
Cost
by...
■
Upcoming Refcardz
youTechnologies ®
■
Data
t toTier
brough Comply.
borate.
Platform Management and more...
■
Chan
ge. Colla By Daniel Rubio
on:
dz. com
grati s
also minimizes the need to make design changes to support
CON
ABOUT CLOUD COMPUTING one time events. TEN TS
INC
s Inte -Patternvall
■
HTML LUD E:
Basics
Automated growthHTM
ref car
OSMF
Usef
ContiPatterns an
■
However, the demands and technology used on such servers faced by web applications.Structure Tools
Core
Key ■
Structur Elem
E: has changed substantially in recent years, especially with al Elem ents
INC LUD gration the entrance of service providers like Amazon, Google and Large scale growth scenarios involvingents
specialized
NTS and mor equipment
rdz !
HTML
CO NTE Microsoft. es e... away by
(e.g. load balancers and clusters) are all but abstracted
Continu at Every e chang
About ns to isolat
relying on a cloud computing platform’s technology.
Software i-patter
space
Clojure
■
n
Re fca
e Work
Build
riptio
and Ant These companies Desc have a Privat
are in long deployed trol repos
itory
webmana applications
ge HTM
L BAS
■
Build
re
Repo
This Refcard active
will introduce are within
to you to cloud riente computing, with an
ION d units etc. Some platforms ram support large grapRDBMS deployments.
■
The src
dy Ha
softw
EGRAT e ine loping task-o it and Java s written in hical on of
all attribute
softwar emphasis onDeve es by
Mainl
INT these ines providers, so youComm can better understand
also rece JavaScri user interfac web develop and the rris
Vis it
OUS
HTML 5
codel chang desc
ding e code Task Level as the
www.dzone.com
CI can ticular con used can rity all depe that all
targe
simp s will L t. HTM l, but r. Tags
es i-pa tend but to most phone services:
y Integ alize plans with late alloted resources,
file
ts with an and XHT lify all help or XHTML, Reg ardless L VS XH <a><
in a par hes som
etim s. Ant , they ctices,
Binar Centr temp your you prov TML b></
proces in the end bad pra enting
nt nmen
incurred cost
geme whether e a single based on
Creatsuchareresources were consumed
t enviro orthenot. Virtualization
muchallows MLaare physical othe
piece ofr hardware to be understand of HTML
approac ed with the cial, but, rily implem nden
cy Mana rties nt targe es to of the actually web cod ide a solid ing has
essa d to Depe prope into differe chang
utilized by multiple operating
function systems.simplerThis allows ing.resourcesfoundati job adm been arou
efi nec itting
associat to be ben
er
pare
Ge t
te builds commo
are not Fortuna
late Verifi e comm than on
ality be allocated nd for
n com Cloud computing asRun it’sremo
known etoday has changed this. expecte irably, that
befor
They they
etc.
(e.g. bandwidth, elem CPU) tohas
nmemory, exclusively totely
Temp
job has some time
Build
lts whe
ually,
appear effects. rm a
Privat y, contin nt team Every mov
entsinstances. ed to CSS
used
to be, HTML d. Earl
ed resu gration
The various resourcesPerfo consumed by webperio applications
dicall (e.g.
opme page system
individual operating bec Brow y exp . Whi
adverse unintend d Builds sitory Build r to devel common (HTM . ause ser HTM anded le it
ous Inte web dev manufacture L had very far mor has done
Stage Repo
e bandwidth, memory, CPU) areIntegtallied
ration on a per-unit CI serve basis L or XHT
produc Continu Refcard
e Build rm an ack from extensio .) All are
e.com
DZone, Inc.
the to the to rate devel
While s efer of CI
ion
Gene
Cloud Computing
the not
s on
expand
ISBN-13: 978-1-936502-02-8
140 Preston Executive Dr. ISBN-10: 1-936502-02-X
Suite 100
50795
Cary, NC 27513