Full Text Search in Rails Ith Sunspot and Solr - Maurício Linhares

Full text search in Rails ith Sunspot and Solr – Maurício Linhares
@mauriciojr – http://codeshooter.wordpress.com/
FULL TEXT SEARCH IN IN RAILS WITH SUNSPOT AND SOLR 2

STARTING THE ENGINES 3
LISTING 1 – SUNSPOT.YML 3
LISTING 2 – CREATE_BASE_TABLES.RB 3
LISTING 3 – CATEGORY.RB 5
LISTING 4 – PRODUCT.RB 5
SEARCHING 6
LISTING 4 – PRODUCTS_CONTROLLER.RB 6
LISTING 5 – SUNSPOT_HACK.RB 7
INDEXING 7
IMAGE 1 – SOLR SCHEMA BROWSER 8
IMAGE 2 – VIEWING THE ANALYSIS AND SEARCH FILTERS 9
IMAGE 3 – SOLR ANALYZER PAGE 10
CUSTOMIZING FIELDS 10
LISTING 6 – SOLR/CONF/SCHEMA.XML EXCEPT 10
LISTING 7 – SOLR/CONFIG/SCHEMA.XML EXCEPT 11
IMAGE 4 – SOLR ANALYZER PAGE 12
PARTIAL MATCHING 12
LISTING 8 – SOLR/CONFIG/SCHEMA.XML EXCEPT 13
IMAGE 5 – ANALYZER OUTPUT WITH PARTIAL MATCHING ENABLED 14
FACETING 14
LISTING 9 – PRODUCTS_CONTROLLER.RB EXCEPT 15
LISTING 10 – PRODUCT.RB EXCEPT 15
LISTING 11 – PRODUCTS/INDEX.HTML.HAML EXCEPT 16
IMAGE 6 – FACETING INFORMATION 16
CONCLUSION 16
This material is provided under a Creative Commons Licence -

http://creativecommons.org/licenses/by-nc-sa/3.0/
Full text search in in Rails with Sunspot and Solr

Everyone wants to take their databases to run everything as fast as possible. We
usually say query less, add more caching mechanisms, add indexes to the
columns being searched, but another solution is not to use the database at all and
look for better solutions for your querying needs.
When querying for text in our databases, we’re often doing “LIKE” searches. Like
searches are only performant if we have an index in that field and the query is
written in a way that the index is used. Imagine that you have a field “name” and
it contains the text “Battlestar Galactica”. This query would be able to run and
use the index:
SELECT p.* FROM products p WHERE p.name LIKE “Battlestar%”
The database would be able to optimize this query and use the index to find the
expected row. But, what if the query was like this one:
SELECT p.* FROM products p WHERE p.name LIKE “%Galactica”
Database indexes usually match from left to right, so, unless you have a nasty
trick under your sleeve, this query will just look at ALL the rows in the products
table and perform a match on every “name” column before returning a result.
And that’s Really Bad News for you, as the DBA will probably come for you
holding a Morning Star to beat you badly. So, querying with “LIKE” when you
what you need is full text search isn’t nice.
That’s where full text search based solutions come in for help. Tools like Solr
allow you to perform optimized text searches, filter input, categorization and
even features like Google’s “Did you mean?”.
In this tutorial you’ll learn how to add full text searching capabilities to your
Rails application using Sunpot and Solr. We will also delve a little bit into Solr’s
configuration and learn how to use specific tokenizers to clear input, perform
partial matching of words and faceting results.
This project uses Rails 3 and Ruby 1.9.2, you’ll find a Gemfile and and “.rvmrc”
with all dependencies declared, it should be pretty easy to follow or setup your
environment based on it (if you’re not using RVM, that’s a GREAT time to learn
using it).
You can possibly follow this tutorial with a previous Rails version and without
bundler or RVM, given all models and most of the code will look exactly the same
in Rails 2 and Sunspot is compatible to Rails 2 too.
The source code for this application is available at GitHub here -

https://github.com/mauricio/sunspot_tutorial

Starting the engines

Download the Sunspot source code from Github -
https://github.com/outoftime/sunspot
Enter the project folder and go to “sunspot/solr-1.3”, inside that folder you
should see a “solr” folder, copy this folder into your project’s folder. This is
where the general Solr configuration is going to live, don’t worry about these
files just yet, we’ll get to them later in this tutorial.
Now create a “sunspot.yml” file under your project’s “config” folder, here’s a
sample:
Listing 1 – sunspot.yml
development:
solr:
hostname: localhost
port: 8980
log_level: INFO
auto_commit_after_delete_request: true
test:
solr:
hostname: localhost
port: 8981
log_level: OFF
production:
solr:
hostname: localhost
port: 8982
log_level: WARNING
auto_commit_after_request: true
You can have different configurations for every environment you’re running. To
see all configuration options, go to the Sunspot source code and head to the
“sunspot_rails/lib/sunspot/rails/configuration.rb” file.
Now we’ll create two models, Product and Category, so let’s start by creating the
migration that will setup them:
rails g migration create_base_tables

Listing 2 – create_base_tables.rb
class CreateBaseTables < ActiveRecord::Migration
def self.up
create_table :categories do |t|
t.string :name, :null => false
end
create_table :products do |t|

t.string :name, :null => false
t.decimal :price, :scale => 2, :precision => 16, :null => false
t.text :description
t.integer :category_id, :null => false
end
add_index :products, :category_id
end
def self.down
drop_table :categories
drop_table :products
end
end
Now we move on to the basic models, starting with the Category model:
Listing 3 – category.rb
class Category < ActiveRecord::Base
has_many :products
validates_presence_of :name
validates_uniqueness_of :name, :allow_blank => true
searchable :auto_index => true, :auto_remove => true do

text :name
end
def to_s
self.name
end
end
Here in the Category class we see our first reference to Sunspot, the “searchable”
method, where we configure the fields that should be indexed by Solr. At the
Category class, there’s only one field that’s useful at this moment, the “name”, so
we tell Sunspot to configure the field name to be indexed as “text” (you usually
don’t want your text indexed as “string”, as it will only be a hit in a full match).
The :auto_index and :auto_remove options are there to let Sunspot automatically
send your model to be indexed at Solr when it is created/updated/destroyed.
The default is “false” for both values, which means you have to manually send

your data to Solr and unless you really want to do that, you should keep both of
these values as “true” in your models.
Now lets look at the Product class:
Listing 4 – product.rb
class Product < ActiveRecord::Base
belongs_to :category
validates_presence_of :name, :description, :category_id, :price

validates_uniqueness_of :name, :allow_blank => true

text :name, :boost => 2.0
text :description
float :price
integer :category_id
end
def to_s
self.name
end
end
In our Product class things are a little bit different, we have more fields (and
more kinds) being indexed. “float” and “integer” are pretty self explanatory, but
the “name” field has some black magic floating around, with the “boost”
parameter. Boosting a field when indexing means that if the match is in that
specific field, it has more “relevance” than if found somewhere else.
Imagine that you’re looking for Iron Maiden’s “Powerslave” album. You go to Iron
Maiden’s Online Store and search for “powerslave”, hoping that the album will be
the first hit, but then you see “Live After Dead” before “Powerslave”. Why did it
happen? The “Live After Dead” album contains the “Powerslave” song in it’s track
listing, so it’s a match as much as the real “Powerslave” album. What we need
here is to tell the search tool that if a match is on an album name, it has higher
relevance than if the hit is in the track listing.
Boosting allows you to reduce these issues. Some fields are inherently more
important than others and you can tell that to Solr by configuring a “:boost” value
for them. When something matches on them, the relevance of that match will be
improved and it should come up before the other results in search.
Searching
Now let’s take a look at the ProductsController to see how we perform the
search:

Listing 4 – products_controller.rb
class ProductsController < ApplicationController
def index
@products = if params[:q].blank?
Product.all :order => 'name ASC'
else
Product.solr_search do |s|
s.keywords params[:q]
end
end
end
end
As you can see, searching is quite simple, you just call the solr_search method
and send in the text to be searched for. One thing that I don’t like about Sunspot
is that searches do not return an Array like object, you get a
Sunspot::Search::StandardSearch object that has, as a property, the results array
which contains the records returned by the search.
Here’s a simple way to fix this issue (I usually place the contents of this file inside
an initializer in “config/initializers”):
Listing 5 – sunspot_hack.rb
::Sunspot::Search::StandardSearch.class_eval do
include Enumerable
delegate(
:current_page,
:per_page,
:total_entries,
:total_pages,
:offset,
:previous_page,
:next_page,
:out_of_bounds?,
:each,
:in_groups_of,
:blank?,
:[],
:to => :results)
end
This simple monkeypatch makes the search object itself behave like an
Enumerable/Array and you can use it to navigate directly in the results, without
having to call the “results” method. The methods usually used by will_paginate
helpers are also included so you can pass this object to a will_paginate call in
your view and it’s just going to work.
Indexing

Now that all the models are in place, we can start fine tuning the Solr indexing
process. First thing to understand here is what happens when you send text to be
indexed by Solr, let’s get into the tool, starting the server:
rake sunspot:solr:run
This rake task starts Solr in the foreground (if you wanted to start it in the
background, you’d use “sunspot:solr:start”). With Solr running, you should add
some data to the database, this tutorial’s project on Github contains a “seed.rb”
file with some basic data for testing, just copy it over your project.
Also copy the “lib/tasks/db.rake” from the project to your project, it contains a
“db:prepare” task that truncates the database, seeds it and then indexes all items
in Solr and we’re doing to be reindexing data a lot.
With everything copied, run the “db:prepare” task:
rake db:prepare
This will add the categories and products to your database and also index them
in Solr. If this task did run successfully, head to the Solr administration interface,
at this URL:
http://localhost:8980/solr/admin/schema.jsp
Once you go to it, click on the “FIELDS”, then on “NAME_TEXT”, you should see a
screen just like the one in image 1:

Image 1 – Solr schema browser
If you don’t see all the fields that are available in this image, your “rake
db:prepare” command has probably failed or Solr wasn’t running when you
called it.
What we see here is the information about the fields we’re indexing. This specific
field contains all data from the name properties from both Category and Product
classes, as you can notice from the top 10 terms.
The name field is not indexed by it’s full content, as a relational database would
usually do, the text is broken into tokens, by the solr.StandardTokenizerFactory
class in Solr. This class receives our text, like “Battlestar Galactica: The
Boardgame” and turns it into:
[“Battlestar”, “Galactica”, “The”, “Boardgame”]
This is what gets indexed and, ultimately, searched by Solr. If you open the web
application now and try to search for “battle”, you won’t have any matches. If you
search for “Battlestar”, you get the two products that match the name.
Everything when indexing information in Solr revolves around building the best
“tokens” available for your input. You have to teach Solr to crunch your data in a
way that makes sense and makes it easy to search for, and adding filters to the
indexing process does this. While in the same page as Image 1 above, click on the
“DETAILS” links as shown in Image 2:

Image 2 – Viewing the analysis and search filters
Each field in Solr has two analyzers, one is the “index” analyzer, that prepares the
input to be indexed and the other is the “query” analyzer that prepares the
search input to finally perform a search. Unless you have some special need, both
of them are usually the same.
In our current configuration, we have the same two filters for both of the
analyzers. The StandardFilterFactory filter removes punctuation characters from
our input (the “:” in “Battlestar Galactica: The Boardgame” is not in our tokens)
and the LowerCaseFilterFactory makes all input lowercased so we can search
with “baTTle”, “BATTLE”, “BaTtLe” and they’re all going to work.
Before we move on to add more filters to our analyzers, let’s take a look at the
analyzer screen in Solr Admin at -
http://localhost:8980/solr/admin/analysis.jsp?highlight=on
In this screen we see how our input is going to be transformed into tokens by the
configured analyzers.

Image 3 – Solr analyzer page
In this screen we have selected the “name_text” field in Solr. In the “Field value
(Index)” you enter the values you’re sending to be indexed, just like you would
send from your model property, in the “Field value (Query)” you enter the values
you’d use to search.
Once you type and hit “Analyze” you should see the output just below the form as
we see in Image 3. This output shows how your input is transformed into tokens
by the tokenizer and filters, this way you can easily experiment by adding more
filters and seeing if the output really matches the way you’d expect it to. This
analysis view is your best friend when debugging search/indexing related issues
or trying out ways to improve the way Solr indexes and matches your data.
Customizing fields
Now that you have an idea about how the indexing and searching process work,
let’s start to customize the fields in Solr, open up the “solr/conf/schema.xml” file
and look for this reference:
Listing 6 – solr/conf/schema.xml except

<fieldtype class="solr.TextField" positionIncrementGap="100" name="text">
<analyzer>
<tokenizer class="solr.StandardTokenizerFactory"/>
<filter class="solr.StandardFilterFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldtype>
If you look at Image 1, where we saw the “name_text” configuration, you’ll see
that the field type is “text”, this except above is the configuration for all fields of
type “text”, which means that if we add more filters here we’ll affect all fields of
this type. This greatly simplifies the way we configure the tool, as we don’t have
to define explicit configurations for every single field that our models have, we

can just reuse this same “text” config for all fields that are supposed to be
indexed as text.
But that’s a lot of talking, let’s get into action!
Let’s start the job by looking at our indexed data from before:
[“battlestar”, “galactica”, “the”, “boardgame”]
The “the” is mostly useless, as it’s going to be available in almost all properties
and no one is ever going to search for “the” (oh yeah, there might be that ONE
guy that does it). In Information Retrieval lingo, “the” is a stop word, it usually
doesn’t have meaning by itself and doesn’t represent valuable information for
our indexer, removing all stop words from your input improves performance and
the relevance of your results.
Given that this is a common operation, Solr already contains a filter that’s
capable of removing all stop words from your data, the solr.StopFilterFactory,
let’s see how we can add it to our config:
Listing 7 – solr/config/schema.xml except

<analyzer>
<filter class="solr.StopFilterFactory" words="stopwords.txt" ignoreCase="true"/>
<filter class="solr.ISOLatin1AccentFilterFactory"/>
<filter class="solr.TrimFilterFactory" />
</analyzer>
</fieldtype>
If you look at the “solr/config” folder you’ll se a “stopwords.txt” file that already
contains most of the common stop words in English, you can add or remove
words from there as needed and if you’re not indexing English text you can just
remove the English names and add your language’s stop words. Now change this
in your “solr/config/schema.xml” file and stop and start Solr again and open the
analyzer:

Image 4 – Solr analyzer page
As you can see, in the last step, the “the” was removed from both the index input
and the query input, we’re maintaining only the pieces of information that are
really useful, this makes our index smaller and also speeds up searching.
While you were not looking, we have also added two other filters,
solr.ISOLatin1AccentFilterFactory, that removes accents from words in Latin
based languages, like Portuguese. If the input is “nã o”, it becomes “nao”. And
after that there’s solr.TrimFilterFactory, that removes unnecessary spaces from
our tokens.
Partial matching
Another pretty common need is to be able to match only a part of a word, usually
a prefix. In the beginning of the tutorial, we saw that searching for “battle”
doesn’t yield any results, while “battlestar” does. This happens because Solr, by
default, only sees a match if it’s a full match. The word you entered must be
exactly the same as a token that’s available in the index, if there is no exact
match, Solr you tell you that there are no results.
If you look at Lucene’s Query Parser Syntax -

http://lucene.apache.org/java/2_9_1/queryparsersyntax.html (Solr is
somewhat a web interface to Lucene) you’ll see that you can use the “*” operator
to perform a partial match. We could then search for “battle*” and this would
yield the results we expect, but doing this kind of partial matching is slow and
could possibly become a bottleneck for your application, so we have to figure out
another way to do this.

When all you need is prefixed partial matching, the solr.EdgeNGramFilterFactory

is your best friend. It will break words into pieces that will then be added to the
index, so it looks like you have partial matching, but in fact the partials are
tokens by themselves in the index, let’s see how our config would look like in this
case:
Listing 8 – solr/config/schema.xml except

<analyzer type="index">
<filter class="solr.EdgeNGramFilterFactory"
minGramSize="3"
maxGramSize="30"/>
</analyzer>
<analyzer type="query">
</analyzer>
</fieldtype>
As you can see, now we have two <analyzer> sections in our <fieldtype>, one of
the analyzers is for “index” and the other is for “query”. This is needed because
we don’t want to have our search parameters being transformed for a partial
match. If the user is searching for “battle”, it doesn’t makes sense to show him
results for “bat”, so the generation of pieces of each word should be done only
when indexing information.
Now restart your Solr instance and head run again the form we had in the
analyzer view, you should see something like Image 5:

Image 5 – Analyzer output with partial matching enabled
Looking at the output, “battlestar” became:
[“bat”, “batt”, “battl”, “battle”, “battles”, “battlest”, “battlesta”, “battlestar”]
Now, if you search for “battle”, you should find all products that have “battle” as a
prefix in any of their words and the search input is not affected by this change.
Faceting
Faceting of results is YACF (Yet Another Cool Feature) that you have when using
Solr and Sunspot. “What does that mean?”, you might ask, it means that Solr is
able to organize your results based on one of it’s properties and tell you how
many results did match for every property value.
“I still don’t get it”, you might be thinking now. In our Product model we’re
indexing the “category_id” property, we’ll tell Sunspot to facet our search based
on the “category_id” field and Sunspot will tell us how many matches each
category had, even if we’re paginating the results. Let’s see how our searching
code would change:

Listing 9 – products_controller.rb except

def index
@page = (params[:page] || 1).to_i
@products = if params[:q].blank?
Product.paginate :order => 'name ASC', :per_page => 3, :page => @page
else
result = Product.solr_search do |s|

s.keywords params[:q]
unless params[:category_id].blank?
s.with( :category_id ).equal_to( params[:category_id].to_i )
else
s.facet :category_id
end
s.paginate :per_page => 3, :page => @page
end
if result.facet( :category_id )
@facet_rows = result.facet(:category_id).rows
end
result
end
end
The search code really changed a lot, now if there’s a “category_id” parameter we
will use that to filter our search, if there isn’t we’re going to perform faceting
with the “s.facet :category_id” call. There’s also a slight change to the “product.rb”
class, let’s see it:
Listing 10 – product.rb except

text :name, :boost => 2.0
text :description
float :price
integer :category_id, :references => ::Category
end
We’ve added the “:references => ::Category” to the “:category_id” field

configuration so Sunspot knows that this field is, in fact, a foreign key to another
object, this will allow Sunspot to load the categories in the facets automatically
for you.
The “result.facet(:category_id)” asks the search object for the array that contains
the facets returned for the :category_id field in this search. Each row in this list
contains an “instance” (which, in our case, is an Category object) and a “count”,
that’s the number of hits in that specific facet. Once you get your hands at the
rows, we can use it in our view, let’s see how we used them:

Listing 11 – products/index.html.haml except

- if !@facet_rows.blank? && @facet_rows.size > 1
%ul
- for row in @facet_rows
%li= link_to( "#{row.instance} (#{row.count})", products_path( :q => params[:q], :category_id
=> row.instance ) )
If there are facets available, we use them to add links that will make the user
filter based on each specific facet, each row object has an instance and a count,
and we use both in the interface to tell the user which category is it and how
many hits it had. Look at how our user interface looks like:
Image 6 – Faceting information
And now you finally have search functionality added to a Rails project, with
partial matching, faceting, pagination and input cleanup. Just forget that you have
ever performed a “SELECT p.* FROM products p WHERE p.name LIKE ‘%battle
%’” and be happy to be using a great full text search solution.
Conclusion
Hopefully this tutorial should be enough to get you up and running with Solr, for
more advanced features I’d recommend you to search on the Solr wiki
(http://wiki.apache.org/solr/FrontPage ) and buy “Solr 1.4 – Enterprise Search
Server” by David Smiley and Erick Pugh
(http://www.amazon.com/gp/product/1847195881?
ie=UTF8&tag=ultimaspalavr-
20&linkCode=as2&camp=1789&creative=390957&creativeASIN=1847195881 )
.


Full Text Search in Rails Ith Sunspot and Solr - Maurício Linhares

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Full Text Search in Rails Ith Sunspot and Solr - Maurício Linhares

Uploaded by

Copyright:

Available Formats

Full text search in Rails ith Sunspot and Solr – Maurício Linhares

FULL TEXT SEARCH IN IN RAILS WITH SUNSPOT AND SOLR 2

This material is provided under a Creative Commons Licence -

Full text search in in Rails with Sunspot and Solr

SELECT p.* FROM products p WHERE p.name LIKE “Battlestar%”

SELECT p.* FROM products p WHERE p.name LIKE “%Galactica”

The source code for this application is available at GitHub here -

This material is provided under a Creative Commons Licence -

Starting the engines

rails g migration create_base_tables

This material is provided under a Creative Commons Licence -

create_table :products do |t|

add_index :products, :category_id

searchable :auto_index => true, :auto_remove => true do

This material is provided under a Creative Commons Licence -

Now lets look at the Product class:

validates_presence_of :name, :description, :category_id, :price

searchable :auto_index => true, :auto_remove => true do

This material is provided under a Creative Commons Licence -

This material is provided under a Creative Commons Licence -

With everything copied, run the “db:prepare” task:

This material is provided under a Creative Commons Licence -

Image 1 – Solr schema browser

[“Battlestar”, “Galactica”, “The”, “Boardgame”]

This material is provided under a Creative Commons Licence -

Image 2 – Viewing the analysis and search filters

This material is provided under a Creative Commons Licence -

Image 3 – Solr analyzer page

Listing 6 – solr/conf/schema.xml except

This material is provided under a Creative Commons Licence -

But that’s a lot of talking, let’s get into action!

[“battlestar”, “galactica”, “the”, “boardgame”]

Listing 7 – solr/config/schema.xml except

This material is provided under a Creative Commons Licence -

Image 4 – Solr analyzer page

If you look at Lucene’s Query Parser Syntax -

This material is provided under a Creative Commons Licence -

When all you need is prefixed partial matching, the solr.EdgeNGramFilterFactory

Listing 8 – solr/config/schema.xml except

This material is provided under a Creative Commons Licence -

Image 5 – Analyzer output with partial matching enabled

Looking at the output, “battlestar” became:

[“bat”, “batt”, “battl”, “battle”, “battles”, “battlest”, “battlesta”, “battlestar”]

This material is provided under a Creative Commons Licence -

Listing 9 – products_controller.rb except

result = Product.solr_search do |s|

Listing 10 – product.rb except

We’ve added the “:references => ::Category” to the “:category_id” field

This material is provided under a Creative Commons Licence -

Listing 11 – products/index.html.haml except

Image 6 – Faceting information

This material is provided under a Creative Commons Licence -

You might also like