You are on page 1of 6

John Smart provides a quick intro to Lucene, a powerful and elegant library for full-text

indexing and searching in Java, with which you can add rich full-text search functionality
to your Java web application.

Lucene is a powerful and elegant library for full-text indexing and searching in Java. In
this article, we go through some Lucene basics, by adding simple yet powerful full-text
index and search functions to a typical J2EE web application.

NOTE

For your convenience, all of the code for this article’s Lucene demo is included in a
source.zip file.

Full-Text Searching
Nowadays, any modern web site worth its salt is considered to need a "Google-like"
search function. Complex multi-criteria search screens are often perceived by users as
being too complex, and are in fact rarely used. Users want to be able to just type the
word(s) they’re seeking and have the computer do the rest. This explains the growing
popularity of search engines such as those of Yahoo! and Google and, more recently,
tools such as Google Desktop.

If you need to add this sort of rich full-text search functionality to your Java web
application, look no further! Lucene is an extremely rich and powerful full-text search
API written in Java. You can use Lucene to provide consistent full-text indexing across
both database objects and documents in various formats (Microsoft Office documents,
PDF, HTML, text, and so on).

In this article, we’ll go through the basics of using Lucene to add full-text search
functionality to a fairly typical J2EE application—an online accommodation database.
The main business object is the Hotel class. In this tutorial, a Hotel has a unique
identifier, a name, a city, and a description.

NOTE

We won’t worry about the underlying storage mechanism (JDBC, Hibernate, EJB 3, or
whatever) or the display layer technology (JSP/Struts, JFS, Tapestry, or whatever). We’ll
just focus on the business layer and the indexing and search functionalities, which are
largely independent of the other architectural layers.
Creating an Index
The first step in implementing full-text searching with Lucene is to build an index. This is
easy—you just specify a directory and an analyzer class. The analyzer breaks text fields
into indexable tokens; this is a core part of Lucene.

Several types of analyzers are provided out of the box. Table 1 shows some of the more
interesting ones.

Table 1 Lucene analyzers.


Analyzer Description
StandardAnalyzer A sophisticated general-purpose analyzer.
WhitespaceAnalyzer
A very simple analyzer that just separates tokens using white
space.
StopAnalyzer
Removes common English words that are not usually useful for
indexing.
An interesting experimental analyzer that works on word roots (a
SnowballAnalyzer search on rain should also return entries with raining, rained, and
so on).

There are even a number of language-specific analyzers, including analyzers for German,
Russian, French, Dutch, and others.

It isn’t difficult to implement your own analyzer, though the standard ones often do the
job well enough. For the sake of simplicity, we’ll use the StandardAnalyzer in this
tutorial.

Next, we need to create an IndexWriter object. The IndexWriter object is used to


create the index and to add new index entries to this index. You can create an
IndexWriter with the StandardAnalyzer analyzer as follows:

IndexWriter indexWriter = new IndexWriter("index", new


StandardAnalyzer(), tru

Indexing an Object
Now you need to index your business objects. To index an object, you use the Lucene
Document class, to which you add the fields that you want indexed. A Lucene Document
is basically a container for a set of indexed fields. This is best illustrated by an example:

Document doc = new Document();


doc.add(new Field("description", hotel.getDescription(),
Field.Store.YES, Field.Index.TOKENIZED));
To add a field to a document, you create a new instance of the Field class. A field is
made up of a name and a value (the first two parameters in the class constructor). The
value may take the form of a String, or a Reader if the object to be indexed is a file.

The two other parameters are used to determine how the field will be stored and indexed
in the Lucene index:

• Storing the value. Does the value need to be stored in the index, or just indexed
and discarded? Storing the value is useful if the value should be displayed in the
search result list, for example. If the value must be stored, use Field.Store.YES.
You can also use Field.Store.COMPRESS for large documents or binary value
fields. If you don’t need to store the value, use Field.Store.NO.
• Indexing the value. Does the value need to be indexed? A database identified, for
example, may just be stored and used later for object retrieval, but not indexed. In
this case, you use Field.Index.NO. In most other cases, you’ll index the value
using the token analyzer associated with the index writer. To do this, you use
Field.Index.TOKENIZED. The value Field.Index.UN_TOKENIZED can be used if
you need to index a value without parsing it with the analyzer; in this case, the
value will be used "as is."

For our example, we just want some fairly simple full-text searching. So we add the
following fields:

• The hotel identifier, so we can retrieve the object later on from the query result
list.
• The hotel name, which we need to display in the query result lists.
• The hotel description, if we need to display this information in the query result
lists.
• Composite text containing key fields of the Hotel object:
o Hotel name
o Hotel city
o Hotel description

We want full-text indexing on this field. We don’t need to display the indexed
text in the query results, so we use Field.Store.NO to save index space.

Here’s the method that indexes a given hotel:

public static void indexHotel(Hotel hotel) throws IOException {


IndexWriter writer = (IndexWriter) getIndexWriter(false);
Document doc = new Document();
doc.add(new Field("id", hotel.getId(), Field.Store.YES,
Field.Index.NO));
doc.add(new Field("name", hotel.getName(), Field.Store.YES,
Field.Index.TOKENIZED));
doc.add(new Field("city", hotel.getCity(), Field.Store.YES,
Field.Index.UN_TOKENIZED));
doc.add(new Field("description", hotel.getDescription(),
Field.Store.YES,
Field.Index.TOKENIZED));
String fullSearchableText
= hotel.getName()
+ " " + hotel.getCity() + " " + hotel.getDescription();

doc.add(new Field("content", fullSearchableText,


Field.Store.NO,
Field.Index.TOKENIZED));
writer.addDocument(doc);
}

Once the indexing is finished, you have to close the index writer, which updates and
closes the associated files on the disk. Opening and closing the index writer is time-
consuming, so it’s not a good idea to do it systematically for each operation in the case of
batch updates. For example, here’s a function that rebuilds the whole index:

public void rebuildIndexes() throws IOException {


//
// Erase existing index
//
getIndexWriter(true);
//
// Index all hotel entries
//
Hotel[] hotels = HotelDatabase.getHotels();
for(Hotel hotel: hotels) {
indexHotel(hotel);
}
//
// Don’t forget to close the index writer when done
//
closeIndexWriter();
}

Full-Text Searching
Now that we’ve indexed our database, we can do some searching. Full-text searching is
done using the IndexSearcher and QueryParser classes. You provide an analyzer
object to the QueryParser; note that this must be the same one used during the indexing.
You also specify the field that you want to search, and the (user-provided) full-text query.
Here’s the class that handles the search function:

public class SearchEngine {

/** Creates a new instance of SearchEngine */


public SearchEngine() {
}

public Hits performSearch(String queryString)


throws IOException, ParseException {

Analyzer analyzer = new StandardAnalyzer();


IndexSearcher is = new IndexSearcher("index");
QueryParser parser = new QueryParser("content", analyzer);
Query query = parser.parse(queryString);
Hits hits = is.search(query);
return hits;
}
}

The search() function returns a Lucene Hits object. This object contains a list of
Lucene Hit objects, in order of relevance. The resulting Document objects can be
obtained directly, as shown here:

Hits hits = instance.performSearch("Notre Dame");


for(int i = 0; i < hits.length(); i++) {
Document doc = hits.doc(i);
String hotelName = doc.get("name");
...
}

As in this example, once you obtain the Document object, you can use the get() method
to fetch field values that have been stored during indexing.

Another possible approach is to use an Iterator, as in the following example:

public void testPerformSearch() throws Exception {


System.out.println("performSearch");
SearchEngine instance = new SearchEngine();
Hits hits = instance.performSearch("Notre Dame museum");

System.out.println("Results found: " + hits.length());


Iterator<Hit> iter = hits.iterator();
while(iter.hasNext()){
Hit hit = iter.next();
Document doc = hit.getDocument();
System.out.println(doc.get("name")
+ " " + doc.get("city")
+ " (" + hit.getScore() + ")");

}
System.out.println("performSearch done");
}

In this example, you can see how the Hit object can be used not only to fetch the
corresponding document, but also to fetch the relative "score"—getScore()—obtained
by this document in the search. The score gives an idea of the relative pertinence of each
document in the result set. For example, the unit test above produces the following
output:

performSearch
Results found: 9
Hôtel Notre Dame Paris (0.5789772)
Hôtel Odeon Paris (0.40939873)
Hôtel Tonic Paris (0.34116563)
Hôtel Bellevue Paris (0.34116563)
Hôtel Marais Paris (0.34116563)
Hôtel Edouard VII Paris (0.16353565)
Hôtel Rivoli Paris (0.11563717)
Hôtel Trinité Paris (0.11563717)
Clarion Cloitre Saint Louis Hotel Avignon (0.11563717)
performSearch done

Summary and References


There is much more to Lucene than is described here. In fact, we barely scratched the
surface. However, this example does show how easy it is to implement full-text search
functions in a Java database application. Try it out, and add some powerful full-text
search functions to your web site today!

• Lucene web site


• Lucene in Action (Manning, 2004), by Erik Hatcher and Otis Gospodnetic

You might also like