Professional Documents
Culture Documents
What's New in Apache Lucene 2.9
What's New in Apache Lucene 2.9
Since its introduction nearly 10 years ago, Apache Lucene has become a competitive player
for developing extensible, high-performance full-text search solutions. The experience
accumulated over time by the community of Lucene committers and contributors and the
innovations they have engineered have delivered significant ongoing advances in Lucene’s
capabilities.
This white paper describes the new features and improvements in the latest version,
Apache Lucene 2.9. It is intended mainly for programmers familiar with the broad base of
Lucene’s capabilities, though those new to Lucene should also find it a useful exploration of
the newest features.
In the simplest terms, Lucene is now faster and more flexible than before. Historic weak
points have been improved to open the way for innovative new features like near-real-time
search, flexible indexing, and high-performance numerical range queries. Many new
features have been added, new APIs introduced, and critical bugs have been fixed—all with
the same goal: improving Lucene’s state-of-the-art search capabilities.
This white paper aims to address key issues for you if you have an Apache Lucene-based
application, and need to upgrade existing code to work well with this latest version, so that
you may take advantage of the various improvements and prepare for the next major
release. If you do not have a Lucene application, the paper should also give you a good
overview of the innovations in this release.
Unlike the previous 2.4.1 release (March 2009), Lucene 2.9 is more than just a bug-fix
release. It introduces multiple performance improvements, new features, better runtime
behavior, API changes, and bug-fixes at a variety of levels. The 2.9 release improves Lucene
in several key aspects, which make it an even more compelling alternative to other
solutions. Most notably:
• Improvements for Near-Realtime Search capabilities make documents searchable
almost instantaneously.
• A new, straightforward API for handling Numeric Ranges both simplifies
development and virtually wipes out performance overhead.
• Analysis API has been replaced for more streamlined, flexible text handling.
The generated terms are indexed just like any other string values passed to Lucene. Under
the hood, Lucene associates distinct terms with all documents containing the term, so that
all documents containing a numeric value with the same prefix are “grouped” together,
meaning the number of terms that need to be searched is reduced tremendously. This
stands in contrast to the relatively less efficient encoding scheme in previous releases,
where each unique numeric value was indexed as a distinct term based on the number of
terms in the index.
You can also use the native encoding of numeric values beyond range searches. Numeric
fields can be loaded in the internal FieldCache, where they are used for sorting. Zero-
padding of numeric primitives (see code example above) is no longer needed as the trie-
encoding guarantees the correct ordering without requiring execution overhead or extra
coding.
The code listing below instead uses the new NumericField to index a numeric Java
primitive using 4-bit precision. Like the straightforward NumericField, querying
numeric ranges also provides a type-safe API. NumericRangeQuery instances are
created using one of the provided static constructors for the corresponding Java primitive.
The example below shows a numeric range query using an int primitive with the same
precision used in the indexing example. If different precision values are used at index or
search time, numeric queries can yield unexpected behavior.
Improvements resulting from new Lucene numeric capabilities are equally significant in
versatility and performance. Now, Lucene can cover almost every use-case related to
numeric values. Moreover, range searches or sorting on float or double values up to fast
date searches (dates converted to time stamps) will execute in less than 100 milliseconds
in most cases. By comparison, the old approach using padded full-precision values could
take up to 30 seconds or more depending on the underlying index.
What the above example does not demonstrate is the full power of the new token API.
There, we replaced one or more characters in the token and discarded the original one. Yet,
in many use-cases, the original token should be preserved in addition to the modified one.
Using the old API required a fair bit of work and logic to handle such a common use-case.
In contrast, the new attribute-based approach allows capture and restoration of the state of
attributes, which makes such use-cases almost trivial. The example below shows a version
of the previous example improved for Lucene 2.9, in which the original term attribute is
restored once the stream is advanced.
The separation of attributes makes it possible to add arbitrary properties to the analysis
chain without using a customized Token class. Attributes are then made type-safely
accessible by all subsequent TokenStream instances, and can eventually be used by the
consumer. This way, you get a generic way to add various kind of custom information, such
as part-of-speech tags, payloads, or average document length to the token stream.
Unfortunately, Lucene 2.9 doesn't yet provide functionality to persist custom Attribute
implementation to the underlying index. This improvement, part of what is often referred
to as "flexible indexing," is under active development and is proposed for one of the
upcoming Lucene releases.
Beyond the generalizability of this API, one of its most significant improvements is its
effective reuse of Attribute instances across multiple iterations of analysis. Attribute
Per-Segment Search
Since the early days of Apache Lucene, documents have been stored at the lowest level in a
segment—a small but entirely independent index. On the highest abstraction level, Lucene
combines segments into one large index and executes searches across all visible segments.
As more and more documents are added to an index, Lucene buffers your documents in
RAM and flushes them to disk periodically. Depending on a variety of factors, Lucene either
incrementally adds documents to an existing segment, or creates entirely new segments. To
reduce the negative impact of an increasing number of segments on search performance,
Lucene tries to combine/merge multiple segments into larger ones. For optimal search
performance, Lucene can optimize an index that essentially merges all existing segments
into a single segment.
Prior to Lucene 2.9, search logic resided at the highest abstraction level, accessing a single
IndexReader no matter how many segments the index was composed of. Similarly the
FieldCache was associated with the top-level IndexReader, and then had to be
invalidated each time an index was reopened. With Lucene 2.9, the search logic and the
FieldCache have moved to a per-segment level. While this has introduced a little more
internal complexity, the benefit of the tradeoff is a new per-segment index behavior that
yields a rich variety of performance improvements for unoptimized indexes.
MultiTermQuery-Related Improvements
In Lucene 2.4, many standard queries, such as FuzzyQuery, WildcardQuery, and
PrefixQuery were re-factored and subclassed under MultiTermQuery. Lucene 2.9
adds some improvements under the hood, resulting into much better performance for
those queries. [BACK- COMPATIBILITY] 2
In Lucene 2.9, multi-term queries now use a constant score internally, based on the
assumption that most programmers don't care about the interim score of the queries
resulting from the term expansion that takes place during query rewriting.
2 This could be a back-compatibility issue if one of those classes has been subclassed.
Payloads
The Payloads feature, though originally added in a previous version of Lucene, remains
pretty new to most programmers. A payload is essentially a byte array that is associated
with a particular term in the index. Payloads can be associated with a single term during
text analysis and subsequently committed directly to the index. On the search side, these
byte arrays are accessible to influence the scoring for a particular term, or even to filter
entire documents.
3
See http://www.lucidimagination.com/blog/2009/08/05/getting-started-with-payloads/ for more information.
To provide a smooth transition from the existing core parser to the new API, this contrib
package also contains an implementation fully compliant with the standard query syntax.
This not only helps the switch to the new query parser but it also serves as an example of
how to use and extend the API. That said, the standard implementation is based on the new
query parser API and therefore it can't simply replace a core parser as is. If you have been
replacing Lucene's current query parser, you can use QueryParserWrapper instead,
which preserves the old query parser interface but calls the new parser framework. One
final caveat: the QueryParserWrapper is marked as deprecated, as the new query parser
will be moved to the core in the upcoming release and eventually replace the old API.