You are on page 1of 12

SURVEY OF OPEN SOURCE

FULL TEXT SEARCH


SOLUTIONS
SEARCH WITHOUT FULL TEXT
• SQL “Like %product%”
– Easy to setup, but…
– SQL statements get too complex
– Indexes on many columns become unwieldy and
slow down inserts
• Outsource to Google
– Hosted Solution
– Can only reach data that you actually render to
html
FULL TEXT GOAL
 Return matches by relevance rather than
pure equality value match
 Precision vs. Recall

Precision – Are the results accurate?


Recall – Did we get all the results we expected?
 Natural Language Search
Queries such as “What is the fastest animal?”
FULL TEXT IMPLEMENTATION
 Inverted Index Data Structure
Index of words to document’s location on disk
 Tokenization, Stopwords
Internationalization Challenges
 Basic Query Languages
Boolean match, relevance, proximity, etc.
LANGUAGE STEMMING
 Reduce inflected words to their root
Increase recall
Decrease inverted index size
 Internationalization Challenges
Language detection of the dataset to determine
which stemming algorithm to use
Complexity proportional to the level of
morphology
 Porter Stemming Algorithm
Examples: names -> name, departed -> depart
MYSQL FULL TEXT
• Pluses
– Integrated into MySQL
– Easy to use without learning a new library
• Minuses
– Indexes bigger than memory tend to be slow
– Scalability options are limited
– Can slow down insertions, deletions
– CJK is lacking
SPHINX
• Pluses
– Very Fast
– Supports many data sources
– Retrieval can be integrated into MySQL
– Distributed Searching is a scaling option

• Minuses
– Configuration can be tricky
– Live index updates accomplished by delta
indexing
– Internationalization (besides Russian) is left as an
exercise for the reader
LUCENE/SOLR
• Pluses
– Java, so easy to integrate into client software as
well as web
– Stable
– Distributed Searching
– Powerful Query Language
– Extensible API
– Good Internationalization Support
• Minuses
– Java
– Simple configuration is a pain
WHEN TO USE WHAT
Questions?
THANK
YOU!

You might also like