You are on page 1of 11

www.ocwsearch.

com

MongoDB Full Text Search with Sphinx
Pierre Far, PhD Twitter: @ocwsearch Web: www.ocwsearch.com Email: pierre@ocwsearch.com

About

www.ocwsearch.com

• A search engine of the full text of OpenCourseWare course materials.
– 2600+ courses, 10 universities, 11 OCW collections – Courses in English, Japanese, Spanish, Dutch

Why MongoDB? MongoDB?
• Very helpful community • Document DB • Schemaless

www.ocwsearch.com

Technology Stack

www.ocwsearch.com

Website (HTML), API (JSON)

Query Index Amazon S3 mongos3 xmlpipe2

Adaptor Scripts

xmlpipe2
• An XML documents input into Sphinx
– Any XML source so...

www.ocwsearch.com

• Read courses from MongoDB and stream as XML
• sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial

Pitfall 1: Document ID

www.ocwsearch.com

“ALL DOCUMENT IDS MUST BE UNIQUE UNSIGNED NON-ZERO INTEGER NUMBERS”

• Generate a unique 10-digit numeric ID for each course.
– Must be deterministic – Unique index on field.

Pitfall 2: UTF-8 UTF-

www.ocwsearch.com

“Fatal error: Uncaught exception 'MongoException' with message 'non-utf8 string” • Encoding: it’s a lie.
– mb_detect_encoding() unreliable.

• 2-part solution
1. $HTML = @mb_convert_encoding($HTML, 'HTML-ENTITIES', 'utf-8'); 2. $Text = FixEncoding($Text);

FixEncoding(); FixEncoding();
• A set of real encoding detection functions
http://lachy.id.au/dev/2005/11/encoding-functions-source

www.ocwsearch.com

• FixEncoding() is a wrapper for these functions

UTFUTF-8 in Sphinx
• In sphinx.conf:
– charset_type = utf-8 – ngram_chars – charset_table

www.ocwsearch.com

• sphinxsearch.com/wiki/doku.php?id=charset_tables

mongos3

www.ocwsearch.com

MongoDB document = S3 object • Backup tool for MongoDB
$Contents = gzencode(json_encode($Course), 9);

www.ocwsearch.com

Thanks! Any questions?
Twitter: @ocwsearch Web: www.ocwsearch.com Email: pierre@ocwsearch.com