You are on page 1of 11

www.ocwsearch.

com

MongoDB Full Text Search


with Sphinx
Pierre Far, PhD
Twitter: @ocwsearch
Web: www.ocwsearch.com
Email: pierre@ocwsearch.com
About
www.ocwsearch.com

A search engine of the full text of OpenCourseWare


course materials.
2600+ courses, 10 universities, 11 OCW collections
Courses in English, Japanese, Spanish, Dutch
Why MongoDB?
MongoDB?
www.ocwsearch.com

Very helpful community

Document DB

Schemaless
Technology Stack www.ocwsearch.com

Website (HTML), API (JSON)

Query

Index

mongos3 xmlpipe2
Amazon S3

Adaptor Scripts
xmlpipe2
www.ocwsearch.com

An XML documents input into Sphinx


Any XML source so...

Read courses from MongoDB and stream as XML

sphinxsearch.com/wiki/doku.php?id=sphinx_xmlpipe2_tutorial
Pitfall 1: Document ID
www.ocwsearch.com

ALL DOCUMENT IDS MUST BE UNIQUE


UNSIGNED NON-ZERO INTEGER NUMBERS

Generate a unique 10-digit numeric ID for each course.


Must be deterministic
Unique index on field.
Pitfall 2: UTF-
UTF-8
www.ocwsearch.com

Fatal error: Uncaught exception 'MongoException' with


message 'non-utf8 string

Encoding: its a lie.


mb_detect_encoding() unreliable.

2-part solution
1. $HTML = @mb_convert_encoding($HTML, 'HTML-ENTITIES', 'utf-8');
2. $Text = FixEncoding($Text);
FixEncoding();
FixEncoding();
www.ocwsearch.com

A set of real encoding detection functions


http://lachy.id.au/dev/2005/11/encoding-functions-source

FixEncoding() is a wrapper for these functions


UTF--8 in Sphinx
UTF
www.ocwsearch.com

In sphinx.conf:
charset_type = utf-8
ngram_chars
charset_table

sphinxsearch.com/wiki/doku.php?id=charset_tables
mongos3
www.ocwsearch.com

MongoDB document = S3 object

Backup tool for MongoDB

$Contents = gzencode(json_encode($Course), 9);


www.ocwsearch.com

Thanks!
Any questions?
Twitter: @ocwsearch
Web: www.ocwsearch.com
Email: pierre@ocwsearch.com