Professional Documents
Culture Documents
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 1
Introduction
There are many reasons to use an open search engine in a Web site or other IR applications inside a company
cost considerations commercial engine has focus on larger sites specic needs that imply code customization
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 2
Introduction
Open source search engines might be classied by
programming language of implementation index data structure search capabilities: Boolean, fuzzy, stemming ranking function les they can index: HTML, PDF, Word, plain text online and incremental indexing maintenance activity and people needed
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 3
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 4
Eliminate engines that depend on other or have a special purpose 9. Managing Gigabytes (MG) 11. XML Query Engine 10. Nutch 12. Zebra
15 engines remain for consideration
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 6
10 engines selected for experimental comparison 1. HtDig 6. OmniFind 2. Indri 7. SWISH-E 3. Lucene 8. SWISH++ 4. MG4J 9. Terrier 5. Omega 10. Zettair
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 7
HtDig Indri Lucene MG4J Omega OmniFind SWISH-E SWISH++ Terrier Zettair
1 1 1 1 1 1 1 1 1 1
1,2 1,2,3,4 1,2,4 1,2 1,2,4,5 1,2,3,4,5 1,2,3 1,2 1,2,3,4,5 1,2
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 8
10 Engines Selected
Conventions for table in previous slide
(a) (b) (c) (d) ( e) (f ) (g )
1:Apache,2:BSD,3:CMU,4:GPL,5:IBM,6:LGPL,7:MPL,8:Comm,9:Free 1:C, 2:C++, 3:Java, 4:Perl, 5:PHP, 6:Tcl 1:phrase, 2:Boolean, 3:wild card. 1:ranking, 2:date, 3:none. 1:HTML, 2:plain text, 3:XML, 4:PDF, 5:PS. 1:le, 2:database. Commercial version only. Available Not Available
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 9
Methodology
Comparison tasks for 10 engines selected
1. Obtain a document collection in HTML 2. Determine a tool to use for monitoring the performance of the search engines 3. Install and congure each of the search engines 4. Index each document collection 5. Process and analyze index results 6. Perform a set of preselected searching tasks 7. Process and analyze the search results
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 10
Document Collections
Collections ranging from 1 GBytes to 10 GBytes 3 TREC-4 subcollections
a rst subcollection with 1,549 documents (750 MB) a second subcollection with 3,193 documents (1.6 GB) a third subcollection with 5,572 documents (2.7 GB)
4 WT10g subcollections
a rst subcollection occupying 2.4 GB a second subcollection occupying 4.8 GB a third subcollection occupying 7.2 GB a fourth subcollection occupying 10.2 GB
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 11
Evaluation Tests
4 different evaluation tests
Test A Indexing:
index document collection with each search engine and record elapsed time and resource consumption time required to build incremental query processing time of the engines,
indexes.
Test C Search Performance:
performance
Test D Search Quality:
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 12
Test A Indexing
Indexing of the 3 TREC-4 Subcollections
Indexing Time
1000 750 MB 1.6 GB 2.7 GB
100
Time (min)
10 1 HtDig Indri Lucene MG4J Omega Omnifind SwishE Swish++ Terrier Zettair
Search Engine
99.4% 20.0 % 100.0% 23.4 % 100.0% 26.8 % 78.4% 17.6 % 100.0% 16.2 % 99.6% 24.8 % 77.2% 20.2 %
100.0% 38.3 % 100.0% 48.0 % 99.2% 52.1 % 83.3% 18.3 % 98.9% 31.9 % 98.5% 34.3 % 98.1% 22.3 %
99.2% 59.4 % 100.0% 70.4 % 94.0% 83.5 % 83.8% 19.5 % 98.8% 56.7 % 98.6% 54.3 % 82.7% 23.1 %
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 15
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 16
Best: Lucene, MG4J, Swish-E, Swish++, and Zettair: between 25%35% of collection size
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 17
Time (min)
10
Search Engine
Indri, MG4J, Terrier, and Zettair: only engines to nish in linear time
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 18
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 20
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 22
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 23
Global Evaluation
Ranking of engines: indexing time, index size, query processing time (for 2.7GB collection), and P@5 (for WT10g collection)
Search Engine Indexing Time (h:m:s) HtDig Indri Lucene MG4J Swish-E Swish++ Terrier Zettair (6) (3) (8) (2) (4) (5) (7) (1) 0:28:30 0:15:45 1:01:25 0:12:00 0:19:45 0:22:15 0:40:12 0:04:44 (8) (7) (1) (6) (3) (2) (5) (4) Index Size (%) 104 63 26 60 31 29 52 33 (4) (1) (2) (3) (6) (8) (7) (4) Searching Time (ms) 32 19 21 22 45 51 50 32 (3) (1) (4) (2) Answer Quality P@5 0.2851 0.2480 0.2800 0.3240
Indri, MG4J, Terrier, and Zettair: indexed whole WT10g Zettair: fastest indexer, good search time, good precision-recall
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 24
Conclusions
Zettair is one of the most complete engines
1. fast processing of large amounts of information in considerably less time than other engines 2. average precision-recall gures that were highest comparatively to the other engines (for the WT10g collection)
Lucene is the most competitive regarding the use of memory and search time performance
Open Source Search Engines, Modern Information Retrieval, Addison Wesley, 2010 p. 25