Welcome to Scribd, the world's digital library. Read, publish, and share books and documents. See more ➡
Download
Standard view
Full view
of .
Add note
Save to My Library
Sync to mobile
Look up keyword
Like this
1Activity
×
0 of .
Results for:
No results containing your search query
P. 1
NLPTools

NLPTools

Ratings: (0)|Views: 133|Likes:
Published by welly_3

More info:

Published by: welly_3 on Jun 18, 2012
Copyright:Attribution Non-commercial

Availability:

Read on Scribd mobile: iPhone, iPad and Android.
download as PDF, TXT or read online from Scribd
See More
See less

06/18/2012

pdf

text

original

 
6/16/20111
Open Source Text MiningTools and Libraries
Companion to the PASED 2011 tutorial on “Information Retrieval Methods for Software Engineering” created by Sonia Haiduc 
Lucene
http://lucene.apache.org
High-performance, full-featured text searchengine libraryWritten entirely in JavaBased on the Vector Space Model and theBoolean Model in IRComes with a set of basic applications, whichcan be used as-is or modified by usersContains functionality for:
 – 
Document processing
: stop-words removal,stemming, tokenization, etc. – 
Indexing
 – 
Searching
 
6/16/20112
Searching in Lucene
http://lucene.apache.org
Supports the indexing and searching of several
fields 
for each document (e.g., “title”, “contents”, etc.)Accepts several types of queries:
 – 
Term query
(e.g., buffer edit) – 
Phrase query
(e.g., “buffer edit”) – 
Boolean query
(e.g., buffer AND edit OR modify) – 
Wildcard query
(e.g., te?t, test*, te*t) – 
Range query
(e.g., date: [20020101 TO 20030101) – 
Fuzzy query
-uses the Levenshtein Distance betweenstrings (e.g., roam~ searches for terms similar to roam, like“roam”, “foam”) – 
Proximity query
 –finds terms within a specific distanceaway (e.g., “jakarta apache”~10 searches for a “apache”and “jakarta” within 10 terms of each other in a document)
 
Other Lucene Implementations
http://lucene.apache.org
There are also implementations of Lucene inmany other programming languages:
 – 
CLucene
-implementation in C++ – 
Lucene.Net
-implementation in .NET – 
Lucene4c
-implementation in C – 
Zend Search
-implementation in the Zend Frameworkfor PHP 5 – 
Plucene
and
KinoSearch
-implementations in Perl – 
PyLucene
-GCJ-compiled version of Java Luceneintegrated with Python – 
MUTIS
-implementation in Delphi – 
Ferret
-implementation in Ruby – 
Montezuma
-implementation in Common Lisp
 
6/16/20113
jLSI
http://tcc.itc.it/research/textec/tools-resources/jlsi.html
An open source Java tool for Latent SemanticIndexingRequires the following linguistic processing to beperformed
before 
its usage:
 –Tokenization –Sentence splitting –Part-of-speech tagging (optional) –Lemmatization (optional)
 
The Semantic Engine
http://knowledgesearch.org/
A C++ library that implements IR indexing andretrievalUses mathematical algorithms based on graphtheory to index the
latent semantic 
content ofdocumentsA semantic graph of a text collection is createdwhich can be used to find relevant documentsthat may not contain any keyword matches
Document processing
functionality:tokenization, POS tagging, stemming, stopwordsremoval

You're Reading a Free Preview

Download
/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->