You are on page 1of 5





Traditional information retrieval systems use query

words to identify relevant documents. In difficult retrieval
tasks, however, one needs access to a wealth of
background knowledge. We present a method that uses
Wikipedia-based feature generation to improve retrieval
performance. Intuitively, we expect that using extensive
world knowledge is likely to improve recall but may
adversely affect precision. High quality feature selection
is necessary to maintain high precision, but here we do
not have the labeled training data for evaluating features,
that we have in supervised learning.
Most of the common techniques in text mining are
based on the statistical analysis of a term either word or
phrase Statistical analysis of a term frequency captures
the importance of the term within a document only.
However, two terms can have the same frequency in their
documents, but one term contributes more to the
meaning of its sentences than the other term. Thus, the
underlying text mining model should indicate terms that
capture the semantics of text. In this case, the mining
model can capture terms that present the concepts of the
sentence, which leads to discover the topic of the
document. A new concept-based mining model that
analyzes terms on the sentence, document, and corpus
levels is introduced. The concept based mining model can
effectively discriminate between nonimportant terms with
respect to sentence semantics and terms which hold the
concepts that represent the sentence meaning.


An efficient concept based text mining technique

that performs mining on document repository.


Existing methods that are used for text clustering

include decision trees, conceptual clustering, clustering
based on data summarization, statistical analysis, neural
nets, inductive logic programming, and rule based
Usually, in text mining techniques that are
mentioned above, the term frequency of a term (word of
phrase) is computed to explore the importance of the
term in the document. However, two terms can have the
same frequency in their documents, but one term
contributes more to the meaning of its sentences than
the other term.
But it is important to note that extracting the
relations between verbs and their arguments in the same
sentence has the potential for analyzing terms within a
sentence where these existing solutions are limiting.


A “sentence-based concept analysis”, “document-

based concept analysis”, “corpus-based concept
analysis” text clustering approach has been proposed
that performs “concept-based similarity measure”.

 A four staged concept-based mining model should be
 This research should target to prove that proposed
concept based mining model will improve the text
clustering quality.
 By exploiting the semantic structure of the sentences
in documents, a better text clustering result should
be achieved.

1. Design and implement new Concept term frequency
measurement model.
2. Sentence-based concept analysis should be done
that helps to analyze the semantic structure of each
3. This semantic structure of the sentence captured in
step 2 will be associated with conceptual term
frequency measure to capture the sentence concept.
4. Perform document-based concept analysis that
analyzes each concept at the document level using
the concept-based term frequency.
5. Perform analysis of concepts on the corpus level
using the document frequency global measure
6. Perform concept-based similarity measure that will
allow measuring the importance of each concept with
respect to the semantics of the sentence, the topic of
the document, and the discrimination among
documents in a corpus
Hardware Requirements
 A CPU with CORE2duo, 2GB RAM and 80GB HDD
 OS: Any OS

Software Requirement
 OS should have JRE (Java Runtime Environment)
 Language: JAVA SE
 IDE: Net beans 6.5
 Build Tool: APACHE ANT
 Data Mining Tools: WEKA, RAPID MINER