P. 1


|Views: 17|Likes:

More info:

Published by: Brijeshkumar Kakdiya on Aug 14, 2012
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as DOC, PDF, TXT or read online from Scribd
See more
See less






08IT007 08IT008

AIM:- To study the research papers on the advanced topics on data mining and prepare and present the report on it. TOPIC:- Mining Text Databases Text mining, also known as text data mining or knowledge discovery from textual databases, refers to the process of extracting interesting and non-trivial patterns or knowledge from text documents. • Motivation for Text Mining:-

Approximately 90% of the world’s data is held in unstructured formats Information intensive business processes demand that we transcend from simple document retrieval to “knowledge” discovery. • What is Text Mining?

Text mining is a multidisciplinary field, involving information retrieval, text analysis, information extraction, clustering, categorization, visualization, database technology, machine learning, and data mining. • A text mining framework.

Text refining converts unstructured text documents into an intermediate form (IF). IF can be document-based or concept-based. Knowledge distillation from a document-based IF deduces patterns or knowledge across documents. A document-based IF can be projected onto a concept-based IF by extracting object information relevant to a domain. Knowledge distillation from a concept-based IF deduces patterns or knowledge across objects or concepts.

7th IT-1


• How do such keyword-based and similarity-based information retrieval systems work? A text retrieval system often associates with a set of documents a stop list. A group of syntactically minorly different words may share the same word stem. Different from database system that has been focused on query and transaction processing of structured data. there are some database system problems which are usually not present in information retrieval systems. One is precision. For example. relevant to the query).. which are in the database and are relevant to the query) were in fact retrieved.e. the. such as unstructured documents • Basic measures for text retrieval There are two basic measures for content-based text retrieval. of. a. drugged. which is the percentage of documents which should be retrieved (i. and so on are stop words even they may appear frequently. with. which is the percentage of retrieved documents are in fact correct (i. transaction management and update. share a common word stem. • Problems with information retrieval Since information retrieval and database systems are handling different kinds of data. 7th IT-1 CITC. such as keywords or example documents. which is a set of words that are deemed “irrelevant”. a group of words drug. recovery.. A typical information retrieval problem is to locate relevant documents based on user input.CHANGA .DWDM(CASE STUDY) MINING TEXT DATA PRACTICAL-12 • Text data analysis and information retrieval 08IT007 08IT008 Information retrieval (IR) is a field that has been developed in parallel with database systems for many years. The other is recall. There are also some common information retrieval problems which are usually not encountered in traditional database systems. such as concurrency control. and the typical information retrieval systems include online library catalog systems and online document management systems. and one may view them as the different appearances of the same word. information retrieval has been focused on the organization and retrieval of information in a large number of text based documents. A text retrieval system needs to identify the group of words which are small syntactic variants of each other and collect only their common word stem.e. drug. and drugs. for. For example.

taken the input of D*T matrix and represent it as a much smaller K*K matrix leads to some information loss. • Inverted indices An inverted index is an index structure widely used in industry for indexing text documents.g. where the posting list specifies a list of document identifiers in which the term appears. each containing two fields: term id and posting list.. It maintains two hash indexed or B+-tree indexed tables: document table and term table. 200) for large document collections.DWDM(CASE STUDY) MINING TEXT DATA PRACTICAL-12 • Latent Semantic Indexing 08IT007 08IT008 The latent semantic indexing method uses a singular value decomposition (SVD) technique. • Other Text Retrieval Indexing Techniques There are also several other popularly adopted text retrieval indexing techniques. it is easy to answer queries like find all the documents associated with a set of terms". where the posting list is a list of terms (or pointers to terms) that occur in the document. The latter (term table) consists of a set of term records. including inverted indices and signature files. sorted according to some relevance measure. The former (document table) consists of a set of document records. each containing two fields: doc id and posting list. With such organization.CHANGA . where K is usually taken to be around a few hundred (e. Notice that such a reduction. or find all the terms associated with a set of documents" 7th IT-1 CITC. To reduce the size of the term frequency table and retain the K most significant rows of the frequency table. We must ensure that they must miss only the least significant parts of the frequency table.

This warehouse is essentially a relational database that has the essential data from the text data. How can we prevent to loose the data? To expect to get useful results. What is possibility of Multilingual text refining? Whereas data mining is largely language independent.CHANGA . one needs to create a warehouse first before mining the converted database. including Term Extraction Text Mining at the Word Level ? The association generation process detected either compounds. • Questions-Answers 1. and we do not loose most of the information present in the document as in the tagged documents approach. secretary. shares. street] or [treasury. that process multilingual text documents and produce language-independent intermediate forms. 2. A simple encoding scheme goes as follows. mining from documents in other languages allows access to previously untapped information and offers a new host of opportunities. Every bit of a document is initialized to 0. i. securities] • Conclusion 1. 3. 2. baker]? Or uninterpretable associations such as [dollars. On the other hand the number of meaningless results is greatly reduced and the execution time of the mining algorithms is also reduced. It is essential to develop text refining algorithms. total. 7th IT-1 CITC. • Keyword-based association analysis Text data consists of structured. Each signature has a fixed size of b bits. exchange. On the one hand there is no need for human effort in tagging document. A bit is set if the corresponding term appears in the document. stake. Domain-dependent terms such as [wall. While most text mining tools focus on processing English documents. text mining involves a significant language component. commission.DWDM(CASE STUDY) MINING TEXT DATA PRACTICAL-12 • Signature Files 08IT007 08IT008 A signature file is a file which stores a signature record for each document in the database.e. semi-structures or unstructured text. Term level text mining attempts to benefit from the advantages of two extremes. A signature S1 matches another signature S2 if each bit set in signature S2 is also set in S1. james.


You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->