CT075!3!2 DTM Topic 12 Text Data Mining

Data Management
CT075-3-2
Text Data Mining

Learning Outcomes
By the end of this lecture, YOU should be able to:

•Understand the text mining concepts.
•Explain the information retrieval concept and how we can use this
concepts in text mining.
•Explore one of data mining algorithms “vector-space model”
•
Key Terms you must be able to use
• If you have mastered this topic, you should be

able to use the following terms correctly in your
assignments and exams:
- Text mining
- Information retrieval
- Stemming
- Recall
- Linguistic Preprocessing
Slide 4 (of 25)

Introduction
• Text databases (document databases)
– Large collections of documents from various sources:

news articles, research papers, books, digital
libraries, e-mail messages, and Web pages, library
database, etc.
– Data stored is usually semi-structured
– Traditional information retrieval techniques become

inadequate for the increasingly vast amounts of text
data
Information retrieval
A field developed in parallel with database systems
There are three main types of retrieval within the knowledge

discovery process framework:
•data retrieval, which concerns retrieval of structured data

from DBMS and data warehouses.
•information retrieval, which concerns the organization and
retrieval of information from large collections of
semistructured or unstructured text-based databases and the
WWW.
•knowledge retrieval, which concerns the generation of
knowledge from (usually) structured data.
Information Retrieval
• Typical IR systems
– Online library catalogs
– Online document management systems
• Information retrieval vs. database systems
– Some DB problems are not present in IR, e.g., update,
transaction management, complex objects
– Some IR problems are not addressed well in DBMS,
e.g., unstructured documents, approximate search
using keywords and relevance
IR Basic definition
The core IR vocabulary includes the following four terms:
• database, which is defined as a collection of text documents
• document, which consists of a sequence of terms in a natural

language that expresses ideas about some topic
• term, which is defined as a semantic unit, phrase, or word (or

more precisely a root of the word)
• query, which is a request for documents that concern a particular

topic that is of interest to the user of an IR system
Keyword-Based Retrieval
• A document is represented by a string, which can be

identified by a set of keywords
• Queries may use expressions of keywords
– E.g., car and repair shop, tea or coffee.
– Queries and retrieval should consider synonyms,

e.g., repair and maintenance
Difficulties in IR
• Major difficulties of the model
– Synonymy: A keyword T does not appear anywhere in

the document, even though the document is closely
related to T, e.g., data mining in database article.
– Polysemy: The same keyword may mean different

things in different contexts,
Example : make can be associated with “make a mistake,” “make of
a car,” “make up excuses,” and “make-believe”
Architecture of Information Retrieval
Systems
Linguistic Preprocessing
Term extraction usually includes two main operations:
1) Removal of stop words
• Determiner articles (a, an, the), possessive determiners

(her, his, its, my, our, their, your);
• Conjunction (for, and, nor, but, or, yet, so), correlative

conjunctions (both … and, either … or, not (only) … but (…
also))
• Preposition (on, beneath, over, of, during, beside, etc.)

Linguistic Preprocessing
2) Stemming:
The words that appear in documents often have many variants.
• each word that is not a stop word is reduced to its

corresponding stem word (term), the words are stemmed to
obtain their root form by removing common prefixes and
suffixes.
• Example
the words disease, diseases, and diseased share a common
stem term disease, and can be treated as different occurrences
of this word.
Basic Measures for Text Retrieval
• Precision evaluates the ability of the IR system to retrieve top-ranked documents that
are mostly relevant, and is defined as the percentage of the retrieved documents that
are truly relevant to the user’s query.
• Recall evaluates the ability of the IR system to find all the relevant items in the
database and is defined as the percentage of the documents that are relevant to the
user’s query and that were retrieved
The IR Quality
• very high precision and low recall. In this case, the system returns a few
documents and almost all of them are relevant, but at the same time a significant
number of other relevant documents is missing.
• very high recall and relatively low precision. In this case, the system returns a
large number of documents that include almost all relevant documents but also
include a significant number of unwanted documents.
Vector-Space Model
• vector-space model in which each document in the database and the

user’s query are represented by a multidimensional vector.
• Vocabulary is the set of all distinct terms that remain after linguistic
preprocessing of the documents in the database, and contains t index
terms.
• Each term, i, in either a document or query, j, is given a real-valued

weight wij .
• Documents and queries are expressed as t-dimensional vectors dj

= (w1j,w2j, ………, wtj). We assume that there are n documents in the
database, i.e., j = 1, 2, ….., n
Quiz : Linguistic Preprocessing
• Q1Document: “data cube contains x data

dimension, y data dimension, and z data
dimension”
• Q2Document:”data mining is a study related to
data warehousing and database.”
• Q3Document: “The Seychelles island consist of
2 million people with 1 million of them educated
and the rest ae uneducated.”
Data Mining: Concepts and Techniques

Vector-Space Model
• In the vector-space model, a database of all documents

is represented by a term-document matrix (also
referred to as term-frequency matrix).
• frequency of a term i in document j is defined as:
• where fij is the number of times term i occurs in

document j. The frequency is normalized by the
frequency of the most common term in the document.
Vector-Space Model
• inverse document frequency

is used to indicate the discriminative power of a term i.
• the inverse document frequency is defined as:
• where dfi is the document frequency of term i and equals

the number of documents that contain term i.
Vector-Space Model
• weights wij are computed using the tf-idf measure (term

frequency-inversed document frequency), which is
defined as :
Example

Text Similarity Measures
• is a function that is used to compute the degree of

similarity between two vectors. It is used to quantify
similarity between the user’s query and each of the
documents from the database.
• cosine measure
Example Text Similarity
Text Similarity for below?
Given query : 0T1, 0T2, 1T3, 0T4

Q&A
Types of Text Data Mining
• Keyword-based association analysis
• Automatic document classification
• Similarity detection
– Cluster documents by a common author
– Cluster documents containing information from a
common source
• Link analysis: unusual correlation between entities
• Sequence analysis: predicting a recurring event
• Anomaly detection: find information that violates usual
patterns
• Hypertext analysis
– Patterns in anchors/links
• Anchor text correlations with linked objects

CT075!3!2 DTM Topic 12 Text Data Mining

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CT075!3!2 DTM Topic 12 Text Data Mining

Uploaded by

Copyright:

Available Formats

Data Management

Text Data Mining

By the end of this lecture, YOU should be able to:

•Explore one of data mining algorithms “vector-space model”

• If you have mastered this topic, you should be

Slide 4 (of 25)

– Large collections of documents from various sources:

– Data stored is usually semi-structured

– Traditional information retrieval techniques become

There are three main types of retrieval within the knowledge

•data retrieval, which concerns retrieval of structured data

• database, which is defined as a collection of text documents

• document, which consists of a sequence of terms in a natural

• term, which is defined as a semantic unit, phrase, or word (or

• query, which is a request for documents that concern a particular

• A document is represented by a string, which can be

• Queries may use expressions of keywords

– E.g., car and repair shop, tea or coffee.

– Queries and retrieval should consider synonyms,

• Major difficulties of the model

– Synonymy: A keyword T does not appear anywhere in

– Polysemy: The same keyword may mean different

Term extraction usually includes two main operations:

1) Removal of stop words

• Determiner articles (a, an, the), possessive determiners

• Conjunction (for, and, nor, but, or, yet, so), correlative

• Preposition (on, beneath, over, of, during, beside, etc.)

• each word that is not a stop word is reduced to its

• vector-space model in which each document in the database and the

• Each term, i, in either a document or query, j, is given a real-valued

• Documents and queries are expressed as t-dimensional vectors dj

• Q1Document: “data cube contains x data

Data Mining: Concepts and Techniques

• In the vector-space model, a database of all documents

• frequency of a term i in document j is defined as:

• where fij is the number of times term i occurs in

• inverse document frequency

• the inverse document frequency is defined as:

• where dfi is the document frequency of term i and equals

• weights wij are computed using the tf-idf measure (term

Data Mining: Concepts and Techniques

• is a function that is used to compute the degree of

Data Mining: Concepts and Techniques

You might also like