You are on page 1of 25

Data Management

CT075-3-2

Text Data Mining


Learning Outcomes

By the end of this lecture, YOU should be able to:


•Understand the text mining concepts.

•Explain the information retrieval concept and how we can use this
concepts in text mining.

•Explore one of data mining algorithms “vector-space model”


Key Terms you must be able to use

• If you have mastered this topic, you should be


able to use the following terms correctly in your
assignments and exams:
- Text mining
- Information retrieval
- Stemming
- Recall
- Linguistic Preprocessing

Slide 4 (of 25)


Introduction
• Text databases (document databases)

– Large collections of documents from various sources:


news articles, research papers, books, digital
libraries, e-mail messages, and Web pages, library
database, etc.

– Data stored is usually semi-structured

– Traditional information retrieval techniques become


inadequate for the increasingly vast amounts of text
data
Information retrieval
A field developed in parallel with database systems

There are three main types of retrieval within the knowledge


discovery process framework:

•data retrieval, which concerns retrieval of structured data


from DBMS and data warehouses.
•information retrieval, which concerns the organization and
retrieval of information from large collections of
semistructured or unstructured text-based databases and the
WWW.
•knowledge retrieval, which concerns the generation of
knowledge from (usually) structured data.
Information Retrieval

• Typical IR systems
– Online library catalogs
– Online document management systems
• Information retrieval vs. database systems
– Some DB problems are not present in IR, e.g., update,
transaction management, complex objects
– Some IR problems are not addressed well in DBMS,
e.g., unstructured documents, approximate search
using keywords and relevance
IR Basic definition
The core IR vocabulary includes the following four terms:

• database, which is defined as a collection of text documents

• document, which consists of a sequence of terms in a natural


language that expresses ideas about some topic

• term, which is defined as a semantic unit, phrase, or word (or


more precisely a root of the word)

• query, which is a request for documents that concern a particular


topic that is of interest to the user of an IR system
Keyword-Based Retrieval

• A document is represented by a string, which can be


identified by a set of keywords

• Queries may use expressions of keywords

– E.g., car and repair shop, tea or coffee.

– Queries and retrieval should consider synonyms,


e.g., repair and maintenance
Difficulties in IR

• Major difficulties of the model

– Synonymy: A keyword T does not appear anywhere in


the document, even though the document is closely
related to T, e.g., data mining in database article.

– Polysemy: The same keyword may mean different


things in different contexts,
Example : make can be associated with “make a mistake,” “make of
a car,” “make up excuses,” and “make-believe”
Architecture of Information Retrieval
Systems
Linguistic Preprocessing

Term extraction usually includes two main operations:

1) Removal of stop words

• Determiner articles (a, an, the), possessive determiners


(her, his, its, my, our, their, your);

• Conjunction (for, and, nor, but, or, yet, so), correlative


conjunctions (both … and, either … or, not (only) … but (…
also))

• Preposition (on, beneath, over, of, during, beside, etc.)


Linguistic Preprocessing
2) Stemming:
The words that appear in documents often have many variants.

• each word that is not a stop word is reduced to its


corresponding stem word (term), the words are stemmed to
obtain their root form by removing common prefixes and
suffixes.

• Example
the words disease, diseases, and diseased share a common
stem term disease, and can be treated as different occurrences
of this word.
Basic Measures for Text Retrieval

• Precision evaluates the ability of the IR system to retrieve top-ranked documents that
are mostly relevant, and is defined as the percentage of the retrieved documents that
are truly relevant to the user’s query.

• Recall evaluates the ability of the IR system to find all the relevant items in the
database and is defined as the percentage of the documents that are relevant to the
user’s query and that were retrieved
The IR Quality

• very high precision and low recall. In this case, the system returns a few
documents and almost all of them are relevant, but at the same time a significant
number of other relevant documents is missing.

• very high recall and relatively low precision. In this case, the system returns a
large number of documents that include almost all relevant documents but also
include a significant number of unwanted documents.
Vector-Space Model

• vector-space model in which each document in the database and the


user’s query are represented by a multidimensional vector.

• Vocabulary is the set of all distinct terms that remain after linguistic
preprocessing of the documents in the database, and contains t index
terms.

• Each term, i, in either a document or query, j, is given a real-valued


weight wij .

• Documents and queries are expressed as t-dimensional vectors dj


= (w1j,w2j, ………, wtj). We assume that there are n documents in the
database, i.e., j = 1, 2, ….., n
Quiz : Linguistic Preprocessing

• Q1Document: “data cube contains x data


dimension, y data dimension, and z data
dimension”
• Q2Document:”data mining is a study related to
data warehousing and database.”
• Q3Document: “The Seychelles island consist of
2 million people with 1 million of them educated
and the rest ae uneducated.”

Data Mining: Concepts and Techniques


Vector-Space Model

• In the vector-space model, a database of all documents


is represented by a term-document matrix (also
referred to as term-frequency matrix).

• frequency of a term i in document j is defined as:

• where fij is the number of times term i occurs in


document j. The frequency is normalized by the
frequency of the most common term in the document.
Vector-Space Model

• inverse document frequency


is used to indicate the discriminative power of a term i.

• the inverse document frequency is defined as:

• where dfi is the document frequency of term i and equals


the number of documents that contain term i.
Vector-Space Model

• weights wij are computed using the tf-idf measure (term


frequency-inversed document frequency), which is
defined as :
Example

Data Mining: Concepts and Techniques


Text Similarity Measures

• is a function that is used to compute the degree of


similarity between two vectors. It is used to quantify
similarity between the user’s query and each of the
documents from the database.

• cosine measure
Example Text Similarity
Text Similarity for below?
Given query : 0T1, 0T2, 1T3, 0T4

Data Mining: Concepts and Techniques


Q&A
Types of Text Data Mining
• Keyword-based association analysis
• Automatic document classification
• Similarity detection
– Cluster documents by a common author
– Cluster documents containing information from a
common source
• Link analysis: unusual correlation between entities
• Sequence analysis: predicting a recurring event
• Anomaly detection: find information that violates usual
patterns
• Hypertext analysis
– Patterns in anchors/links
• Anchor text correlations with linked objects

You might also like