You are on page 1of 13

Text Mining and Word Clouds

Text is Everywhere

• Medical Records
• Consumer Complaint Logs
• Product Inquiries
• Social Media Posts (Twitter feed, Emails, Facebook status,
Reddit comments, etc.)
• Personal Webpages

Text Mining deals with converting this vast amount of data to a


meaningful form.
• Structured data:
• Well organized
• Common agreed upon features in each data sample
• Formats: tables, relational databases etc.
• Sources: government, industry, CRM,, markets, etc.
• Unstructured data
• Not well organized, & unclear what are the features
• A lot of heterogeneity between data samples
• Formats: text, images, video, audio etc.
• Sources: social media, security cameras, etc.

• Unstructured data is unstructured because adding


structure is hard work
• It is not designed for analysis
Adding Structure

Identify and
Build Features

f1 f2 …
blah, blah, blah,
blah, blah, blah, Explore
val11 val12 … Explain
blah, blah, blah,
blah, blah, blah, Predict
blah, blah, blah,
blah, blah, blah,
val21 val22 …


Text Data is Difficult to Analyze

• Text data is “unstructured”: Does not come in well


formatted table with each field having a specific meaning!
• Text has a linguistic structure that is easily understood by
humans (not computers)
• Words vary in length and the order of words matters
• The data tends to have poor quality: spelling mistakes,
abbreviations, punctuation, etc.

Text data must undergo extensive prepossessing before being


used in any analytics algorithm/application.
Bag of Words Approach
• Treat a document as a collection of individual words, i.e.
Ignore Grammar, Word Order, Sentence Structure, etc.
Bag of Words Approach
• Treat a document as a collection of individual words, i.e.
Ignore Grammar, Word Order, Sentence Structure, etc.
• Each word is equally likely to be an important keyword.
• The words that appear the most in the document are the
most important keywords (the most valuable features).
• The term frequency TF(t, d) is count of number of times a
particular word t appears in a document d (may be also
normalized).

“all data mining involves the use of machine learning but not all
machine learning requires data mining”

Term Freq. Term Freq. Term Freq. Term Freq.


all 2 data 2 mining 2 involves 1
the 1 use 1 of 1 machine 2
learning 2 but 1 not 1 requires 1
Advantages,

Advantages
• A very simple representation
• Inexpensive to generate
• Works in many settings
• Often works surprisingly well!
Technical reports, prescriptions,…
• “a duck walked up to a lemonade stand”
• “a horse walked up to a lemonade
stand”
• “The Duck walks near the Lemonade
Stand”
The bag of words features:
According to bag of words:

[“a”, “duck”, “walked”, “up”, “to”, “a”,


“lemonade”, “stand”],
is similar to
[“a”, “horse”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”]

BUT
[“a”, “duck”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”],
not similar
[“The”, “Duck”, “walks”, “near”, “the”,
“Lemonade”, “Stand”]
Cleaning the Text

• Convert the text to lower case.


• Remove common stopwords like “the”, “we”, “and”, etc.

• “not” is not a good stop‐word, why?


• Remove numbers (or replace them with words).
• Remove punctuation like “.”, “,”, etc.
• Reduce the words to their root (word stemming). Example:
“announces”, “announced”, “announcing” are reduced to
“announc”.
• Remove unnecessary white space.
Cleaning the Text in R

Load the required libraries


library("tm") # Text Mining Library
library("SnowballC") # For reducing words to their root

Create the text document object


myDocument <- Corpus(VectorSource("All data mining involves the use of machine
learning, but not all machine learning requires data mining."))

Clean the Text


myDocument <- tm_map(myDocument, content_transformer(tolower)) #Convert to lower case
myDocument <- tm_map(myDocument, removeWords, stopwords("english")) #Remove stopwords
myDocument <- tm_map(myDocument, removeNumbers) #Remove numbers
myDocument <- tm_map(myDocument, removePunctuation) #Remove punctuation
myDocument <- tm_map(myDocument, stemDocument) #Reduce the words to their root
myDocument <- tm_map(myDocument, stripWhitespace) #Remove unnecessary white space
Getting Term Frequency Table in R

termMatrix = as.matrix(TermDocumentMatrix(myDocument)) #Get terms and Freq. matrix


sortedtermMatrix <- sort(rowSums(termMatrix),decreasing=TRUE) #Sort dec. order of Freq.
d <- data.frame("Term" = names(sortedtermMatrix),"Freq."=sortedtermMatrix,
row.names = NULL) #Store as Data Frame
print(d) #display data frame
Word Cloud
Word clouds are commonly used to visualize/highlight keywords
in documents
• Artistically place words with sizes proportional to their
frequency of occurrence.
• Typically, the exact position of the word does not mean
anything.
library("wordcloud") # Word Cloud Library
wordcloud(words = d$Term, freq = d$Freq., colors=brewer.pal(8, "Dark2"))

You might also like