Text Analysis

Text Mining and Word Clouds
Text is Everywhere
• Medical Records
• Consumer Complaint Logs
• Product Inquiries
• Social Media Posts (Twitter feed, Emails, Facebook status,
Reddit comments, etc.)
• Personal Webpages
Text Mining deals with converting this vast amount of data to a

meaningful form.
• Structured data:
• Well organized
• Common agreed upon features in each data sample
• Formats: tables, relational databases etc.
• Sources: government, industry, CRM,, markets, etc.
• Unstructured data
• Not well organized, & unclear what are the features
• A lot of heterogeneity between data samples
• Formats: text, images, video, audio etc.
• Sources: social media, security cameras, etc.
• Unstructured data is unstructured because adding

structure is hard work
• It is not designed for analysis
Adding Structure
Identify and
Build Features
f1 f2 …
blah, blah, blah,
blah, blah, blah, Explore
val11 val12 … Explain
blah, blah, blah,
blah, blah, blah, Predict
blah, blah, blah,
blah, blah, blah,
val21 val22 …
…
…
Text Data is Difficult to Analyze
• Text data is “unstructured”: Does not come in well

formatted table with each field having a specific meaning!
• Text has a linguistic structure that is easily understood by
humans (not computers)
• Words vary in length and the order of words matters
• The data tends to have poor quality: spelling mistakes,
abbreviations, punctuation, etc.
Text data must undergo extensive prepossessing before being

used in any analytics algorithm/application.
Bag of Words Approach
• Treat a document as a collection of individual words, i.e.
Ignore Grammar, Word Order, Sentence Structure, etc.
Bag of Words Approach
• Treat a document as a collection of individual words, i.e.
Ignore Grammar, Word Order, Sentence Structure, etc.
• Each word is equally likely to be an important keyword.
• The words that appear the most in the document are the
most important keywords (the most valuable features).
• The term frequency TF(t, d) is count of number of times a
particular word t appears in a document d (may be also
normalized).
“all data mining involves the use of machine learning but not all
machine learning requires data mining”
Term Freq. Term Freq. Term Freq. Term Freq.

all 2 data 2 mining 2 involves 1
the 1 use 1 of 1 machine 2
learning 2 but 1 not 1 requires 1
Advantages,
Advantages
• A very simple representation
• Inexpensive to generate
• Works in many settings
• Often works surprisingly well!
Technical reports, prescriptions,…
• “a duck walked up to a lemonade stand”
• “a horse walked up to a lemonade
stand”
• “The Duck walks near the Lemonade
Stand”
The bag of words features:
According to bag of words:
[“a”, “duck”, “walked”, “up”, “to”, “a”,

“lemonade”, “stand”],
is similar to
[“a”, “horse”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”]
BUT
[“a”, “duck”, “walked”, “up”, “to”, “a”,
“lemonade”, “stand”],
not similar
[“The”, “Duck”, “walks”, “near”, “the”,
“Lemonade”, “Stand”]
Cleaning the Text
• Convert the text to lower case.

• Remove common stopwords like “the”, “we”, “and”, etc.
• “not” is not a good stop‐word, why?

• Remove numbers (or replace them with words).
• Remove punctuation like “.”, “,”, etc.
• Reduce the words to their root (word stemming). Example:
“announces”, “announced”, “announcing” are reduced to
“announc”.
• Remove unnecessary white space.
Cleaning the Text in R
Load the required libraries

library("tm") # Text Mining Library
library("SnowballC") # For reducing words to their root
Create the text document object

myDocument <- Corpus(VectorSource("All data mining involves the use of machine
learning, but not all machine learning requires data mining."))
Clean the Text

myDocument <- tm_map(myDocument, content_transformer(tolower)) #Convert to lower case
myDocument <- tm_map(myDocument, removeWords, stopwords("english")) #Remove stopwords
myDocument <- tm_map(myDocument, removeNumbers) #Remove numbers
myDocument <- tm_map(myDocument, removePunctuation) #Remove punctuation
myDocument <- tm_map(myDocument, stemDocument) #Reduce the words to their root
myDocument <- tm_map(myDocument, stripWhitespace) #Remove unnecessary white space
Getting Term Frequency Table in R
termMatrix = as.matrix(TermDocumentMatrix(myDocument)) #Get terms and Freq. matrix

sortedtermMatrix <- sort(rowSums(termMatrix),decreasing=TRUE) #Sort dec. order of Freq.
d <- data.frame("Term" = names(sortedtermMatrix),"Freq."=sortedtermMatrix,
row.names = NULL) #Store as Data Frame
print(d) #display data frame
Word Cloud
Word clouds are commonly used to visualize/highlight keywords
in documents
• Artistically place words with sizes proportional to their
frequency of occurrence.
• Typically, the exact position of the word does not mean
anything.
library("wordcloud") # Word Cloud Library
wordcloud(words = d$Term, freq = d$Freq., colors=brewer.pal(8, "Dark2"))

Text Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Text Analysis

Uploaded by

Copyright:

Available Formats

Text Mining and Word Clouds

Text Mining deals with converting this vast amount of data to a

• Unstructured data is unstructured because adding

• Text data is “unstructured”: Does not come in well

Text data must undergo extensive prepossessing before being

Term Freq. Term Freq. Term Freq. Term Freq.

[“a”, “duck”, “walked”, “up”, “to”, “a”,

• Convert the text to lower case.

• “not” is not a good stop‐word, why?

Load the required libraries

Create the text document object

Clean the Text

termMatrix = as.matrix(TermDocumentMatrix(myDocument)) #Get terms and Freq. matrix

You might also like