You are on page 1of 32

Text Mining and

Analytics
Dr. Praveen Kumar .T

praveen@msrim.org
12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 2
• Also known as Text Data Mining
• Process of examining large collections of unstructured textual resources in
order to generate new information,

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 3


Why Do We Use Text Mining?
• Turn text into data for analysis
• Generate new information
• Populate a database with the information extracted

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 4


Motivation for Text
Mining
• Approximately 90% of the world’s data is held in unstructured formats
(source: Oracle Corporation)
• Information intensive business processes demand that we transcend from simple
document retrieval to “knowledge” discovery.

Unstructured or Semi-structured
Information
10%

Structured Numerical or Coded


Information

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 5


“Search” versus “Discover”

Search Discover
(goal-oriented) (opportunistic)

Structured Data Data


Data Retrieval Mining

Unstructured Information Text


Data (Text) Retrieval Mining

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 6


Text Mining: Examples
• Text mining is an exercise to gain knowledge from stores of language text.
• Text:
• Web pages
• Medical records
• Customer surveys
• Email filtering (spam)
• DNA sequences
• Incident reports
• Drug interaction reports
• News stories (e.g. predict stock movement)

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 7


APPLICATIONS OF TEXT
ANALYTICS
Search & info access 39%
Customer experience management 39%
Brand management 39%
Research 36%
Competitive intelligence 33%
Customer service 26%
E-discovery 15%
Life sciences 15%
Product design 15%
Online commerce 11%
Finance 10%
Other 9%
Content management Insurance & 8%
fraud 8%
Military intelligence 7%
Law enforcement 6%
12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 8
Text Mining
• Typically falls into one of two categories
• Analysis of text: I have a bunch of text I am interested in, tell me something
about it
• E.g. sentiment analysis, “buzz” searches
• Retrieval: There is a large corpus of text documents, and I want the one
closest to a specified query
• E.g. web search, library catalogs, legal and medical precedent studies

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 9


Text Mining: Analysis
• Which words are most present
• Which words are most surprising
• Which words help define the document
• What are the interesting text phrases?

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 10


Text Mining: Retrieval
• Find n objects in the corpus of documents which are most similar to
my query.
• Can be viewed as “interactive” data mining - query not specified a
priori.
• Main problems of text retrieval:
• What does “similar” mean?
• How do I know if I have the right documents?
• How can I incorporate user feedback?

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 11


IR architecture

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 12


Information retrieval models
• An IR model governs how a document and a query are represented and
how the relevance of a document to a user query is defined.
• Main models:
• Boolean model
• Vector space model
• Statistical language model

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 13


Text Mining Process

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 14


Advantages of Text Mining
• Text mining can help in predictive analytics
• Text Mining used to summarize the documents and helps to track
opinions over time
• Text mining techniques used to analyze problems in different areas of
business.
• Also, it helps to extract concepts from the text and present it in a
more simple way

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 15


12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 16
12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 17
Session -2
• Text mining Using R
• Facebook Data Mining and analysis Using R
• Twitter sentimental analysis

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 18


How does Text Mining work
• Allows for understanding the text better than anything else.
• Takes words from unstructured data into numerical values.
• Helps to find patterns and relationships that exist in a large chunk of
text.
• Uses machine algorithms to read and analyze text-data information.

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 19


Text Mining Using R – Steps
Step 1: Create a text file

Step 2 : Install and load the required packages

Step 3 : Text Mining

Step 4: Cleaning the text

Step 5 : Generate the Word cloud

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 20


R installation and Basic computation
• https://cran.r-project.org/
• Click on Download R for Windows. Click on base. Click on Download
R 3.3.2 for Windows (or a newer version that appears).
• Install R. Leave all default settings in the installation options.
• http://rstudio.org/download/desktop (it should be called something
like RStudio 1.0.136 — Windows Vista/7/8/10).
• Choose default installation options.

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 21


Step 2 – Load Packages
• install.package(“NLP”)
• install.package("tm")
• install.package(“RColorBrewer”)
• install.package(“wordcloud”)
• library(NLP)
• library(tm)
• library(RColorBrewer)
• library(wordcloud)
• library(wordcloud2)

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 22


Step 3 – Text mining – R coding
• > filepath=("Z:R.txt")
• > text_file=readLines(filepath)
• > head(text_file)

• we can see the first few lines of our text file.

• Now we are using paste() function in text_file and make it a chunk


and the text collapse into quotations (“ ”).
• And storing to text_file1.
12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 23
….contd
• > text_file1=paste(text_file, collapse = " ")
• > head(text_file1)

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 24


Step 4: Cleaning the text And Generate
Word cloud – R coding

• The text mining function is used


• To convert the text to lower case
• To remove unnecessary white space
• To remove common stopwords like ‘the’, “we”, to remove words, etc.

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 25


>clean_text <- tolower(text_file1)
>head(clean_text)
>clean_text1 <- gsub(pattern = "\\W", replace = " " ,clean_text)
>head(clean_text1)
>clean_text2 <- gsub(pattern = "\\d", replace = " ", clean_text1)
>head(clean_text2)
>clean_text3 <- removeWords(clean_text2,words = c(stopwords(),"ai","â"))
>head(clean_text3)
> clean_text4 <- gsub(pattern = "\\b[A-z]\\b{1}", replace = " ", clean_text3 )
> head(clean_text4)
> clean_text5 <- stripWhitespace(clean_text4)
> head(clean_text4)
> clean_text6 <- strsplit(clean_text5, " ")
> head(clean_text6)
> word_freq <- table(clean_text6)
> head(word_freq)

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 26


>word_freq1 <- cbind(names(word_freq),as.integer(word_freq))
>head(word_freq1)
>library(RColorBrewer)
>library(wordcloud)
>class(clean_text6)
>word_cloud <- unlist(clean_text6)
>wordcloud(word_cloud)
>wordcloud(word_cloud,min.freq = 5 , random.order = FALSE, scale=c(3, 0.5))
>library(wordcloud2)
>wordcloud2(word_freq)
>wordcloud2(word_freq, color = "random-light", backgroundColor = "white")
> wordcloud2(word_freq, color = "random-dark", backgroundColor = "white",size = 0.5, shape =
"triangle")

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 27


12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 28
Installation & Importing of
Rfacebook
• install.packages(”Rfacebook”)
• install.packages(”RCrul”)
• Install.packges(“httpuv”)
• Install.packages(“Rcolorbrewer”)
• Install.packages(“rjson”)
• Install.packages(“httr”)
• library(“packages Name”)

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 29


12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 30
Installation & Importing of
twitter
• library(RColorBrewer)
• library(wordcloud)
• library(tm)
• library(twitteR)
• library(ROAuth)
• library(plyr)
• library(stringr)
• library(base64enc)
• library(SnowballC)
• library(ggplot2)
• library(maps)

12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 31


12/16/19 Dr. Praveen Kumar .T - Text Mining and Analytics 32

You might also like