John MC Connell

TEXT ANALYTICS
V.SUBBA RAO
184819
BBA BA
Contents
 Background & Objectives
 Our current view on Text Analytics
 Value
 Process
 An example application
 Conclusions
Background
 Text Analytics and Text Mining are largely synonymous
 Interest and execution of Text Analytics is growing
 Social Media sources are largely responsible for this
 And that often means “Big Data”
 This should lead to further improvements in technology and methodology which will benefit
survey practitioners
Objectives
 We’ve been involved in more Text Analytics work in the

last 2 MONTHS BACK than in all previous years
 Our objective in this presentation is to share some of our
experience and thoughts around some of the technology we
have used
4
The Value Propositions
1. Reduce cost (and time)
*http://wp.eaagle.com/
2. Generating actionable insights

 Improve public and commercial processes
Using Text Analytics to find 6
Text Analytics software
http://www.isvworld.com/..0
Software tools
R
• Open Source Statistical Platform
• Command driven
Rapid Miner
• Open Source Data Mining Workbench
• GUI
• Built on R and Weka
SPSS Text Analytics for Surveys

• Commercial Text Analytics
• GUI
7
The Process – Highest Level
Unstructured
Structured data
data
Process – Level 2
1. Extract
2. Refine
3. Analyse
How can we tell if we are using
the right tool(s)?
Extract
• How good is the first extraction?
• How long to get to an acceptable extraction?
Refine
• How easy is to refine?
• How easy is to capture refinements to re-use them in
future?
Analyse
• What tools exist to support the Text Analytics process?
• What tools exist to use the Structured Text in other analyses?
How well do the tools/methods deliver on the value propositions?

10
Algorithms and Dictionaries 11
1. Extract
Algorithms
• e.g. Natural Language Processing (NLP)
Dictionaries
• Variously called Lexicons, Resources, Libraries, etc.
• Are usually contextual e.g. Customer Satisfaction
Example Data
 The American Physical Society (APS)

 Student Survey Comments from 2009 (Base=1304)
 Q4.2 Comments about the best features of and what could
be added or improved to the special programses for Student
Members*
*http://www.aps.org/about/governance/committees/commemb/upload/2009-student-comments.pdf
12
13 The first extraction with R
library("tm", lib.loc="C:/Users/jmcconnell/Documents/R/win-
library/3.0")
APS2009df = read.csv("C:/AP/ASC/APS/APS2009Verbatims.csv",
header = TRUE)
text_corpus <- Corpus(VectorSource(APS2009df),

readerControl = list(language = "en"))
summary(text_corpus) #check what went in

text_corpus <- tm_map(text_corpus, removeNumbers)
text_corpus <- tm_map(text_corpus, removePunctuation)
text_corpus <- tm_map(text_corpus , stripWhitespace)
text_corpus <- tm_map(text_corpus, tolower)
We apply a basic set of text handling methods (simple NLP)

e.g. removePunctuation
We also apply a small dictionary of known “Stopwords” (not shown)
R Extraction Results – Top 20 14
Terms
The first extraction with Rapid 15
Miner
We visually construct a similar set of steps

Rapid Miner Extraction Results 16
– Top 20 Terms
Improving and creating new 17
data
2. Refine
Improve the extraction

• Correct mistakes
• Add omissions
Map the extraction to structured data

• Group and combine meaningful terms that will become
data for further analysis
In second and subsequent waves (where applicable) Refine should be a

shorter step where we look for new concepts
Rapid Miner - Refine 18
We add one process step to fix up some of the issues in the first extraction
Filter Tokens sets a lower limit for the length of an extracted term/attribute
Rapid Miner results after first 19
refinement
The first extraction with SPSS 20
SPSS Uses a Wizard to specify the extraction steps

SPSS Extraction Results – Top 21
20 Terms
SPSS Is counting respondents not

occurrences
Synonyms are used from the dictionaries

Adding Wordnet to our R 22
(/RapidMiner) analysis
library("wordnet")
setDict ("C:/Wordnet/WordNet-3.0/dict")
synonyms("excellent", "ADJECTIVE")
[1] "excellent" "fantabulous" "first-class" "splendid"

Analytics to aid refinement 23
Job … Fair 24
Students are asking for more “stuff” at the job fair

R Extraction Results – Top 20 25
Terms
Onward to analysis 26
Key Drivers of Recommendation*
Teaching quality 40%
Support Services 30%
Accommodation 20%
Job Fair - Would like more 10%

0% 5% 10% 15% 20% 25% 30% 35% 40% 45%
*This is an anonymised example

3. Analyse
Onward to analysis
R
• In R we are in a statistical platform already
• Text Analytics outputs are part of the data in the current “Workspace”
• For Research style charts and tables we may need to export data
Rapid Miner
• In RM we are in a Data Mining platform already
• Text Analytics is part of the current process flow
SPSS Text Analytics for Surveys

• Data needs to be exported elsewhere for Analysis
• To SPSS .sav, Excel or Data Collection
27
A High Level Comparison 28
Attribute R Rapid Miner SPSS TAfS
Help & Support Lot of User Generated Lots of User Paid support
Content Generated Content
Paid support option
Usability Low level coding Visual Visual UI
control programming
Scalability R in itself isn’t too scalable Radoop We experienced
but many scalable Issues with data sets
implementations exist e.g. around 100,000
Revolution, Hadoop
cases*
Extensibility Various options Various options None
Automation Can be run in batch Can run in batch None
Overall Great for the coder. The power of R The most graphical
Those familiar with R with a GUI and tuned for
Generic survey
types e.g. Opinions
Our current conclusions
 Dictionaries help in the initial extraction
 But it is almost inevitable you will want to extend them to get to the specificity
of the study. If the study domain is very specific you can build your own
dictionaries in all 3 tools. A lot of social media monitoring starts with libraries of
regular expressions built from the ground up.
 Open Source tools like R and Rapid Miner will continue to improve with
“packages” added by the R community
 There is no “silver bullet”. The Refine step will typically require a lot of
manual input
 Especially in the initial “build” phase
 More is required on larger surveys
 But the ROI – in time and/or cost - should be clear
 And the results more robust and reliable
29
A journey into “ Text
Analytics”
Thank-you & Questions
V.SUBBA RAO
BBA BA
184819

John MC Connell

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

John MC Connell

Uploaded by

Copyright:

Available Formats

TEXT ANALYTICS

 We’ve been involved in more Text Analytics work in the

1. Reduce cost (and time)

2. Generating actionable insights

Text Analytics software

SPSS Text Analytics for Surveys

How well do the tools/methods deliver on the value propositions?

 The American Physical Society (APS)

text_corpus <- Corpus(VectorSource(APS2009df),

summary(text_corpus) #check what went in

We apply a basic set of text handling methods (simple NLP)

We visually construct a similar set of steps

Improve the extraction

Map the extraction to structured data

In second and subsequent waves (where applicable) Refine should be a

SPSS Uses a Wizard to specify the extraction steps

SPSS Is counting respondents not

Synonyms are used from the dictionaries

[1] "excellent" "fantabulous" "first-class" "splendid"

Students are asking for more “stuff” at the job fair

Teaching quality 40%

Support Services 30%

Job Fair - Would like more 10%

*This is an anonymised example

SPSS Text Analytics for Surveys

Automation Can be run in batch Can run in batch None

You might also like