Professional Documents
Culture Documents
V.SUBBA RAO
184819
BBA BA
Contents
Background & Objectives
Our current view on Text Analytics
Value
Process
An example application
Conclusions
Background
Text Analytics and Text Mining are largely synonymous
Interest and execution of Text Analytics is growing
Social Media sources are largely responsible for this
And that often means “Big Data”
This should lead to further improvements in technology and methodology which will benefit
survey practitioners
Objectives
4
The Value Propositions
*http://wp.eaagle.com/
http://www.isvworld.com/..0
Software tools
R
• Open Source Statistical Platform
• Command driven
Rapid Miner
• Open Source Data Mining Workbench
• GUI
• Built on R and Weka
7
The Process – Highest Level
Unstructured
Structured data
data
Process – Level 2
1. Extract
2. Refine
3. Analyse
How can we tell if we are using
the right tool(s)?
Extract
• How good is the first extraction?
• How long to get to an acceptable extraction?
Refine
• How easy is to refine?
• How easy is to capture refinements to re-use them in
future?
Analyse
• What tools exist to support the Text Analytics process?
• What tools exist to use the Structured Text in other analyses?
1. Extract
Algorithms
• e.g. Natural Language Processing (NLP)
Dictionaries
• Variously called Lexicons, Resources, Libraries, etc.
• Are usually contextual e.g. Customer Satisfaction
Example Data
*http://www.aps.org/about/governance/committees/commemb/upload/2009-student-comments.pdf
12
13 The first extraction with R
library("tm", lib.loc="C:/Users/jmcconnell/Documents/R/win-
library/3.0")
APS2009df = read.csv("C:/AP/ASC/APS/APS2009Verbatims.csv",
header = TRUE)
Terms
The first extraction with Rapid 15
Miner
– Top 20 Terms
Improving and creating new 17
data
2. Refine
We add one process step to fix up some of the issues in the first extraction
Filter Tokens sets a lower limit for the length of an extracted term/attribute
Rapid Miner results after first 19
refinement
The first extraction with SPSS 20
20 Terms
(/RapidMiner) analysis
library("wordnet")
setDict ("C:/Wordnet/WordNet-3.0/dict")
synonyms("excellent", "ADJECTIVE")
Terms
Onward to analysis 26
Key Drivers of Recommendation*
Accommodation 20%
Rapid Miner
• In RM we are in a Data Mining platform already
• Text Analytics is part of the current process flow
27
A High Level Comparison 28
Attribute R Rapid Miner SPSS TAfS
Help & Support Lot of User Generated Lots of User Paid support
Content Generated Content
Paid support option
Usability Low level coding Visual Visual UI
control programming
Scalability R in itself isn’t too scalable Radoop We experienced
but many scalable Issues with data sets
implementations exist e.g. around 100,000
Revolution, Hadoop
cases*
Extensibility Various options Various options None
Overall Great for the coder. The power of R The most graphical
Those familiar with R with a GUI and tuned for
Generic survey
types e.g. Opinions
Our current conclusions
Dictionaries help in the initial extraction
But it is almost inevitable you will want to extend them to get to the specificity
of the study. If the study domain is very specific you can build your own
dictionaries in all 3 tools. A lot of social media monitoring starts with libraries of
regular expressions built from the ground up.
Open Source tools like R and Rapid Miner will continue to improve with
“packages” added by the R community
There is no “silver bullet”. The Refine step will typically require a lot of
manual input
Especially in the initial “build” phase
More is required on larger surveys
But the ROI – in time and/or cost - should be clear
And the results more robust and reliable
29
A journey into “ Text
Analytics”
Thank-you & Questions
V.SUBBA RAO
BBA BA
184819