You are on page 1of 30

TEXT ANALYTICS

V.SUBBA RAO
184819
BBA BA
Contents
 Background & Objectives
 Our current view on Text Analytics
 Value
 Process
 An example application
 Conclusions
Background
 Text Analytics and Text Mining are largely synonymous
 Interest and execution of Text Analytics is growing
 Social Media sources are largely responsible for this
 And that often means “Big Data”

 This should lead to further improvements in technology and methodology which will benefit
survey practitioners
Objectives

 We’ve been involved in more Text Analytics work in the


last 2 MONTHS BACK than in all previous years
 Our objective in this presentation is to share some of our
experience and thoughts around some of the technology we
have used

4
The Value Propositions

1. Reduce cost (and time)

*http://wp.eaagle.com/

2. Generating actionable insights


 Improve public and commercial processes
Using Text Analytics to find 6

Text Analytics software

http://www.isvworld.com/..0
Software tools

R
• Open Source Statistical Platform
• Command driven

Rapid Miner
• Open Source Data Mining Workbench
• GUI
• Built on R and Weka

SPSS Text Analytics for Surveys


• Commercial Text Analytics
• GUI

7
The Process – Highest Level

Unstructured
Structured data
data
Process – Level 2

1. Extract

2. Refine

3. Analyse
How can we tell if we are using
the right tool(s)?
Extract
• How good is the first extraction?
• How long to get to an acceptable extraction?
Refine
• How easy is to refine?
• How easy is to capture refinements to re-use them in
future?
Analyse
• What tools exist to support the Text Analytics process?
• What tools exist to use the Structured Text in other analyses?

How well do the tools/methods deliver on the value propositions?


10
Algorithms and Dictionaries 11

1. Extract
Algorithms
• e.g. Natural Language Processing (NLP)

Dictionaries
• Variously called Lexicons, Resources, Libraries, etc.
• Are usually contextual e.g. Customer Satisfaction
Example Data

 The American Physical Society (APS)


 Student Survey Comments from 2009 (Base=1304)
 Q4.2 Comments about the best features of and what could
be added or improved to the special programses for Student
Members*

*http://www.aps.org/about/governance/committees/commemb/upload/2009-student-comments.pdf
12
13 The first extraction with R
library("tm", lib.loc="C:/Users/jmcconnell/Documents/R/win-
library/3.0")

APS2009df = read.csv("C:/AP/ASC/APS/APS2009Verbatims.csv",
header = TRUE)

text_corpus <- Corpus(VectorSource(APS2009df),


readerControl = list(language = "en"))

summary(text_corpus) #check what went in


text_corpus <- tm_map(text_corpus, removeNumbers)
text_corpus <- tm_map(text_corpus, removePunctuation)
text_corpus <- tm_map(text_corpus , stripWhitespace)
text_corpus <- tm_map(text_corpus, tolower)

We apply a basic set of text handling methods (simple NLP)


e.g. removePunctuation
We also apply a small dictionary of known “Stopwords” (not shown)
R Extraction Results – Top 20 14

Terms
The first extraction with Rapid 15

Miner

We visually construct a similar set of steps


Rapid Miner Extraction Results 16

– Top 20 Terms
Improving and creating new 17

data
2. Refine

Improve the extraction


• Correct mistakes
• Add omissions

Map the extraction to structured data


• Group and combine meaningful terms that will become
data for further analysis

In second and subsequent waves (where applicable) Refine should be a


shorter step where we look for new concepts
Rapid Miner - Refine 18

We add one process step to fix up some of the issues in the first extraction
Filter Tokens sets a lower limit for the length of an extracted term/attribute
Rapid Miner results after first 19

refinement
The first extraction with SPSS 20

SPSS Uses a Wizard to specify the extraction steps


SPSS Extraction Results – Top 21

20 Terms

SPSS Is counting respondents not


occurrences

Synonyms are used from the dictionaries


Adding Wordnet to our R 22

(/RapidMiner) analysis

library("wordnet")
setDict ("C:/Wordnet/WordNet-3.0/dict")
synonyms("excellent", "ADJECTIVE")

[1] "excellent" "fantabulous" "first-class" "splendid"


Analytics to aid refinement 23
Job … Fair 24

Students are asking for more “stuff” at the job fair


R Extraction Results – Top 20 25

Terms
Onward to analysis 26
Key Drivers of Recommendation*

Teaching quality 40%

Support Services 30%

Accommodation 20%

Job Fair - Would like more 10%


0% 5% 10% 15% 20% 25% 30% 35% 40% 45%

*This is an anonymised example


3. Analyse
Onward to analysis
R
• In R we are in a statistical platform already
• Text Analytics outputs are part of the data in the current “Workspace”
• For Research style charts and tables we may need to export data

Rapid Miner
• In RM we are in a Data Mining platform already
• Text Analytics is part of the current process flow

SPSS Text Analytics for Surveys


• Data needs to be exported elsewhere for Analysis
• To SPSS .sav, Excel or Data Collection

27
A High Level Comparison 28
Attribute R Rapid Miner SPSS TAfS

Help & Support Lot of User Generated Lots of User Paid support
Content Generated Content
Paid support option
Usability Low level coding Visual Visual UI
control programming
Scalability R in itself isn’t too scalable Radoop We experienced
but many scalable Issues with data sets
implementations exist e.g. around 100,000
Revolution, Hadoop
cases*
Extensibility Various options Various options None

Automation Can be run in batch Can run in batch None

Overall Great for the coder. The power of R The most graphical
Those familiar with R with a GUI and tuned for
Generic survey
types e.g. Opinions
Our current conclusions
 Dictionaries help in the initial extraction
 But it is almost inevitable you will want to extend them to get to the specificity
of the study. If the study domain is very specific you can build your own
dictionaries in all 3 tools. A lot of social media monitoring starts with libraries of
regular expressions built from the ground up.
 Open Source tools like R and Rapid Miner will continue to improve with
“packages” added by the R community
 There is no “silver bullet”. The Refine step will typically require a lot of
manual input
 Especially in the initial “build” phase
 More is required on larger surveys
 But the ROI – in time and/or cost - should be clear
 And the results more robust and reliable

29
A journey into “ Text
Analytics”
Thank-you & Questions

V.SUBBA RAO
BBA BA
184819

You might also like