You are on page 1of 16

9/23/2014

Text Analytics Tutorial


SF Data Mining Meetup
September 22, 2014

Kilian Thiel, Rosaria Silipo, Cathy Pearl


KNIME.com AG, Zurich, Switzerland
www.knime.com
@KNIME

Rosaria.Silipo@knime.com
cpearl@gmail.com
Kilian.Thiel@knime.com

Copyright © 2014 KNIME.com AG

Tool Installation

• Download open source KNIME analytics platform


from:
http://www.knime.org/knime-analytics-platform-sdk-download
• Select package for your OS and install
• Open the KNIME application
• In the top menu select “File” or “LOCAL” -> “Install
KNIME Extensions”
• Install “KNIME & Extensions” and “KNIME Labs
Extensions”

Copyright © 2014 KNIME.com AG 2

1
9/23/2014

Install KNIME Extensions (incl. Text Processing)

Copyright © 2014 KNIME.com AG 3

Requirements to import and run Demo Workflows

• KNIME 2.10
• Text Processing Extension from KNIME Labs
Extensions
• Distance Matrix from KNIME Extensions

Memory Tip
In file knime.ini set memory to max available
• -Xmx 3G

Copyright © 2014 KNIME.com AG 4

2
9/23/2014

Resources
• The KNIME Website (www.knime.org)
• LEARNING HUB under RESOURCES (www.knime.org/learning-
hub)
• Use Cases and White Papers for example workflows, and

• FORUM for questions and answers


• DOCUMENTATION for documentation, FAQ, change-logs, ...
• LABS for new developments and experimental nodes
• COMMUNITY for development instructions and third party nodes
• Blog for news, tips and tricks(www.knime.org/blog)

• KNIME TV channel on
Text Mining Webinar http://www.youtube.com/watch?v=tY7vpTLYlIg

• KNIME on @KNIME
Copyright © 2014 KNIME.com AG 5

Resources
eBooks from the KNIME Press:
http://www.knime.org/knimepress

Free Beginner’s
- KNIME Beginner’s Luck Guide – use Code
“meetupsf14”
- The KNIME Cookbook
- The KNIME Booklet for SAS Users

Copyright © 2014 KNIME.com AG

3
9/23/2014

Text Processing Steps


3. Pre-processing 4. Classification
1. Import Data (Filtering, Stemming, …) Clustering

2. Enrichment
(Tagging)
Document
Type
Cell
4. Transformation
Term BoW, Frequencies,
Type Document Vector
Cell

Copyright © 2014 KNIME.com AG 7

Import Demo Workflows

• Download zip file with demo workflows from


meetup site
• Open the KNIME application
• In the top menu, select File -> Import KNIME
Workflow ...
• Enable option „Select Archive File“
• Browse to zip file
• Import all workflows and data into KNIME

Copyright © 2014 KNIME.com AG 8

4
9/23/2014

Import Demo Workflows

Copyright © 2014 KNIME.com AG 9

Demo Workflows

0-TripAdvisorCrawling: importing data from web


1-Reading: Importing data from text, word, pdf, Twitter,
XML, …
2-Enrichment POS: String to Document and Word Tagging
in Document
3-Preprocessing: Filtering and Stemming
4-Classification-Cuisine: BoW, Frequencies, Document to
Document Vector
Other workflows for multi-words, clustering, topic
extraction, and reporting.

Copyright © 2014 KNIME.com AG 10

5
9/23/2014

Demo: The KNIME Workbench

Copyright © 2014 KNIME.com AG

Text Processing Category

Copyright © 2014 KNIME.com AG 12

6
9/23/2014

Demo: TripAdvisor Restaurant Data Set (SF)

Copyright © 2014 KNIME.com AG 13

Demo: TripAdvisor Data (SF Restaurants)

Reviews about Italian and Chinese restaurants in San


Francisco
• Chinese: 272
• Italian: 268

Copyright © 2014 KNIME.com AG 14

7
9/23/2014

Demo: Goal of this Tutorial

Goal:
• Build a classifier to distinguish between Chinese and
Italian restaurants, based on the reviews.

Italian or Chinese
Restaurant?

Copyright © 2014 KNIME.com AG 15

Demo: Final Workflow

Goal:

Copyright © 2014 KNIME.com AG 16

8
9/23/2014

1.) Reading

Read/Parse textual data

Copyright © 2014 KNIME.com AG 17

Demo

Reading
• Read Tripadvisor data (.table file)
• Filter rows with missing restaurant value
• Convert strings to documents
• Filter all but the document column
• Examples of other possible formats to import

Copyright © 2014 KNIME.com AG 18

9
9/23/2014

0.) Web Crawler Workflow

Palladian Extension from:


KNIME Community Contributions – Other

Copyright © 2014 KNIME.com AG 19

Demo

Reading
• Web Crawler Workflow to get data from the Web
• Palladian Community Contributions Extension
• HtmlParser node
• Xpath node

Copyright © 2014 KNIME.com AG 20

10
9/23/2014

2.) Enrichment

Enrich documents with semantic information

This assigns a tag to each word:


- Grammar tags (POS)
- Context dependent tags
- Sentiment tags
- Named Entity tags
- Custom tags

Copyright © 2014 KNIME.com AG 21

Demo

Enrichment / Tagging
• Apply POS Tagger node
• Use Bag of Words node to inspect tagging result
• Show other possible Taggings

Copyright © 2014 KNIME.com AG 22

11
9/23/2014

3.) Preprocessing

Preprocess documents and filter words

Copyright © 2014 KNIME.com AG 23

Demo

Preprocessing
• Filter
– Numbers
– Punctuation marks
– Stop Words
• Convert to lower case
• Stemming (Snowball stemmer because of the many
languages associated with it)
• Keep only nouns (NN), verbs (VB), adjectives (JJ)

Copyright © 2014 KNIME.com AG 24

12
9/23/2014

4.) Transformation

Creation of numerical representation of documents

BoW creates the list of words for each document


TF calculates word frequencies (absolute or relative)
in each document

Copyright © 2014 KNIME.com AG 25

Demo

Transformation
• Transform to bag of word
• Compute TF value for terms
TFrel (word) = n(word)/N
IDF(word) = log(1+(n(docs)/n(word, docs))
Tfrel(word) * IDF(word) is used often
ICF(word) = log(1+(n(cat)/n(word, cat))
• Sort output data by frequency

Copyright © 2014 KNIME.com AG 26

13
9/23/2014

4.) Transformation

Creation of numerical representation of documents

Copyright © 2014 KNIME.com AG 27

Demo

Transformation
• Transform to document vectors
• Extract category (class) value

Copyright © 2014 KNIME.com AG 28

14
9/23/2014

5.) Classification

Back to classical Data Analytics:


Training of a model (decision tree) and scoring

Copyright © 2014 KNIME.com AG 29

Demo

Classification
• Append color based on class
• Partition data into training and test set
• Train decision tree model in training data
• Apply decision tree model on test data
• Score model, measure accuracy
• Show cross-validation loop

Copyright © 2014 KNIME.com AG 30

15
9/23/2014

Additional Workflows

• Multi Word Tagging


– Detection of frequent Ngrams (Ngram Creator)
– Creation of dictionary from Ngrams
– Applying Dictionary Tagger
• Classification with Multi Words
• Clustering of documents
– hierarchical clustering based on distance matrix
• Topic Extraction
– Topic Extractor (Parallel LDA)

Copyright © 2014 KNIME.com AG 31

Thank You

Questions
• http://tech.knime.org/forum
• Rosaria.Silipo@knime.com
60k
Follow us
40k
• Twitter: @KNIME
• 20k
LinkedIn: https://www.linkedin.com/groups?gid=2212172
• KNIME Blog: http://www.knime.org/blog

Copyright © 2014 KNIME.com AG 32

16

You might also like