You are on page 1of 16

9/23/2014

Text Analytics Tutorial


SF Data Mining Meetup
September 22, 2014

Kilian Thiel, Rosaria Silipo, Cathy Pearl


KNIME.com AG, Zurich, Switzerland
www.knime.com
@KNIME

Rosaria.Silipo@knime.com
cpearl@gmail.com
Kilian.Thiel@knime.com

Copyright 2014 KNIME.com AG

Tool Installation

Download open source KNIME analytics platform


from:
http://www.knime.org/knime-analytics-platform-sdk-download
Select package for your OS and install
Open the KNIME application
In the top menu select File or LOCAL -> Install
KNIME Extensions
Install KNIME & Extensions and KNIME Labs
Extensions

Copyright 2014 KNIME.com AG 2

1
9/23/2014

Install KNIME Extensions (incl. Text Processing)

Copyright 2014 KNIME.com AG 3

Requirements to import and run Demo Workflows

KNIME 2.10
Text Processing Extension from KNIME Labs
Extensions
Distance Matrix from KNIME Extensions

Memory Tip
In file knime.ini set memory to max available
-Xmx 3G

Copyright 2014 KNIME.com AG 4

2
9/23/2014

Resources
The KNIME Website (www.knime.org)
LEARNING HUB under RESOURCES (www.knime.org/learning-
hub)
Use Cases and White Papers for example workflows, and

FORUM for questions and answers


DOCUMENTATION for documentation, FAQ, change-logs, ...
LABS for new developments and experimental nodes
COMMUNITY for development instructions and third party nodes
Blog for news, tips and tricks(www.knime.org/blog)

KNIME TV channel on
Text Mining Webinar http://www.youtube.com/watch?v=tY7vpTLYlIg

KNIME on @KNIME
Copyright 2014 KNIME.com AG 5

Resources
eBooks from the KNIME Press:
http://www.knime.org/knimepress

Free Beginners
- KNIME Beginners Luck Guide use Code
meetupsf14
- The KNIME Cookbook
- The KNIME Booklet for SAS Users

Copyright 2014 KNIME.com AG

3
9/23/2014

Text Processing Steps


3. Pre-processing 4. Classification
1. Import Data (Filtering, Stemming, ) Clustering

2. Enrichment
(Tagging)
Document
Type
Cell
4. Transformation
Term BoW, Frequencies,
Type Document Vector
Cell

Copyright 2014 KNIME.com AG 7

Import Demo Workflows

Download zip file with demo workflows from


meetup site
Open the KNIME application
In the top menu, select File -> Import KNIME
Workflow ...
Enable option Select Archive File
Browse to zip file
Import all workflows and data into KNIME

Copyright 2014 KNIME.com AG 8

4
9/23/2014

Import Demo Workflows

Copyright 2014 KNIME.com AG 9

Demo Workflows

0-TripAdvisorCrawling: importing data from web


1-Reading: Importing data from text, word, pdf, Twitter,
XML,
2-Enrichment POS: String to Document and Word Tagging
in Document
3-Preprocessing: Filtering and Stemming
4-Classification-Cuisine: BoW, Frequencies, Document to
Document Vector
Other workflows for multi-words, clustering, topic
extraction, and reporting.

Copyright 2014 KNIME.com AG 10

5
9/23/2014

Demo: The KNIME Workbench

Copyright 2014 KNIME.com AG

Text Processing Category

Copyright 2014 KNIME.com AG 12

6
9/23/2014

Demo: TripAdvisor Restaurant Data Set (SF)

Copyright 2014 KNIME.com AG 13

Demo: TripAdvisor Data (SF Restaurants)

Reviews about Italian and Chinese restaurants in San


Francisco
Chinese: 272
Italian: 268

Copyright 2014 KNIME.com AG 14

7
9/23/2014

Demo: Goal of this Tutorial

Goal:
Build a classifier to distinguish between Chinese and
Italian restaurants, based on the reviews.

Italian or Chinese
Restaurant?

Copyright 2014 KNIME.com AG 15

Demo: Final Workflow

Goal:

Copyright 2014 KNIME.com AG 16

8
9/23/2014

1.) Reading

Read/Parse textual data

Copyright 2014 KNIME.com AG 17

Demo

Reading
Read Tripadvisor data (.table file)
Filter rows with missing restaurant value
Convert strings to documents
Filter all but the document column
Examples of other possible formats to import

Copyright 2014 KNIME.com AG 18

9
9/23/2014

0.) Web Crawler Workflow

Palladian Extension from:


KNIME Community Contributions Other

Copyright 2014 KNIME.com AG 19

Demo

Reading
Web Crawler Workflow to get data from the Web
Palladian Community Contributions Extension
HtmlParser node
Xpath node

Copyright 2014 KNIME.com AG 20

10
9/23/2014

2.) Enrichment

Enrich documents with semantic information

This assigns a tag to each word:


- Grammar tags (POS)
- Context dependent tags
- Sentiment tags
- Named Entity tags
- Custom tags

Copyright 2014 KNIME.com AG 21

Demo

Enrichment / Tagging
Apply POS Tagger node
Use Bag of Words node to inspect tagging result
Show other possible Taggings

Copyright 2014 KNIME.com AG 22

11
9/23/2014

3.) Preprocessing

Preprocess documents and filter words

Copyright 2014 KNIME.com AG 23

Demo

Preprocessing
Filter
Numbers
Punctuation marks
Stop Words
Convert to lower case
Stemming (Snowball stemmer because of the many
languages associated with it)
Keep only nouns (NN), verbs (VB), adjectives (JJ)

Copyright 2014 KNIME.com AG 24

12
9/23/2014

4.) Transformation

Creation of numerical representation of documents

BoW creates the list of words for each document


TF calculates word frequencies (absolute or relative)
in each document

Copyright 2014 KNIME.com AG 25

Demo

Transformation
Transform to bag of word
Compute TF value for terms
TFrel (word) = n(word)/N
IDF(word) = log(1+(n(docs)/n(word, docs))
Tfrel(word) * IDF(word) is used often
ICF(word) = log(1+(n(cat)/n(word, cat))
Sort output data by frequency

Copyright 2014 KNIME.com AG 26

13
9/23/2014

4.) Transformation

Creation of numerical representation of documents

Copyright 2014 KNIME.com AG 27

Demo

Transformation
Transform to document vectors
Extract category (class) value

Copyright 2014 KNIME.com AG 28

14
9/23/2014

5.) Classification

Back to classical Data Analytics:


Training of a model (decision tree) and scoring

Copyright 2014 KNIME.com AG 29

Demo

Classification
Append color based on class
Partition data into training and test set
Train decision tree model in training data
Apply decision tree model on test data
Score model, measure accuracy
Show cross-validation loop

Copyright 2014 KNIME.com AG 30

15
9/23/2014

Additional Workflows

Multi Word Tagging


Detection of frequent Ngrams (Ngram Creator)
Creation of dictionary from Ngrams
Applying Dictionary Tagger
Classification with Multi Words
Clustering of documents
hierarchical clustering based on distance matrix
Topic Extraction
Topic Extractor (Parallel LDA)

Copyright 2014 KNIME.com AG 31

Thank You

Questions
http://tech.knime.org/forum
Rosaria.Silipo@knime.com
60k
Follow us
40k
Twitter: @KNIME
20k
LinkedIn: https://www.linkedin.com/groups?gid=2212172
KNIME Blog: http://www.knime.org/blog

Copyright 2014 KNIME.com AG 32

16