Professional Documents
Culture Documents
Rosaria.Silipo@knime.com
cpearl@gmail.com
Kilian.Thiel@knime.com
Tool Installation
1
9/23/2014
• KNIME 2.10
• Text Processing Extension from KNIME Labs
Extensions
• Distance Matrix from KNIME Extensions
Memory Tip
In file knime.ini set memory to max available
• -Xmx 3G
2
9/23/2014
Resources
• The KNIME Website (www.knime.org)
• LEARNING HUB under RESOURCES (www.knime.org/learning-
hub)
• Use Cases and White Papers for example workflows, and
• KNIME TV channel on
Text Mining Webinar http://www.youtube.com/watch?v=tY7vpTLYlIg
• KNIME on @KNIME
Copyright © 2014 KNIME.com AG 5
Resources
eBooks from the KNIME Press:
http://www.knime.org/knimepress
Free Beginner’s
- KNIME Beginner’s Luck Guide – use Code
“meetupsf14”
- The KNIME Cookbook
- The KNIME Booklet for SAS Users
3
9/23/2014
2. Enrichment
(Tagging)
Document
Type
Cell
4. Transformation
Term BoW, Frequencies,
Type Document Vector
Cell
4
9/23/2014
Demo Workflows
5
9/23/2014
6
9/23/2014
7
9/23/2014
Goal:
• Build a classifier to distinguish between Chinese and
Italian restaurants, based on the reviews.
Italian or Chinese
Restaurant?
Goal:
8
9/23/2014
1.) Reading
Demo
Reading
• Read Tripadvisor data (.table file)
• Filter rows with missing restaurant value
• Convert strings to documents
• Filter all but the document column
• Examples of other possible formats to import
9
9/23/2014
Demo
Reading
• Web Crawler Workflow to get data from the Web
• Palladian Community Contributions Extension
• HtmlParser node
• Xpath node
10
9/23/2014
2.) Enrichment
Demo
Enrichment / Tagging
• Apply POS Tagger node
• Use Bag of Words node to inspect tagging result
• Show other possible Taggings
11
9/23/2014
3.) Preprocessing
Demo
Preprocessing
• Filter
– Numbers
– Punctuation marks
– Stop Words
• Convert to lower case
• Stemming (Snowball stemmer because of the many
languages associated with it)
• Keep only nouns (NN), verbs (VB), adjectives (JJ)
12
9/23/2014
4.) Transformation
Demo
Transformation
• Transform to bag of word
• Compute TF value for terms
TFrel (word) = n(word)/N
IDF(word) = log(1+(n(docs)/n(word, docs))
Tfrel(word) * IDF(word) is used often
ICF(word) = log(1+(n(cat)/n(word, cat))
• Sort output data by frequency
13
9/23/2014
4.) Transformation
Demo
Transformation
• Transform to document vectors
• Extract category (class) value
14
9/23/2014
5.) Classification
Demo
Classification
• Append color based on class
• Partition data into training and test set
• Train decision tree model in training data
• Apply decision tree model on test data
• Score model, measure accuracy
• Show cross-validation loop
15
9/23/2014
Additional Workflows
Thank You
Questions
• http://tech.knime.org/forum
• Rosaria.Silipo@knime.com
60k
Follow us
40k
• Twitter: @KNIME
• 20k
LinkedIn: https://www.linkedin.com/groups?gid=2212172
• KNIME Blog: http://www.knime.org/blog
16