Welcome to Scribd!

What Is Text Mining?: Ted Kwartler

Uploaded by

0% found this document useful (0 votes)

5 views17 pages

This document discusses text mining using a bag-of-words approach in R. It describes the text mining workflow, including problem definition, text collection, preprocessing, feature extraction, and analysis. Preprocessing steps covered include removing numbers, punctuation, and abbreviations. Methods for generating term-document matrices and document-term matrices from preprocessed text are presented, as well as stemming words and creating a word frequency matrix.

Original Description:

Original Title

chapter1

Copyright

Available Formats

PDF, TXT or read online from Scribd

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Report this Document

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

0% found this document useful (0 votes)

5 views17 pages

What Is Text Mining?: Ted Kwartler

Uploaded by

itamar_cordeiro

Copyright:

Available Formats

Download as PDF, TXT or read online from Scribd

Flag for inappropriate content

Jump to Page

You are on page 1of 17

Search inside document

What is text mining?

T E X T M I N I N G W I T H B A G - O F -W O R D S I N R

Ted Kwartler
Instructor
What is text mining?
The process of distilling actionable insights from text

TEXT MINING WITH BAG-OF-WORDS IN R

Text mining workflow
1 - Problem de nition & speci c goals

2 - Identify text to be collected

3 - Text organization

4 - Feature extraction

5 - Analysis

6 - Reach an insight, recommendation, or

output

TEXT MINING WITH BAG-OF-WORDS IN R

Semantic parsing vs. bag of words

TEXT MINING WITH BAG-OF-WORDS IN R

Let's practice!
T E X T M I N I N G W I T H B A G - O F -W O R D S I N R
Getting started
T E X T M I N I N G W I T H B A G - O F -W O R D S I N R

Ted Kwartler
Instructor
Building our first corpus
# Load corpus
coffee_tweets <- read.csv("coffee.csv", stringsAsFactors = FALSE)
# Vector of tweets
coffee_tweets <- coffee_tweets$text
# View first 5 tweets
head(coffee_tweets, 5)

[1] "@ayyytylerb that is so true drink lots of coffee"

[2] "RT @bryzy_brib: Senior March tmw morning at 7:25 A.M. in the SENIOR lot. Get up early,
[3] "If you believe in #gunsense tomorrow would be a very good day to have your coffee any p
[4] "My cute coffee mug. http://t.co/2udvMU6XIG"
[5] "RT @slaredo21: I wish we had Starbucks here... Cause coffee dates in the morning sound

TEXT MINING WITH BAG-OF-WORDS IN R

Let's practice!
T E X T M I N I N G W I T H B A G - O F -W O R D S I N R
Cleaning and
preprocessing text
T E X T M I N I N G W I T H B A G - O F -W O R D S I N R

Ted Kwartler
Instructor
Common preprocessing functions

TEXT MINING WITH BAG-OF-WORDS IN R

Preprocessing in practice

# Make a vector source: coffee_source

coffee_source <- VectorSource(coffee_tweets)
# Make a volatile corpus: coffee_corpus
coffee_corpus <- VCorpus(coffee_source)
# Apply various preprocessing functions
tm_map(coffee_corpus, removeNumbers)
tm_map(coffee_corpus, removePunctuation)
tm_map(coffee_corpus, content_transformer(replace_abbreviation))

TEXT MINING WITH BAG-OF-WORDS IN R

Another preprocessing step: word stemming
# Stem words
stem_words <- stemDocument(c("complicatedly", "complicated","complication"))
stem_words

"complic" "complic" "complic"

# Complete words using single word dictionary

stemCompletion(stem_words, c("complicate"))

complic complic complic

"complicate" "complicate" "complicate"

# Complete words using entire corpus

stemCompletion(stem_words, my_corpus)

TEXT MINING WITH BAG-OF-WORDS IN R

Let's practice!
T E X T M I N I N G W I T H B A G - O F -W O R D S I N R
The TDM & DTM
T E X T M I N I N G W I T H B A G - O F -W O R D S I N R

Ted Kwartler
Instructor
TDM vs. DTM

# Generate TDM
coffee_tdm <- TermDocumentMatrix(clean_corp)
# Generate DTM
coffee_dtm <- DocumentTermMatrix(clean_corp)

TEXT MINING WITH BAG-OF-WORDS IN R

Word Frequency Matrix (WFM)
# Load qdap package
library(qdap)
# Generate word frequency matrix
coffee_wfm <- wfm(coffee_text$text)

TEXT MINING WITH BAG-OF-WORDS IN R

Let's practice!
T E X T M I N I N G W I T H B A G - O F -W O R D S I N R

Data 101 Complete PDF
Document603 pages
Data 101 Complete PDF
Sami Almuallim
No ratings yet
Additional Exercice S Data Science
Document3 pages
Additional Exercice S Data Science
Frank
No ratings yet
Text Mining With Bag of Words in R - 1 PDF
Document17 pages
Text Mining With Bag of Words in R - 1 PDF
donne4real
No ratings yet
Keyword Clustering
Document15 pages
Keyword Clustering
JACOB RACHUONYO
No ratings yet
Python NLP
Document15 pages
Python NLP
Pierre Tibokbe
No ratings yet
Natural Language Processing With RNNs .Ipynb - Colaboratory
Document15 pages
Natural Language Processing With RNNs .Ipynb - Colaboratory
zb lai
No ratings yet
NLP Manual (1-12)
Document55 pages
NLP Manual (1-12)
sj120cp
No ratings yet
NLP Manual (1-12)
Document54 pages
NLP Manual (1-12)
sj120cp
No ratings yet
Chapter 2
Document22 pages
Chapter 2
saifelbob2002
No ratings yet
Chapter2 PDF
Document22 pages
Chapter2 PDF
NourheneMbarek
No ratings yet
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
Document13 pages
NLP - Twitter Sentiment Analysis With Tensorflow - Sebastian Correa - Medium
Sebastian Correa
No ratings yet
NLP Manual (1-12) 1
Document56 pages
NLP Manual (1-12) 1
sj120cp
No ratings yet
Phylogenetic Comparative Methods in R Using Phylogeny As A Statistical Fix
Document22 pages
Phylogenetic Comparative Methods in R Using Phylogeny As A Statistical Fix
Sara Gamboa Jurado
No ratings yet
Sree017 NLP
Document3 pages
Sree017 NLP
Rahul Jaiswal
No ratings yet
Experiment No: 5 BE-COMP-B-26 Aim: Tools: Theory:: Implement Stop Word Removal Techniques. Python
Document2 pages
Experiment No: 5 BE-COMP-B-26 Aim: Tools: Theory:: Implement Stop Word Removal Techniques. Python
ROHIT SELVAM6
No ratings yet
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
Document11 pages
Deep Learning in Practice Project Two: NLP of The Holy Quran in Python
shoaib riaz
No ratings yet
Practical and Effective Neural NER
Document31 pages
Practical and Effective Neural NER
wcc32
No ratings yet
Language Engineering - Section
Document24 pages
Language Engineering - Section
asmaa soliman
No ratings yet
Assignment II - NLP - D
Document4 pages
Assignment II - NLP - D
Sumit Das
No ratings yet
Text Mining in R: A Tutorial
Document7 pages
Text Mining in R: A Tutorial
meenana
No ratings yet
NLP For ML - Spam Classifier
Document14 pages
NLP For ML - Spam Classifier
Thomas West
No ratings yet
AI Phash3
Document11 pages
AI Phash3
techusama4
No ratings yet
Cours 4 - Loading and Preprocessing Data With TensorFlow
Document23 pages
Cours 4 - Loading and Preprocessing Data With TensorFlow
Sarah Bouammar
No ratings yet
2-Data Collection and Preperation
Document43 pages
2-Data Collection and Preperation
Menna Saed
No ratings yet
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
Document6 pages
Text Mining Package and Datacleaning: #Cleaning The Text or Text Transformation
Arush sambyal
No ratings yet
Advanced NLP With Spacy Chapter2
Document28 pages
Advanced NLP With Spacy Chapter2
Fgpeqw
100% (1)
Regression Trees Chapter2
Document21 pages
Regression Trees Chapter2
Mikistli Yowaltekutli
No ratings yet
5 Paso S Text Mining
Document4 pages
5 Paso S Text Mining
Ken Matsuda
No ratings yet
CS1026: Assignment 3 Sentiment Analysis: Due: November 13, 2019 at 9:00pm. Weight: 12%
Document5 pages
CS1026: Assignment 3 Sentiment Analysis: Due: November 13, 2019 at 9:00pm. Weight: 12%
Janice Xu
No ratings yet
Introduction To R: Maximilian Kasy
Document35 pages
Introduction To R: Maximilian Kasy
Easye Troublej
No ratings yet
Python Chatbot Project
Document10 pages
Python Chatbot Project
Vanitha G
No ratings yet
09 Rohit Jujaray NLP Experiments
Document24 pages
09 Rohit Jujaray NLP Experiments
NEMAT KHAN
No ratings yet
NLTK
Document16 pages
NLTK
Honey
No ratings yet
NLP Lab Manual (R20)
Document24 pages
NLP Lab Manual (R20)
Gopi Naveen
100% (1)
NLP Aat-2 16-03-2023
Document9 pages
NLP Aat-2 16-03-2023
btms
No ratings yet
NLP Preparing The Text Data (Part I)
Document2 pages
NLP Preparing The Text Data (Part I)
learnit learnit
No ratings yet
A Tutorial of Text Mining in R Using TM Package
Document6 pages
A Tutorial of Text Mining in R Using TM Package
Angel Montilla
No ratings yet
Write Text To PDF With Itextsharp in
Document2 pages
Write Text To PDF With Itextsharp in
cafjnk
0% (1)
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
Document12 pages
Comprehensive Guide Data Exploration Sas Using Python Numpy Scipy Matplotlib Pandas
Ahsan Ahmad Beg
100% (1)
Module03 Embeddings
Document102 pages
Module03 Embeddings
D
No ratings yet
Machine Learning: Suprit Shrestha (19BCE2584)
Document20 pages
Machine Learning: Suprit Shrestha (19BCE2584)
Suprit D. Shrestha
No ratings yet
Recurrent Neural Networks Tutorial, Part 2
Document16 pages
Recurrent Neural Networks Tutorial, Part 2
hoja
No ratings yet
Text Mining - Creating Tidy Text AFIT Data Science Lab R Programming Guide
Document11 pages
Text Mining - Creating Tidy Text AFIT Data Science Lab R Programming Guide
Angel Montilla
No ratings yet
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
Document33 pages
Building Transformer Models With Attention Crash Course Build A Neural Machine Translator in 12 Days
Nam Nguyen
No ratings yet
Moiz Compiler
Document11 pages
Moiz Compiler
Ammar Khan
No ratings yet
Problem 1 Proposal
Document24 pages
Problem 1 Proposal
uthayakumar j
No ratings yet
RDataMining Slides Twitter Analysis
Document40 pages
RDataMining Slides Twitter Analysis
Sidney GM
100% (1)
Keras For Beginners: Implementing A Recurrent Neural Network
Document13 pages
Keras For Beginners: Implementing A Recurrent Neural Network
Robert DEL POPOLO
No ratings yet
Dax 1677526120
Document8 pages
Dax 1677526120
b5gbf67b5p
No ratings yet
R Tutorial
Document11 pages
R Tutorial
udhai170819
No ratings yet
NLP Part
Document19 pages
NLP Part
아이 커Iker
No ratings yet
Neural Translation Model (Capstone Project)
Document20 pages
Neural Translation Model (Capstone Project)
Arthur Carvalho
No ratings yet
Sample Resume For ETL Testing 2+yrs
Document4 pages
Sample Resume For ETL Testing 2+yrs
Jinendraabhi
100% (3)
Ammar Compiler
Document11 pages
Ammar Compiler
Ammar Khan
No ratings yet
Introduction To Regular Expressions: Katharine Jarmul
Document31 pages
Introduction To Regular Expressions: Katharine Jarmul
NourheneMbarek
No ratings yet
Introduction To R and Rstudio, R Script, Calling Functions, Running Code
Document10 pages
Introduction To R and Rstudio, R Script, Calling Functions, Running Code
keexuepin
No ratings yet
Data Science With R Text Mining by Graham Williams
Document21 pages
Data Science With R Text Mining by Graham Williams
Anda Roxana Nenu
No ratings yet
SC Lecture01 PDF
Document7 pages
SC Lecture01 PDF
GUODI LIU
No ratings yet
Unix Assignment 02
Document3 pages
Unix Assignment 02
Prashant Upadhye
No ratings yet
SQL Server: Tips and Tricks - 2
From Everand
SQL Server: Tips and Tricks - 2
Priyanka Agarwal
Rating: 4.5 out of 5 stars
4.5/5 (3)
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
From Everand
Computer Programming: A Simplified Entry to Python, Java, and C++ Programming for Beginners
Lena Neill
No ratings yet
Energy Research & Social Science
Document16 pages
Energy Research & Social Science
itamar_cordeiro
No ratings yet
Analyzing Tourist Data On Twitter: A Case Study in The Province of Granada at Spain
Document38 pages
Analyzing Tourist Data On Twitter: A Case Study in The Province of Granada at Spain
itamar_cordeiro
No ratings yet
Slow Tourism On Instagram: An Image Content and Geotag Analysis
Document9 pages
Slow Tourism On Instagram: An Image Content and Geotag Analysis
itamar_cordeiro
No ratings yet
MANDEL - Marxist Theory of The State (October 1969)
Document17 pages
MANDEL - Marxist Theory of The State (October 1969)
itamar_cordeiro
No ratings yet
Geographical Research On Tourism, Recreation and Leisure: Origins, Eras and Directions
Document20 pages
Geographical Research On Tourism, Recreation and Leisure: Origins, Eras and Directions
itamar_cordeiro
No ratings yet
Sms Spam Filtering Pres
Document18 pages
Sms Spam Filtering Pres
jexef21647
No ratings yet
1904 08067
Document68 pages
1904 08067
Hitesh
No ratings yet
Graph-Based Text Representations PPT
Document14 pages
Graph-Based Text Representations PPT
Saikat Mondal
No ratings yet
Chapter V - Working With Text Data
Document30 pages
Chapter V - Working With Text Data
Halbeega Waayaha
No ratings yet
Orange3 Text PDF
Document53 pages
Orange3 Text PDF
fajrina rina
No ratings yet
Bag of Words
Document32 pages
Bag of Words
ravinyse
No ratings yet
Introduction To Text Mining
Document82 pages
Introduction To Text Mining
Rajesh Siraskar
No ratings yet
Orange3 Text Mining Documentation: Biolab
Document53 pages
Orange3 Text Mining Documentation: Biolab
Karin Muñoz
No ratings yet
Basics of Bag of Words Model
Document32 pages
Basics of Bag of Words Model
Ganesh Chandrasekaran
No ratings yet
Sentiment Analysis With KNIME
Document45 pages
Sentiment Analysis With KNIME
Reshma Hemant Patel
100% (1)
Chapter2 PDF
Document22 pages
Chapter2 PDF
NourheneMbarek
No ratings yet
Artificial Intelligence Class 10 Notes
Document18 pages
Artificial Intelligence Class 10 Notes
Satyam Singla
100% (1)
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
Document9 pages
Distributed Representations of Sentences and Documents: Quoc Le Tomas Mikolov
Franz Liszt
No ratings yet
Understanding Bag-Of-Words Model: A Statistical Framework
Document10 pages
Understanding Bag-Of-Words Model: A Statistical Framework
Ik Ram
No ratings yet
5.2 Natural Language Processing
Document43 pages
5.2 Natural Language Processing
punit mishra
No ratings yet
Glossary of Artificial Intelligence
Document62 pages
Glossary of Artificial Intelligence
jchondro
No ratings yet
Effective Classification of Text
Document6 pages
Effective Classification of Text
seventhsensegroup
No ratings yet