You are on page 1of 7

1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

Data Science Capstone - Week 2


Milestone - Exploratory Data Analysis on
Text Files
Leandro Freitas
10/26/2017

1. Executive Summary
The goal of this project is to do an exploratory data analysis on text files as part of Week 2 activities from Data
Science Specialization SwiftKey Capstone. Data for the analysis can be downloaded from the link below:

https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip
(https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip)

2. Preparing Environment
2.1. Loading Libraries
Loading required packages:

set.seed(500)
library(ggplot2)
library(knitr)
library(RWeka)
library(SnowballC)
library(tm)
library(wordcloud)

Complementary information:

sessionInfo()

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 1/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

## R version 3.4.1 (2017-06-30)


## Platform: x86_64-w64-mingw32/x64 (64-bit)
## Running under: Windows 10 x64 (build 15063)
##
## Matrix products: default
##
## locale:
## [1] LC_COLLATE=Portuguese_Brazil.1252 LC_CTYPE=Portuguese_Brazil.1252
## [3] LC_MONETARY=Portuguese_Brazil.1252 LC_NUMERIC=C
## [5] LC_TIME=Portuguese_Brazil.1252
##
## attached base packages:
## [1] stats graphics grDevices utils datasets methods base
##
## other attached packages:
## [1] wordcloud_2.5 RColorBrewer_1.1-2 tm_0.7-1
## [4] NLP_0.1-11 SnowballC_0.5.1 RWeka_0.4-34
## [7] knitr_1.17 ggplot2_2.2.1 RevoUtilsMath_10.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.12 magrittr_1.5 RWekajars_3.9.1-3
## [4] munsell_0.4.3 colorspace_1.3-2 rlang_0.1.2
## [7] stringr_1.2.0 plyr_1.8.4 tools_3.4.1
## [10] parallel_3.4.1 grid_3.4.1 gtable_0.2.0
## [13] htmltools_0.3.6 yaml_2.1.14 lazyeval_0.2.0
## [16] rprojroot_1.2 digest_0.6.12 tibble_1.3.4
## [19] rJava_0.9-8 slam_0.1-40 evaluate_0.10.1
## [22] rmarkdown_1.6 stringi_1.1.5 compiler_3.4.1
## [25] RevoUtils_10.0.5 scales_0.5.0 backports_1.1.0

2.2. Loading Datasets


# Read text files
Blogs <- readLines("./source/en_US.blogs.txt")
News <- readLines("./source/en_US.news.txt")
Twitter <- readLines("./source/en_US.twitter.txt")

2.2.1. Basic summaries of the three files

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 2/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

Blogs_Summary <- c(sum(nchar(Blogs)),


length(unlist(strsplit(Blogs, " "))),
format(object.size(Blogs), units = "Mb"))

News_Summary <- c(sum(nchar(News)),


length(unlist(strsplit(News, " "))),
format(object.size(News), units = "Mb"))

Twitter_Summary <- c(sum(nchar(Twitter)),


length(unlist(strsplit(Twitter, " "))),
format(object.size(Twitter), units = "Mb"))

var_names <- c("Characters", "Words", "Size")


summary_files <- data.frame(Blogs_Summary, News_Summary, Twitter_Summary, row.names = var_names)
names(summary_files) <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
kable(summary_files, align = "c")

en_US.blogs.txt en_US.news.txt en_US.twitter.txt

Characters 208361438 15683765 162384825

Words 37334131 2643969 30373543

Size 248.5 Mb 19.2 Mb 301.4 Mb

2.3. Preparing Data


2.3.1. Sampling and Corpus
Since the source files are large, a sample will be taken from each one to do the analysis:

Sample_Text <- rbind( sample(Blogs,10000),


sample(News, 10000),
sample(Twitter, 10000))

# Delete no longer needed large data


rm(Blogs, News, Twitter)

Now create a corpus (collection of text documents) from the sample texts:

Corpus_ST <- Corpus(VectorSource(Sample_Text))

2.3.2. Clean and prep data for analysis


Corpus_ST <- tm_map(Corpus_ST, removeWords, stopwords("english"))
Corpus_ST <- tm_map(Corpus_ST, removePunctuation)
Corpus_ST <- tm_map(Corpus_ST, removeNumbers)
Corpus_ST <- tm_map(Corpus_ST, stripWhitespace)
Corpus_ST <- tm_map(Corpus_ST, tolower)
Corpus_ST <- tm_map(Corpus_ST, stemDocument)

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 3/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

3. Exploratory Data Analysis


3.1. Finding n-grams
# Function for tokenizing the Corpus
f_tokenizer <- function (corpus, i) {
temp <- c()
ngram <-c()
temp <- NGramTokenizer(corpus, Weka_control(min=i,max=i))
ngram <- data.frame(table(temp))
return(ngram)
}

# Find n-grams
ngram_US_2 <- f_tokenizer(Corpus_ST, 2)
ngram_US_4 <- f_tokenizer(Corpus_ST, 4)

3.1.1. Most used sequences of 2 and 4 words


ngram_US_2 <- ngram_US_2[order(ngram_US_2$Freq, decreasing = TRUE),]
ngram_US_4 <- ngram_US_4[order(ngram_US_4$Freq, decreasing = TRUE),]

head(ngram_US_2, 10)

## temp Freq
## 168895 i think 543
## 168103 i know 394
## 168164 i love 326
## 169016 i want 314
## 167449 i can 308
## 169058 i will 273
## 168083 i just 236
## 194669 last year 231
## 402541 year ago 186
## 168141 i like 175

head(ngram_US_4, 10)

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 4/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

## temp Freq
## 283048 me me me me 36
## 207375 i feel like i 16
## 479751 ugli ugli ugli ugli 14
## 207482 i felt like i 7
## 209752 i know i know 7
## 214658 i think i can 7
## 451947 the new york time 7
## 206733 i donâ<U+0080><U+0099>t know i 5
## 214729 i think im go 5
## 208866 i hope i can 4

3.1.2. Plot most used sequences of 2 words


Bigrams <- ngram_US_2[order(ngram_US_2$Freq,decreasing = TRUE),]
colnames(Bigrams)<-c("Bigram","Frequency" )
Bigrams<- Bigrams[1:10,]

barplot(Bigrams$Frequency, las = 2,
names.arg = Bigrams$Bigram,
col ="lightgreen", main ="Top 10 Bigrams",
ylab = "Frequency")

3.2.3. Plot most used sequences of 4 words


https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 5/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

Quadgrams <- ngram_US_4[order(ngram_US_4$Freq,decreasing = TRUE),]


colnames(Quadgrams)<-c("Quadgram","Frequency" )
Quadgrams<- Quadgrams[1:10,]

barplot(Quadgrams$Frequency, las = 2,
names.arg = Quadgrams[1:10,]$Quadgram,
col ="lightblue", main ="Top 10 Quadgrams",
ylab = "Frequency")

3.2. Most Common Words


3.2.1. Top 50 words used in the texts
Matrix_US <- DocumentTermMatrix(Corpus_ST)
Matrix_US <- removeSparseTerms(Matrix_US, 0.99)
frequency <- colSums(as.matrix(Matrix_US))
order_freq <- order(frequency, decreasing=TRUE)
frequency[head(order_freq,50)]

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 6/7
1/18/2018 Data Science Capstone - Week 2 Milestone - Exploratory Data Analysis on Text Files

## the one will said get like just time can year
## 4988 2854 2848 2838 2282 2245 2225 2182 2093 2037
## make day new work know now good love say peopl
## 1775 1641 1547 1528 1418 1359 1352 1337 1311 1302
## want think also use but look first see thing back
## 1297 1277 1267 1244 1199 1190 1186 1186 1156 1150
## two and need come last take even way much this
## 1147 1142 1127 1126 1124 1086 1072 1057 957 956
## week state start realli well right still great play game
## 924 919 918 910 904 872 864 823 818 816

3.2.2. Word Cloud


colors = c("blue", "red", "orange", "green")
wordcloud(names(frequency), frequency, max.words=50, min.freq=2, colors=colors)

4. Future Actions
My goal for the eventual app and algorithm is to create a “Shiny version” of a word prediction/completios apps
available for cell phones.

https://rstudio-pubs-static.s3.amazonaws.com/323145_6c395a8d69e6441d90c3abd94f67a5ce.html 7/7