Professional Documents
Culture Documents
1. Shanti aspires to learn text analytics and takes a coffee.csv dataset on coffee reviews from
twitter. After referring to the data she realizes that the data needs to be cleaned. So, she
prepares a list of preprocessing techniques. Help her in implementing those techniques using R.
NOTE: As this data set in based on coffee reviews consider the words coffee and mug as
a) Extract the text data from the given dataset and convert it into a VectorSource.
f) Now using the functions from tm package convert all the text to lower case, replace
1
19CS3275S BIG DATA ANALYTICS
2
19CS3275S BIG DATA ANALYTICS
3
19CS3275S BIG DATA ANALYTICS
2. Consider the dataset ‘Amazon reviews’ which contains the reviews of the ‘Philips Avent 3 Pack
9oz Bottles’ used for kids along with the ratings given by users. Perform the following steps for
classifying the reviews as good/bad(1 indicating that the review is positive and 0 indicating that
b) Display the table of ratings i.e., the count of each kind of rating.
c) We can observe that the ratings are quite spread on the extreme ends. Generally, people
write reviews if they are super happy or dislike the product. Add a new column named
‘rating_new’ by putting a ‘1’ for good reviews i.e the records whose rating values are {4, 5}
and ‘0’ for bad reviews i.e the records whose ratings are {1, 2}. Discard the records where
4
19CS3275S BIG DATA ANALYTICS
the rating is equal to 3.
d) Divide the cleaned dataset into 2 parts namely training set and test set wherein the
training set possesses the initial 130 records and the test set the rest.
e) Import the tm library. Now consider the ‘review’ column, create a corpus; tokenize it by
constructing a Document term matrix; remove the punctuation and numbers and store it in
f) Column binds the ‘rating_new’ column of ‘product_review’ to this matrix and then convert
it back to a data frame. Rename this column as ‘y’ and convert it into a factor variable.
g) Import the caret library and create an SVM classification model by training the
h) Now consider the review column of the test dataset and create a corpus, create a DTM by
taking into account the terms of the DTM of the training set.
i) Predict the ratings of the test dataset using the generated classification model and store
the results in ‘model_toy_result’. Subtract the rating value by 1 as the classes produced by
j) Finally, print the accuracy score of this model by displaying the fraction of the number of
5
19CS3275S BIG DATA ANALYTICS
correct and incorrect results.
6
19CS3275S BIG DATA ANALYTICS
Post-lab:
1. Arjun is asked to plot and group the most frequent words from a dataset extracted from twitter.
He decides to use clustering techniques to perform this task. He has prepared a series of steps
to be performed to accomplish his task. Help him in implementing these tasks using r.
a) Read the "Tweets.txt" dataset using readLines function and convert it into a corpus with
b) Convert the text into lowercase letters, remove the punctuations, replace multiple spaces
with single space, remove stopwords using the functions available in tm package.
7
19CS3275S BIG DATA ANALYTICS
8
19CS3275S BIG DATA ANALYTICS
9
19CS3275S BIG DATA ANALYTICS
10
19CS3275S BIG DATA ANALYTICS
11
19CS3275S BIG DATA ANALYTICS
12