You are on page 1of 12

19CS3275S BIG DATA ANALYTICS

Lab #6: Text Analytics 190030900


K.Baby Sahithi
In-lab:

1. Shanti aspires to learn text analytics and takes a coffee.csv dataset on coffee reviews from

twitter. After referring to the data she realizes that the data needs to be cleaned. So, she

prepares a list of preprocessing techniques. Help her in implementing those techniques using R.

NOTE: As this data set in based on coffee reviews consider the words coffee and mug as

stopwords and remove them while preprocessing.

a) Extract the text data from the given dataset and convert it into a VectorSource.

b) Create a Volatile corpus from the vector source created.

c) Replace hashtags with space like "#Don'tDrinkNShoot".

d) Replace Mentions with space like "@Dorkv76".

e) Replace hyphens, colons, and other punctuations with space.

f) Now using the functions from tm package convert all the text to lower case, replace

multiple whitespace characters with single space.

g) Remove stop words from the corpus.

1
19CS3275S BIG DATA ANALYTICS

2
19CS3275S BIG DATA ANALYTICS

3
19CS3275S BIG DATA ANALYTICS

2. Consider the dataset ‘Amazon reviews’ which contains the reviews of the ‘Philips Avent 3 Pack

9oz Bottles’ used for kids along with the ratings given by users. Perform the following steps for

classifying the reviews as good/bad(1 indicating that the review is positive and 0 indicating that

the review is negative)

a) Read the given Amazon_reviews.csv file into a variable named ‘product_review’.

b) Display the table of ratings i.e., the count of each kind of rating.

c) We can observe that the ratings are quite spread on the extreme ends. Generally, people

write reviews if they are super happy or dislike the product. Add a new column named

‘rating_new’ by putting a ‘1’ for good reviews i.e the records whose rating values are {4, 5}

and ‘0’ for bad reviews i.e the records whose ratings are {1, 2}. Discard the records where

4
19CS3275S BIG DATA ANALYTICS
the rating is equal to 3.

d) Divide the cleaned dataset into 2 parts namely training set and test set wherein the

training set possesses the initial 130 records and the test set the rest.

e) Import the tm library. Now consider the ‘review’ column, create a corpus; tokenize it by

constructing a Document term matrix; remove the punctuation and numbers and store it in

the matrix format and name it as ‘training_set_toy’.

f) Column binds the ‘rating_new’ column of ‘product_review’ to this matrix and then convert

it back to a data frame. Rename this column as ‘y’ and convert it into a factor variable.

g) Import the caret library and create an SVM classification model by training the

‘training_set_toy’ data frame and considering ‘y’ as the response variable.

h) Now consider the review column of the test dataset and create a corpus, create a DTM by

taking into account the terms of the DTM of the training set.

i) Predict the ratings of the test dataset using the generated classification model and store

the results in ‘model_toy_result’. Subtract the rating value by 1 as the classes produced by

the classification model are 1 and 2.

j) Finally, print the accuracy score of this model by displaying the fraction of the number of

5
19CS3275S BIG DATA ANALYTICS
correct and incorrect results.

6
19CS3275S BIG DATA ANALYTICS

Post-lab:

1. Arjun is asked to plot and group the most frequent words from a dataset extracted from twitter.

He decides to use clustering techniques to perform this task. He has prepared a series of steps

to be performed to accomplish his task. Help him in implementing these tasks using r.

a) Read the "Tweets.txt" dataset using readLines function and convert it into a corpus with

the help of VectorSource and Corpus functions.

b) Convert the text into lowercase letters, remove the punctuations, replace multiple spaces

with single space, remove stopwords using the functions available in tm package.

c) Replace URLs with spaces.

d) Create a term-document matrix and remove sparse terms from it.

e) By using the hclust() function plot the graph.

f) Group the dendrogram into 12 parts using rect.hclust() function.

7
19CS3275S BIG DATA ANALYTICS

8
19CS3275S BIG DATA ANALYTICS

9
19CS3275S BIG DATA ANALYTICS

(Writing space for Post-Lab)

10
19CS3275S BIG DATA ANALYTICS

11
19CS3275S BIG DATA ANALYTICS

(For Evaluator’s use only)

Comment of the Evaluator (if Any) Evaluator’s Observation

Marks Secured: out of


Full Name of the Evaluator:

Signature of the Evaluator Date of Evaluation:

12

You might also like