Lab #6: Text Analytics

19CS3275S BIG DATA ANALYTICS
Lab #6: Text Analytics 190030900

K.Baby Sahithi
In-lab:
1. Shanti aspires to learn text analytics and takes a coffee.csv dataset on coffee reviews from
twitter. After referring to the data she realizes that the data needs to be cleaned. So, she
prepares a list of preprocessing techniques. Help her in implementing those techniques using R.
NOTE: As this data set in based on coffee reviews consider the words coffee and mug as
stopwords and remove them while preprocessing.
a) Extract the text data from the given dataset and convert it into a VectorSource.
b) Create a Volatile corpus from the vector source created.
c) Replace hashtags with space like "#Don'tDrinkNShoot".
d) Replace Mentions with space like "@Dorkv76".
e) Replace hyphens, colons, and other punctuations with space.
f) Now using the functions from tm package convert all the text to lower case, replace
multiple whitespace characters with single space.
g) Remove stop words from the corpus.
1
2
3
2. Consider the dataset ‘Amazon reviews’ which contains the reviews of the ‘Philips Avent 3 Pack
9oz Bottles’ used for kids along with the ratings given by users. Perform the following steps for
classifying the reviews as good/bad(1 indicating that the review is positive and 0 indicating that
the review is negative)
a) Read the given Amazon_reviews.csv file into a variable named ‘product_review’.
b) Display the table of ratings i.e., the count of each kind of rating.
c) We can observe that the ratings are quite spread on the extreme ends. Generally, people
write reviews if they are super happy or dislike the product. Add a new column named
‘rating_new’ by putting a ‘1’ for good reviews i.e the records whose rating values are {4, 5}
and ‘0’ for bad reviews i.e the records whose ratings are {1, 2}. Discard the records where
4
the rating is equal to 3.
d) Divide the cleaned dataset into 2 parts namely training set and test set wherein the
training set possesses the initial 130 records and the test set the rest.
e) Import the tm library. Now consider the ‘review’ column, create a corpus; tokenize it by
constructing a Document term matrix; remove the punctuation and numbers and store it in
the matrix format and name it as ‘training_set_toy’.
f) Column binds the ‘rating_new’ column of ‘product_review’ to this matrix and then convert
it back to a data frame. Rename this column as ‘y’ and convert it into a factor variable.
g) Import the caret library and create an SVM classification model by training the
‘training_set_toy’ data frame and considering ‘y’ as the response variable.
h) Now consider the review column of the test dataset and create a corpus, create a DTM by
taking into account the terms of the DTM of the training set.
i) Predict the ratings of the test dataset using the generated classification model and store
the results in ‘model_toy_result’. Subtract the rating value by 1 as the classes produced by
the classification model are 1 and 2.
j) Finally, print the accuracy score of this model by displaying the fraction of the number of
5
correct and incorrect results.
6
Post-lab:
1. Arjun is asked to plot and group the most frequent words from a dataset extracted from twitter.
He decides to use clustering techniques to perform this task. He has prepared a series of steps
to be performed to accomplish his task. Help him in implementing these tasks using r.
a) Read the "Tweets.txt" dataset using readLines function and convert it into a corpus with
the help of VectorSource and Corpus functions.
b) Convert the text into lowercase letters, remove the punctuations, replace multiple spaces
with single space, remove stopwords using the functions available in tm package.
c) Replace URLs with spaces.
d) Create a term-document matrix and remove sparse terms from it.
e) By using the hclust() function plot the graph.
f) Group the dendrogram into 12 parts using rect.hclust() function.
7
8
9
(Writing space for Post-Lab)
10
11
(For Evaluator’s use only)
Comment of the Evaluator (if Any) Evaluator’s Observation
Marks Secured: out of

Full Name of the Evaluator:
Signature of the Evaluator Date of Evaluation:
12

Lab #6: Text Analytics

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lab #6: Text Analytics

Uploaded by

Copyright:

Available Formats

19CS3275S BIG DATA ANALYTICS

Lab #6: Text Analytics 190030900

stopwords and remove them while preprocessing.

b) Create a Volatile corpus from the vector source created.

c) Replace hashtags with space like "#Don'tDrinkNShoot".

d) Replace Mentions with space like "@Dorkv76".

e) Replace hyphens, colons, and other punctuations with space.

multiple whitespace characters with single space.

g) Remove stop words from the corpus.

the review is negative)

a) Read the given Amazon_reviews.csv file into a variable named ‘product_review’.

the matrix format and name it as ‘training_set_toy’.

‘training_set_toy’ data frame and considering ‘y’ as the response variable.

the classification model are 1 and 2.

the help of VectorSource and Corpus functions.

c) Replace URLs with spaces.

d) Create a term-document matrix and remove sparse terms from it.

e) By using the hclust() function plot the graph.

f) Group the dendrogram into 12 parts using rect.hclust() function.

(Writing space for Post-Lab)

(For Evaluator’s use only)

Comment of the Evaluator (if Any) Evaluator’s Observation

Marks Secured: out of

Signature of the Evaluator Date of Evaluation:

You might also like