Report Varsha GanapathyRao

CA2 Part1:
Problem statement:
Creating the support vector classifier and Naïve Bayes classifier to predict ratings of medicine reviews and then
select the best model to deploy for predicting the given NoRatings.csv dataset.
Data exploration:
The given MedReview is unstructured and imbalanced dataset which has Medicine, condition, Review and
Rating attributes where Rating is the target variable.
17900 documents are High rated and 5407 documents are low rated.
Converted the Rating column into INT and then dropped the duplicate documents.
Normalize the Review column :

 Words occurring in more than 30 documents
 Removing the HTML encodings
 Removing &#mentions, numerics, any emoticons
 Tokenize the words which are having length more than 2.
 Convert all words to lower case
 Remove stopping words
 Apply WordNetLemmatizer to extract only lemmas.
After normalizing, deleting the Null rows if there are any (our text data doesn’t have any URL,Image or XML
hence there won’t be any row with Null values).
Dropped the original review, Medicine and Condition column. Saved this cleaned data in to cleaned_data file.
Created TDIDF matrix for the words which occurs in more than 30 documents.
Model selection:
Cross validated the SVC and NBC model by iterating 10 times 9:1 train and test set.
As its unbalanced dataset tried both SMOTE and Random Over Sampling. Find the result for the same below:
Random oversampling Results

List of cross-validation accuracies for SVC using uni-bi-trigrams: [0.9572921907317345,
0.9572921907317345, 0.9572921907317345, 0.9572921907317345, 0.9572921907317345,
0.9572921907317345, 0.9572921907317345, 0.9572921907317345, 0.9572921907317345,
0.9572921907317345]
Mean cross-validation accuracy for SVC using uni-bi-trigrams: 0.9572921907317345
Best cross-validation accuracy for SVC using uni-bi-trigrams: 0.9572921907317345
List of cross-validation accuracies for NBC using uni-bi-trigrams: [0.8419140253787244,

0.8419140253787244, 0.8419140253787244, 0.8419140253787244, 0.8419140253787244,
0.8419140253787244, 0.8419140253787244, 0.8419140253787244, 0.8419140253787244,
0.8419140253787244]
Mean cross-validation accuracy for NBC using uni-bi-trigrams: 0.8419140253787244
Best cross-validation accuracy for NBC using uni-bi-trigrams: 0.8419140253787244
SMOT Results:
List of cross-validation accuracies for SVC using uni-bi-trigrams: [0.9584940466208285,
0.9584940466208285, 0.9584940466208285, 0.9584940466208285, 0.9584940466208285,
0.9584940466208285, 0.9584940466208285, 0.9584940466208285, 0.9584940466208285,
0.9584940466208285]
Mean cross-validation accuracy for SVC using uni-bi-trigrams: 0.9584940466208284
Best cross-validation accuracy for SVC using uni-bi-trigrams: 0.9584940466208285
List of cross-validation accuracies for NBC using uni-bi-trigrams: [0.8424450779808821,

0.8424450779808821, 0.8424450779808821, 0.8424450779808821, 0.8424450779808821,
0.8424450779808821, 0.8424450779808821, 0.8424450779808821, 0.8424450779808821,
0.8424450779808821]
Mean cross-validation accuracy for NBC using uni-bi-trigrams: 0.8424450779808821
Best cross-validation accuracy for NBC using uni-bi-trigrams: 0.8424450779808821
SMOT is slightly better than Random Over sampling. Hence Used the SMOT to balance the text data
while doing cross validation.
Support vector classifier is best model with 95.84% accuracy on our test data hence saving the
respective vocabulary with 5431 words and saving the SVC model for deploying it on our
NoRating.CSV to predict the Rating of Unknow Medicine Review.
Model Implementation:
Implementing the saved the svc.sav saved in creation part.
Converted the vocabulary_SVC.csv to dictionary and normalizing the Review text column in
NoRating.csv.
Save this cleaned data in cleaned_Reviews file and the predict the rating for these reviews.
Final file with reviews and predicted rating will be stored in predicted_Rating.
Interpretation
After validating the predicted_Rating file for 53 rows the model is working well on the NoRating file
to predict the unknow Ratings.
Out of 53 there are 6 incorrect classification which is 88.67% correctness.
==================================================================================
CA2 Part2:
Problem statement
Analysing the online user intensions and predicting whether user is going to buy the product or not
using 3 ensemble model - Random Forest, AdaBoost, and GradientBoost. Compare their results to
decide which model we would be applying in the real world.
Data Exploration
The E-shop.csv dataset is ecommerce dataset having Administrative, Administrative_Duration,
Informational, Informational_Duration, ProductRelated, ProductRelated_Duration, BounceRate,
ExitRate, PageValue, SpecialDay, Month, VisitorType, Weekend and Transaction.
Transaction is the target column. No Null Values.
All column as numeric except Month and VisitorType are categorical data and Weekend and Transaction are
Boolean.
Converted the columns as below
dataset['Weekend'] = dataset['Weekend'].apply(converter)
dataset['Transaction'] = dataset['Transaction'].apply(converter)
dataset['VisitorType'] = dataset['VisitorType'].map({'Returning_Visitor':1, 'New_Visitor':0})
dataset['Month'] =
dataset['Month'].map({'Feb':2,'Mar':3,'May':5,'June':6,'Jul':7,'Aug':8,'Sep':9,'Oct':10,'Nov':11,'Dec':12})
Below graph shows how target variable(transaction) wrt all other variables
Model selection and interpretations:

Purchase intention will be used to describe the customer loyalty,balance demand and supply chain
and revenue prediction.
Purchase intention may be changed under the influence of price or perceived quality and value.
More competitive the market, the more important the level of customer satisfaction. Hence its
important to reduce the FN.
If we are considering the demand and supply prediction or revenue prediction then we need to
minimise the FP and FN or increase the accuracy.
In this paper I will be considering the customer loyalty/customer retention hence using the scoring
parameter as “recall” because we need to minimise FN.
Random forest with feature selection

Classification report:
precision recall f1-score support
0 0.95 0.87 0.91 2067

1 0.52 0.74 0.61 382
accuracy 0.85 2449

macro avg 0.73 0.80 0.76 2449
weighted avg 0.88 0.85 0.86 2449
TP: 281
TN: 1806
FP: 261
FN: 101
Accuracy: 0.852184565128624
Ada boosting
0 0.95 0.88 0.92 2067

1 0.54 0.75 0.63 382
accuracy 0.86 2449
macro avg 0.75 0.82 0.77 2449
weighted avg 0.89 0.86 0.87 2449
TP: 286
TN: 1826
FP: 241
FN: 96
Accuracy: 0.8623928133932217
Gradient Boosting Classifier

0 0.95 0.91 0.93 2067

1 0.60 0.72 0.65 382
accuracy 0.88 2449

macro avg 0.77 0.82 0.79 2449
weighted avg 0.89 0.88 0.89 2449
TP: 275
TN: 1883
FN: 184
FP: 107
Accuracy: 0.8811759902000816
Although all models performed equally well, I will be selecting the Ada boosting as my concern is to
reduce the FN.
Recommendations:
 Most successful transaction happened in the month of Nov which is obvious that due to year
end sale and festival season all over the world.
Hence its crucial month for Big sale discount to attract new customers as well as retain
existing customers. Advertising more on demanding products to increase the revenue.
 Offer small perks as a base reward to be a part of the system and then promote repeat
clients by increasing the bonus benefit as the client pushes the loyalty ladder upwards.
 "Customers who bought [this item] also bought [that item]" suggestions provide the
consumer with social evidence and peer-generated suggestions of related items.
For example, if customer is willing to buy the phone then we can surely recommend screen
guard, back case, head phones etc., which are related to purchase which increases the
customer satisfaction and increase the revenue.
 Present recommendations on the product pages "Similar to products you've visited" to help
inspire consumers to add new things to their basket.
 Notify viewers of items modified by creating notices about "There is a newer edition of this
product."
 Generate merchandise combos (items bought often together) and provide a special discount
for group purchases.
 Give suggested pairings of items on the shopping cart tab. You have one more chance before
the consumers step into the checkout process to provide them with quality suggestions. If
you choose to employ this strategy, make sure you don't discourage consumers from
finishing the order with the items you sell. The best approach here is cross-selling related
products which complement the products already in their cart.

Report Varsha GanapathyRao

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Report Varsha GanapathyRao

Uploaded by

Copyright:

Available Formats

CA2 Part1:

Normalize the Review column :

Random oversampling Results

List of cross-validation accuracies for NBC using uni-bi-trigrams: [0.8419140253787244,

List of cross-validation accuracies for NBC using uni-bi-trigrams: [0.8424450779808821,

Out of 53 there are 6 incorrect classification which is 88.67% correctness.

Transaction is the target column. No Null Values.

Model selection and interpretations:

Random forest with feature selection

0 0.95 0.87 0.91 2067

accuracy 0.85 2449

0 0.95 0.88 0.92 2067

Gradient Boosting Classifier

0 0.95 0.91 0.93 2067

accuracy 0.88 2449

You might also like