You are on page 1of 16

CIS242_HW6_ANSPDF

April 8, 2021

CIS 242

0.1 Spring 2021


0.2 HOMEWORK ASSIGNMENT 6
Please compile your responses use markdown in your Jupyter notebook to answer the questions.
If you prefer, you may also submit a Word or PDF document with the responses along the PDF or
HTML version of the completed notebook.

Active notebooks (.ipynb files) or raw code (.py files) will NOT be accepted and no points will
be given. The code part of the files will not be graded, but they will be checked if necessary
to verify your findings and recommendations. Point deductions may occur if there are major
discrepancies between your written answers and the output from the code.

Please make sure that your answers are readable and don’t run off the page when the notebook
is converted to HTML or PDF. Questions are worth 2 points each for a total of 26 points.

0.3 Working with movie reviews


You have been hired by a new streaming service to try to decide which movies they should bid
for to get exclusive rights before Netflix gets them. You have access to user reviews from the test
screenings that the studios do.
There are positive and negative reviews for each movie. You want to be able to predict the rating
of a movie based on user reviews so you can help your new company know how to predict “good”
movies based on review terms. You have a data file movie_reviews.csv to work with.

1. Read in the reviews and do some EDA. What can you say about the data? Create an uncon-
strained vector space. How many features do you have?
[3]: # code or markdown here
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', 150000) #important for getting all the text
pd.set_option('display.max_columns', 999)
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from IPython.display import Image
import pydotplus

1
from sklearn.feature_extraction.text import CountVectorizer
import math

[4]: df = pd.read_csv("movie_reviews.csv")
df.shape

[4]: (25000, 3)

[5]: df.head(2)

[5]: Movie ID Rating \


0 1821 4
1 9487 1

Review
0 Alan Rickman & Emma Thompson give performances with southern/New Orleans
accents in this detective flick. It's worth seeing for their scenes- and
Rickman's scene with Hal Holbrook. These three actors mannage to entertain us no
matter what the movie, it seems. The plot for the movie shows potential, but one
gets the impression in watching the film that it was not pulled off as well as
it could have been. The fact that it is cluttered by a rather uninteresting
subplot and mostly uninteresting kidnappers really muddles things. The movie is
worth a view- if for nothing more than entertaining performances by Rickman,
Thompson, and Holbrook.
1 I have seen this movie and I did not care for this movie anyhow. I would not
think about going to Paris because I do not like this country and its national
capital. I do not like to learn french anyhow because I do not understand their
language. Why would I go to France when I rather go to Germany or the United
Kingdom? Germany and the United Kingdom are the nations I tolerate. Apparently
the Olsen Twins do not understand the French language just like me. Therefore I
will not bother the France trip no matter what. I might as well stick to the
United Kingdom and meet single women and play video games if there is a video
arcade. That is all.

[6]: df['Rating'] = df['Rating'].astype("category")

[7]: df['Rating'].unique()

[7]: [4, 1, 2, 3, 9, 10, 7, 8]


Categories (8, int64): [4, 1, 2, 3, 9, 10, 7, 8]

[8]: df['Rating'].hist(bins = 20)

[8]: <matplotlib.axes._subplots.AxesSubplot at 0xc7b2510>

2
[9]: bow = CountVectorizer(binary = True)
bow_dm = bow.fit_transform(df.Review) #apply the transformation
print(type(bow_dm))
print(bow_dm.shape)
# print(bow.get_feature_names())

<class 'scipy.sparse.csr.csr_matrix'>
(25000, 74899)
There 25000 observations, we have 8 class of rating and related reviews. Threre are 74899 features.

2. What are the top 20 words in the feature space? Are you surprised by this? Why or why not?
[10]: # code or markdown here
sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-20:]

ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]

[10]: ['be',
'have',
'one',
'not',
'movie',
'on',

3
'as',
'was',
'with',
'for',
'but',
'that',
'in',
'it',
'is',
'this',
'to',
'of',
'and',
'the']

It is not surprised for these words are most used in all days and everywhere.

3. Create a feature space that removes stopwords. What is the size of the space now? What are
the top words? Is this an improvement in size or content? Why or why not? The size now is
74588, siz is reduced, words are more meaningful, therefore we have made a improvement.

[11]: # code or markdown here

bow = CountVectorizer(stop_words='english',binary = True)


bow_dm = bow.fit_transform(df.Review) #apply the transformation
print(type(bow_dm))
print(bow_dm.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(25000, 74588)
[12]: sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-20:]

ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]

[12]: ['character',
'know',
'plot',
'characters',
'think',
'acting',
'seen',
'movies',
'watch',
'way',

4
'people',
'make',
'don',
'story',
'really',
'time',
'just',
'like',
'film',
'movie']

4. Create a feature space that requires a minimum frequency for a term as well as removing
stopwords. What limit will you choose? Why? What is the effect on the feature space? I
choose the minum frequency to be 100, if less, the features only appears in a few texts, too high
there will be too little features.

[13]: # code or markdown here


np.quantile(sum_words, 0.5)

[13]: 2.0

[14]: np.quantile(sum_words, 0.9)

[14]: 37.0

[15]: np.quantile(sum_words, 0.95)

[15]: 92.0

[16]: bow = CountVectorizer(stop_words='english', min_df = 100,binary = True)


bow_dm = bow.fit_transform(df.Review) #apply the transformation
print(type(bow_dm))
print(bow_dm.shape)

<class 'scipy.sparse.csr.csr_matrix'>
(25000, 3496)

5. Investigate this feature space. What tokens would you like to remove that remain? Why?
(Or Why not?) some meaningless words and numbers exists, we set length of 2 to exclude them.

[18]: # code or markdown here

sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-100:]

ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]

5
[18]: ['having',
'trying',
'original',
'horror',
'performance',
'fun',
'screen',
'believe',
'worth',
'tv',
'action',
'especially',
'looking',
'sure',
'hard',
'kind',
'minutes',
'comedy',
'guy',
'll',
'away',
'script',
'probably',
'feel',
'role',
'making',
'bit',
'music',
'point',
'far',
'gets',
'young',
'interesting',
'isn',
'times',
'saw',
'right',
'world',
'come',
'big',
'fact',
'pretty',
'got',
'quite',
'long',
'new',
'thought',

6
'things',
'cast',
'want',
'funny',
'old',
'lot',
'10',
'work',
'going',
'look',
'actually',
'years',
'makes',
'director',
'doesn',
'didn',
'actors',
'real',
'thing',
'watching',
've',
'scene',
'scenes',
'man',
'end',
'say',
'does',
'life',
'love',
'films',
'little',
'better',
'did',
'character',
'know',
'plot',
'characters',
'think',
'acting',
'seen',
'movies',
'watch',
'way',
'people',
'make',
'don',
'story',

7
'really',
'time',
'just',
'like',
'film',
'movie']

6. Create custom stopword list to remove the tokens. How does this affect your feature space?
If choose not to remove any more tokens, explain why your feature space is appropriate as is -
be specific. This reduce the size of feutres again.

[19]: # code or markdown herezai


from sklearn.feature_extraction import text
stop_wds = [i for i in ftrs if len(i) <= 2]

from sklearn.feature_extraction import text


skl_stopwords = list(text.ENGLISH_STOP_WORDS)
mystp = skl_stopwords + stop_wds
stopmin = CountVectorizer(binary=True, min_df = 100, stop_words = mystp)
bow_dm = stopmin.fit_transform(df.Review)
sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-20:]
print(bow_dm.shape)
ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]

(25000, 3446)
[19]: ['cause',
'just',
'philosophy',
'causes',
'technical',
'absurd',
'scared',
'miserably',
'videos',
'views',
'owner',
'losing',
'discovers',
'starred',
'quickly',
'test',
'jazz',
'leading',

8
'fears',
'miscast']

7. Investigate the feature space further. What other changes you would like to make? Why?
Choose custom dictionary replacement or stemming and implement your choice. How does
that affect your feature space now? I use stemming, moreover, this step is actually done after-
wards, I set minum frequency being 500, otherwise memory is out of usage.

[20]: # code or markdown here


from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
df['pstem'] = df["Review"].apply(lambda x: [ps.stem(y) for y in x.split()])
df['pstem']= [" ".join(token) for token in df['pstem']]

stem = CountVectorizer(binary=False, min_df = 500, stop_words = mystp)


bow_dm = stem.fit_transform(df.pstem)
sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-20:]
print(bow_dm.shape)
ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]

(25000, 786)
[20]: ['appreciated',
'broken',
'clue',
'broad',
'african',
'christian',
'cartoon',
'cameos',
'curious',
'crisis',
'brown',
'cousin',
'believes',
'boredom',
'bright',
'billy',
'cameo',
'band',
'crying',
'count']

8. Choose the feature space you think most appropriate to use in a model. Now we want to
use this predict the audience feeling for a movie. How will you do that? What target variable

9
will help decide which movies to buy? How will you impliment this? The rating varible help
audience to buy, products of higher rating is usually more popular. I will use classfication model
to impliment this.

9. Train at least two classification models using your feature space. Which had the highest
predictive accuracy? What did it do well? Not so well? MultinomialNB has higher accuracy, it
do relative good in class 1 and class 2, does worse in predicting other classes.

[22]: # code or markdown here


from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
X = bow_dm.toarray()
y = df['Rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
,→2,random_state=42)

model = MultinomialNB()
model.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = model.predict(X_test)

[24]: accuracy_score(clf2_expected, clf2_predicted)

[24]: 0.3518

[25]: print(classification_report(clf2_expected, clf2_predicted))

precision recall f1-score support

1 0.45 0.66 0.54 1033


2 0.20 0.12 0.15 459
3 0.22 0.16 0.18 512
4 0.26 0.22 0.24 511
7 0.23 0.25 0.24 459
8 0.20 0.18 0.19 535
9 0.18 0.12 0.14 459
10 0.50 0.55 0.52 1032

accuracy 0.35 5000


macro avg 0.28 0.28 0.27 5000
weighted avg 0.32 0.35 0.33 5000

[26]: from sklearn.tree import DecisionTreeClassifier


import warnings
warnings.filterwarnings("ignore")
dt_pru = DecisionTreeClassifier(criterion='gini', min_impurity_decrease = 0.0003,
min_samples_split= 0.0005,

10
random_state = 12345)
dt_pru.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = dt_pru.predict(X_test)

[27]: accuracy_score(clf2_expected, clf2_predicted)

[27]: 0.2814

[29]: print(classification_report(clf2_expected, clf2_predicted))

precision recall f1-score support

1 0.32 0.58 0.41 1033


2 0.17 0.02 0.04 459
3 0.00 0.00 0.00 512
4 0.18 0.14 0.16 511
7 0.15 0.05 0.07 459
8 0.21 0.10 0.13 535
9 0.00 0.00 0.00 459
10 0.28 0.63 0.39 1032

accuracy 0.28 5000


macro avg 0.17 0.19 0.15 5000
weighted avg 0.20 0.28 0.21 5000

10. What is the appropriate metric for evaluating these models? Why? How did your previ-
ous models do when using that measure? macro f1 score, which considers all classes and both
precision and recall is a better metric. In previous models, the macro f1 made same decsion as
accuracy.

11. Create an alternative feature space. You can change any aspect. Run your two models on
this feature space. Does it improve your models’ performance for your chosen metric? No, I
use TfidfVectorizer, the better classifer is still MultinomialNB, the accuracy increased, but macro
f1 is lower.

[30]: # code or markdown here

from sklearn.feature_extraction.text import TfidfVectorizer


tfidf1 = TfidfVectorizer(min_df = 500, stop_words = mystp)
bow_dm = tfidf1.fit_transform(df.pstem)

X = bow_dm.toarray()
y = df['Rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
,→2,random_state=42)

11
model = MultinomialNB()
model.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = model.predict(X_test)
print(classification_report(clf2_expected, clf2_predicted))

precision recall f1-score support

1 0.39 0.86 0.53 1033


2 0.00 0.00 0.00 459
3 0.18 0.01 0.01 512
4 0.32 0.12 0.18 511
7 0.32 0.03 0.06 459
8 0.24 0.07 0.11 535
9 0.00 0.00 0.00 459
10 0.36 0.80 0.50 1032

accuracy 0.37 5000


macro avg 0.23 0.24 0.17 5000
weighted avg 0.26 0.37 0.25 5000

[31]: dt_pru = DecisionTreeClassifier(criterion='gini', min_impurity_decrease = 0.0003,


min_samples_split= 0.0005,
random_state = 12345)
dt_pru.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = dt_pru.predict(X_test)
print(classification_report(clf2_expected, clf2_predicted))

precision recall f1-score support

1 0.29 0.62 0.39 1033


2 0.09 0.00 0.00 459
3 0.17 0.04 0.07 512
4 0.20 0.17 0.18 511
7 0.10 0.02 0.04 459
8 0.13 0.02 0.03 535
9 0.00 0.00 0.00 459
10 0.30 0.58 0.40 1032

accuracy 0.27 5000


macro avg 0.16 0.18 0.14 5000
weighted avg 0.19 0.27 0.20 5000

12. Using the “best” feature space, try to improve on your model’s performance. What
model/parameter settings do you think will help? Why? What is the result? I set the smooth-

12
ing parameter from default 1 to 0.8, the macro f1 increased a bit.

[37]: # code or markdown here

stem = CountVectorizer(binary=False, min_df = 500, stop_words = mystp)


bow_dm = stem.fit_transform(df.pstem)

X = bow_dm.toarray()
y = df['Rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
,→2,random_state=42)

model = MultinomialNB(alpha = 0.8)


model.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = model.predict(X_test)
print(classification_report(clf2_expected, clf2_predicted))

# from sklearn.ensemble import AdaBoostClassifier


# NBboost = AdaBoostClassifier(MultinomialNB(),n_estimators=10)
# NBboost.fit(X_train, y_train)
# clf3_predicted = NBboost.predict(X_test)
# print(classification_report(y_test, clf3_predicted))

precision recall f1-score support

1 0.45 0.66 0.54 1033


2 0.20 0.12 0.15 459
3 0.22 0.16 0.19 512
4 0.26 0.22 0.24 511
7 0.23 0.25 0.24 459
8 0.20 0.18 0.19 535
9 0.17 0.12 0.14 459
10 0.50 0.55 0.52 1032

accuracy 0.35 5000


macro avg 0.28 0.28 0.28 5000
weighted avg 0.32 0.35 0.33 5000

13. Are you able to build a model that will help your streaming service get some market share?
Why or why not? For random guess, the correct prediction has probability of 1/8, the prediction
of our model is better than this, it helps.

Extra credit (3 points): What terms from the reviews are most strongly related to a movie’s
classification? How do you know this? crap is most important term according to decsion tree
classifer.

13
[38]: # code or markdown here
dt_pru = DecisionTreeClassifier(criterion='gini', min_impurity_decrease = 0.0003,
min_samples_split= 0.0005,
random_state = 12345)
dt_pru.fit(X_train, y_train)

[38]: DecisionTreeClassifier(min_impurity_decrease=0.0003, min_samples_split=0.0005,


random_state=12345)

[41]: from sklearn.tree import export_text


text_tree = export_text(dt_pru, feature_names = stem.get_feature_names())
print(text_tree)

|--- crap <= 0.50


| |--- love <= 0.50
| | |--- poor <= 0.50
| | | |--- hi <= 0.50
| | | | |--- money <= 0.50
| | | | | |--- avoid <= 0.50
| | | | | | |--- plot <= 0.50
| | | | | | | |--- noth <= 0.50
| | | | | | | | |--- pretti <= 0.50
| | | | | | | | | |--- thi <= 2.50
| | | | | | | | | | |--- perfect <= 0.50
| | | | | | | | | | | |--- truncated branch of depth 7
| | | | | | | | | | |--- perfect > 0.50
| | | | | | | | | | | |--- class: 10
| | | | | | | | | |--- thi > 2.50
| | | | | | | | | | |--- veri <= 0.50
| | | | | | | | | | | |--- truncated branch of depth 4
| | | | | | | | | | |--- veri > 0.50
| | | | | | | | | | | |--- class: 10
| | | | | | | | |--- pretti > 0.50
| | | | | | | | | |--- class: 4
| | | | | | | |--- noth > 0.50
| | | | | | | | |--- class: 1
| | | | | | |--- plot > 0.50
| | | | | | | |--- enjoy <= 0.50
| | | | | | | | |--- class: 1
| | | | | | | |--- enjoy > 0.50
| | | | | | | | |--- class: 7
| | | | | |--- avoid > 0.50
| | | | | | |--- class: 1
| | | | |--- money > 0.50
| | | | | |--- class: 1
| | | |--- hi > 0.50
| | | | |--- perfect <= 0.50
| | | | | |--- script <= 0.50

14
| | | | | | |--- suppos <= 0.50
| | | | | | | |--- plot <= 0.50
| | | | | | | | |--- hi <= 2.50
| | | | | | | | | |--- avoid <= 0.50
| | | | | | | | | | |--- better <= 0.50
| | | | | | | | | | | |--- truncated branch of depth 2
| | | | | | | | | | |--- better > 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | | |--- avoid > 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- hi > 2.50
| | | | | | | | | |--- class: 8
| | | | | | | |--- plot > 0.50
| | | | | | | | |--- class: 4
| | | | | | |--- suppos > 0.50
| | | | | | | |--- class: 2
| | | | | |--- script > 0.50
| | | | | | |--- class: 1
| | | | |--- perfect > 0.50
| | | | | |--- class: 10
| | |--- poor > 0.50
| | | |--- thi <= 2.50
| | | | |--- class: 1
| | | |--- thi > 2.50
| | | | |--- class: 1
| |--- love > 0.50
| | |--- suppos <= 0.50
| | | |--- hi <= 0.50
| | | | |--- noth <= 0.50
| | | | | |--- love <= 1.50
| | | | | | |--- money <= 0.50
| | | | | | | |--- nice <= 0.50
| | | | | | | | |--- script <= 0.50
| | | | | | | | | |--- class: 10
| | | | | | | | |--- script > 0.50
| | | | | | | | | |--- class: 1
| | | | | | | |--- nice > 0.50
| | | | | | | | |--- class: 7
| | | | | | |--- money > 0.50
| | | | | | | |--- class: 1
| | | | | |--- love > 1.50
| | | | | | |--- scene <= 0.50
| | | | | | | |--- class: 10
| | | | | | |--- scene > 0.50
| | | | | | | |--- class: 10
| | | | |--- noth > 0.50
| | | | | |--- class: 1
| | | |--- hi > 0.50

15
| | | | |--- favorit <= 0.50
| | | | | |--- poorli <= 0.50
| | | | | | |--- problem <= 0.50
| | | | | | | |--- guy <= 0.50
| | | | | | | | |--- class: 10
| | | | | | | |--- guy > 0.50
| | | | | | | | |--- class: 1
| | | | | | |--- problem > 0.50
| | | | | | | |--- class: 4
| | | | | |--- poorli > 0.50
| | | | | | |--- class: 1
| | | | |--- favorit > 0.50
| | | | | |--- class: 10
| | |--- suppos > 0.50
| | | |--- class: 1
|--- crap > 0.50
| |--- veri <= 0.50
| | |--- class: 1
| |--- veri > 0.50
| | |--- class: 1

0.4 Additional code below


[ ]: # code to set up environment and bring in data

[ ]: # other calculations

16

You might also like