Cis242 HW6 Anspdf

CIS242_HW6_ANSPDF
April 8, 2021
CIS 242
0.1 Spring 2021

0.2 HOMEWORK ASSIGNMENT 6
Please compile your responses use markdown in your Jupyter notebook to answer the questions.
If you prefer, you may also submit a Word or PDF document with the responses along the PDF or
HTML version of the completed notebook.
Active notebooks (.ipynb files) or raw code (.py files) will NOT be accepted and no points will
be given. The code part of the files will not be graded, but they will be checked if necessary
to verify your findings and recommendations. Point deductions may occur if there are major
discrepancies between your written answers and the output from the code.
Please make sure that your answers are readable and don’t run off the page when the notebook
is converted to HTML or PDF. Questions are worth 2 points each for a total of 26 points.
0.3 Working with movie reviews

You have been hired by a new streaming service to try to decide which movies they should bid
for to get exclusive rights before Netflix gets them. You have access to user reviews from the test
screenings that the studios do.
There are positive and negative reviews for each movie. You want to be able to predict the rating
of a movie based on user reviews so you can help your new company know how to predict “good”
movies based on review terms. You have a data file movie_reviews.csv to work with.
1. Read in the reviews and do some EDA. What can you say about the data? Create an uncon-
strained vector space. How many features do you have?
[3]: # code or markdown here
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', 150000) #important for getting all the text
pd.set_option('display.max_columns', 999)
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from IPython.display import Image
import pydotplus
1
from sklearn.feature_extraction.text import CountVectorizer
import math
[4]: df = pd.read_csv("movie_reviews.csv")
df.shape
[4]: (25000, 3)
[5]: df.head(2)
[5]: Movie ID Rating \

0 1821 4
1 9487 1
Review
0 Alan Rickman & Emma Thompson give performances with southern/New Orleans
accents in this detective flick. It's worth seeing for their scenes- and
Rickman's scene with Hal Holbrook. These three actors mannage to entertain us no
matter what the movie, it seems. The plot for the movie shows potential, but one
gets the impression in watching the film that it was not pulled off as well as
it could have been. The fact that it is cluttered by a rather uninteresting
subplot and mostly uninteresting kidnappers really muddles things. The movie is
worth a view- if for nothing more than entertaining performances by Rickman,
Thompson, and Holbrook.
1 I have seen this movie and I did not care for this movie anyhow. I would not
think about going to Paris because I do not like this country and its national
capital. I do not like to learn french anyhow because I do not understand their
language. Why would I go to France when I rather go to Germany or the United
Kingdom? Germany and the United Kingdom are the nations I tolerate. Apparently
the Olsen Twins do not understand the French language just like me. Therefore I
will not bother the France trip no matter what. I might as well stick to the
United Kingdom and meet single women and play video games if there is a video
arcade. That is all.
[6]: df['Rating'] = df['Rating'].astype("category")
[7]: df['Rating'].unique()
[7]: [4, 1, 2, 3, 9, 10, 7, 8]

Categories (8, int64): [4, 1, 2, 3, 9, 10, 7, 8]
[8]: df['Rating'].hist(bins = 20)
[8]: <matplotlib.axes._subplots.AxesSubplot at 0xc7b2510>
2
[9]: bow = CountVectorizer(binary = True)
bow_dm = bow.fit_transform(df.Review) #apply the transformation
print(type(bow_dm))
print(bow_dm.shape)
# print(bow.get_feature_names())
<class 'scipy.sparse.csr.csr_matrix'>
(25000, 74899)
There 25000 observations, we have 8 class of rating and related reviews. Threre are 74899 features.
2. What are the top 20 words in the feature space? Are you surprised by this? Why or why not?
sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-20:]
ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]
[10]: ['be',
'have',
'one',
'not',
'movie',
'on',
3
'as',
'was',
'with',
'for',
'but',
'that',
'in',
'it',
'is',
'this',
'to',
'of',
'and',
'the']
It is not surprised for these words are most used in all days and everywhere.
3. Create a feature space that removes stopwords. What is the size of the space now? What are
the top words? Is this an improvement in size or content? Why or why not? The size now is
74588, siz is reduced, words are more meaningful, therefore we have made a improvement.
bow = CountVectorizer(stop_words='english',binary = True)

print(type(bow_dm))
print(bow_dm.shape)
(25000, 74588)
[12]: sum_words = bow_dm.sum(axis=0)
[12]: ['character',
'know',
'plot',
'characters',
'think',
'acting',
'seen',
'movies',
'watch',
'way',
4
'people',
'make',
'don',
'story',
'really',
'time',
'just',
'like',
'film',
'movie']
4. Create a feature space that requires a minimum frequency for a term as well as removing
stopwords. What limit will you choose? Why? What is the effect on the feature space? I
choose the minum frequency to be 100, if less, the features only appears in a few texts, too high
there will be too little features.

np.quantile(sum_words, 0.5)
[13]: 2.0
[14]: np.quantile(sum_words, 0.9)
[14]: 37.0
[15]: np.quantile(sum_words, 0.95)
[15]: 92.0
[16]: bow = CountVectorizer(stop_words='english', min_df = 100,binary = True)

print(type(bow_dm))
print(bow_dm.shape)
(25000, 3496)
5. Investigate this feature space. What tokens would you like to remove that remain? Why?
(Or Why not?) some meaningless words and numbers exists, we set length of 2 to exclude them.
5
[18]: ['having',
'trying',
'original',
'horror',
'performance',
'fun',
'screen',
'believe',
'worth',
'tv',
'action',
'especially',
'looking',
'sure',
'hard',
'kind',
'minutes',
'comedy',
'guy',
'll',
'away',
'script',
'probably',
'feel',
'role',
'making',
'bit',
'music',
'point',
'far',
'gets',
'young',
'interesting',
'isn',
'times',
'saw',
'right',
'world',
'come',
'big',
'fact',
'pretty',
'got',
'quite',
'long',
'new',
'thought',
6
'things',
'cast',
'want',
'funny',
'old',
'lot',
'10',
'work',
'going',
'look',
'actually',
'years',
'makes',
'director',
'doesn',
'didn',
'actors',
'real',
'thing',
'watching',
've',
'scene',
'scenes',
'man',
'end',
'say',
'does',
'life',
'love',
'films',
'little',
'better',
'did',
'character',
'know',
'plot',
'characters',
'think',
'acting',
'seen',
'movies',
'watch',
'way',
'people',
'make',
'don',
'story',
7
'really',
'time',
'just',
'like',
'film',
'movie']
6. Create custom stopword list to remove the tokens. How does this affect your feature space?
If choose not to remove any more tokens, explain why your feature space is appropriate as is -
be specific. This reduce the size of feutres again.
[19]: # code or markdown herezai

from sklearn.feature_extraction import text
stop_wds = [i for i in ftrs if len(i) <= 2]
from sklearn.feature_extraction import text

skl_stopwords = list(text.ENGLISH_STOP_WORDS)
mystp = skl_stopwords + stop_wds
stopmin = CountVectorizer(binary=True, min_df = 100, stop_words = mystp)
bow_dm = stopmin.fit_transform(df.Review)
print(bow_dm.shape)
(25000, 3446)
[19]: ['cause',
'just',
'philosophy',
'causes',
'technical',
'absurd',
'scared',
'miserably',
'videos',
'views',
'owner',
'losing',
'discovers',
'starred',
'quickly',
'test',
'jazz',
'leading',
8
'fears',
'miscast']
7. Investigate the feature space further. What other changes you would like to make? Why?
Choose custom dictionary replacement or stemming and implement your choice. How does
that affect your feature space now? I use stemming, moreover, this step is actually done after-
wards, I set minum frequency being 500, otherwise memory is out of usage.

from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()
df['pstem'] = df["Review"].apply(lambda x: [ps.stem(y) for y in x.split()])
df['pstem']= [" ".join(token) for token in df['pstem']]
stem = CountVectorizer(binary=False, min_df = 500, stop_words = mystp)

bow_dm = stem.fit_transform(df.pstem)
print(bow_dm.shape)
(25000, 786)
[20]: ['appreciated',
'broken',
'clue',
'broad',
'african',
'christian',
'cartoon',
'cameos',
'curious',
'crisis',
'brown',
'cousin',
'believes',
'boredom',
'bright',
'billy',
'cameo',
'band',
'crying',
'count']
8. Choose the feature space you think most appropriate to use in a model. Now we want to
use this predict the audience feeling for a movie. How will you do that? What target variable
9
will help decide which movies to buy? How will you impliment this? The rating varible help
audience to buy, products of higher rating is usually more popular. I will use classfication model
to impliment this.
9. Train at least two classification models using your feature space. Which had the highest
predictive accuracy? What did it do well? Not so well? MultinomialNB has higher accuracy, it
do relative good in class 1 and class 2, does worse in predicting other classes.

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
X = bow_dm.toarray()
y = df['Rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
,→2,random_state=42)
model = MultinomialNB()
model.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = model.predict(X_test)
[24]: accuracy_score(clf2_expected, clf2_predicted)
[24]: 0.3518
[25]: print(classification_report(clf2_expected, clf2_predicted))
precision recall f1-score support
1 0.45 0.66 0.54 1033

2 0.20 0.12 0.15 459
3 0.22 0.16 0.18 512
4 0.26 0.22 0.24 511
7 0.23 0.25 0.24 459
8 0.20 0.18 0.19 535
9 0.18 0.12 0.14 459
10 0.50 0.55 0.52 1032
accuracy 0.35 5000

macro avg 0.28 0.28 0.27 5000
weighted avg 0.32 0.35 0.33 5000
[26]: from sklearn.tree import DecisionTreeClassifier

import warnings
warnings.filterwarnings("ignore")
dt_pru = DecisionTreeClassifier(criterion='gini', min_impurity_decrease = 0.0003,
min_samples_split= 0.0005,
10
random_state = 12345)
dt_pru.fit(X_train, y_train)
clf2_predicted = dt_pru.predict(X_test)
[27]: accuracy_score(clf2_expected, clf2_predicted)
[27]: 0.2814
[29]: print(classification_report(clf2_expected, clf2_predicted))
1 0.32 0.58 0.41 1033

2 0.17 0.02 0.04 459
3 0.00 0.00 0.00 512
4 0.18 0.14 0.16 511
7 0.15 0.05 0.07 459
8 0.21 0.10 0.13 535
9 0.00 0.00 0.00 459
10 0.28 0.63 0.39 1032
accuracy 0.28 5000

macro avg 0.17 0.19 0.15 5000
weighted avg 0.20 0.28 0.21 5000
10. What is the appropriate metric for evaluating these models? Why? How did your previ-
ous models do when using that measure? macro f1 score, which considers all classes and both
precision and recall is a better metric. In previous models, the macro f1 made same decsion as
accuracy.
11. Create an alternative feature space. You can change any aspect. Run your two models on
this feature space. Does it improve your models’ performance for your chosen metric? No, I
use TfidfVectorizer, the better classifer is still MultinomialNB, the accuracy increased, but macro
f1 is lower.
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf1 = TfidfVectorizer(min_df = 500, stop_words = mystp)
bow_dm = tfidf1.fit_transform(df.pstem)
11
model = MultinomialNB()
print(classification_report(clf2_expected, clf2_predicted))
1 0.39 0.86 0.53 1033

2 0.00 0.00 0.00 459
3 0.18 0.01 0.01 512
4 0.32 0.12 0.18 511
7 0.32 0.03 0.06 459
8 0.24 0.07 0.11 535
9 0.00 0.00 0.00 459
10 0.36 0.80 0.50 1032
accuracy 0.37 5000

macro avg 0.23 0.24 0.17 5000
weighted avg 0.26 0.37 0.25 5000
[31]: dt_pru = DecisionTreeClassifier(criterion='gini', min_impurity_decrease = 0.0003,

clf2_predicted = dt_pru.predict(X_test)
1 0.29 0.62 0.39 1033

2 0.09 0.00 0.00 459
3 0.17 0.04 0.07 512
4 0.20 0.17 0.18 511
7 0.10 0.02 0.04 459
8 0.13 0.02 0.03 535
9 0.00 0.00 0.00 459
10 0.30 0.58 0.40 1032
accuracy 0.27 5000

macro avg 0.16 0.18 0.14 5000
weighted avg 0.19 0.27 0.20 5000
12. Using the “best” feature space, try to improve on your model’s performance. What
model/parameter settings do you think will help? Why? What is the result? I set the smooth-
12
ing parameter from default 1 to 0.8, the macro f1 increased a bit.
stem = CountVectorizer(binary=False, min_df = 500, stop_words = mystp)

bow_dm = stem.fit_transform(df.pstem)
model = MultinomialNB(alpha = 0.8)

# from sklearn.ensemble import AdaBoostClassifier

# NBboost = AdaBoostClassifier(MultinomialNB(),n_estimators=10)
# NBboost.fit(X_train, y_train)
# clf3_predicted = NBboost.predict(X_test)
# print(classification_report(y_test, clf3_predicted))
1 0.45 0.66 0.54 1033

2 0.20 0.12 0.15 459
3 0.22 0.16 0.19 512
4 0.26 0.22 0.24 511
7 0.23 0.25 0.24 459
8 0.20 0.18 0.19 535
9 0.17 0.12 0.14 459
10 0.50 0.55 0.52 1032
accuracy 0.35 5000

macro avg 0.28 0.28 0.28 5000
weighted avg 0.32 0.35 0.33 5000
13. Are you able to build a model that will help your streaming service get some market share?
Why or why not? For random guess, the correct prediction has probability of 1/8, the prediction
of our model is better than this, it helps.
Extra credit (3 points): What terms from the reviews are most strongly related to a movie’s
classification? How do you know this? crap is most important term according to decsion tree
classifer.
13
dt_pru = DecisionTreeClassifier(criterion='gini', min_impurity_decrease = 0.0003,
[38]: DecisionTreeClassifier(min_impurity_decrease=0.0003, min_samples_split=0.0005,

random_state=12345)
[41]: from sklearn.tree import export_text

text_tree = export_text(dt_pru, feature_names = stem.get_feature_names())
print(text_tree)
|--- crap <= 0.50

| |--- love <= 0.50
| | |--- poor <= 0.50
| | | |--- hi <= 0.50
| | | | |--- money <= 0.50
| | | | | |--- avoid <= 0.50
| | | | | | |--- plot <= 0.50
| | | | | | | |--- noth <= 0.50
| | | | | | | | |--- pretti <= 0.50
| | | | | | | | | |--- thi <= 2.50
| | | | | | | | | | |--- perfect <= 0.50
| | | | | | | | | | | |--- truncated branch of depth 7
| | | | | | | | | | |--- perfect > 0.50
| | | | | | | | | | | |--- class: 10
| | | | | | | | | |--- thi > 2.50
| | | | | | | | | | |--- veri <= 0.50
| | | | | | | | | | |--- veri > 0.50
| | | | | | | | | | | |--- class: 10
| | | | | | | | |--- pretti > 0.50
| | | | | | | | | |--- class: 4
| | | | | | | |--- noth > 0.50
| | | | | | | | |--- class: 1
| | | | | | |--- plot > 0.50
| | | | | | | |--- enjoy <= 0.50
| | | | | | | | |--- class: 1
| | | | | | | |--- enjoy > 0.50
| | | | | | | | |--- class: 7
| | | | | |--- avoid > 0.50
| | | | | | |--- class: 1
| | | | |--- money > 0.50
| | | | | |--- class: 1
| | | |--- hi > 0.50
| | | | |--- perfect <= 0.50
| | | | | |--- script <= 0.50
14
| | | | | | |--- suppos <= 0.50
| | | | | | | |--- plot <= 0.50
| | | | | | | | |--- hi <= 2.50
| | | | | | | | | |--- avoid <= 0.50
| | | | | | | | | | |--- better <= 0.50
| | | | | | | | | | |--- better > 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | | |--- avoid > 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- hi > 2.50
| | | | | | | | | |--- class: 8
| | | | | | | |--- plot > 0.50
| | | | | | | | |--- class: 4
| | | | | | |--- suppos > 0.50
| | | | | | | |--- class: 2
| | | | | |--- script > 0.50
| | | | | | |--- class: 1
| | | | |--- perfect > 0.50
| | | | | |--- class: 10
| | |--- poor > 0.50
| | | |--- thi <= 2.50
| | | | |--- class: 1
| | | |--- thi > 2.50
| | | | |--- class: 1
| |--- love > 0.50
| | |--- suppos <= 0.50
| | | |--- hi <= 0.50
| | | | |--- noth <= 0.50
| | | | | |--- love <= 1.50
| | | | | | |--- money <= 0.50
| | | | | | | |--- nice <= 0.50
| | | | | | | | |--- script <= 0.50
| | | | | | | | | |--- class: 10
| | | | | | | | |--- script > 0.50
| | | | | | | | | |--- class: 1
| | | | | | | |--- nice > 0.50
| | | | | | | | |--- class: 7
| | | | | | |--- money > 0.50
| | | | | | | |--- class: 1
| | | | | |--- love > 1.50
| | | | | | |--- scene <= 0.50
| | | | | | | |--- class: 10
| | | | | | |--- scene > 0.50
| | | | | | | |--- class: 10
| | | | |--- noth > 0.50
| | | | | |--- class: 1
| | | |--- hi > 0.50
15
| | | | |--- favorit <= 0.50
| | | | | |--- poorli <= 0.50
| | | | | | |--- problem <= 0.50
| | | | | | | |--- guy <= 0.50
| | | | | | | | |--- class: 10
| | | | | | | |--- guy > 0.50
| | | | | | | | |--- class: 1
| | | | | | |--- problem > 0.50
| | | | | | | |--- class: 4
| | | | | |--- poorli > 0.50
| | | | | | |--- class: 1
| | | | |--- favorit > 0.50
| | | | | |--- class: 10
| | |--- suppos > 0.50
| | | |--- class: 1
|--- crap > 0.50
| |--- veri <= 0.50
| | |--- class: 1
| |--- veri > 0.50
| | |--- class: 1
0.4 Additional code below

[ ]: # code to set up environment and bring in data
[ ]: # other calculations
16

Cis242 HW6 Anspdf

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Cis242 HW6 Anspdf

Uploaded by

Copyright:

Available Formats

CIS242_HW6_ANSPDF

0.1 Spring 2021

0.3 Working with movie reviews

[5]: Movie ID Rating \

[6]: df['Rating'] = df['Rating'].astype("category")

[7]: [4, 1, 2, 3, 9, 10, 7, 8]

[8]: df['Rating'].hist(bins = 20)

[8]: <matplotlib.axes._subplots.AxesSubplot at 0xc7b2510>

[11]: # code or markdown here

bow = CountVectorizer(stop_words='english',binary = True)

[13]: # code or markdown here

[14]: np.quantile(sum_words, 0.9)

[15]: np.quantile(sum_words, 0.95)

[16]: bow = CountVectorizer(stop_words='english', min_df = 100,binary = True)

[18]: # code or markdown here

[19]: # code or markdown herezai

from sklearn.feature_extraction import text

[20]: # code or markdown here

stem = CountVectorizer(binary=False, min_df = 500, stop_words = mystp)

[22]: # code or markdown here

[24]: accuracy_score(clf2_expected, clf2_predicted)

[25]: print(classification_report(clf2_expected, clf2_predicted))

precision recall f1-score support

1 0.45 0.66 0.54 1033

accuracy 0.35 5000

[26]: from sklearn.tree import DecisionTreeClassifier

[27]: accuracy_score(clf2_expected, clf2_predicted)

[29]: print(classification_report(clf2_expected, clf2_predicted))

precision recall f1-score support

1 0.32 0.58 0.41 1033

accuracy 0.28 5000

[30]: # code or markdown here

from sklearn.feature_extraction.text import TfidfVectorizer

precision recall f1-score support

1 0.39 0.86 0.53 1033

accuracy 0.37 5000

[31]: dt_pru = DecisionTreeClassifier(criterion='gini', min_impurity_decrease = 0.0003,

precision recall f1-score support

1 0.29 0.62 0.39 1033

accuracy 0.27 5000

[37]: # code or markdown here

stem = CountVectorizer(binary=False, min_df = 500, stop_words = mystp)

model = MultinomialNB(alpha = 0.8)

# from sklearn.ensemble import AdaBoostClassifier

precision recall f1-score support

1 0.45 0.66 0.54 1033

accuracy 0.35 5000

[38]: DecisionTreeClassifier(min_impurity_decrease=0.0003, min_samples_split=0.0005,

[41]: from sklearn.tree import export_text

|--- crap <= 0.50

0.4 Additional code below

You might also like