Professional Documents
Culture Documents
April 8, 2021
CIS 242
Active notebooks (.ipynb files) or raw code (.py files) will NOT be accepted and no points will
be given. The code part of the files will not be graded, but they will be checked if necessary
to verify your findings and recommendations. Point deductions may occur if there are major
discrepancies between your written answers and the output from the code.
Please make sure that your answers are readable and don’t run off the page when the notebook
is converted to HTML or PDF. Questions are worth 2 points each for a total of 26 points.
1. Read in the reviews and do some EDA. What can you say about the data? Create an uncon-
strained vector space. How many features do you have?
[3]: # code or markdown here
import pandas as pd
import numpy as np
pd.set_option('display.max_colwidth', 150000) #important for getting all the text
pd.set_option('display.max_columns', 999)
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from IPython.display import Image
import pydotplus
1
from sklearn.feature_extraction.text import CountVectorizer
import math
[4]: df = pd.read_csv("movie_reviews.csv")
df.shape
[4]: (25000, 3)
[5]: df.head(2)
Review
0 Alan Rickman & Emma Thompson give performances with southern/New Orleans
accents in this detective flick. It's worth seeing for their scenes- and
Rickman's scene with Hal Holbrook. These three actors mannage to entertain us no
matter what the movie, it seems. The plot for the movie shows potential, but one
gets the impression in watching the film that it was not pulled off as well as
it could have been. The fact that it is cluttered by a rather uninteresting
subplot and mostly uninteresting kidnappers really muddles things. The movie is
worth a view- if for nothing more than entertaining performances by Rickman,
Thompson, and Holbrook.
1 I have seen this movie and I did not care for this movie anyhow. I would not
think about going to Paris because I do not like this country and its national
capital. I do not like to learn french anyhow because I do not understand their
language. Why would I go to France when I rather go to Germany or the United
Kingdom? Germany and the United Kingdom are the nations I tolerate. Apparently
the Olsen Twins do not understand the French language just like me. Therefore I
will not bother the France trip no matter what. I might as well stick to the
United Kingdom and meet single women and play video games if there is a video
arcade. That is all.
[7]: df['Rating'].unique()
2
[9]: bow = CountVectorizer(binary = True)
bow_dm = bow.fit_transform(df.Review) #apply the transformation
print(type(bow_dm))
print(bow_dm.shape)
# print(bow.get_feature_names())
<class 'scipy.sparse.csr.csr_matrix'>
(25000, 74899)
There 25000 observations, we have 8 class of rating and related reviews. Threre are 74899 features.
2. What are the top 20 words in the feature space? Are you surprised by this? Why or why not?
[10]: # code or markdown here
sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-20:]
ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]
[10]: ['be',
'have',
'one',
'not',
'movie',
'on',
3
'as',
'was',
'with',
'for',
'but',
'that',
'in',
'it',
'is',
'this',
'to',
'of',
'and',
'the']
It is not surprised for these words are most used in all days and everywhere.
3. Create a feature space that removes stopwords. What is the size of the space now? What are
the top words? Is this an improvement in size or content? Why or why not? The size now is
74588, siz is reduced, words are more meaningful, therefore we have made a improvement.
<class 'scipy.sparse.csr.csr_matrix'>
(25000, 74588)
[12]: sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-20:]
ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]
[12]: ['character',
'know',
'plot',
'characters',
'think',
'acting',
'seen',
'movies',
'watch',
'way',
4
'people',
'make',
'don',
'story',
'really',
'time',
'just',
'like',
'film',
'movie']
4. Create a feature space that requires a minimum frequency for a term as well as removing
stopwords. What limit will you choose? Why? What is the effect on the feature space? I
choose the minum frequency to be 100, if less, the features only appears in a few texts, too high
there will be too little features.
[13]: 2.0
[14]: 37.0
[15]: 92.0
<class 'scipy.sparse.csr.csr_matrix'>
(25000, 3496)
5. Investigate this feature space. What tokens would you like to remove that remain? Why?
(Or Why not?) some meaningless words and numbers exists, we set length of 2 to exclude them.
sum_words = bow_dm.sum(axis=0)
sum_words = sum_words.tolist()[0]
ids = np.argsort(sum_words)[-100:]
ftrs = bow.get_feature_names()
[ftrs[i] for i in ids]
5
[18]: ['having',
'trying',
'original',
'horror',
'performance',
'fun',
'screen',
'believe',
'worth',
'tv',
'action',
'especially',
'looking',
'sure',
'hard',
'kind',
'minutes',
'comedy',
'guy',
'll',
'away',
'script',
'probably',
'feel',
'role',
'making',
'bit',
'music',
'point',
'far',
'gets',
'young',
'interesting',
'isn',
'times',
'saw',
'right',
'world',
'come',
'big',
'fact',
'pretty',
'got',
'quite',
'long',
'new',
'thought',
6
'things',
'cast',
'want',
'funny',
'old',
'lot',
'10',
'work',
'going',
'look',
'actually',
'years',
'makes',
'director',
'doesn',
'didn',
'actors',
'real',
'thing',
'watching',
've',
'scene',
'scenes',
'man',
'end',
'say',
'does',
'life',
'love',
'films',
'little',
'better',
'did',
'character',
'know',
'plot',
'characters',
'think',
'acting',
'seen',
'movies',
'watch',
'way',
'people',
'make',
'don',
'story',
7
'really',
'time',
'just',
'like',
'film',
'movie']
6. Create custom stopword list to remove the tokens. How does this affect your feature space?
If choose not to remove any more tokens, explain why your feature space is appropriate as is -
be specific. This reduce the size of feutres again.
(25000, 3446)
[19]: ['cause',
'just',
'philosophy',
'causes',
'technical',
'absurd',
'scared',
'miserably',
'videos',
'views',
'owner',
'losing',
'discovers',
'starred',
'quickly',
'test',
'jazz',
'leading',
8
'fears',
'miscast']
7. Investigate the feature space further. What other changes you would like to make? Why?
Choose custom dictionary replacement or stemming and implement your choice. How does
that affect your feature space now? I use stemming, moreover, this step is actually done after-
wards, I set minum frequency being 500, otherwise memory is out of usage.
(25000, 786)
[20]: ['appreciated',
'broken',
'clue',
'broad',
'african',
'christian',
'cartoon',
'cameos',
'curious',
'crisis',
'brown',
'cousin',
'believes',
'boredom',
'bright',
'billy',
'cameo',
'band',
'crying',
'count']
8. Choose the feature space you think most appropriate to use in a model. Now we want to
use this predict the audience feeling for a movie. How will you do that? What target variable
9
will help decide which movies to buy? How will you impliment this? The rating varible help
audience to buy, products of higher rating is usually more popular. I will use classfication model
to impliment this.
9. Train at least two classification models using your feature space. Which had the highest
predictive accuracy? What did it do well? Not so well? MultinomialNB has higher accuracy, it
do relative good in class 1 and class 2, does worse in predicting other classes.
model = MultinomialNB()
model.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = model.predict(X_test)
[24]: 0.3518
10
random_state = 12345)
dt_pru.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = dt_pru.predict(X_test)
[27]: 0.2814
10. What is the appropriate metric for evaluating these models? Why? How did your previ-
ous models do when using that measure? macro f1 score, which considers all classes and both
precision and recall is a better metric. In previous models, the macro f1 made same decsion as
accuracy.
11. Create an alternative feature space. You can change any aspect. Run your two models on
this feature space. Does it improve your models’ performance for your chosen metric? No, I
use TfidfVectorizer, the better classifer is still MultinomialNB, the accuracy increased, but macro
f1 is lower.
X = bow_dm.toarray()
y = df['Rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
,→2,random_state=42)
11
model = MultinomialNB()
model.fit(X_train, y_train)
clf2_expected = y_test
clf2_predicted = model.predict(X_test)
print(classification_report(clf2_expected, clf2_predicted))
12. Using the “best” feature space, try to improve on your model’s performance. What
model/parameter settings do you think will help? Why? What is the result? I set the smooth-
12
ing parameter from default 1 to 0.8, the macro f1 increased a bit.
X = bow_dm.toarray()
y = df['Rating'].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.
,→2,random_state=42)
13. Are you able to build a model that will help your streaming service get some market share?
Why or why not? For random guess, the correct prediction has probability of 1/8, the prediction
of our model is better than this, it helps.
Extra credit (3 points): What terms from the reviews are most strongly related to a movie’s
classification? How do you know this? crap is most important term according to decsion tree
classifer.
13
[38]: # code or markdown here
dt_pru = DecisionTreeClassifier(criterion='gini', min_impurity_decrease = 0.0003,
min_samples_split= 0.0005,
random_state = 12345)
dt_pru.fit(X_train, y_train)
14
| | | | | | |--- suppos <= 0.50
| | | | | | | |--- plot <= 0.50
| | | | | | | | |--- hi <= 2.50
| | | | | | | | | |--- avoid <= 0.50
| | | | | | | | | | |--- better <= 0.50
| | | | | | | | | | | |--- truncated branch of depth 2
| | | | | | | | | | |--- better > 0.50
| | | | | | | | | | | |--- class: 1
| | | | | | | | | |--- avoid > 0.50
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- hi > 2.50
| | | | | | | | | |--- class: 8
| | | | | | | |--- plot > 0.50
| | | | | | | | |--- class: 4
| | | | | | |--- suppos > 0.50
| | | | | | | |--- class: 2
| | | | | |--- script > 0.50
| | | | | | |--- class: 1
| | | | |--- perfect > 0.50
| | | | | |--- class: 10
| | |--- poor > 0.50
| | | |--- thi <= 2.50
| | | | |--- class: 1
| | | |--- thi > 2.50
| | | | |--- class: 1
| |--- love > 0.50
| | |--- suppos <= 0.50
| | | |--- hi <= 0.50
| | | | |--- noth <= 0.50
| | | | | |--- love <= 1.50
| | | | | | |--- money <= 0.50
| | | | | | | |--- nice <= 0.50
| | | | | | | | |--- script <= 0.50
| | | | | | | | | |--- class: 10
| | | | | | | | |--- script > 0.50
| | | | | | | | | |--- class: 1
| | | | | | | |--- nice > 0.50
| | | | | | | | |--- class: 7
| | | | | | |--- money > 0.50
| | | | | | | |--- class: 1
| | | | | |--- love > 1.50
| | | | | | |--- scene <= 0.50
| | | | | | | |--- class: 10
| | | | | | |--- scene > 0.50
| | | | | | | |--- class: 10
| | | | |--- noth > 0.50
| | | | | |--- class: 1
| | | |--- hi > 0.50
15
| | | | |--- favorit <= 0.50
| | | | | |--- poorli <= 0.50
| | | | | | |--- problem <= 0.50
| | | | | | | |--- guy <= 0.50
| | | | | | | | |--- class: 10
| | | | | | | |--- guy > 0.50
| | | | | | | | |--- class: 1
| | | | | | |--- problem > 0.50
| | | | | | | |--- class: 4
| | | | | |--- poorli > 0.50
| | | | | | |--- class: 1
| | | | |--- favorit > 0.50
| | | | | |--- class: 10
| | |--- suppos > 0.50
| | | |--- class: 1
|--- crap > 0.50
| |--- veri <= 0.50
| | |--- class: 1
| |--- veri > 0.50
| | |--- class: 1
[ ]: # other calculations
16