You are on page 1of 17

CMSC422 Machine Learning Final Project Presentation

Anis Abboud

The Task
Throughout this project, I developed and used multiple machine algorithms to predict whether or not a user is going to give a movie a good rating (>=4) or not. 1. 2. 3. 4. Decision Trees (DT) - implemented Association Rules (AR) - implemented Support Vector Machines (SVM) used libsvm. K-Nearest Neighbors (KNN) used scikit-learn.

5.

Gaussian Nave Bayes (GNB) used scikit-learn.

Attributes used
The following attributes and modifications were used from the provided data: Movie:
Decade was extracted from the movie title. E.g., 1990s for The Shawshank Redemption (1994). Due to different technologies and themes in different decades, this information could be useful. Genre E.g., Action, Drama, etc.

User:
Gender (M/F) Age group Occupation Zip code prefix extracted from the given zip code. The zip code itself can have a lot of variation and less meaning. However, the first digit can tell approximately from which region/states the user is. Since the mindset might vary e.g., between the east coast and the west coast, this information could be useful. On the next slide, you can see a US map split into the 10 zip code regions.

http://en.wikipedia.org/wiki/File:ZIP_Code_zones.svg

Technical note
Throughout this presentation, I focus on the data subset that was provided for the third part of the project and contains movies on or after year 2000, and classifying reviews with 4 or 5 stars. This is because:
1. The requirements for the first 3 parts of the project varied, and this is the only subset that was consistent throughout them all. Thus, the only one I can run clear comparisons on between all the algorithms. It saves on runtime.

2.

Since the data subset contains only 28K instances out of 700K total, the accuracies I received are expected to go up when using the entire data, leading to better results. In addition, since all the movie in this subset are from the 2000s, the decade attribute had no added value, while it could be valuable if used with the entire data.

1. Decision Trees
Technical notes:
Nodes are split based on the attribute which minimizes the entropy. Leaf nodes with less than 10 examples were pruned to avoid over-fitting, and classification inside a node is based on the average. For numerical attributes, range-splits were checked (e.g., age <= 25) in addition to standard splits (e.g., age == 25 or gender == M). Since many movies have multiple genres, splits were based on whether the movie has the genre or not (e.g., genres include Action).

Here are the results:


DT accuracy precision recall f-score mcc 0.59 0.583 0.645 61 0.172

Throughout this presentation, we will compare the different approaches based on their MCC score (Matthews correlation coefficient) - the higher the better.

Using averages
We have a limited set of attributes for the movies and users, but still, can we do better? Idea:
If a movie is rated high (or low) by most users, its more likely to get rated high (or low) by the new user. If a user rates most movies high (or low), they are most likely to rate a new movie high (or low) as well.

So, lets add a new attribute for each movie, and for each user: average rating, and calculate its value based on past reviews for that movie/user in the training data.

Here are the results after doing this:


DT without avg. with avg. accuracy precision recall f-score mcc 0.59 0.583 0.645 61 0.172 0.64 0.645 0.653 65 0.284

We notice a huge bump in the MCC score after applying this! From 17% to 28%.

2. Association Rules
In the second part of the project, we had to implement decision rules. We learned two algorithms for mining association rules:
1. Apriori: Simple but not efficient. 2. FPGrowth Tree: Efficient but very complex to code.

Observing that each of these two algorithms has its major downside, I thought of my own algorithm I call it HybridAlgorithm, which compromises efficiency and complexity:
1. Its not complicated can be explained in the next two slides. 2. Its efficient - it ran on the 700K examples in ~10 minutes.

Here are the results, both without and with using the average rating attributes explained in the previous slide:
AR

without avg. with avg.

accuracy precision recall f-score mcc 0.56 0.647 0.304 41 0.157 0.65 0.675 0.594 63 0.303

Association Rules HybridAlgorithm


To mine the association rules:
1. First it collects all the item "candidates" (there are 68 of those), of the format: (attr, value). For example: ('genre', 'Action'). 2. Next, it starts recursively to build subsets of those 68. Starting with an empty subset. It kinds of builds a tree on the fly, because recursion works in a tree. For every candidate, it considers 2 options:
1. 2. Not taking it into the subset (the subset is basically the X in the X->Y rule). Taking it into the subset. In this case, it checks if the current subset meets the required support. If it doesn't we stop this path and move on. If it does meet the threshold, we take it and continue with it recursively. Whenever we take a new item, we keep only the movie reviews that contain this item. For example, if we decided that our X->Y rule will have ('genre', 'Action') in X, then we move on only with those reviews on Action movies. The reasoning is that the others won't help the support. This makes a huge difference on the performance, because the data size shrinks significantly when the recursion goes deeper, making the code run orders of magnitude faster.

For classification, it collects all the positive association rules that the given review applies to and calculates the average of their confidence. Similarly for the negative rules. Then it classifies based on the higher average.

def RecursiveItemsetBuild(rules, current_itemset, candidates, next_candidate_index, class_data_subset, class_data_count, all_data_subset): if next_candidate_index == len(candidates): # Stop condition. return # First, try without the current candidate. RecursiveItemsetBuild(rules, current_itemset, candidates, next_candidate_index + 1, class_data_subset, class_data_count, all_data_subset) # Second, try with the current candidate. item = candidates[next_candidate_index] new_itemset = current_itemset + [item] new_class_data_subset = [review for review in class_data_subset if item in review.ar_items] new_all_data_subset = [review for review in all_data_subset if item in review.ar_items] support = len(new_class_data_subset) / class_data_count if support >= MIN_SUPPORT: confidence = len(new_class_data_subset) / float(len(new_all_data_subset)) if confidence >= MIN_CONFIDENCE: rules[frozenset(new_itemset)] = confidence RecursiveItemsetBuild(rules, new_itemset, candidates, next_candidate_index + 1, new_class_data_subset, class_data_count, new_all_data_subset)

This function is called twice, once for mining the positive rules and once for mining the negative rules as follows:
RecursiveItemsetBuild(rules a list to be filled by the recursion, [], the 68 candidates, 0, positive/negative data, number of positive/negative examples, data)

3/4/5. SVM, KNN, Nave Bayes


In the third part of the project, I used libsvm to classify with Support Vector Machines (SVM). In addition, to get a broader picture, I used the K-Nearest Neighbors (KNN) and Gaussian Nave Bayes (GNB) algorithms from the scikit-learn library. Here are all the results, for perspective:
DT without avg. with avg. without avg. with avg. without avg. with avg. without avg. with avg. without avg. with avg. accuracy precision recall f-score mcc 0.59 0.583 0.645 61 0.172 0.64 0.645 0.653 65 0.284 0.56 0.647 0.304 41 0.157 0.65 0.675 0.594 63 0.303 0.59 0.586 0.64 61 0.177 0.67 0.674 0.67 67 0.337 0.51 0.544 0.157 24 0.032 0.64 0.644 0.637 64 0.275 0.56 0.54 0.823 65 0.124 0.62 0.6 0.777 68 0.258

AR
SVM Scikit-learn KNN Scikit-learn Naive Bayes

From this, we conclude that SVM provides the best predictions, with an MCC score of 34%.

What differentiates highly rated movies?


From the previous slide, we can observe that using the average rating attribute significantly bumps the MCC score of the algorithms almost doubles them. Therefore, we can argue that the average of the past ratings can be used to estimate/predict ratings in new reviews.

However, if you are directing a new movie, you have no access to the average rating (as the movie has not been released yet). What can you still learn from the data?
The association rules algorithm (without using the average rating attribute) mined 63 positive rules and 39 negative rules of the form: genre=Drama,occupation=17,genre=Action->yes: 0.796791 genre=Thriller,genre=Sci-Fi,genre=Horror->no: 0.779783

All the rules can be found in the notes section of this slide (below).

Observing the association rules


We notice the following statistics about some genres:
Genre Drama Action Comedy Animation War Children's Adventure Sci-Fi Mystery Thriller Horror Number of positive association rules including it 49 27 11 10 6 5 0 0 0 0 0 Number of negative association rules including it 2 0 0 0 0 0 2 7 8 18 24

The good and the bad genres


From the previous slide, we observe that some genres are highly associated with good reviews, while others, are generally associated with bad reviews. Is that really the case? Lets check it out. We make a super-simple classifier, that classifies if a movie review is going to rate it >= 4 only based on the movie genre!
if 'Drama' in genres or 'Action' in genres or 'Comedy' in genres or 'Animation' in genres or 'War' in genres or 'Children\'s' in genres: classification = True else if 'Thriller' in genres or 'Horror' in genres or 'Mystery' in genres or 'Sci-Fi' in genres or 'Adventure' in genres: classification = False

Here are the results:


accuracy precision recall Simple Classifier only genre 0.54 0.528 0.931 f-score 67 mcc 0.126

Conclusions
If you are looking to predict if a user is going to like a movie (give a good review), e.g., in order to decide which movies to recommend for that user, then I would recommend using the SVM classifier, which achieved the highest MCC score of 34%. If you are looking to make a new movie, I would recommend going for the following genres: Drama, Action, Comedy, Animation, War, and Children's, and avoiding: Thriller, Horror, Mystery, Sci-Fi, War, and Adventure, based on the previous slide, in which I show a simple classifier that classifies only based on this, and achieves an MCC score of 13%, which is very close to what the real complex algorithms achieved without using the average rating attribute, and is actually higher than the MCC score that was required for the second homework (association rules) of 10%. Although we could learn a lot from the given data, the data is very limited and lacking, making it hard to achieve higher accuracies and scores, even when using complex algorithms like SVM. To better predict a movies performance, one needs to consider numerous additional attributes, including:
Cast, director, budget, sequel or not, marketing, popularity, etc

Future work
1. 2. 3. Get more data and attributes. This will facilitate achieving better results. Try more algorithms. E.g., on WEKA. Analyze the value of each attribute. E.g., is gender actually useful? Remove each attribute from the data and check if that drives the scores down if yes, its probably a useful attribute. Parameter tuning: Many of the algorithms used have parameters that can be tuned in order to try and get better results. E.g., the c (cost) parameter for SVM.

4.

Anis Abboud