You are on page 1of 2

Assignment 6

Learning Objectives: 1. Get practice with a slightly more realistically scaled machine learning problem then what we have done in prior assignments this will give you the opportunity to apply a great many of the techniques we have explored so far 2. Apply features of TagHelper and Weka for processing raw text data. 3. Investigate issues related to feature space design. Description: Before we watch a movie, we read reviews about it!!. Typically movie reviews also indicate whether the movie is good or bad (with thumbs up/down symbols) or provide a rating on a scale of 0-5 or 0-10. Suppose if we want to apply machine learning algorithm to come up with a decision whether the reviews of a movie are positive or negative, then such a machine learning algorithm would take review comments / text as input and predict whether the review is negative / positive about the movie. In this assignment, we are given a bunch of reviews on several movies classified as positive and negative. The goal is to build a classifier that can correctly assign either a positive or negative tag to the movie review texts. Use MovieReviews.xls. Step-by-Step Guide: 1. Complete the week 6 and 7 assigned readings, and review the lecture slides from week 7, especially where instructions for using TagHelper tools were given 2. Manually examine some examples of given movie-reviews data and observe what could be likely features that could predict a review to be negative or positive. 3. Read the file into TagHelper tools and configure the customization panel so that you are using only unigram features, and you are using attribute selection to get the top 200 of these attributes, and do the classification using SMO. After you have run TagHelper tools, you will find a performance report and output file in the OUTPUT directory as well as a .arff file in the ARFF directory. (rename it as base-line .arff file and set it aside with the performance report.) a. Make a note of the baseline performance as indicated in the performance report. 4. Do an error analysis and determine where the machine learning algorithm is making mistakes 5. Load the .xls file into TagHelper tools again. Based on your error analysis, configure TagHelper in such a way as to try to compensate for the confusions

you observed in your error analysis. You may wish to create some new features using the Advanced Feature Editor. Now run TagHelper to obtain a new .arff file and performance report. Label the .arff file Final.arff. 6. Compare the performance obtained in Step 5 Vs Step 3 using the Experimenter to determine whether any observed difference in performance is statistically significant.

Deliverables: 1. Your baseline .arff file, your final .arff file after experimentation and modification, 2. Write up of your experimentation process that includes: a. Your observations from your initial exploratory analysis of the data, b. A description of your baseline performance and error analysis c. A description of what you tried for improving results over your baseline and why you thought it would work d. A comparison of the results of your final approach with your baseline approach. e. Was it significantly better? NOTE: Your report should display your understanding of the concepts and the logic/process you have chosen to uncover the hidden features in the text. Your report should explain why a particular technique seemed to be working or not working. The final performance (high or low accuracy) plays a secondary role in reviewing your report.