You are on page 1of 3

CS 1026 : Computer Science Fundamentals

ASSIGNMENT 03

SENTIMENT
ANALYSIS
Due Date:
Friday, November 17, 2023 23:55:00 ET

EXPIRED

ACADEMIC DISHONESTY
Assignments will be run through a similarity checking software to check for code that looks very similar to that of other students. Sharing or

copying code in any way is considered plagiarism (Academic dishonesty) and may result in a mark of 0 on the assignment and/or reported to

the Dean's Office. Plagiarism is a serious offence. Work is to be done individually.

If you want to store a PDF version of this assignment, press Ctrl+p on Windows or Command+p on Mac, the print window withh appear. Then,

select Save As PDF from the Destination dropdown. Then click Save

this file was last modified on 2023-11-02 11:20 AM

UPDATES and CHANGES


The following are the updates/changes made to this assignment AFTER it was posted:

Thursday Nov. 2, 11:20 AM

Example 2 in Section 3.3: This example was missing the keyword "perfectly" in the result. This has now been corrected.

The following are the instructions for assignment 3:

1 LEARNING OUTCOMES

2 BACKGROUND

3 TASKS

4 FUNCTIONAL SPECIFICATION

5 NON-FUNCTIONAL SPECIFICATION

6 STARTER CODE

7 MARKING AND SUBMISSION

8 ATTACHMENTS AND EXAMPLES

9 HELPFUL FUNCTIONS AND METHODS

1 Learning Outcomes

By completing this assignment, you will gain skills relating to:

1 Using functions

2 Complex data structures

3 Dictionaries and lists

4 Text processing

5 Working with TSV and CSV files

6 File input and output

7 Exceptions in Python

8 Using Python simple modules

9 Testing programs and developing test cases; adhering to specifications

10 Writing code that is used by other programs

11 Working with real world data

12 Reviewing past concepts such as Ifs, Loops, and I/O

2 Background

With the emergence of social media sites such as Facebook, Reddit, Twitter (also known as X), LinkedIn, and WhatsApp, more and more

data is being produced and made accessible online in a textual format. This textual data, such as Tweets or Facebook posts, can be hard

to process but is incredibly important for organizations as it offers a current snapshot of the public’s feelings (or sentiment) about a

topic at a current point in time. Having a live view of your customer’s current sentiment about your products or the publics view of your

political campaign can be critical for success.

Twitter is a social media site that allows users to post “tweets”, short (typically under 280 characters) messages. It is commonly used

by people to “tweet” aspects of their daily lives and current opinions about a variety of topics. This “flow of tweets” has become a way

to study or at least guess at how people feel about various aspects of the world, their own lives, or a specific topic. For example,

analysis of tweets has been used to try to determine how certain geographical regions may be voting or their opinion on a recently

announced product.

This is accomplished by analyzing the content, the words, and phrases, in tweets. For example, analysis of keywords or phrases in

tweets can be used to determine how popular or unpopular a movie might be. This is often referred to as sentiment analysis.

In this assignment, you will be performing a sentiment analysis on a dataset of Tweets collected in February 2023 relating to a

business, product, or security. The end goal is to produce a report that summarizes the sentiment of the tweets contained in the

dataset.

3 Tasks
In this assignment, you will write a Python module, called sentiment_analysis.py (this is the name of the file that you should use) and a

main program, main.py , that uses the module to analyze Twitter information. In the module sentiment_analysis.py , you will create a

number of functions (as specified in the Functional Specifications) that will perform simple sentiment analysis on Twitter data.

sentiment_analysis.py should only contain your function definitions and have no code outside of these functions.

The Twitter data contains comments (“tweets”) from individuals related to a given keyword. The objective is to determine the average

sentiment for the dataset, the number of positive/negative/neutral tweets, and the top 5 countries that are most positive about this

keyword (as well as a few other statistics).

To accomplish this, you will need to do the following:

1 Read in and process a set of keywords and tweets from a given file.

2 Clean the tweets to remove any punctuation and convert them to all lowercase letters.

3 Process each individual tweet to determine a score, a “sentiment score”, for the individual tweet.

4 Analyze these scores to determine an overall average sentiment, an average sentiment for favorited/linked tweets, an average

sentiment for retweeted tweets, the number of positive/negative/neutral tweets, and find the top 5 countries by average

sentiment.

5 Report this information back to the user by outputting a new file containing the report.

3.1 - Read

STARTER CODE
Before you start coding, please note that starter code is available in Section 6. It is highly recommended that you use this code as

it will ensure your function names and parameters are correct.

Your program will have to read in two files, keywords.tsv (a tab-separated file) and tweets.csv (a comma-separated file). The exact

names of these files will be specified by the user, but the content will always be in the same format.

3.1.1 - keywords.tsv
This Tab-Separated Values (TSV) file will contain a list of one more keyword as well as a score each keyword contributes to the overall

sentiment of a tweet. Each line of this file will start with the keyword in all lower case, followed by a single tab character, and then an

integer value between -5 and 5.

An example of this file, named keywords.tsv, can be found here. This file contains the AFINN-111 wordlist of common keywords and

scores used for sentiment analysis. An example is shown below of the first 13 lines:

abandon -2

abandoned -2

abandons -2

abducted -2

abduction -2

abductions -2

abhor -3

abhorred -3

abhorrent -3

abhors -3

abilities 2

ability 2

aboard 1

Each keyword is separated from it’s corresponding score by a single tab (\t) character. A score of 5 would mean that this is a very

positive/happy word. A score of -5 would mean that this is a very negative/unhappy word.

IMPORTANT!
The AFINN-111 wordlist is just one real life example of a wordlist for sentiment analysis. Your program should work for any

wordlist of the same TSV format provided by the user!

3.1.2 - tweets.csv
This Comma-Separated Values (CSV) file will contain a list of tweets as well as associated metadata about the tweet and the tweeter,

such as their location (only if known), the number of times the tweet has been favorited/liked or retweeted, the date the tweet was

made, the user who tweeted it, etc.

Each line in the file contains information about only one tweet. Each field on a line is separated (delimited) by a comma. The fields are

always in the following order:

Created At, Tweet Text, Username, Retweet Count, Favorite Count, Language, Country, State/Province, City, Latitude, Longitude

An example line from the file adidas.csv that can be found here:

Feb 10 21:00:45 2023,Adidas says Kanye West split could cost company $1.3B as Yeezy shoes go unsold https://t.co/eviVODm3ig,D

The following table describes each filed in more depth:

Key In
Field Name Data Type Description
Dictionary

The date this tweet was posted to twitter in the format MMM DD HH:MM:SS YYYY. This
Created At date String
can be read in as a string.

The text that was tweeted by the user. Note that this text may be unclean and contain
Tweet Text text String
odd characters, punctuation, and hyperlinks.

Username user String The username of the user who made the tweet. Always one word with no spaces.

Retweet
retweet Integer The number of times this tweet has been retweeted. Always a positive integer value.
Count

Favorite Count favorite Integer The number of times this tweet has been favorited/liked. Always a positive integer value.

The language code representing the language this user has set in their profile. In most

Language lang String cases, this will be "en" as only English tweets were selected for inclusion in the dataset,

but can be any combination of two letters.

If known, the country that the user resides in will be listed here. If it is not known, the
Country country String
string value "NULL" will be given.

If known, the state or province that the user resides in will be listed here. If it is not
State/Province state String
known, the string value "NULL" will be given.

If known, the city that the user resides in will be listed here. If it is not known, the string
City city String
value "NULL" will be given.

If possible, an estimate of the user’s current latitude on the earth will be given here as a

Latitude lat Float/String floating-point value. If the latitude could not be estimated, this will be the string value

"NULL".

If possible, an estimate of the user’s current longitude on the earth will be given here as a

Longitude lon Float/String floating-point value. If the longitude could not be estimated, this will be the string value

"NULL".

Note that not all of these fields will be used in our analysis, but they must be read in by your program as described in the Functional

Specification.

3.2 - Clean

The text of the tweets in each dataset is not “clean”. That is to say that it contains characters that must be removed before we can

perform our analysis. This will involve two steps;

1) all characters except for English letters and spaces should be removed,

2) all English letters should be converted to lower case.

For example, if the tweet’s text was:

Java, Python, C++; endless possibilities await in the world of coding! http://t.co/ASD32S4S

After cleaning it should be:

java python c endless possibilities await in the world of coding httptcoasdss

3.3 - Process

The sentiment score for an individual tweet is calculated by comparing each word in the provided wordlist (e.g. keywords.tsv) to the

words contained in the cleaned tweet text. Each time a keyword is encountered, that keywords score is added to the sentiment score.

The keywords must be an exact match to count. For example the keyword “friend” should not match “friendly” and vice versa.

Examples:

If given the following already cleaned tweet and the provided AFINN-111 wordlist the sentiment score would be 12:

beautiful sunrise friendly smilesjoy setbacks frustrated call from best friend lifted spirits surprise gift added excitement m

This score is calculated by adding the scores for the keywords found in the tweet from the keywords.tsv file:

Any tweet with a positive score (>0) would be classified as a positive tweet. Any tweet with a negative score (<0) would be classified as

a negative tweet. Any tweet with a score of zero (0) would be classified as a neutral tweet.

If a keyword is encountered multiple times in a tweet, it should be counted multiple times such as in this example with a sentiment

score of 13 (based on the AFINN-111 wordlist):

in her best dreams the day unfolded perfectly her best friend surprised her with the best present imaginable

Keep in mind that the keyword list can be different depending on the keyword file the user provides.

3.4 - Analyze

After the sentiment score has been calculated for each tweet individually, statistics need to be calculated for the dataset as a whole.

The following metrics must be calculated:

1 The number of tweets in the dataset.

2 The average sentiment score of all tweets in the dataset.

3 The total number of positive, negative, and neutral tweets based on the tweet’s sentiment score (tweets with a positive score

are positive, negative score are negative, and neutral if they have a score of zero).

4 The number of tweets with at least one favorite/like.

5 The average sentiment score of only the tweets with at least one favorite/like.

6 The number of tweets with at least one retweet.

7 The average sentiment score of only the tweets with at least one retweet.

8 The average sentiment score for each country listed in the dataset (used to calculate the top 5 countries).

9 The top 5 countries in the dataset based on their average sentiment score.

All floating-point values should be rounded to two decimal places. These statistics will be returned in a dictionary as described in the

Functional Specification.

If there are no tweets with retweets in the dataset, then a string value of "NAN" should be returned for the average sentiment score of

tweets with at least one retweet. Similarly, if there are no tweets with any favorites in the dataset, then a string value of "NAN" should

be returned for the average sentiment score of tweets with at least one favorite/like.

3.5 - Report

After the analysis has been preformed the statistic calculated must be returned to the user in the form of plain text file (.txt file) with

the following format:

Average sentiment of all tweets: [float value]

Total number of tweets: [int value]

Number of positive tweets: [int value]

Number of negative tweets: [int value]

Number of neutral tweets: [int value]

Number of favorited tweets: [int value]

Average sentiment of favorited tweets: [float value]

Number of retweeted tweets: [int value]

Average sentiment of retweeted tweets: [float value]

Top five countries by average sentiment: [list of string values]

Where all values shown contained in [ ] should be replaced with a real value as shown in the example below:

Average sentiment of all tweets: -0.08

Total number of tweets: 534

Number of positive tweets: 134

Number of negative tweets: 150

Number of neutral tweets: 250

Number of favorited tweets: 258

Average sentiment of favorited tweets: 0.1

Number of retweeted tweets: 74

Average sentiment of retweeted tweets: 0.16

Top five countries by average sentiment: United States, United Kingdom, United Arab Emirates, Taiwan, Sweden

The other text contained in the file such as “Average sentiment of all tweets: ” or “Number of favorited tweets: “ must be exactly as

shown including the space and semicolon. The items in the report must be in exactly this order.

The list of the top 5 countries should be on one line and each country should be separated by a comma and space as shown above.

They should be ordered by average sentiment (highest to lowest). There must not be an extra comma at the end of the list (commas

should only appear between country names). The "NULL" value should not be included in the list. If there are less than 5 countries in

the dataset, there will be less than 5 countries in this list.

The filename of the report will be specified by the user.

4 Functional Specification
4.1 - sentiment_analysis.py

IMPORTANT!
All of your function names and the order of the parameters they take must be exactly as specified in this part. Naming your

functions differently, will result in the autograder being unable to grade your assignment (this will result in a grade penalty).

Your sentiment_analysis.py file must only contain function definitions. You must not call a function, ask for input, or give output

outside of these function definitions. Running the sentiment_analysis.py file should result in no output of any kind as your program

should be driven by the main.py file.

The module sentiment_analysis must contain the functions described in this section and they must be used in some way in your

program to read, clean, process, analyze, or report on the tweets in the given dataset. Each function and it’s parameters must have the

same name as specified below:

read_keywords(keyword_file_name)
This function should read the Tab-Separated Values (TSV) keywords file previously described (in Section 3.1.1). keyword_file_name is a

string containing the name of the file. You can safely assume that if the file exists, it will be in the current working directory (the

directory that main.py and sentiment_analysis.py is located in).

The function should return a dictionary with a key for each keyword in the file and a corresponding value equal to the score listed for

that keyword in keyword_file_name.

Example:

wonderful 4
unfair -2
trusted 2
tired -2

the dictionary produced should have the following values and keys:

{
'wonderful': 4,
'unfair': -2,
'trusted': 2,
'tired': -2
}

The keys should be strings and the values integers.

Exceptions:

If an IOError occurs, such as the file not existing, this function should print the text:

"Could not open file [keyword_file_name]!"

where [keyword_file_name] should be replaced with the value of keyword_file_name and the function should return an empty
dictionary.

clean_tweet_text(tweet_text)
This function should take a string, tweet_text, which contains a single tweet from the dataset and return a copy of the string that only

contains English letters and spaces. All letters should also be made lowercase.

More details and an example are given previously in Section 3.2. Clean.

read_tweets(tweet_file_name)
This function should read the Comma-Separated Values (CSV) tweet file previously described (in Section 3.1.2). tweet_file_name is a

string containing the name of the file. You can safely assume that if the file exits, it will be in the current working directory (the directory

that main.py and sentiment_analysis.py is located in).

The function should return a list of dictionaries. There should be one dictionary for each line contained in the tweet_file_name file. The

keys of the dictionary should be the key names given in the table in Section 3.1.2 and the values the corresponding values for that field

in the file.

The function clean_tweet_text should be used to clean the text of the tweets before they are copied into the dictionary.

Example:

If tweet_file_name contains the following two lines (note that word wrapping is used in this document to show each line on multiple

lines but in the file there is only a line break at the end of each line):

2023-02-10 17:20,Did an Air Canada flight spot the Chinese spy balloon over B.C. on Jan. 31? https://t.co/KOzRJFoORh https://t

2023-02-10 17:16,@AdamJPfeffer @AirCanada Your lucky Air Canada got you there. Lol,tekmacrogersco1,0,0,en,Canada,Ontario,NULL,

The list of dictionaries returned should be:

[
{
'city': 'NULL',
'country': 'NULL',
'date': '2023-02-10 17:20',
'favorite': 12,
'lang': 'en',
'lat': 'NULL',
'lon': 'NULL',
'retweet': 2,
'state': 'NULL',
'text': 'did an air canada flight spot the chinese spy balloon over bc on jan httpstcokozrjfoorh h
'user': 'CTVNews'
},
{
'city': 'NULL',
'country': 'Canada',
'date': '2023-02-10 17:16',
'favorite': 0,
'lang': 'en',
'lat': 50.000678,
'lon': -86.000977,
'retweet': 0,
'state': 'Ontario',
'text': 'adamjpfeffer aircanada your lucky air canada got you there lol',
'user': 'tekmacrogersco1'
}
]

Note that favorite and retweet should have integer values and not strings, and lat and lon should be floating point values unless they

are given as “NULL” in the file. Any field with a “NULL” value given in the file should simply have a string value of 'NULL' in the dictionary.

Exceptions:

If an IOError occurs, such as the file not existing, this function should print the text:

"Could not open file [tweet_file_name]"

where [tweet_file_name] should be replaced with the value of tweet_file_name and the function should return an empty list.

calc_sentiment(tweet_text, keyword_dict)
This function should calculate the sentiment score for an individual tweet based on the text contained in that tweet as described in

Section 3.3. Process. tweet_text is a string value containing the already cleaned text of an individual tweet. keyword_dict is a keyword

dictionary created by the read_keywords function to be used for calculating the sentiment score.

The function should return an integer value equal to the sentiment score for the given tweet.

Example using the AFINN-111 wordlist:

calc_sentiment("in her best dreams the day unfolded perfectlyher best friend surprised her with the best pr

Output:

10

classify(score)
This function takes a sentiment score, score, and classifies it as positive, negative, or neutral. If the score is greater than zero, the

function should return the string "positive", if the score is less than zero it should return the string "negative", if it is equal to zero

exactly, it should return the string "neutral".

make_report(tweet_list, keyword_dict)
This function takes a list of tweets, tweet_list, created by the read_tweets function and a keyword dictionary, keyword_dict, created

by the read_keywords function and performs the analysis described in Section 3.4.

The function should return a dictionary that contains the following keys and values:

Key Type Value

The average sentiment value of all tweets that have been favorited/liked at least once. The string
avg_favorite Float/String
value "NAN" should be output if num_favorite is zero.

The average sentiment value of all tweets that have been retweeted at least once. The string value
avg_retweet Float/String
"NAN" should be output if num_retweet is zero.

The average sentiment value of all tweets in the tweet list. The string value "NAN" should be output
avg_sentiment Float/String
if num_tweets is zero.

num_favorite Integer The number of tweets in the tweet list that have been favorited/liked at least once.

num_negative Integer The number of tweets in the tweet list that would be classified as negative by the classify function.

num_neutral Integer The number of tweets in the tweet list that would be classified as neutral by the classify function.

num_positive Integer The number of tweets in the tweet list that would be classified as positive by the classify function.

num_retweet Integer The number of tweets in the tweet list that have been retweeted at least once.

num_tweets Integer The total number of tweets in the given tweet list.

A string containing the top 5 countries found in the tweet list based on the average sentiment of

tweets for that country. They should be ordered by average sentiment (highest to lowest). Each

country listed in the string should be separated by a comma followed by a space. Make sure you
top_five String
don't have an extra comma at the end of the country list. Note that the value "NULL" should not

appear in this list. If there are less than 5 countries in the dataset, there will be less than 5 countries

in this list.

All floating-point values (e.g. the average sentiment scores) should be rounded to two decimal places using python’s round function.

The order of the items in the dictionary does not mater but the keys must be named exactly as listed above.

If an average value can not be calculated, for example due to there being no tweets in the dataset that are favorited, the average value

should be the string "NAN". In all other cases it should be the correct floating point value rounded to two decimal places.

Example Output:

The following is an example of a report dictionary that could be produced by this function:

{
'avg_favorite': 0.1,
'avg_retweet': 0.16,
'avg_sentiment': -0.08,
'num_favorite': 258,
'num_negative': 150,
'num_neutral': 250,
'num_positive': 134,
'num_retweet': 74,
'num_tweets': 534,
'top_five': 'United States, United Kingdom, United Arab Emirates, Taiwan, Sweden'
}

Note that the last value for top_five is a string and not a list. This string should list the top five countries in order of average

sentiment.

Hints:

1 There are several ways to sort your countries by average sentiment depending on how you have them stored. If they are stored

in a dictionary, with the keys being the country names and the values the average sentiment for that country, you can take

advantage of the sorted function. The following are some resources that may help with sorting a dictionary by values:

Sort Dictionary by Value in Python (freeCodeCamp.org)

Sorting a Python Dictionary: Values, Keys, and More (RealPython.com)

How to sort dictionary by value in Python? (flexiple.com)

Sorting HOW TO (Python Documentation)

2 If you have a list of countries (or any string values) and wish to join the values into a string seperated by a comma (or other

value) you can use the string .join() method.

write_report(report, output_file)
This function creates the report file described in Section 3.5. As input, it takes report, the dictionary created by the make_report

function, and output_file, the name of the file to write the report to. The report should be formatted exactly as described in Section 3.5

including being output in the same order.

If writing to the file was successful, this function should print the text:

"Wrote report to [output_file]"

where [output_file] is replaced with the value of output_file.

This text should not be printed if an exception occurred when opening or writing to the file.

Exceptions:

Should an IOError occur when opening or writing to the output_file, this function should print the text:

"Could not open file [output_file]"

where [output_file] is replaced with the value of output_file.

4.2 - main.py

IMPORTANT!
All specified functions should be defined in sentiment_analysis.py and not main.py .

Any specified functions defined in main.py will not be graded.

The program in main.py should ask the user for the file names of the keyword file and tweet file that data will be read from, as well as

the name of the report file that will be created. It must use the functions defined in the sentiment_analysis.py module to perform the

tasks described in Section 3 and write the final report.

Additionally, main.py should check the input from the user is valid and raise an exception in the following cases:

1 If the keywords filename does not end in the .tsv extension an Exception with the text "Must have tsv file

extension!" should be raised.

2 If the tweet filename does not end in the .csv extension an Exception with the text "Must have csv file extension!"

should be raised.

3 If the report filename does not end in the .txt extension an Exception with the text "Must have txt file extension!"

should be raised.

4 If either read_keywords or read_tweets returns an empty dictionary or empty list an Exception with the text "Tweet list

or keyword dictionary is empty!" should be raised.

Example Input / Output (successful):

Input keyword filename (.tsv file): keywords.tsv

Input tweet filename (.csv file): tweets.csv

Input filename to output report in (.txt file): report.txt

Wrote report to report.txt

User input is shown in red. The report should be written to the file name given by the user (in this case report.txt) and not shown.

Your prompts to the user should contain the same text as shown above. Note that the last line, "Wrote report to report.txt" is

printed by the write_report function.

Example Input / Output (Exception):

Input keyword filename (.tsv file): bad_file_ext.docx

Traceback (most recent call last):

File "main.py", line 7, in

raise Exception("Must have tsv file extension!")

Exception: Must have tsv file extension!

5 Non-Func tiona l Specifica t ion


In addition to the other tasks and specifications given in this document, your program must also fulfill the following requirements:

1 Your code must be written for Python 3 and work in Python 3.9.

2 You may not use any modules or third-party libraries not described in this document. Standard built-in functions such as the

String, file, and math functions are fine. You should not have to import anything other than your sentiment_analysis module.

3 You must document your code with brief comments. Each file should contain a comment at the top of the file with your name,

student number, and a brief description of what is contained in that file. At least one comment should also be given for each

function that describes its purpose, parameters, and values returned. You should also include any additional comments to

document any lines that may be unclear to the reader.

4 Your program must be efficient and terminate within a reasonable time limit. All gradescope test cases must terminate within

the autograder’s 10 minute time limit.

5 Assignments are to be done individually and must be your own original work. You may not show or otherwise share your code

for this assignment with others. Software will be used to detect academic dishonesty (cheating). If you have any questions

about what is or is not academic dishonesty, please consult the document on academic dishonesty and ask any questions to

your course instructor before submitting this assignment.

6 You must follow Python style and coding conventions and good programming techniques, for example:

Meaningful variable and function names.

Follow conventions (either camelCase or snake_case) for naming variables and constants. This must be done consistently

throughout your program.

Readability: indentation, white space, consistency.

Try to follow the PEP 8 style guide for Python code where possible.

Do not use global variables unless they are constant (never change) and do not have functions access variables outside of

their scope.

Do not define functions inside of other functions.

Do not use recursion inappropriately or in a way that would eventually cause your program to crash. Your main()

method should only be called once and not from another function (should only be called from the bottom of main.py).

7 All of your code should be contained in the files main.py and sentiment_analysis.py . Only submit these files and no others and

ensure the filenames match exactly. It is your responsibility to ensure you have submitted the correct files.

8 sentiment_analysis.py must only contain function definitions. No code should be outisde of a function in this file. Running

sentiment_analysis.py directly should result in no output and should not wait for any input.

9 main.py should not contain any specified functions, only functions in sentiment_analysis.py will be graded by the autograder.

10 All function names, key names, and outputs should follow the specifications given in this document exactly. Not following the
specifications may lead to test cases failing. It is your responsibility to ensure you have followed them correctly.

11 Frequently backup your work remotely (e.g. using OneDrive) in a way that is secure and private. No extension will be given for
lost or corrupted files. ¸

Violating any of these rules will result in a mark penalty!

6 Starter Code
The following starter code has been provided for you. You are free to use this code in your solution. You should keep the function

headers (the names and the parameters the same). Keep the functions in the files shown below and do not use global variables.

6.1 - sentiment_analysis.py

"""
Starter code for sentiment_analysis.py
Your function headers must match this file exactly.
You should only have function definitions in this file.
No code should be outside of a function in this file.

Replace this comment with one containing your full name,


student number, UWO username, the date, and a short
description of what this file does/contains.

Each function should have at least one comment documenting


what it does and the arguments it takes.
"""

def read_keywords(keyword_file_name):
# Add your code here
# Should return a dict of keywords.

def clean_tweet_text(tweet_text):
# Add your code here
# Should return a string with the clean tweet text.

def calc_sentiment(tweet_text, keyword_dict):


# Add your code here
# Should return an integer value.

def classify(score):
# Add your code here
# Should return a string.

def read_tweets(tweet_file_name):
# Add your code here
# Should return a list with a dictionary for each tweet.

def make_report(tweet_list, keyword_dict):


# Add your code here
# Should return a dictionary containing the report values.

def write_report(report, output_file):


# Add your code here
# Should write the report to the output_file.

6.2 - main.py

"""
Starter code for main.py

This file should take input from the user and call the
functions in sentiment_analysis.py

Replace this comment with one containing your full name,


student number, UWO username, the date, and a short
description of what this file does/contains.

You can NOT move the functions from sentiment_analysis.py


into this file. They must be defined in sentiment_analysis.py
"""

# Import the sentiment_analysis module


from sentiment_analysis import *

def main():
# Add code for main() here.
# This should get input from the user and call the
# required functions from sentiment_analysis.py

main()

7 Marking and Submission


7.1 - Submission

1 You must submit the 2 files:

main.py

sentiment_analysis.py

2 This must be submitted to the Assignment 3 submission page on Gradescope.

3 Several tests that will automatically run when you upload your files. It is important to review the results of these testcases as

this will give you an idea of how well your program is working. You may resubmit any number of times up until the due date.

4 It is recommended that you create your own test cases to check that the code is working properly for a multitude of different

scenarios (some example datasets have been provided for you at the bottom of this document).

5 Assignments will not be accepted by email or by any other form then a Gradescipe submission.

7.2 - Marking

The assignment will be marked as a combination of your auto-graded tests and manual grading of your code logic, comments,

formatting, style, etc.

Marks will be deducted for failing to follow any of the specifications in this document (both functional and nonfunctional), not

documenting your code with comments, using poor formatting or style, or naming your files incorrectly.

TEST CASE REQUIREMENT


Starting with this assingment, you MUST pass the test cases to obtain the autograder points on the assingment.

For this assignment Gradescope will show you the result of all testcases (there are no hidden test cases). As such it is your

responsibility to ensure they all pass. TAs will not manually grade your code.

Submit to Gradescope offten and ensure all testcases pass before the due date.

Assume the autograder is correct and marked your testcases accurately until told otherwise.

Marking Scheme

[119 marks] Auto-graded test cases which check your code for correctness and adherence to the specifications given in this

document. For this assignment these MUST pass, TAs will not manually grade your code.

[6 marks] Comments. One comment at the top of each file with your details and a description of the file, one comment describing

each function, and any other needed comments.

[15 marks] Style and Variable Names. Using consistent and clear variable names, avoiding global variables, defining functions

correct (not inside or each other), and other programming style.

Total: 140 marks

Penalties

Addtional marks will be removed for the following:

Filename Issues: If one or more files are not named correctly. They must be exactly as specified, including capitalization.

Code outside a function in sentiment_analysis.py: If the program has input or output in sentiment_analysis.py outside of a

function or other code outside of a function in sentiment_analysis.py.

Functions in main.py: If the functions specified in the assignment such as read_tweets are defined in main.py and not

sentiment_analysis.py. It is fine if functions not specified in the assignment are in main.py, this also does not apply to the main()

function in main.py.

Function or Key Name Incorrect: If the name of a function or a dictionary key does not match the specifcation exactly, including

capitalization.

Hardcoding: Hardcoding is writing code that is not easily modified or reused. Hardcoded code can not properly adapt to user input

and only works for set values or cases. Hardcoding your program to only pass the Gradescope test cases, and not work properly

for other cases, will result in a significant penalty on this assignment.

7.3 - Late Submission

Late assignments will only be accepted up to 4 days late and only if you have enough late coupons remaining (at least one for each day

late). If you submit one day late, you will need to use 1 late coupon. 2 days late, 2 late coupons. 3 days late, 3 late coupon, and so on. If

you have insufficient late coupons remaining or submit more than 4 days late, you will receive a zero grade on this assignment.

It is your responsibility to track your late coupon use. Any values shown on OWL should be considered an estimate and may not be

accurate or up to date.

REMEMBER!
You have 4 coupons (for the entire semester) that will be automatically applied when you submit late.

It is the student's responsibility to ensure the work was submitted and posted in GradeScope.

Any assignment not submitted correctly will not be graded.

Submissions through the email will not be accepted at any circumstances.

Please check this page back whenever an announcement is posted regarding this assignment.
ONLY the 2 mentioned files are to be submitted. Otherwise, marks will be deducted if you submit anything more than these

2 files.

Do NOT submit any of the other files or folders.

Do NOT zip or archive your files.

Do NOT submit a PDF or screenshot of your code (this will result in a zero grade)!

8 Attachments and E xam p l e s

Example 1:
In this example, there is only one long tweet. Note that words like "adventurous" and "frustration" do not match the keywords
"adventure" and "frustrated" and do not alter the score. This is as intended. Your program should only count exact matches (after
capitalization and punctuation is removed).

Keyword File: keys1.tsv

Tweet File: onetweet.csv

Output File: keys1_onetweet.txt

Example 2:
In this example, there is only one tweet with a lot of punctuation. Once it is removed the word smile is left and this is the only word
that impacts the score.

Keyword File: keys1.tsv

Tweet File: onetweet2.csv

Output File: keys1_onetweet2.txt

Example 3:
In this example, the word happy is repeated many times. Each time a word is repeated it should count towards the overall score.

Keyword File: keys2.tsv

Tweet File: onetweet3.csv

Output File: keys2_onetweet3.txt

Example 4:
In this example, there are 20 tweets in the CSV file. Note that as there are only 4 different countries, the report only lists 4 countries in
the top 5.

Keyword File: keys1.tsv

Tweet File: tweet1.csv

Output File: keys1_tweet1.txt

Example 5:
In this example, there are 40 tweets in the CSV file. In this case some tweets have NULL values for country, state, or city as well as the
lat/long. NULL should not be considered a country for the top 10 list.

Keyword File: keys2.tsv

Tweet File: tweet2.csv

Output File: keys2_tweet2.txt

Example 6:
In this example, the full The AFINN-111 wordlist is used on the tweets from example 5.

Keyword File: keywords.tsv

Tweet File: tweet2.csv

Output File: keywords_tweet2.txt

Real World Datasets (your code will not be tested on these tweets)
The following are real life examples of tweets taken from X (twitter) and word lists used in real life sentiment analysis. As such they
may be more complex, longer, and could contain inappropriate language. The autograder will test your code with smaller and more
basic tweet datasets, but these datasets below are included if you would like to try your code on real world data.

keywords.tsv - The AFINN-111 wordlist.

air_canada.csv - Dataset of tweets about Air Canada.

adidas.csv - Dataset of tweets about Adidas.

chatgpt.csv - Dataset of tweets about ChatGPT.

disney.csv - Dataset of tweets about Disney.

microsoft.csv - Dataset of tweets about Microsoft.

netflix.csv - Dataset of tweets about Netflix.

shopify.csv - Dataset of tweets about Shopify.

sony.csv - Dataset of tweets about Sony.

9 Helpf ul Func tions and Met hods

You may find the following built-in Python functions and methods useful.

String Functions/Methods

split, strip, lower, endswith, join, count

File Functions/Methods

open, close, readline, write

Type Conversion

int, float , str

Math and Other Functions

round, sorted , len

TOP

You might also like