You are on page 1of 10

{

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Import Packages"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import nltk\n",
"from nltk.corpus import twitter_samples\n",
"import numpy as np"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Download Twitter dataset"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"nltk.download('twitter_samples')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can load the text fields of the positive and negative tweets by using the
module's strings() method like this:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# select the set of positive and negative tweets\n",
"all_positive_tweets = twitter_samples.strings('positive_tweets.json')\n",
"all_negative_tweets = twitter_samples.strings('negative_tweets.json')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Next, we'll print a report with the number of positive and negative tweets."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Number of positive tweets: ', len(all_positive_tweets))\n",
"print('Number of negative tweets: ', len(all_negative_tweets))\n",
"\n",
"print('\\nThe type of all_positive_tweets is: ', type(all_positive_tweets))\
n",
"print('The type of a tweet entry is: ', type(all_negative_tweets[0]))"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print(\"Positive Tweet Example:\")\n",
"print(all_positive_tweets[0])\n",
"\n",
"print(\"\\nNegative Tweet Example:\")\n",
"print(all_negative_tweets[0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Preprocess Tweets"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import re # library for regular expression
operations\n",
"import string # for string operations\n",
"\n",
"from nltk.corpus import stopwords # module for stop words that come
with NLTK\n",
"from nltk.stem import PorterStemmer # module for stemming\n",
"from nltk.tokenize import TweetTokenizer # module for tokenizing strings"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove hyperlinks, Twitter marks and styles"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We do not want to use every word in a tweet because many tweets have hashtags,
retweet marks, and hyperlinks. We will use regular expressions to remove them from
a tweet."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def remove_hyperlinks_marks_styles(tweet):\n",
" \n",
" # remove old style retweet text \"RT\"\n",
" new_tweet = re.sub(r'^RT[\\s]+', '', tweet)\n",
"\n",
" # remove hyperlinks\n",
" new_tweet = re.sub(r'https?:\\/\\/.*[\\r\\n]*', '', new_tweet)\n",
"\n",
" # remove hashtags\n",
" # only removing the hash # sign from the word\n",
" new_tweet = re.sub(r'#', '', new_tweet)\n",
" \n",
" return new_tweet"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Tokenize the string"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To tokenize means to split a string into individual words."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# instantiate tokenizer class\n",
"tokenizer = TweetTokenizer(preserve_case=False, strip_handles=True,\n",
" reduce_len=True)\n",
"\n",
"def tokenize_tweet(tweet):\n",
" \n",
" tweet_tokens = tokenizer.tokenize(tweet)\n",
" \n",
" return tweet_tokens"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Remove stop works and punctuations"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Remove stop words and punctuations. Stop words are words that don't add
significant meaning to the text. For example, 'i' and 'me'."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"nltk.download('stopwords')\n",
"\n",
"#Import the english stop words list from NLTK\n",
"stopwords_english = stopwords.words('english')\n",
"\n",
"punctuations = string.punctuation\n",
"\n",
"def remove_stopwords_punctuations(tweet_tokens):\n",
" \n",
" tweets_clean = []\n",
" \n",
" for word in tweet_tokens:\n",
" if (word not in stopwords_english and word not in punctuations):\n",
" tweets_clean.append(word)\n",
" \n",
" return tweets_clean"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Stemming"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The process of converting a word to its most general form, or stem.\n",
"\n",
"learning -> learn\n",
"\n",
"learned -> learn\n",
"\n",
"learnt -> learn"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"stemmer = PorterStemmer()\n",
"\n",
"def get_stem(tweets_clean):\n",
" \n",
" tweets_stem = []\n",
" \n",
" for word in tweets_clean:\n",
" stem_word = stemmer.stem(word)\n",
" tweets_stem.append(stem_word)\n",
" \n",
" return tweets_stem"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tweet_example = all_positive_tweets[2277]\n",
"print(tweet_example)\n",
"\n",
"processed_tweet = remove_hyperlinks_marks_styles(tweet_example)\n",
"print(\"\\nRemoved hyperlinks, Twitter marks and styles:\")\n",
"print(processed_tweet)\n",
"\n",
"tweet_tokens = tokenize_tweet(processed_tweet)\n",
"print(\"\\nTokenize the string:\")\n",
"print(tweet_tokens)\n",
"\n",
"tweets_clean = remove_stopwords_punctuations(tweet_tokens)\n",
"print(\"\\nRemove stop words and punctuations:\")\n",
"print(tweets_clean)\n",
"\n",
"tweets_stem = get_stem(tweets_clean)\n",
"print(\"\\nGet stem of each word:\")\n",
"print(tweets_stem)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Combine all preprocess techniques"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def process_tweet(tweet):\n",
" \n",
" processed_tweet = remove_hyperlinks_marks_styles(tweet)\n",
" tweet_tokens = tokenize_tweet(processed_tweet)\n",
" tweets_clean = remove_stopwords_punctuations(tweet_tokens)\n",
" tweets_stem = get_stem(tweets_clean)\n",
" \n",
" return tweets_stem"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"tweet_example = all_negative_tweets[1000]\n",
"print(tweet_example)\n",
"\n",
"processed_tweet = process_tweet(tweet_example)\n",
"print(processed_tweet)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Split data into two pieces, one for training and one for testing"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"test_pos = all_positive_tweets[4000:]\n",
"train_pos = all_positive_tweets[:4000]\n",
"test_neg = all_negative_tweets[4000:]\n",
"train_neg = all_negative_tweets[:4000]\n",
"\n",
"train_x = train_pos + train_neg\n",
"test_x = test_pos + test_neg\n",
"\n",
"train_y = np.append(np.ones(len(train_pos)), np.zeros(len(train_neg)))\n",
"test_y = np.append(np.ones(len(test_pos)), np.zeros(len(test_neg)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Create frequency dictionary"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def create_frequency(tweets, ys):\n",
" \n",
" freq_d = {}\n",
"\n",
" # TODO: Create frequency dictionary\n",
" ...\n",
" \n",
" return freq_d"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# testing function\n",
"\n",
"tweets = ['i am happy', 'i am tricked', 'i am sad', 'i am tired', 'i am
tired']\n",
"ys = [1, 0, 0, 0, 0]\n",
"\n",
"freq_d = create_frequency(tweets, ys)\n",
"print(freq_d)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Train model using Naive Bayes"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# build the freqs dictionary\n",
"\n",
"freqs = create_frequency(train_x, train_y)"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"def train_naive_bayes(freqs, train_x, train_y):\n",
" '''\n",
" Input:\n",
" freqs: dictionary from (word, label) to how often the word appears\n",
" train_x: a list of tweets\n",
" train_y: a list of labels correponding to the tweets (0,1)\n",
" Output:\n",
" logprior: the log prior. (equation 3 above)\n",
" loglikelihood: the log likelihood of you Naive bayes equation.
(equation 6 above)\n",
" '''\n",
" \n",
" loglikelihood = {}\n",
" logprior = 0\n",
" \n",
" # calculate the number of unique words in vocab\n",
" unique_words = set([pair[0] for pair in freqs.keys()])\n",
" V = len(unique_words)\n",
" \n",
" # calculate N_pos and N_neg\n",
" N_pos = N_neg = 0\n",
" for pair in freqs.keys():\n",
" \n",
" # TODO: get N_pos and N_get\n",
" ...\n",
" \n",
" # TODO: calculate the number of documents (tweets)\n",
" D = ...\n",
" \n",
" # TODO: calculate D_pos, the number of positive documents (tweets)\n",
" D_pos = ...\n",
" \n",
" # TODO: calculate D_neg, the number of negative documents (tweets)\n",
" D_neg = ...\n",
" \n",
" # TODO: calculate logprior\n",
" logprior = ...\n",
" \n",
" # for each unqiue word\n",
" for word in unique_words:\n",
" \n",
" # get the positive and negative frequency of the word\n",
" freq_pos = ...\n",
" freq_neg = ...\n",
" \n",
" # calculate the probability that word is positive, and negative\n",
" p_w_pos = ...\n",
" p_w_neg = ...\n",
" \n",
" # calculate the log likelihood of the word\n",
" loglikelihood[word] = ...\n",
" \n",
" return logprior, loglikelihood"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"logprior, loglikelihood = train_naive_bayes(freqs, train_x, train_y)\n",
"print(logprior)\n",
"print(len(loglikelihood))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Predict Tweets!"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# UNQ_C4 (UNIQUE CELL IDENTIFIER, DO NOT EDIT)\n",
"def naive_bayes_predict(tweet, logprior, loglikelihood):\n",
" '''\n",
" Input:\n",
" tweet: a string\n",
" logprior: a number\n",
" loglikelihood: a dictionary of words mapping to numbers\n",
" Output:\n",
" p: the sum of all the logliklihoods of each word in the tweet (if
found in the dictionary) + logprior (a number)\n",
"\n",
" '''\n",
"\n",
" # TODO: process the tweet to get a list of words\n",
" word_l = ...\n",
"\n",
" # TODO: initialize probability to zero\n",
" p = ..\n",
"\n",
" # TODO: add the logprior\n",
" p += ...\n",
"\n",
" for word in word_l:\n",
"\n",
" # TODO: get log likelihood of each keyword\n",
" ...\n",
"\n",
" return p"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Run this cell to test your function\n",
"for tweet in ['I am happy', 'I am bad', 'this movie should have been great.',
'great', 'great great', 'great great great', 'great great great great', 'bad bad
bad bad']:\n",
" # print( '%s -> %f' % (tweet, naive_bayes_predict(tweet, logprior,
loglikelihood)))\n",
" p = naive_bayes_predict(tweet, logprior, loglikelihood)\n",
"# print(f'{tweet} -> {p:.2f} ({p_category})')\n",
" print(f'{tweet} -> {p:.2f}')"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

You might also like