You are on page 1of 40

{

"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# U.S.A. Presidential Vocabulary"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"My Codecademy portfolio project from the <a
href='https://www.codecademy.com/learn/paths/data-science'>Data Scientist Path</a>
Natural Languages Processing (NLP) Course, Word Embeddings Section."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Overview"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Whenever a United States of America president is elected or re-elected, an
inauguration ceremony takes place to mark the beginning of the president’s term.
During the ceremony, the president gives an inaugural address to the nation,
dictating the tone and focus of the next four years of leadership.\n",
"\n",
"In this project you will have the chance to analyze the inaugural addresses of
the presidents of the United States of America, as collected by the <a
href=\"https://www.nltk.org/book/ch02.html\">Natural Language Toolkit</a>, using
word embeddings.\n",
"\n",
"By training sets of word embeddings on subsets of inaugural address versus the
collection of presidents as a whole, we can learn about the different ways in which
the presidents use language to convey their agenda."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Project Goal:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analyze USA presidential inaugural speeches using NLP word embeddings models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Project Requirements"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Be familiar with:\n",
"- Python3\n",
"- NLP (Natural Languages Processing)\n",
"<br><br>\n",
"- The Python Libraries:\n",
" - re\n",
" - Pandas\n",
" - Json\n",
" - Collections\n",
" - NLKT\n",
" - gensim\n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Links:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href='https://www.alex-ricciardi.com/post/u-s-a-presidential-vocabulary'>My
Project Blog Presentation<a><br>\n",
"<br>\n",
"<a href='https://github.com/ARiccGitHub/us_presidential_vocabulary'>Project
GitHub<a><br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style='color : MediumBlue'>Preprocessing the Data</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The project corpus data can be freely downloaded from <a
href='http://www.nltk.org/nltk_data/'>NLTK Corpora</a> under the designation \"68.
C-Span Inaugural Address Corpus\"."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Libraries:"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"# Regex\n",
"import re\n",
"# Operating system dependent functionality\n",
"import os\n",
"# JSON encoder and decoder\n",
"import json\n",
"# Data manipulation tool\n",
"import pandas as pd\n",
"# Natural language processing\n",
"import nltk\n",
"# Tokenization into sentences\n",
"from nltk.tokenize import PunktSentenceTokenizer\n",
"# Stop words and lexical database of English \n",
"from nltk.corpus import stopwords, wordnet\n",
"# lemmatization class\n",
"from nltk.stem import WordNetLemmatizer\n",
"# Counter Dictionary class -
https://docs.python.org/3/library/collections.html#collections.Counter -\n",
"from collections import Counter"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Saves list function:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this project, I want to save lists of lists, a list of lists is a list of
objects, I use <a hreff='https://docs.python.org/3/library/json.html'>json</a> to
save the lists in my list as objects."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The save_list() function:\n",
"\n",
"- Takes the arguments:\n",
" - file_name, string data type\n",
" - list_to_save, list data type\n",
"<br><br>\n",
"- Saves list_to_save into file_name.txt as json objects"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"def save_list(file_name, list_to_save): \n",
" with open(f'data/{file_name}.txt', 'w') as file:\n",
" file.write(json.dumps(list_to_save))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To load the list files you can use\n",
"\n",
"The load_list() function:\n",
"\n",
"- Takes the arguments:\n",
" - list_name, string data type\n",
"- Load list_name.txt\n",
"<br><br>\n",
"- Returns the list_name.txt as a list "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"def load_list(list_name):\n",
" with open(f'data/{file_name}.txt', 'r') as file:\n",
" return json.loads(file.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 style='color : DarkMagenta'>Converting files into a corpus</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"For this project, I decided to combine the files data into a corpus that I
named ```speeches```."
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Project directory path\n",
"path = os.getcwd()\n",
"# Sorts and save files name from the corpus_data folder\n",
"file_names = sorted([file for file in os.listdir(f\"{path}/corpus_data\")])\
n",
"# Creates a speeches list from files \n",
"speeches = []\n",
"for name in file_names: \n",
" with open(f'corpus_data/{name}', 'r+') as file: \n",
" speeches.append(file.read())"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sample from the ```speeches``` corpus:"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"'Fellow citizens, I am again called upon by the voice of my country to
execute the functions of its Chief Magistrate. When the occasion proper for it
shall arrive, I shall endeavor to express the high sense I entertain of this
distinguished honor, and of the confidence which has been reposed in me by the
people of united America.\\n\\nPrevious to the execution of any official act of the
President the Constitution requires an oath of office. This oath I am now about to
take, and in your presence: That if it shall be found during my administration of
the Government I have in any instance violated willingly or knowingly the
injunctions thereof, I may (besides incurring constitutional punishment) be subject
to the upbraidings of all who are now witnesses of the present solemn ceremony.\\
n\\n \\n'"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 1793-Washington's speech\n",
"speeches[1]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can see from the ```speeches``` corpus sample that the speech's texts is
not clean, it can not be processed properly by a NLT model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 style='color : DarkMagenta'>Preprocessing corpus</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before a text can be processed by a NLP model, the text data needs to be
preprocessed.<br>\n",
"Text data preprocessing is the process of cleaning and prepping the text data
to be processed by NLP models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Cleaning and prepping tasks:\n",
"\n",
"- Noise removal is a text pre-processing step concerned with removing
unnecessary formatting from our text.\n",
"\n",
"- Tokenization is a text pre-processing step devoted to breaking up text into
smaller units (usually words or discrete terms).\n",
"<br>\n",
"\n",
"- Normalization is the name we give most other text preprocessing tasks,
including stemming, lemmatization, upper and lowercasing, and stopword removal.\n",
"\n",
" - Stemming is the normalization preprocessing task focused on removing
word affixes. \n",
"\n",
"\t- Lemmatization is the normalization preprocessing task that more carefully
brings words down to their root forms."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Tokenization"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In this project, I break down the presidents' speeches into words on a
sentence by sentence basis by using the sentence tokenizer <a
href='https://www.nltk.org/_modules/nltk/tokenize/punkt.html'>nltk.tokenize.PunktSe
ntenceTokenizer()</a> class."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Part-of-Speech Tagging"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To improve the performance of <a
href=\"https://nlp.stanford.edu/IR-book/html/htmledition/stemming-and-
lemmatization-1.html\">lemmatization</a> (bring a word to his root), each word in
the processed text is assigned parts of speech tag, \n",
"<a href=\"https://nlp.stanford.edu/software/tagger.shtml#:~:text=A%20Part%2DOf
%2DSpeech%20Tagger,like%20'noun%2Dplural'.\">Part-of-Speech Tagging</a> is the
process of reading text in some language and assigns parts of speech to each word
(and other token), such as noun, verb, adjective, etc."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Part-of-Speech tagging function:\n",
"\n",
"The ```get_part_of_speech()``` function:\n",
"- Takes the arguments:\n",
" - ```word```, string data type.<br>\n",
"<br>\n",
"- Matches ```word``` with synonyms\n",
"- Tags ```word``` and count tags.<br> \n",
"<br>\n",
"- Returns The most common tag, the tag with the highest count, ex: n for Noun,
string data type."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"def get_part_of_speech(word):\n",
" # Synonyms matching\n",
" probable_part_of_speech = wordnet.synsets(word)\n",
" # Initializing Counter class object\n",
" pos_counts = Counter()\n",
" # Taging and counting tags\n",
" pos_counts[\"n\"] = len( [ item for item in probable_part_of_speech if
item.pos()==\"n\"] ) # Noun\n",
" pos_counts[\"v\"] = len( [ item for item in probable_part_of_speech if
item.pos()==\"v\"] ) # Verb\n",
" pos_counts[\"a\"] = len( [ item for item in probable_part_of_speech if
item.pos()==\"a\"] ) # Adjectif\n",
" pos_counts[\"r\"] = len( [ item for item in probable_part_of_speech if
item.pos()==\"r\"] ) # Adverb\n",
" # The most common tag, the tag with the highest count, ex: n for Noun \n",
" most_likely_part_of_speech = pos_counts.most_common(1)[0][0]\n",
" \n",
" return most_likely_part_of_speech"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### The word 'us':"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The word 'us' is a commonly used word in presidential inauguration
addresses.<br>\n",
"The word 'us' is a commonly used word in presidential inauguration addresses.\
n",
"The result of preprocessing the word 'us' through lemmatizing with the part-
of-speech tagging method <a
href='https://www.nltk.org/api/nltk.corpus.reader.html#nltk.corpus.reader.wordnet.L
emma.synset'>nlt.corpus.reader.wordnet.synsets()</a> function and in conjunction
with `stopwords` removal and the <a
href='https://www.nltk.org/_modules/nltk/stem/wordnet.html'>nltk.stem.WordNetLemmat
izer().lemmatize()</a> method, is that the word 'us' becomes 'u'.<br>\n",
"<br>\n",
"This happens because the lemmatize(word, get_part_of_speech(word)) method
removes the character 's' at the end of words tagged as nouns. The word 'us', which
is not part of the stopwords list, is tagged as a noun causing the lemmatization
result of 'us' to be 'u'."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The word `'us'` is not a `stopword`:"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"False"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"'us' in set(stopwords.words('english'))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `get_part_of_speech()` function tags the word `'us'` as a noun:"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'n'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"get_part_of_speech('us')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The `lemmatize(word, get_part_of_speech(word))` method removes the character
`'s'` at the of `words` tags as noun, and the word `'us'` is tagged as a noun
causing the lemmatization result of `'us'` to be `'u'`. "
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'u'"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"normalizer_us = WordNetLemmatizer()\n",
"normalizer_us.lemmatize('us', get_part_of_speech('us'))"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"u\n",
"\n"
]
}
],
"source": [
"print(f'{normalizer_us.lemmatize(\"us\", get_part_of_speech(\"us\"))}\\n')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Preprocessing:"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# Stop words\n",
"stop_words = set(stopwords.words('english'))\n",
"# Initializes the lemmatizer\n",
"normalizer = WordNetLemmatizer()\n",
"# Creates an empty list of processed speeches\n",
"preprocessed_speeches = []\n",
"# ---------------------------------------------------- Preprocessing loop\n",
"for speech in speeches:\n",
" # ------------------ Tokenizing\n",
" # Initializes sentence tokenizer\n",
" sentence_tokenizer = PunktSentenceTokenizer()\n",
" # Tokenizes speech into sentences\n",
" sentence_tokenized_speech = sentence_tokenizer.tokenize(speech)\n",
" # ------------------ Normalizing loop\n",
" # Creates an empty sentences list \n",
" word_sentences = [] \n",
" for sentence in sentence_tokenized_speech:\n",
" # ----------- Removes noise from sentence and tokenizes the sentence
into words \n",
" word_tokenized_sentence = [re.sub('[^a-zA-Z0-9]+', '',
word.lower()) \\\n",
" for word in
sentence.replace(\",\",\"\").replace(\"-\",\" \").replace(\":\",\"\").split()] \n",
" # ---------------- Removes stopwords from sentences\n",
" sentence_no_stopwords = [word for word in word_tokenized_sentence if
word not in stop_words]\n",
" # ---------------- Before lemmatizing, adds a 's' to the word 'us'\n",
" word_sentence_us = ['uss' if word == 'us' else word for word in
sentence_no_stopwords]\n",
" # ---------------- Lemmatizes\n",
" word_sentence = [normalizer.lemmatize(word,
get_part_of_speech(word)) \\\n",
" for word in
word_sentence_us if not re.match(r'\\d+', word)]\n",
" # Stores preprocessed word \n",
" word_sentences.append(word_sentence) \n",
" # Stores sentence tokenized into words \n",
" preprocessed_speeches.append(word_sentences) \n",
"# Saves preprocessed corpus\n",
"save_list('preprocessed_speeches', preprocessed_speeches)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A personal preference <a
href='https://docs.python.org/3/library/pprint.html'>pprint-Data pretty
printer</a>"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Pretty printing has been turned OFF\n"
]
}
],
"source": [
"%pprint"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sample from the ```preprocessed_speeches``` list, preprocessed corpus:"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['fellow', 'citizen', 'call', 'upon', 'voice', 'country', 'execute',
'function', 'chief', 'magistrate']"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dispays words from the second speech first sentences\n",
"preprocessed_speeches[1][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Dictionary of the presidents' words speeches by sentences:"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Creates a list of the speech's names relative to the presidents' names and
year of the speech \n",
"year_president_speech_names = [name.lower().replace('.txt', '').replace('1989-
bush', '1989-bush senior') for name in file_names]\n",
"# Creates a dictionary of the presidents preprocessed speeches\n",
"presidents_speeches = dict(zip(year_president_speech_names,
preprocessed_speeches))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Presidents pre-processed speeches DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 15,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Preprocessed Speech</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1789-washington</th>\n",
" <td>[[fellow, citizen, senate, house, representati...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1793-washington</th>\n",
" <td>[[fellow, citizen, call, upon, voice, country,...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1797-adams</th>\n",
" <td>[[first, perceive, early, time, middle, course...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1801-jefferson</th>\n",
" <td>[[friend, fellow, citizen, call, upon, underta...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1805-jefferson</th>\n",
" <td>[[proceed, fellow, citizen, qualification, con...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Preprocessed Speech\n",
"1789-washington [[fellow, citizen, senate, house, representati...\n",
"1793-washington [[fellow, citizen, call, upon, voice, country,...\n",
"1797-adams [[first, perceive, early, time, middle, course...\n",
"1801-jefferson [[friend, fellow, citizen, call, upon, underta...\n",
"1805-jefferson [[proceed, fellow, citizen, qualification, con..."
]
},
"execution_count": 15,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_presidents_speeches = pd.DataFrame({'Preprocessed Speech' :
preprocessed_speeches}, index = year_president_speech_names)\n",
"df_presidents_speeches.to_csv('data/processed_presidents_speeches.csv')\n",
"df_presidents_speeches.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sample from the ```df_presidents_speeches``` DataFrame:"
]
},
{
"cell_type": "code",
"execution_count": 16,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[['fellow', 'citizen', 'call', 'upon', 'voice', 'country', 'execute',
'function', 'chief', 'magistrate'], ['occasion', 'proper', 'shall', 'arrive',
'shall', 'endeavor', 'express', 'high', 'sense', 'entertain', 'distinguish',
'honor', 'confidence', 'repose', 'people', 'unite', 'america'], ['previous',
'execution', 'official', 'act', 'president', 'constitution', 'require', 'oath',
'office'], ['oath', 'take', 'presence', 'shall', 'find', 'administration',
'government', 'instance', 'violate', 'willingly', 'knowingly', 'injunction',
'thereof', 'may', 'besides', 'incur', 'constitutional', 'punishment', 'subject',
'upbraiding', 'witness', 'present', 'solemn', 'ceremony']]"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dispays words in each sentences from the 1793-Washington's speech \n",
"df_presidents_speeches.loc['1793-washington'][0]"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['fellow', 'citizen', 'call', 'upon', 'voice', 'country', 'execute',
'function', 'chief', 'magistrate']"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Dispays words in the first sentence from the 1793-Washington's speech \n",
"df_presidents_speeches.loc['1793-washington'][0][0]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Combine list of all the sentences from all the president speeches "
]
},
{
"cell_type": "code",
"execution_count": 18,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Creates an empty list of all the stences in processed_speeches\n",
"all_sentences = [sentence for speech in preprocessed_speeches for sentence in
speech]\n",
"# Saves all_sentences\n",
"save_list('all_sentences', all_sentences)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Sample from the ```all_sentences``` list:"
]
},
{
"cell_type": "code",
"execution_count": 19,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['fellow', 'citizen', 'call', 'upon', 'voice', 'country', 'execute',
'function', 'chief', 'magistrate']"
]
},
"execution_count": 19,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"all_sentences[23]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Words in all sentences list:"
]
},
{
"cell_type": "code",
"execution_count": 20,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"all_words = [word for sentence in all_sentences for word in sentence]\n",
"# Saves all_words\n",
"save_list('all_words', all_words)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2 style='color : MediumBlue'>Word Embeddings</h2>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<a href='https://www.codecademy.com/learn/natural-language-processing/
modules/nlp-word-embeddings'>Word embeddings</a> are a type of word representation
that allows words with similar meaning to have a similar representation. In NLP
words are often represented as numeric vectors, the algorithms used to vectorize
words are referred to as \"words to vectors\"(<a
href='https://en.wikipedia.org/wiki/Word2vec'>word2vec</a>)."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Libraries:\n",
"In addition of the libraries imported for pre-processing the data, I use the
fallowing libraries for Word Embeddings"
]
},
{
"cell_type": "code",
"execution_count": 21,
"metadata": {},
"outputs": [],
"source": [
"# word2vec model library \n",
"import gensim"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: \n",
"<a href='https://blog.usejournal.com/how-does-the-model-produce-different-
results-on-same-dataset-54486f951dbf'>Machine learning models will produce
different results on same dataset</a>, the models generate a sequence of random
numbers called <a href='https://towardsdatascience.com/how-to-use-random-seeds-
effectively-54a4cd855a79'>random seed</a> used within the process of generating
test, validation and training datasets from a given dataset. Configurating a
model's seed to a set value will ensure that the results are reproducible.<br>\n",
"The python library `gensim` relies on different processes to initialize and
train its word embeddings model class, if you need to generate more consistent
results (not recommended), click on the following link:<br>\n",
"<a href='https://stackoverflow.com/questions/34831551/ensure-the-gensim-
generate-the-same-word2vec-model-for-different-runs-on-the-sam'>Ensure the gensim
generate the same Word2Vec model for different runs on the same data</a>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Input Function:"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The input function is optional, a personal preference, I created the function
to easily input different variable values without having to change the code.<br>\
n",
"The option to use the function is by default turn off. "
]
},
{
"cell_type": "code",
"execution_count": 22,
"metadata": {},
"outputs": [],
"source": [
"input_option = False"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The input_word() function:\n",
"\n",
"- Takes the arguments:\n",
" - input_subject, string data type\n",
" - word_list, list data integer type\n",
"<br><br>\n",
"- Outputs on screen the input_subject\n",
"- Take a user input, inputted_word\n",
"- Compares inputted_word with items in the word_list\n",
"<br><br>\n",
"- Returns inputted_word.lower()"
]
},
{
"cell_type": "code",
"execution_count": 23,
"metadata": {},
"outputs": [],
"source": [
" def input_word(input_subject, word_list):\n",
" # User Input a word\n",
" inputted_word = input(f'\\nEnter a {input_subject}: ')\n",
"\n",
" while inputted_word.lower() not in word_list:\n",
" print(f'\\n{inputted_word} is not in the {input_subject} list')\
n",
" inputted_word = input(f'\\nPlease reenter a {input_subject}: ')
\n",
" print()\n",
"\n",
" return inputted_word.lower()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 style='color : DarkMagenta'>All Presidents</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analysis of the presidential vocabulary by looking at all the inaugural
addresses."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most frequently used terms:<br>\n",
"The following list of words are the ten most frequently used presidential
inauguration speech terms. \n",
"The numeric values represent the number of times the corresponding words
appeared in the combined presidential inauguration speeches. "
]
},
{
"cell_type": "code",
"execution_count": 24,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"[('government', 651), ('people', 623), ('nation', 515), ('us', 480),
('state', 448), ('great', 394), ('upon', 371), ('must', 366), ('make', 357),
('country', 355)]"
]
},
"execution_count": 24,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"most_freq_words = Counter(all_words).most_common()\n",
"# Saves most_freq_words\n",
"save_list('most_freq_words', most_freq_words)\n",
"\n",
"# 10 most frequently used words\n",
"most_freq_words[:10]"
]
},
{
"cell_type": "code",
"execution_count": 25,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[('government', 651), ('people', 623), ('nation', 515)]"
]
},
"execution_count": 25,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 3 most frequently used words with count\n",
"most_freq_words[:3]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The fallowing list of words are the three most frequently used presidential
inauguration speeches' terms. "
]
},
{
"cell_type": "code",
"execution_count": 26,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['government', 'people', 'nation']"
]
},
"execution_count": 26,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 3 most frequently used words\n",
"[word[0] for word in most_freq_words[:3]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the three most frequently used term results, we can see that the main
topic within the combined inaugural addresses seem to be centered around the terms
`government`, `people` and `nation`.<br>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"#### Word2Vec, word embeddings model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The idea behind word embeddings is a theory known as the distributional
hypothesis. This hypothesis states that words that co-occur in the same contexts
tend to have similar meanings.\n",
"Word2Vec is a shallow neural network model that can build word embeddings
using either continuous bag-of-words or continuous skip-grams.<br>\n",
"<br>\n",
"The word2vec method that I use to create word embeddings is based on
continuous skip-grams. Skip-grams function similarly to n-grams, except instead of
looking at groupings of n-consecutive words in a text, we can look at sequences of
words that are separated by some specified distance between them.<br>\n",
"<br>\n",
"For this project, we want to create word embeddings model using the skip-grams
word2vec model, within the USA presidential inaugural speeches context. "
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"word_embeddings = gensim.models.Word2Vec(all_sentences, size=96, window=5,
min_count=1, workers=2, sg=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Note: ```gensim.models.Word2Vec()``` takes a text as an argument to give
context to the words, for this project the sentences in ```all_sentences``` list
are the context."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Vocabulary of terms:<br>\n",
"For this project, the vocabulary of terms is the list unique words from the
words in all sentences list, ```all_words```.<br>\n",
"In other words, the vocabulary of terms is the list of words, which are not
stop words, use within the inaugural speeches. "
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [],
"source": [
"# Removing duplicated words in all_words\n",
"vocabulary_of_terms = list(set(all_words))\n",
"# Saves vocabulary_of_terms\n",
"save_list('vocabulary_of_terms', vocabulary_of_terms)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A sample of a word vector representation generate by the ```word_embedding```
model. "
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"us\n",
"[-0.03566258 -0.057491 -0.22429784 -0.19385831 -0.05224766 -0.21745294\n",
" -0.03083961 0.01476462 -0.35729468 0.07782798 0.15073359 -0.01993579\n",
" -0.16864842 0.12328792 -0.22607584 -0.02692644 -0.17266034 -0.2122943\n",
" 0.0932364 -0.20162335 -0.08800333 0.13730125 -0.27899835 -0.14394923\n",
" 0.07692517 0.13672738 0.03948302 0.27413663 -0.3095462 -0.05113837\n",
" -0.05605842 0.09483039 -0.15032959 -0.05592294 0.00599146 -0.2568221\n",
" 0.1071072 0.11913454 0.11639356 -0.13132544 -0.04766015 -0.11872021\n",
" 0.28670758 -0.26508254 -0.01220568 0.16427317 0.2563698 -0.17130415\n",
" -0.11297461 -0.12608878 -0.01350829 -0.24418342 -0.04711331 0.20453948\n",
" 0.10789154 -0.28325412 0.00551981 0.05497228 0.20139584 -0.06281348\n",
" 0.10151973 0.2364551 -0.33533013 0.08800003 -0.02218356 -0.12237483\n",
" -0.38471484 0.03775231 0.12288336 0.20087232 0.26013163 -0.03415838\n",
" 0.16984472 -0.16185957 -0.14474404 0.10821487 -0.07793511 0.09060979\n",
" 0.35984805 -0.22210045 -0.23348917 -0.07200254 0.19353855 0.09751591\n",
" -0.14434262 0.18588798 0.05520443 -0.06917608 -0.19102585 -0.15925734\n",
" 0.33743894 0.17864917 -0.11023522 -0.2014192 0.04389788 0.05955582]\n"
]
}
],
"source": [
"vec_word = 'us'\n",
"\n",
"# Word verctor representation\n",
"print(vec_word)\n",
"print(word_embeddings.wv[vec_word])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Using the word vectors we can calculate the cosine distances between the
vectors to find out how similar terms are within the USA presidential inaugural
speeches context."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar terms sample:<br>\n",
"Using the word vectors created by word embeddings model, we can calculate the
cosine distances between the vectors to find out how similar terms are within the
USA presidential inaugural speeches context."
]
},
{
"cell_type": "code",
"execution_count": 30,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"government\n",
"[('power', 0.9972726702690125), ('federal', 0.9970261454582214),
('authority', 0.9962713122367859), ('support', 0.9958561062812805), ('executive',
0.9952426552772522), ('grant', 0.9951847791671753), ('within', 0.9951294660568237),
('exercise', 0.9944376349449158), ('law', 0.9943814277648926), ('territory',
0.9943108558654785), ('protect', 0.9942190051078796), ('general',
0.9940023422241211), ('defend', 0.9939836859703064), ('respect',
0.9939656853675842), ('reserve', 0.9939048290252686), ('union',
0.9937343001365662), ('preserve', 0.993715763092041), ('principle',
0.9934515357017517), ('limit', 0.9931932091712952), ('local', 0.9931504726409912)]\
n"
]
}
],
"source": [
"# Optional input function\n",
"if input_option:\n",
" similar_to_word = input_word('similar word', vocabulary_of_terms)\n",
"else:\n",
" similar_to_word = 'government'\n",
" \n",
"# Similar to \n",
"print(similar_to_word)\n",
"# Calculate the cosine distance between word vectors outputting the 20 most
similar words to the inputted word\n",
"similar_word_dist_vec = word_embeddings.wv.most_similar(similar_to_word,
topn=20)\n",
"# Saves vocabulary_of_terms\n",
"save_list('vocabulary_of_terms', vocabulary_of_terms)\n",
"# List of similar words and their vectors cosine distance relative to the
inputted word\n",
"print(similar_word_dist_vec) "
]
},
{
"cell_type": "code",
"execution_count": 31,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"government\n",
"['power', 'federal', 'authority', 'support', 'executive', 'grant', 'within',
'exercise', 'law', 'territory', 'protect', 'general', 'defend', 'respect',
'reserve', 'union', 'preserve', 'principle', 'limit', 'local']\n"
]
}
],
"source": [
"# List of the similar words no cosine distance\n",
"print(similar_to_word)\n",
"print([word[0] for word in similar_word_dist_vec])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"From the presidential inaugural addresses data and by training a word
embeddings model with it, I was able to create a some what accurate U.S.A.
presidential vocabulary, the small size of corpus limits how efficiently the model
can be trained, nonetheless the model gives us good insight into how terms are
connected to each other within the presidential inauguration addresses context. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 style='color : DarkMagenta'>One President</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Analysis of a president vocabulary by looking at his inaugural addresses."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preprocessing president names:<br>\n",
"From the list `year_president_speech_names`, I can extract the president
names, but the list as duplicated president names, ex: `['1789-washington', '1792-
washington']`<br>\n",
"After removing the the years from the `year_president_speech_names` values, I
could used `set()` to remove duplicated values, but `set(`) does not preserve the
list values insertion order and I want to keep the values insertion order as
`['washington', ..., ..., ... , 'trump']`, the best method to remove a list
duplicated values and preserve the values insertion order is to use a `dictionary`
as follow:"
]
},
{
"cell_type": "code",
"execution_count": 32,
"metadata": {
"scrolled": true
},
"outputs": [
{
"data": {
"text/plain": [
"['washington', 'adams', 'jefferson', 'madison', 'monroe', 'jackson',
'vanburen', 'harrison', 'polk', 'taylor', 'pierce', 'buchanan', 'lincoln', 'grant',
'hayes', 'garfield', 'cleveland', 'mckinley', 'roosevelt', 'taft', 'wilson',
'harding', 'coolidge', 'hoover', 'truman', 'eisenhower', 'kennedy', 'johnson',
'nixon', 'carter', 'reagan', 'bush senior', 'clinton', 'bush', 'obama', 'trump']"
]
},
"execution_count": 32,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"president_names = list(dict.fromkeys([re.sub(r'^....-', '', name) for name in
year_president_speech_names]))\n",
"# Saves president_names\n",
"save_list('president_names', president_names)\n",
"\n",
"president_names"
]
},
{
"cell_type": "code",
"execution_count": 33,
"metadata": {
"scrolled": false
},
"outputs": [],
"source": [
"# Optional input function\n",
"if input_option:\n",
" president_name = input_word('president name', president_names)\n",
"else:\n",
" president_name = 'madison'\n",
"\n",
"\n",
"# Speeches list\n",
"one_president_speeches = [presidents_speeches[name] for name in
year_president_speech_names if president_name in name]\n",
"# Sentences list \n",
"one_president_sentences = [sentence for speech in one_president_speeches for
sentence in speech]\n",
"# Words list\n",
"one_president_all_words = [word for sentence in one_president_sentences for
word in sentence]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The president most frequently used terms:"
]
},
{
"cell_type": "code",
"execution_count": 34,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"madison\n",
"[('war', 17), ('nation', 13), ('country', 11), ('state', 10), ('public', 8),
('unite', 8), ('right', 7), ('every', 7), ('without', 6), ('long', 6)]\n"
]
}
],
"source": [
"one_president_most_freq_words =
Counter(one_president_all_words).most_common()\n",
"# 10 most frequently used words\n",
"print(president_name)\n",
"print(one_president_most_freq_words[:10])"
]
},
{
"cell_type": "code",
"execution_count": 35,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"madison\n",
"['war', 'nation', 'country']\n"
]
}
],
"source": [
"# 3 most frequently used words\n",
"print(president_name)\n",
"print([word[0] for word in one_president_most_freq_words[0:3]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The president word embeddings model"
]
},
{
"cell_type": "code",
"execution_count": 36,
"metadata": {},
"outputs": [],
"source": [
"one_president_word_embeddings =
gensim.models.Word2Vec(one_president_sentences, size=96, window=5, min_count=1,
workers=2, sg=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The president vocabulary of terms:"
]
},
{
"cell_type": "code",
"execution_count": 37,
"metadata": {},
"outputs": [],
"source": [
"# Removing duplicated words in one_president_all_words\n",
"one_president_vocabulary_of_terms = list(set(one_president_all_words))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"How similar terms are within the president's presidential inaugural speeches
context."
]
},
{
"cell_type": "code",
"execution_count": 38,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"madison's government\n",
"[('discontinue', 0.3531048893928528), ('weight', 0.31632769107818604),
('strengthen', 0.3040248155593872), ('perpetuate', 0.292847603559494),
('encourage', 0.28303566575050354), ('emigrate', 0.26809194684028625), ('knife',
0.2605974078178406), ('noble', 0.2598712742328644), ('emanate',
0.2519160807132721), ('mariner', 0.24878022074699402), ('determine',
0.24856646358966827), ('native', 0.23511092364788055), ('protection',
0.23499499261379242), ('american', 0.23039644956588745), ('stamp',
0.23036402463912964), ('form', 0.22654259204864502), ('intrigue',
0.21648332476615906), ('author', 0.21170350909233093), ('revenue',
0.21013860404491425), ('support', 0.2087252289056778)]\n"
]
}
],
"source": [
"# Optional input function\n",
"if input_option:\n",
" one_president_similar_to_word = input_word('word',
one_president_vocabulary_of_terms)\n",
"else:\n",
" one_president_similar_to_word = 'government'\n",
" \n",
"# Similar to \n",
"print(f'{president_name}\\'s {one_president_similar_to_word}')\n",
"# Calculate the cosine distance between word vectors outputting the 20 most
similar words to the inputted word\n",
"one_president_similar_word_dist =
one_president_word_embeddings.wv.most_similar(one_president_similar_to_word,
topn=20)\n",
"# List of similar words and their vectors cosine distance relative to the
inputted word\n",
"print(one_president_similar_word_dist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cosine distance values of the most similar words are under 0.5, the result
are less than satisfying due to the small size of the corpus,
```one_president_sentences```, used to train the word embeddings model."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The list of the similar terms with no cosine distance:"
]
},
{
"cell_type": "code",
"execution_count": 39,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"madison's government\n",
"['discontinue', 'weight', 'strengthen', 'perpetuate', 'encourage',
'emigrate', 'knife', 'noble', 'emanate', 'mariner', 'determine', 'native',
'protection', 'american', 'stamp', 'form', 'intrigue', 'author', 'revenue',
'support']\n"
]
}
],
"source": [
"print(f'{president_name}\\'s {one_president_similar_to_word}')\n",
"print([word[0] for word in one_president_similar_word_dist])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"I thought that it will good idea to create a presidents' vocabularies
DataFrame :"
]
},
{
"cell_type": "code",
"execution_count": 40,
"metadata": {
"scrolled": false
},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Three Most Recurrent Terms</th>\n",
" <th>Ten Most Recurrent Terms</th>\n",
" <th>Terms List</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>washington</th>\n",
" <td>[government, every, may]</td>\n",
" <td>[government, every, may, citizen, present, cou...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>adams</th>\n",
" <td>[government, nation, people]</td>\n",
" <td>[government, nation, people, union, upon, coun...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>jefferson</th>\n",
" <td>[may, public, citizen]</td>\n",
" <td>[may, public, citizen, us, government, fellow,...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>madison</th>\n",
" <td>[war, nation, country]</td>\n",
" <td>[war, nation, country, state, public, unite, r...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>monroe</th>\n",
" <td>[state, great, government]</td>\n",
" <td>[state, great, government, war, citizen, unite...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>jackson</th>\n",
" <td>[government, people, state]</td>\n",
" <td>[government, people, state, power, public, uni...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>vanburen</th>\n",
" <td>[people, every, country]</td>\n",
" <td>[people, every, country, institution, governme...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>harrison</th>\n",
" <td>[power, people, state]</td>\n",
" <td>[power, people, state, government, upon, const...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>polk</th>\n",
" <td>[government, state, union]</td>\n",
" <td>[government, state, union, power, would, one, ...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>taylor</th>\n",
" <td>[shall, government, duty]</td>\n",
" <td>[shall, government, duty, interest, country, h...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>pierce</th>\n",
" <td>[upon, right, power]</td>\n",
" <td>[upon, right, power, nation, state, government...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>buchanan</th>\n",
" <td>[state, shall, constitution]</td>\n",
" <td>[state, shall, constitution, may, government, ...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>lincoln</th>\n",
" <td>[state, constitution, union]</td>\n",
" <td>[state, constitution, union, law, shall, gover...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>grant</th>\n",
" <td>[country, best, nation]</td>\n",
" <td>[country, best, nation, office, people, questi...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hayes</th>\n",
" <td>[country, government, upon]</td>\n",
" <td>[country, government, upon, party, state, publ...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>garfield</th>\n",
" <td>[government, people, constitution]</td>\n",
" <td>[government, people, constitution, make, state...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>cleveland</th>\n",
" <td>[people, government, public]</td>\n",
" <td>[people, government, public, citizen, us, shal...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>mckinley</th>\n",
" <td>[upon, government, people]</td>\n",
" <td>[upon, government, people, must, congress, gre...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>roosevelt</th>\n",
" <td>[nation, people, us]</td>\n",
" <td>[nation, people, us, government, life, must, m...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>taft</th>\n",
" <td>[government, make, business]</td>\n",
" <td>[government, make, business, law, must, may, s...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>wilson</th>\n",
" <td>[upon, life, great]</td>\n",
" <td>[upon, life, great, nation, shall, thing, men,...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>harding</th>\n",
" <td>[world, must, make]</td>\n",
" <td>[world, must, make, america, war, government, ...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>coolidge</th>\n",
" <td>[country, great, must]</td>\n",
" <td>[country, great, must, nation, government, peo...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>hoover</th>\n",
" <td>[government, law, people]</td>\n",
" <td>[government, law, people, nation, upon, progre...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>truman</th>\n",
" <td>[nation, world, people]</td>\n",
" <td>[nation, world, people, peace, freedom, free, ...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>eisenhower</th>\n",
" <td>[people, world, nation]</td>\n",
" <td>[people, world, nation, free, peace, freedom, ...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>kennedy</th>\n",
" <td>[let, us, world]</td>\n",
" <td>[let, us, world, side, power, new, nation, ple...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>johnson</th>\n",
" <td>[nation, us, change]</td>\n",
" <td>[nation, us, change, must, man, people, union,...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>nixon</th>\n",
" <td>[us, world, let]</td>\n",
" <td>[us, world, let, peace, america, new, nation, ...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>carter</th>\n",
" <td>[nation, new, must]</td>\n",
" <td>[nation, new, must, us, strength, people, toge...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>reagan</th>\n",
" <td>[us, government, people]</td>\n",
" <td>[us, government, people, world, american, one,...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bush senior</th>\n",
" <td>[new, us, make]</td>\n",
" <td>[new, us, make, nation, great, thing, work, wo...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>clinton</th>\n",
" <td>[us, new, world]</td>\n",
" <td>[us, new, world, america, american, must, cent...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>bush</th>\n",
" <td>[america, freedom, nation]</td>\n",
" <td>[america, freedom, nation, us, time, american,...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>obama</th>\n",
" <td>[us, must, nation]</td>\n",
" <td>[us, must, nation, america, people, new, time,...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>trump</th>\n",
" <td>[america, american, country]</td>\n",
" <td>[america, american, country, nation, people, o...</td>\n",
" <td>[need, trust, inadequacy, suffer, gratefully, ...</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Three Most Recurrent Terms \\\n",
"washington [government, every, may] \n",
"adams [government, nation, people] \n",
"jefferson [may, public, citizen] \n",
"madison [war, nation, country] \n",
"monroe [state, great, government] \n",
"jackson [government, people, state] \n",
"vanburen [people, every, country] \n",
"harrison [power, people, state] \n",
"polk [government, state, union] \n",
"taylor [shall, government, duty] \n",
"pierce [upon, right, power] \n",
"buchanan [state, shall, constitution] \n",
"lincoln [state, constitution, union] \n",
"grant [country, best, nation] \n",
"hayes [country, government, upon] \n",
"garfield [government, people, constitution] \n",
"cleveland [people, government, public] \n",
"mckinley [upon, government, people] \n",
"roosevelt [nation, people, us] \n",
"taft [government, make, business] \n",
"wilson [upon, life, great] \n",
"harding [world, must, make] \n",
"coolidge [country, great, must] \n",
"hoover [government, law, people] \n",
"truman [nation, world, people] \n",
"eisenhower [people, world, nation] \n",
"kennedy [let, us, world] \n",
"johnson [nation, us, change] \n",
"nixon [us, world, let] \n",
"carter [nation, new, must] \n",
"reagan [us, government, people] \n",
"bush senior [new, us, make] \n",
"clinton [us, new, world] \n",
"bush [america, freedom, nation] \n",
"obama [us, must, nation] \n",
"trump [america, american, country] \n",
"\n",
" Ten Most Recurrent Terms \\\n",
"washington [government, every, may, citizen, present, cou... \n",
"adams [government, nation, people, union, upon, coun... \n",
"jefferson [may, public, citizen, us, government, fellow,... \n",
"madison [war, nation, country, state, public, unite, r... \n",
"monroe [state, great, government, war, citizen, unite... \n",
"jackson [government, people, state, power, public, uni... \n",
"vanburen [people, every, country, institution, governme... \n",
"harrison [power, people, state, government, upon, const... \n",
"polk [government, state, union, power, would, one, ... \n",
"taylor [shall, government, duty, interest, country, h... \n",
"pierce [upon, right, power, nation, state, government... \n",
"buchanan [state, shall, constitution, may, government, ... \n",
"lincoln [state, constitution, union, law, shall, gover... \n",
"grant [country, best, nation, office, people, questi... \n",
"hayes [country, government, upon, party, state, publ... \n",
"garfield [government, people, constitution, make, state... \n",
"cleveland [people, government, public, citizen, us, shal... \n",
"mckinley [upon, government, people, must, congress, gre... \n",
"roosevelt [nation, people, us, government, life, must, m... \n",
"taft [government, make, business, law, must, may, s... \n",
"wilson [upon, life, great, nation, shall, thing, men,... \n",
"harding [world, must, make, america, war, government, ... \n",
"coolidge [country, great, must, nation, government, peo... \n",
"hoover [government, law, people, nation, upon, progre... \n",
"truman [nation, world, people, peace, freedom, free, ... \n",
"eisenhower [people, world, nation, free, peace, freedom, ... \n",
"kennedy [let, us, world, side, power, new, nation, ple... \n",
"johnson [nation, us, change, must, man, people, union,... \n",
"nixon [us, world, let, peace, america, new, nation, ... \n",
"carter [nation, new, must, us, strength, people, toge... \n",
"reagan [us, government, people, world, american, one,... \n",
"bush senior [new, us, make, nation, great, thing, work, wo... \n",
"clinton [us, new, world, america, american, must, cent... \n",
"bush [america, freedom, nation, us, time, american,... \n",
"obama [us, must, nation, america, people, new, time,... \n",
"trump [america, american, country, nation, people, o... \n",
"\n",
" Terms List \n",
"washington [need, trust, inadequacy, suffer, gratefully, ... \n",
"adams [need, trust, inadequacy, suffer, gratefully, ... \n",
"jefferson [need, trust, inadequacy, suffer, gratefully, ... \n",
"madison [need, trust, inadequacy, suffer, gratefully, ... \n",
"monroe [need, trust, inadequacy, suffer, gratefully, ... \n",
"jackson [need, trust, inadequacy, suffer, gratefully, ... \n",
"vanburen [need, trust, inadequacy, suffer, gratefully, ... \n",
"harrison [need, trust, inadequacy, suffer, gratefully, ... \n",
"polk [need, trust, inadequacy, suffer, gratefully, ... \n",
"taylor [need, trust, inadequacy, suffer, gratefully, ... \n",
"pierce [need, trust, inadequacy, suffer, gratefully, ... \n",
"buchanan [need, trust, inadequacy, suffer, gratefully, ... \n",
"lincoln [need, trust, inadequacy, suffer, gratefully, ... \n",
"grant [need, trust, inadequacy, suffer, gratefully, ... \n",
"hayes [need, trust, inadequacy, suffer, gratefully, ... \n",
"garfield [need, trust, inadequacy, suffer, gratefully, ... \n",
"cleveland [need, trust, inadequacy, suffer, gratefully, ... \n",
"mckinley [need, trust, inadequacy, suffer, gratefully, ... \n",
"roosevelt [need, trust, inadequacy, suffer, gratefully, ... \n",
"taft [need, trust, inadequacy, suffer, gratefully, ... \n",
"wilson [need, trust, inadequacy, suffer, gratefully, ... \n",
"harding [need, trust, inadequacy, suffer, gratefully, ... \n",
"coolidge [need, trust, inadequacy, suffer, gratefully, ... \n",
"hoover [need, trust, inadequacy, suffer, gratefully, ... \n",
"truman [need, trust, inadequacy, suffer, gratefully, ... \n",
"eisenhower [need, trust, inadequacy, suffer, gratefully, ... \n",
"kennedy [need, trust, inadequacy, suffer, gratefully, ... \n",
"johnson [need, trust, inadequacy, suffer, gratefully, ... \n",
"nixon [need, trust, inadequacy, suffer, gratefully, ... \n",
"carter [need, trust, inadequacy, suffer, gratefully, ... \n",
"reagan [need, trust, inadequacy, suffer, gratefully, ... \n",
"bush senior [need, trust, inadequacy, suffer, gratefully, ... \n",
"clinton [need, trust, inadequacy, suffer, gratefully, ... \n",
"bush [need, trust, inadequacy, suffer, gratefully, ... \n",
"obama [need, trust, inadequacy, suffer, gratefully, ... \n",
"trump [need, trust, inadequacy, suffer, gratefully, ... "
]
},
"execution_count": 40,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Creates a DataFrame \n",
"df_presidents_vocabularies = pd.DataFrame(index=president_names)\n",
"\n",
"# Sepeeches list\n",
"all_presidents_speeches = [[presidents_speeches[name] for name in
year_president_speech_names if president in name] \\\n",
"
for president in president_names]\n",
"# Sentences list \n",
"all_presidents_sentences = [[sentence for speech in speeches for sentence in
speech] \\\n",
" for
speeches in all_presidents_speeches]\n",
"# Words list\n",
"all_presidents_all_words = [[word for sentence in sentences for word in
sentence] \\\n",
" for
sentences in all_presidents_sentences]\n",
"\n",
"# Each president most three recurrent words \n",
"df_presidents_vocabularies['Three Most Recurrent Terms'] = [[word[0] for word
in Counter(words).most_common()[:3]] \\\n",
"
for words in all_presidents_all_words]\n",
" \n",
"# Each president most 10 recurrent words \n",
"df_presidents_vocabularies['Ten Most Recurrent Terms'] = [[word[0] for word in
Counter(words).most_common()[:15]] \\\n",
"
for words in all_presidents_all_words]\n",
"# Each president vocabulary of terms\n",
"df_presidents_vocabularies['Terms List'] = [list(set(one_president_all_words))
\\\n",
" for
presidents_all_words in all_presidents_all_words]\n",
"# Saves DataFrame\n",
"df_presidents_vocabularies.to_csv('data/presidents_vocabularies.csv')\n",
"df_presidents_vocabularies"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3 style='color : DarkMagenta'>Selection of Presidents</h3>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We can analyze further, using word embeddings, the presidential vocabulary by
combining the first five US presidents' inaugural speeches and compare the results
with the results from last five US presidents' inaugural speeches."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Preprocessing the data:"
]
},
{
"cell_type": "code",
"execution_count": 41,
"metadata": {},
"outputs": [],
"source": [
"# All words list\n",
"first_5_presidents_all_words = [word for words in all_presidents_all_words[:5]
for word in words]\n",
"last_5_presidents_all_words = [word for words in
all_presidents_all_words[len(all_presidents_all_words)-6:-1] \\\n",
"
for word in words]\n",
"\n",
"# Sentences list\n",
"first_5_presidents_sentences = [sentence for sentences in
all_presidents_sentences[:5] for sentence in sentences]\n",
"last_5_presidents_sentences = [sentence for sentences in
all_presidents_sentences[len(all_presidents_sentences)-6:-1] \\\n",
"
for sentence in sentences]\n",
"# Vocabulary of terms:\n",
"first_5_presidents_vocabulary = list(set(first_5_presidents_all_words))\n",
"last_5_presidents_vocabulary = list(set(last_5_presidents_all_words))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most frequently used terms:"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First Five Presidents\n",
"[('government', 105), ('state', 103), ('nation', 81), ('great', 75), ('may',
69), ('citizen', 66), ('country', 65), ('people', 64), ('war', 61), ('public',
60)]\n",
"\n",
"Last Five Presidents\n",
"[('us', 176), ('america', 111), ('must', 105), ('nation', 104), ('world',
101), ('new', 101), ('american', 95), ('time', 91), ('people', 90), ('freedom',
81)]\n"
]
}
],
"source": [
"# First five presidents\n",
"first_5_presidents_most_freq_words =
Counter(first_5_presidents_all_words).most_common()\n",
"# 10 most frequently used words\n",
"print('First Five Presidents')\n",
"print(first_5_presidents_most_freq_words[:10])\n",
"\n",
"# Last five presidents\n",
"last_5_presidents_most_freq_words =
Counter(last_5_presidents_all_words).most_common()\n",
"# 10 most frequently used words\n",
"print('\\nLast Five Presidents')\n",
"print(last_5_presidents_most_freq_words[:10])"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First Five Presidents\n",
"[('government', 105), ('state', 103), ('nation', 81)]\n",
"\n",
"Last Five Presidents\n",
"[('us', 176), ('america', 111), ('must', 105)]\n"
]
}
],
"source": [
"# First five presidents\n",
"first_5_presidents_most_freq_words =
Counter(first_5_presidents_all_words).most_common()\n",
"# 3 most frequently used words\n",
"print('First Five Presidents')\n",
"print(first_5_presidents_most_freq_words[:3])\n",
"\n",
"# Last five presidents\n",
"last_5_presidents_most_freq_words =
Counter(last_5_presidents_all_words).most_common()\n",
"# 3 most frequently used words\n",
"print('\\nLast Five Presidents')\n",
"print(last_5_presidents_most_freq_words[:3])"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First Five Presidents\n",
"['government', 'state', 'nation']\n",
"\n",
"Last Five Presidents\n",
"['us', 'america', 'must']\n"
]
}
],
"source": [
"# 3 first most frequently used words\n",
"\n",
"print('First Five Presidents')\n",
"print([word[0] for word in first_5_presidents_most_freq_words[:3]])\n",
"\n",
"print('\\nLast Five Presidents')\n",
"print([word[0] for word in last_5_presidents_most_freq_words[:3]])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Word embeddings:"
]
},
{
"cell_type": "code",
"execution_count": 45,
"metadata": {},
"outputs": [],
"source": [
"first_5_presidents_word_embeddings =
gensim.models.Word2Vec(first_5_presidents_sentences, size=96, window=5, \n",
"
min_count=1, workers=2, sg=1)\n",
"last_5_presidents_word_embeddings =
gensim.models.Word2Vec(last_5_presidents_sentences, size=96, window=5, \n",
"
min_count=1, workers=2, sg=1)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Similar words:"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First five presidents government\n",
"[('great', 0.9993283152580261), ('war', 0.9993203282356262), ('may',
0.9992838501930237), ('make', 0.9992183446884155), ('state', 0.9992141723632812),
('every', 0.9992128610610962), ('nation', 0.9992037415504456), ('union',
0.9991971254348755), ('would', 0.99918532371521), ('power', 0.9991679191589355),
('public', 0.9991235136985779), ('us', 0.999123215675354), ('duty',
0.9991223812103271), ('year', 0.9991199970245361), ('country', 0.999106228351593),
('citizen', 0.9990841746330261), ('well', 0.9990724325180054), ('interest',
0.9990679621696472), ('principle', 0.9990620017051697), ('shall',
0.9990471005439758)]\n",
"\n",
"Last five presidents government\n",
"[('american', 0.9995585680007935), ('world', 0.9994905591011047), ('us',
0.999477207660675), ('make', 0.9994717836380005), ('citizen', 0.9994479417800903),
('long', 0.9994418025016785), ('must', 0.999437153339386), ('great',
0.9994307160377502), ('nation', 0.9994282126426697), ('work', 0.9994194507598877),
('america', 0.9994121789932251), ('life', 0.9994106888771057), ('freedom',
0.9994078874588013), ('every', 0.9993854761123657), ('see', 0.9993788003921509),
('time', 0.9993759393692017), ('word', 0.9993681907653809), ('history',
0.9993537664413452), ('power', 0.9993528127670288), ('generation',
0.9993504285812378)]\n"
]
}
],
"source": [
"# Optional input function\n",
"input_option = False\n",
"if input_option:\n",
" first_last_pre_voc = list(set(first_5_presidents_vocabulary +
last_5_presidents_vocabulary))\n",
" first_last_pre_similar_to_word = input_word('First and last four
presidents word', first_last_pre_voc)\n",
"else:\n",
" first_last_pre_similar_to_word = 'government'\n",
"\n",
"# Calculate the cosine distance between word vectors outputting the 20 most
similar words to the inputted word\n",
"first_5_pre_similar_word_dist =
first_5_presidents_word_embeddings.wv.most_similar(first_last_pre_similar_to_word,
topn=20)\n",
"last_5_pre_similar_word_dist =
last_5_presidents_word_embeddings.wv.most_similar(first_last_pre_similar_to_word,
topn=20)\n",
"\n",
"# List of similar words and their vectors cosine distance relative to the
inputted word\n",
"print(f'First five presidents {first_last_pre_similar_to_word}')\n",
"print(first_5_pre_similar_word_dist)\n",
"print(f'\\nLast five presidents {first_last_pre_similar_to_word}')\n",
"print(last_5_pre_similar_word_dist)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The cosine distance values of the most similar words are close to 1, the
results are satisfying, better than ones from one president, the results are better
due to the larger size of the corpus used to train the word embeddings models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Most similar words, no cosine distances:"
]
},
{
"cell_type": "code",
"execution_count": 47,
"metadata": {
"scrolled": false
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"First five presidents government\n",
"['great', 'war', 'may', 'make', 'state', 'every', 'nation', 'union',
'would', 'power', 'public', 'us', 'duty', 'year', 'country', 'citizen', 'well',
'interest', 'principle', 'shall']\n",
"\n",
"Last five presidents government\n",
"['american', 'world', 'us', 'make', 'citizen', 'long', 'must', 'great',
'nation', 'work', 'america', 'life', 'freedom', 'every', 'see', 'time', 'word',
'history', 'power', 'generation']\n"
]
}
],
"source": [
"print(f'First five presidents {first_last_pre_similar_to_word}')\n",
"print([word[0] for word in first_5_pre_similar_word_dist])\n",
"print(f'\\nLast five presidents {first_last_pre_similar_to_word}')\n",
"print([word[0] for word in last_5_pre_similar_word_dist])"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.8.2"
}
},
"nbformat": 4,
"nbformat_minor": 4
}

You might also like