Professional Documents
Culture Documents
Assign2 Writeup
Assign2 Writeup
The first step was to ensure that the text is in English. I used the langdetect
library to detect the language of each text instance. If the language was not
English, the text was discarded.
I implemented patterns to identify and remove code snippets and common boilerplate
text often found in web content.
The cleaned text was tokenized using the NLTK word_tokenize function. This step
split the text into individual words or tokens.
Stopwords, which are common words that do not carry significant meaning, were
removed using NLTK's stopwords corpus.
I applied both stemming and lemmatization to further normalize the tokens. Stemming
reduces words to their root form, while lemmatization converts words to their base
or dictionary form.
After preprocessing, I reconstructed the sentences from the normalized tokens for
further analysis.
I iteratively updated the code to better identify and remove code snippets,
boilerplate text, and URLs.
I tested various date and time formats and handled exceptions to ensure accurate
parsing and normalization.
I explored the option of customizing the list of stopwords based on the specific
domain or context of the text data.