You are on page 1of 1

WRITE-UP FOR ASSIGNMENT

The first step was to ensure that the text is in English. I used the langdetect
library to detect the language of each text instance. If the language was not
English, the text was discarded.

I implemented patterns to identify and remove code snippets and common boilerplate
text often found in web content.

I performed various cleaning operations such as removing extra whitespaces,


normalizing punctuation marks, handling contractions, and standardizing date and
time formats.

The cleaned text was tokenized using the NLTK word_tokenize function. This step
split the text into individual words or tokens.

Stopwords, which are common words that do not carry significant meaning, were
removed using NLTK's stopwords corpus.

I applied both stemming and lemmatization to further normalize the tokens. Stemming
reduces words to their root form, while lemmatization converts words to their base
or dictionary form.
After preprocessing, I reconstructed the sentences from the normalized tokens for
further analysis.

During the incremental development of the code, I conducted several experiments to


refine the preprocessing steps and improve the quality of the tokenized data. Some
of the experiments included:

I experimented with different parameters for language detection, cleaning


operations, and tokenization to achieve better results.

I iteratively updated the code to better identify and remove code snippets,
boilerplate text, and URLs.

I tested various date and time formats and handled exceptions to ensure accurate
parsing and normalization.

I explored the option of customizing the list of stopwords based on the specific
domain or context of the text data.

I evaluated the performance of the tokenizer and normalizer by analyzing the


quality of the output tokens and their impact on downstream tasks such as text
classification or clustering.

You might also like