You are on page 1of 1

This project is a hate speech detection task, which involves automatically

determining whether a piece of text contains hateful content. The classifier for this
task was built using PyTorch and a pre-trained BERT model.

The project starts with setting up the GPU environment if available. Then, the
necessary libraries, such as `transformers`, are installed. The data is loaded from the
Hate Towards the Political Opponent Twitter Corpus Study of the 2020 US Elections
dataset. Only the 'text' and 'HOF' (label) columns are used for the task.

Next, the project proceeds with data exploration and visualization. Basic statistics
about the dataset, including the number of tweets in each class and a sample of the
dataset, are displayed. The distribution of tweets by class and the histogram of tweet
lengths are also visualized using matplotlib.

The data is preprocessed by mapping the labels to binary integers and handling
class imbalances by downsampling the majority class ('Non-Hateful'). The train
dataset is split into train and development sets for model evaluation.

The data is prepared by creating a `BERTDataset` class, which performs cleaning,


tokenization, and encoding of the tweets using the BERT tokenizer. The class also
stores the labels. Additionally, a collate function is defined to handle batch creation
and padding.

The project defines a BERT classifier model using the `BERTClassifier` class. It
consists of a BERT model, a linear layer for classification, and a dropout layer. The
BERT layers are frozen, and the model is moved to the available device (GPU or CPU).

The model is trained using the Adam optimizer, cross-entropy loss, and a specified
number of epochs. Training is performed in a loop, iterating over batches from the
train dataset. The model is put into training mode, and for each batch, the gradients
are cleared, data is moved to the device, forward pass is performed, loss is
computed, gradients are calculated, gradients are clipped to prevent exploding
gradients, and the parameters are updated.

After each training epoch, the model is evaluated on the validation set. The model is
put into evaluation mode, and the predictions and true labels are collected for further
evaluation.

Overall, this project showcases the implementation of a hate speech detection task
using PyTorch and a pre-trained BERT model, including data loading, preprocessing,
model building, training, and evaluation.

You might also like