You are on page 1of 4

Technical Answers To Real World Problems - CBS1901

Vellore Institute of Technology, Vellore


Summer Special Semester 2021-22

TEAM MEMBERS REGISTRATION NUMBER

SAIKRISHNA S 19BBS0086

CHARAN SRIDHAR BHOGARAJU 19BBS0094

DIVYABRATA DASGUPTA 19BBS0099

ISH JAIN 19BBS0113

NIKHIL AGARWAL 19BBS0120

SWARNIM TIWARI 19BBS0130

PROJECT TITLE: Review Classification using Active Learning


Professor: Dr. Lavanya K
FRAMEWORK DIAGRAM

1. Initial Data Set

Using the IMDB review dataset for our project. This dataset has a total of 50,000
reviews, including the training and test data set. We merge these splits and sample our
own balanced training, validation, and test data set.
Output

2. Preprocessing

In the preprocessing stage, the initial dataset is being taken and then it is divided into
training, testing and validation datasets.
a. Data Transformation - In this we convert the tensorflow dataset to a 2-D numpy
array using the numpy function.
b. In preprocessing technique, first we convert the given text into lowercase and
then we convert these lower converted text in the form of vectors. For this
purpose, we use tokenization and lemmatization.
c. After doing preprocessing techniques, stream-based sampling is used to find data
points for training data sets. These data points should be fed to the model and
training will be done based upon these points.
3. Vectorization

Algorithms required for figuring out how a particular term in a review document affects
the polarity of reviews are of many types.What we are using in our project Is the
algorithm known as vectorization. What we do in this algorithm is that after figuring out
the term frequency using a bag of words which uses vocabulary creation, we figure out
how many times a particular term appears in positive and negative reviews. Using these
values we calculate the vector value and using the vector values we use it on the test data
to figure out whether the particular review is positive or negative.

4. Training Data set

Based on the vocabulary created, the training data set consists of some data points which
will be used for the training model so after going through a round of iteration it further
embeds some other data points based on the validation data set which calculates the
accuracy score and details of the model whether it is overfitting. This will be used as a
test criteria for our model which actively learns from the pool of data points and
improves the accuracy score for the testing data set.

5. Bidirectional LSTM

A bidirectional LSTM, often known as a biLSTM, is a sequence processing model that


consists of two LSTMs, one of which receives input forward and the other of which
receives it backward. With the help of BiLSTMs, the network has access to more
information, which benefits the algorithm's context.

6. Training

The model is trained based on 20 epochs initially. The best model is saved at every epoch
and the best one will be chosen for the test dataset. This is used in the model training and
the model actively learns about selecting data points at every iteration.

7. Pass Criteria

a. Check for Overfitting: Done with the help of validation dataset


b. RMS Prop: Calculates the loss in every iteration which is similar to gradient
descent.
c. Binary Cross Entropy: It ​compares each of the predicted probabilities to the actual
class output, which can be 0 or 1 as in a binary classifier.

8. Selection of most uncertain data points

After the model assigns confidence level to the unused data points, it selects the outliers
with highest confidence levels and removes them from the training set.This is to remove
the points which the model can already predict with high accuracy as it will not improve
our model much.The data points are then shuffled and then selected as per the range
specified.

9. Sampling Labeling New Data

We will perform sampling using the following formula:

𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑆𝑎𝑚𝑝𝑙𝑒𝑠 = 𝐴𝑙𝑙 𝐹𝑎𝑙𝑠𝑒 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑆𝑎𝑚𝑝𝑙𝑒𝑠 = 𝐴𝑙𝑙 𝐹𝑎𝑙𝑠𝑒 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠

Active Learning techniques use callbacks mainly to track progress. We are using model
checkpointing and early stopping. The patience parameter for Early Stop can help
minimize overfitting and the time required. We have set patience=4 for this project but
since the model is robust, we can increase the patience level if desired.

You might also like