Professional Documents
Culture Documents
SAIKRISHNA S 19BBS0086
Using the IMDB review dataset for our project. This dataset has a total of 50,000
reviews, including the training and test data set. We merge these splits and sample our
own balanced training, validation, and test data set.
Output
2. Preprocessing
In the preprocessing stage, the initial dataset is being taken and then it is divided into
training, testing and validation datasets.
a. Data Transformation - In this we convert the tensorflow dataset to a 2-D numpy
array using the numpy function.
b. In preprocessing technique, first we convert the given text into lowercase and
then we convert these lower converted text in the form of vectors. For this
purpose, we use tokenization and lemmatization.
c. After doing preprocessing techniques, stream-based sampling is used to find data
points for training data sets. These data points should be fed to the model and
training will be done based upon these points.
3. Vectorization
Algorithms required for figuring out how a particular term in a review document affects
the polarity of reviews are of many types.What we are using in our project Is the
algorithm known as vectorization. What we do in this algorithm is that after figuring out
the term frequency using a bag of words which uses vocabulary creation, we figure out
how many times a particular term appears in positive and negative reviews. Using these
values we calculate the vector value and using the vector values we use it on the test data
to figure out whether the particular review is positive or negative.
Based on the vocabulary created, the training data set consists of some data points which
will be used for the training model so after going through a round of iteration it further
embeds some other data points based on the validation data set which calculates the
accuracy score and details of the model whether it is overfitting. This will be used as a
test criteria for our model which actively learns from the pool of data points and
improves the accuracy score for the testing data set.
5. Bidirectional LSTM
6. Training
The model is trained based on 20 epochs initially. The best model is saved at every epoch
and the best one will be chosen for the test dataset. This is used in the model training and
the model actively learns about selecting data points at every iteration.
7. Pass Criteria
After the model assigns confidence level to the unused data points, it selects the outliers
with highest confidence levels and removes them from the training set.This is to remove
the points which the model can already predict with high accuracy as it will not improve
our model much.The data points are then shuffled and then selected as per the range
specified.
𝐹𝑎𝑙𝑠𝑒 𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒𝑠
𝑁𝑒𝑔𝑎𝑡𝑖𝑣𝑒 𝑆𝑎𝑚𝑝𝑙𝑒𝑠 = 𝐴𝑙𝑙 𝐹𝑎𝑙𝑠𝑒 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
𝐹𝑎𝑙𝑠𝑒 𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒𝑠
𝑃𝑜𝑠𝑖𝑡𝑖𝑣𝑒 𝑆𝑎𝑚𝑝𝑙𝑒𝑠 = 𝐴𝑙𝑙 𝐹𝑎𝑙𝑠𝑒 𝑃𝑟𝑒𝑑𝑖𝑐𝑡𝑖𝑜𝑛𝑠
Active Learning techniques use callbacks mainly to track progress. We are using model
checkpointing and early stopping. The patience parameter for Early Stop can help
minimize overfitting and the time required. We have set patience=4 for this project but
since the model is robust, we can increase the patience level if desired.