Professional Documents
Culture Documents
Harsh Vishwakarma
MTech CSA
21532
vharsh@iisc.ac.in
Abstract
This research report delves into the critical domain of Generalization in Neural Networks for Natural
Language Processing (NLP). An exploration of related works is undertaken to understand the motivation
behind various techniques aimed at achieving enhanced generalization. The report assesses the advantages
and disadvantages of these techniques, shedding light on how these aspects evolve over time. Additionally, the
paper explores innovative approaches for quantifying the generalization capabilities of neural networks.
A focal point of this research is the implementation of ProtoBERT, a language model designed to maximize
generalization in resource-constrained environments i.e. on few shot scenario. We will also see a technique to
improve the results of ProtoBERT for few shots examples.
Notably, the report addresses a prevailing concern in the literature: the impact of batch sizes on model
generalization. Several empirical studies provide numerical evidence supporting the argument that larger
batch sizes often lead to diminished generalization quality. The investigation highlights that large-batch
methods tend to converge towards sharp minimizers of training and testing functions. This is a well-established
phenomenon, as sharp minima are associated with poorer generalization. The report substantiates these
findings through a series of experiments and compelling arguments, offering valuable insights into the nuanced
relationship between batch sizes and model generalization.
4 BERT on Few-Shot
6.2.1 Dataset
• Source data - SNLI
• Domain Classifier:
: Table 5: Comparative results of different models
on covariate shift – Integrate domain classifier into BERT
model.
The above figures show how the covariate shifts af- – Extract BERT embeddings for input sen-
fect the model’s generalization. But we can also see tence pairs.
from Table 5 that, as the model complexity increases,
the difference in the accuracies of this shift minimizes. – Add domain classification layer to predict
domain labels.
One of the reasons for such behavior is that models
• Training Procedure:
like T5 are pre-trained on different tasks, which re-
sults in a better understanding of language and makes – Train entire model with combined NLI and
it the comparatively efficient model for zero-shot gen- domain classification losses.
eralization tasks.
– Pass sentence pairs from both domains
One of the common ways to handle this is to through the model.
pre-train the model with a combination of combined – Compute domain classification loss and up-
datasets and then finetune it on an objective-specific date model parameters.
task.
• Evaluation:
In the above table, I have trained the model on
combined SNLI+MNLI and then tested it on SNLI, – Evaluate trained BERT model on target
which resulted in improved accuracy. domain dataset.
6.2.3 Results on NLI dataset Using this, the accuracy I got earlier on BERT
model 72% (I trained the model on 5 epochs as it was
From the above-mentioned method I got better re-
taking so much time), when source data - SQuAD and
sult than mentioned across BERT in table 3. In table
target data- MS MARCO, and using the self-training
3 with SNLI+MultiNLI trained BERT and evaluated
method, I got 76.8% for 5 epochs and 5 self training
on SNLI I got 91.6% accuracy and with the domain
iterations.
adaptation method I got 92.9%