You are on page 1of 10

Generalization of Neural Networks in NLP

Harsh Vishwakarma
MTech CSA
21532
vharsh@iisc.ac.in

MTech Project Report

Abstract
This research report delves into the critical domain of Generalization in Neural Networks for Natural
Language Processing (NLP). An exploration of related works is undertaken to understand the motivation
behind various techniques aimed at achieving enhanced generalization. The report assesses the advantages
and disadvantages of these techniques, shedding light on how these aspects evolve over time. Additionally, the
paper explores innovative approaches for quantifying the generalization capabilities of neural networks.
A focal point of this research is the implementation of ProtoBERT, a language model designed to maximize
generalization in resource-constrained environments i.e. on few shot scenario. We will also see a technique to
improve the results of ProtoBERT for few shots examples.
Notably, the report addresses a prevailing concern in the literature: the impact of batch sizes on model
generalization. Several empirical studies provide numerical evidence supporting the argument that larger
batch sizes often lead to diminished generalization quality. The investigation highlights that large-batch
methods tend to converge towards sharp minimizers of training and testing functions. This is a well-established
phenomenon, as sharp minima are associated with poorer generalization. The report substantiates these
findings through a series of experiments and compelling arguments, offering valuable insights into the nuanced
relationship between batch sizes and model generalization.

1 INTRODUCTION for optimal generalization. Another pertinent chal-


lenge arises in scenarios where the data is imbalanced.
The remarkable advancements in Neural Networks In the presence of minority examples, the model faces
have propelled them to achieve accuracy levels com- the dilemma of either treating them as noise or mem-
parable to human performance across diverse tasks. orizing them, both of which hinder genuine general-
However, a critical challenge persists in ensuring the ization.
generalization of these models. The crux of the mat- This research delves into the impact of large batch
ter lies in distinguishing between a model that learns sizes on the generalization gap, a phenomenon ob-
generalized patterns and one that merely memorizes served in neural network training. Two primary rea-
examples, leading to overfitting. Generalization is sons contribute to the widening of this gap:
paramount for discerning intricate patterns within
• Firstly, large batch sizes have the potential to in-
data and achieving superior performance. Despite its
duce overfitting, compromising the model’s abil-
importance, real-world scenarios present various im-
ity to generalize.
pediments to the generalization process.
One significant obstacle is the presence of noise • Secondly, the minima achieved on the objec-
in the data, a factor that substantially hampers the tive function curve are often sharp, making
model’s ability to generalize effectively. As noise levels the model susceptible to accuracy fluctuations
increase, the model’s performance tends to degrade, with even minor changes. This susceptibility to
emphasizing the need to mitigate such disturbances sharp minima further exacerbates the challenge
of maintaining robust generalization. this, Zhang et el. (2020) proposed the Entity Coverage
Ratio, which quantifies whether an example occurring
In light of these complexities, this paper explores in a test set is seen in a training set. This paper also
strategies to overcome these obstacles and enhance the introduced the concept of Consistency, quantifying
generalization capabilities of neural networks. the relationship between the entities in the dataset.
The advantage is that we can train the model consid-
ering the relationship between the entities, which are
2 Related Work positively related as gradient step taken w.r.t one en-
tity reduces the loss on another entity, similarly with
Hupkes et al. (2023) proposed the taxonomy of gener- negatively related. But the main disadvantage of such
alization. It showed the motivation behind generaliza- an evaluation process is that generally, models are pre-
tion, its types, and how different shifts in the dataset trained on the vast corpus, and finding such relation-
can affect generalization. ships between the entities in large datasets is computa-
He also showed that little work is done on fairness tionally heavy and will take time. This problem is then
and inclusivity of generalization. Tanzer et el. (2022) addressed by Yang et el. (2022), who proposed evalu-
proposed a model in which they gave an architecture ating NLP models with generalization metrics that do
of BERT for low resource setting, improving the per- not need training or testing data access. They used
formance of BERT in few-shot. This addressed the the concept of heavy-tail self-regularization. HTSR
fairness motivation of generalization. considers Empirical Spectral Density. Now suppose
Hoffer et el. (2018) found out that hyper- for several models, we have their Density curve and
parameters also play a crucial role in generalization the output BLEU score for a specific dataset. Plot-
performance. Taking smaller batch sizes is beneficial ting the BLEU score and this density function gives
as compared to larger batch sizes. So, they proposed the best model. This idea only considers the weight
that we train the model for more epochs and take parameters and the final evaluation score to measure
smaller virtual batches for normalization rather than the model’s generalization performance. Nowhere it
the original bigger batches. The advantage of such needs the dataset. The explanation behind such be-
normalization is that even if we consider larger batch havior is that the weight matrices become increasingly
sizes, the actual updates in each iteration are done ac- correlated during training, and the eigenvalues also be-
cording to the small batch size, thus giving us similar come similar, making the ESD heavy-tailed. And with
performance in smaller and larger batch sizes. the experiment, it is found that models that have bet-
But the catch here is that the performance depends ter BLEU scores show heavier-tailed ESDs.
not on the number of epochs but on the number of
Various techniques are adopted as state-of-the-art
updates. Also, the learning rate should be more sig-
to improve generalization. The feature fusion tech-
nificant in starting phase as we want to make a stride
nique dramatically affects the generalization of Neu-
as long as possible from the initial point to reach the
ral Networks. Neelesh Mungoli (2023) proposed the
local minima of smooth curvature (related to better
idea of Adaptive Feature Fusion (AFF) for enhancing
generalization). After this, Naganuma et el. (2022)
Generalization in Deep Learning Models.
found out that not only just hyperparameters but also
the distribution of the dataset used and the optimizer The feature fusion process can be used by looking
plays a crucial role. He also discovered that for out- at the data and the model and then aggregating the
of-distribution data, non-adaptive optimizers(SGD, extracted features for better performance. AFF lever-
momentum) perform better than adaptive optimiz- ages a combination of data-driven and model-based
ers(Adam, RMSprop). Also, for in-distribution cases, fusion strategies. It adaptively fuses features based on
there is no significant difference seen. the underlying data characteristics and model require-
Some researchers also proposed their work on the ments. The framework of AFF is that the adaptive fu-
evaluation techniques of generalization. The main sion layer receives features from multiple sources and
problem of generalization occurs when we have a combines them using fusion functions. These func-
smaller number of examples for a category, so models tions are learned during the training process. A meta-
find it hard to generalize for such non-major entities. learning step is introduced at the end, allowing the
There can be several scenarios, like a particular cate- fusion layer to learn how to optimally combine the
gory of examples seen in the test sample that are not outputs of the fusion functions for a given task.
seen in the training set, an example occurring in more Since the pre-trained models are trained on a large
than one category in the training set, etc. To address dataset, they are generally trained on some general
domain. To use these models, we must fine-tune them the idea that it is rare in real-world scenarios that
to perform tasks for a specific domain. Despite this, the training dataset and test dataset are from the
adapters are fine-tuned, so there will be no updates in same distribution. The model should handle out-of-
the model parameters after pretraining. distribution cases between the training and test more
Chronopoulou et el. (2023) proposed the idea robustly. They found out that there may not be the
of AdapterSoup, which averages the weights of the case always that high in-distribution accuracy does
adapters most relevant(using Attention-Mechanism) not necessarily translate to good out-of-distribution
to the new domain to improve the out-of-domain per- generalization, and vice versa. They also found that
formance of pre-trained models at test time. if there is a linear trend between in-distribution and
In contrast, one of the most critical results of Sta- out-of-distribution like in CIFAR 10, all the datasets
tistical Learning Theory states that the discrepancy with similar shifts follow the same linear trend.
between training and generalization error diminishes Keskar et al. [2017], in their findings, gave the reason
with increasing training examples. Ferreira et al. for Large Batches to converge at sharper minima to
(2023) proposed the idea of Data Augmentation(DA) be large eigenvalues of ∇2 f (x) whereas the eigenval-
for increasing the diversity of training examples, thus ues of ∇2 f (x) for small batch is small which leads to
expanding the generalization power of the model. flatter minimizes.
This method is very much exploited in Computer He took several fully connected and convolutional
Vision to generalize the model, like providing the CNN models and performed classification tasks on each one
with a rotatedMNIST dataset for domain adaptation, with the CIFAR10 and CIFAR100 datasets, consider-
etc. In NLP, different data augmentation methods can ing both large and small batch sizes.
be applied to both Data Space and Feature Space. The experiment showed that if we consider the smaller
One other way can be noise injection. batch size, irrespective of the model used, the general-
Authors have also used cGAN to generate synthetic ization capability of the model improves up to 2 − 5%
data while fine-tuning the model. This helps in low-
resource settings. The most promising technique for
DA is mixup. The main advantage of this is that
it is label-free augmentation. So we can do the data 3 Generalizability of BERT
augmentation via a linear combination of examples.
Fisch et al.(2019) proposed the results of the Ma-
chine Reading for Question Answering shared on eval- 3.1 Learning Events of pre-trained
uating the generalization capabilities of the reading BERT
comprehension system. They adapted the idea of an-
swering the questions based on two extractions format Pretrained BERT is a robust model that is not prone
method to forgetting events. Forgetting events are the events
if, for step t − 1, the model learns an example, and for
• The answer to each question must appear as a
tth step, it forgets. The loss curve for the BERT is
span of tokens in the passage. The idea behind
monotonically decreasing in Fig 1. The learning pro-
here is that we can extract the main token from
cess of the model is backed by the pretraining on a
the question and search for similar ones in the
large dataset.
passage to find a more accurate answer
• The passage of truncated to first 800 tokens.
This will reduce the computational complexity
in case of the large dataset as we have to search
in a limited number of tokens
The token of the question and passage is concate-
nated with some special tokens to obtain a join con-
text which is then encoded with a baseline BERT, the
encoding of the input is then passed to the MultiQA
model to predict the answer.
Miller et al. (2021) showed the correlation be- : Fig 1: The above graph between the loss and num-
tween the in-distribution and out-of-distribution data ber of learning steps shows the learning behavior of
regarding generalization. Their research is based on BERT for M N LI dataset.
loss decreases with the same pattern in the presence
of noise.

4 BERT on Few-Shot

One of the most crucial findings by Tanzer et el.


(2022) is BERT’s behavior in a low-resource setting,
basically in a Few-Shot learning scenario. The modifi-
cation of ProtoBERT includes the idea of the random
: Fig 2: The behavior of BERT on SN LI dataset for sampler, which keeps in check the ratio of minority to
training for Loss vs Epochs non-minority class. Also, class centroids are used to
The results in Fig 2 are similar to the training be- classify examples, and softmax is applied to the dis-
havior of BERT proposed by Michael Tänzer et. el. tances between the predicted class and class centroid.
2022. This behavior is explained by the memorization The below graph compares BERT and Proto-
phase of BERT training which is after a certain num- BERT for Natural Language Inference dataset. The
ber of training iterations, the model starts over-fitting model visualBERT(pre-trained model at hugging face)
for the noise. The above model used is protoBERT is used here.
for the SN LI dataset where some of the examples for
certain entities are removed to make the dataset im-
balanced and add some noise.

3.2 Behaviour of BERT in the pres-


ence of Noise
The BERT model is robust to the presence of
noise, it first learns the examples and ignores
the noise, after that a settling phase kicks in
where the loss of the training process is stabi-
lized, and at the end it memorizes the noise. : Fig 4: BERT vs protoBERT on SNLI
dataset(imbalanced dataset)

Fig 4 shows that protoBERT outperforms BERT in


the dataset that is unbalanced i.e., contains a mix of
minority and majority classes. The above results are
generated by taking 70% of the dataset and training
the model.

4.1 Drawbacks of protoBERT

ProtoBERT is a model that performs well in few-shot


: Fig 3: BERT’s behavior on noisy dataset learning, but base-BERT exceeds in the case of the full
Fig 3 shows BERT’s behavior in the presence of noise dataset. Also, it doesn’t leverage the idea of data aug-
to different levels in the M N LI dataset. Similar be- mentation, which can be extremely useful in the case
havior can also be seen in the Tanzer et el. (2022) pa- of a few-shot. Luong et al. proposed a model(STraTA)
per. He also showed that in the training process, the that includes self-training with task augmentation to
behavior of BERT is robust to noise, that the training achieve better results.
4.2 Self-Training on ProtoBert insights.

Fig 6 shows that with the increase in the batch


size, not only the generalization but also the f1 score
gets reduced. This is model-dependent behavior as
smaller and large batches can give higher accuracy
for training time, but there is a significant gap during
inference.

: Fig 5: Accuracy curve of STraTA

: Fig 6: Performance of Pre-Trained BERT on differ-


: Table 1: Comparisions of model on SNLI dataset on
ent batches during training
25 training iterations for full and Few-Shot datasets
Self-training, also known as self-supervised learning or
self-teaching, is a machine-learning technique where a
model generates its own training data and iteratively
refines itself through a training process. This approach
is particularly useful when labeled data is scarce or
expensive to obtain. Table 1 shows the accuracies of
protoBERT and STraTA on the SNLI dataset for both
few-shot and complete examples. (The actual results
in the paper were slightly higher as shown in Table 1,
I am not able to regenerate the same results as I did
not run the training for original hyperparameters.) : Table 3: Numerical values for training and testing
weighted f1 scores for different ranges of Batch Sizes.

5 Effects of Batch Sizes on Gen- 5.1 Small Batch vs Large Batch


eralization Though Small Batch is computationally efficient, and
we can parallelize those outputs to get the average,
the main benefit of SB over LB is that they converge
over flatter minima.
This leads to 2 main benefits:

• They tend to surpass sharp local minima to con-


verge at a better minimal less prone to perturba-
tions as LB methods lack the explorative prop-
: Table 2: Experiment Configuration erties of SB methods and tend to zoom in on the
In this section, we have performed similar exper- minimizer closest to the initial point.
iments on BERT, Roberta & Electra across various
batches on sentiment classification and paragraph • Since the converging point is flat, it is less sen-
classification datasets, providing the corresponding sitive to changes. In contrast, for LB, the large
sensitivity of the training function at a sharp
minimizer negatively impacts the ability of the
trained model to generalize on new data.
We can see the behavior of BERT on both SB
& LB in Fig: 7.

: Fig 10: small batch(64) vs large batch(1000) on


Electra on IMDB sentiment classification

: Fig 7: Comparision between small batch(64) and


large batch(1000) on pre-trained BERT

5.2 Plots of models on different


datasets : Fig 11: small batch(64) vs large batch(1000) on
BERT on AG News Classification
For the below experiments, we considered SB = 64
and LB = 1000.

: Fig 12: small batch(64) vs large batch(1000) on


: Fig 8: small batch(64) vs large batch(1000) on ROBERTA on AG News Classification
BERT on IMDB sentiment classification

: Fig 13: small batch(64) vs large batch(1000) on


: Fig 9: small batch(64) vs large batch(1000) on ELECTRA on AG News Classification
Roberta on IMDB sentiment classification
6 Data shifts affecting the Gen-
eralization
6.1 Covariate Shift
Covariate shift refers to a situation in which the distri-
: Table 4: F1 scores of all the experiments performed bution of input features changes between the training
and test datasets i.e p(xtr ) ̸= p(xtst ) but p(ytr |xtr ) =
p(ytst |xtst ). This can cause a machine learning model
trained on the training data to perform poorly when
As we can see in all the above plots of different mod- applied to the test data, as the model’s assumptions
els on 2 different datasets and in Table 4 the behavior about the data distribution may no longer hold.
on different batch sizes remains similar. We can con-
clude that the small batch size gives better inference
results than larger batch sizes, hence improving the
generalization.

5.3 Is Smaller Batch always beneficial?

As we have seen in Fig: 7 for smaller batch sizes, we


get better generalization, but the main question is
that if we keep on decreasing the batch size, do we get
: Fig 16: Covariate Shift on BiLSTM
better results?
The answer is not always, we can see the compari-
son of mini-batch with batch size 64 and stochastic
gradient descent in Fig: 15, as the step computation
in SGD is noisy, the gradient update is very much
stochastic, though it has faster computation but it
loses the generalizability. Also, SGD is not appropri-
ate for parallelization.
So, a threshold must exist before and after which the
generalization ability keeps decreasing.
For our experiment, we can see from Table: 3 that
such a threshold is Batch-size 64.

: Fig 17: Covariate Shift on BERT

: Fig 15: Stochastic vs mini-batch(64) on pre-trained


BERT : Fig 18: Covariate Shift on Roberta
6.2 Improving Covariate Shift
We can use Domain Adaptation Techniques to correct
the covariate shift.

6.2.1 Dataset
• Source data - SNLI

• Target data - MultiNLI

6.2.2 Experimental Setup


: Fig 19: Covariate Shift on T5
• Pre-processing

– Tokenize input sentences using BERT tok-


enizer.
– Convert tokenized sentences into BERT-
compatible input features.
– Prepare NLI dataset for both source and
target domains.

• BERT Model Architecture:

– Load pre-trained BERT model (in this case


I used BERT-base-uncased).
– Fine-tune BERT for NLI with added clas-
sification layer.

• Domain Classifier:
: Table 5: Comparative results of different models
on covariate shift – Integrate domain classifier into BERT
model.
The above figures show how the covariate shifts af- – Extract BERT embeddings for input sen-
fect the model’s generalization. But we can also see tence pairs.
from Table 5 that, as the model complexity increases,
the difference in the accuracies of this shift minimizes. – Add domain classification layer to predict
domain labels.
One of the reasons for such behavior is that models
• Training Procedure:
like T5 are pre-trained on different tasks, which re-
sults in a better understanding of language and makes – Train entire model with combined NLI and
it the comparatively efficient model for zero-shot gen- domain classification losses.
eralization tasks.
– Pass sentence pairs from both domains
One of the common ways to handle this is to through the model.
pre-train the model with a combination of combined – Compute domain classification loss and up-
datasets and then finetune it on an objective-specific date model parameters.
task.
• Evaluation:
In the above table, I have trained the model on
combined SNLI+MNLI and then tested it on SNLI, – Evaluate trained BERT model on target
which resulted in improved accuracy. domain dataset.
6.2.3 Results on NLI dataset Using this, the accuracy I got earlier on BERT
model 72% (I trained the model on 5 epochs as it was
From the above-mentioned method I got better re-
taking so much time), when source data - SQuAD and
sult than mentioned across BERT in table 3. In table
target data- MS MARCO, and using the self-training
3 with SNLI+MultiNLI trained BERT and evaluated
method, I got 76.8% for 5 epochs and 5 self training
on SNLI I got 91.6% accuracy and with the domain
iterations.
adaptation method I got 92.9%

6.2.4 Results on classification dataset 6.3 Label Shift


Using the above mentioned method for Label shift refers to a change in the distribution of
Sentence Classification: class labels between the training and test datasets.
Basically the phenomenon where p(xtst ) = p(xtr ) but
• source data: - IMDB review Dataset p(ytst |xtst ) ̸= p(ytr |xtr ).
The best way to correct the Label Shift is to use the
• target data: - AG News Dataset
idea of BART and T5, the modeling task by augment-
The results I got are 5.4% higher when checked using ing the input data and performing the unsupervised
the BERT model. The same results were comparable training by masking the input sentence word wise or
when done using T5 model. chunk of words to make the model learn the corpus in
Paragraph Classification dataset: a more generalized way. After that, we can zero-shot
the model on the test data by giving prompts.
• source data: - Amazon Review Dataset
• target data: - Trip Advisor Dataset 6.3.1 Experminental Setup

The accuracy difference is around 4.9% which is some- Dataset


what comparable to what I got on the sentence clas- I’ve used the MultiNLI dataset in this experiment.
sification dataset. MultiNLI dataset contains sentence pairs annotated
From the above two results, we can say that the for natural language inference (NLI), where the labels
method proposed works fine on classification datasets. indicate entailment, contradiction, or neutral relation-
ship between the premises and hypotheses. It includes
two sets of labels: genre labels (e.g., fiction, govern-
6.2.5 Results on Question Answering dataset ment, telephone) and NLI labels (entailment, contra-
diction, neutral).
The method proposed above failed on QA dataset for
• source data: - MS MARCO dataset Methodology
I have performed two experiments to check whether
• target data: - SQuAD dataset the above-mentioned method is effective or not,
The method works fine on NLI and classification
datasets but doesn’t generalize the model for the QA
• First, I’ve taken pre-trained BERT and fine-
dataset. The possible reason behind such a behavior
tuned the MultiNLI dataset on NLI task and
is that the technique works fine when the Language
then got the inference on what is the actual
model has to output a single token, not a sequence of
genre from which hypotheses and premise are
words. For QA I tried the idea I used to improve the
taken from
proto-BERT in my earlier work, i.e. self-training.
• Second, I’ve taken pre-trained BERT and fine-
• Train the model on Source dataset
tuned it on multiple tasks, giving input as sen-
• Generate the pseudo-labels of the unlabelled tences of the dataset.
Target dataset
– fine-tuning BERT on NLI task.
• The model is then retrained on the combined – Next sentence prediction of hypothesis and
Source data and unlabelled Target data premise
• Iteratively repeat this process till we get a favor- – MLM-type training combines the hypothe-
able accuracy. sis and premise as one sentence.
Result [3] Hiroki Naganuma, Kartik Ahuja, Ioannis Mitliagkas.
The latter improved accuracy by 2.7% as it gained a 2022. Empirical Study on Optimizer Selection for Out-of-
better knowledge of the whole corpus and generalized Distribution Generalization.
it well. [4] Michael Tänzer, Sebastian Ruder, Marek Rei. 2022.
Memorization versus Generalisation in Pre-trained Lan-
guage Models,
7 Conclusions and Future Work [5] Jinlan Fu, Pengfei Liu, Qi Zhang, Xuanjing Huang.
2020. Rethinking Generalization of Neural Models: A
Up till now, I’ve been following the State − of − the − Named Entity Recognition Case Study
art generalisation research in N LP by Hupkes et al., [6] Elad Hoffer , Itay Hubara , Daniel Soudry. 2018.
and I’ve performed experiments on improving the gen- Train longer, generalize better: closing the generalization
eralization of BERT in the few-shot scenario by giv- gap in large batch training of neural networks,
[7] Yaoqing Yang , Ryan Theisen , Liam Hodgkinson
ing a new architecture, then Expermiented how Batch
, Joseph E. Gonzalez , Kannan Ramchandran , Charles
sizes affect the generalization capabilities and differen-
H. Martin , Michael W. Mahoney. 2022. Evaluating natu-
tiating its effects on small batches Vs. Large batches. ral language processing models with generalization metrics
Now, I’ve experimented with how data shifts can affect that do not need access to any training or testing data,
the generalization capabilities and given its mitiga- [8] Neelesh Mungoli. 2023. Adaptive Feature Fusion:
tion techniques based on the state-of-the-art language Enhancing Generalization in Deep Learning Models
models. Now, future work has the scope of using all [9] Alexandra Chronopoulou, Matthew E. Peters,
these techniques for cross-lingual setup, described un- Alexander Fraser, Jesse Dodge. 2023. AdapterSoup:
der the types of generalization of the literature men- Weight Averaging to Improve Generalization of pre trained
tioned above. Language Models
[10] Lucas Francisco Amaral Orosco Pellicer , Taynan
Maier Ferreira, Anna Helena Reali Costa. 2023. Data
References augmentation techniques in natural language processing
[11] Tu Vu1, Minh-Thang Luong , Quoc V. Le , Grady
[1] Dieuwke Hupkes, Mario Giulianelli , Verna Dankers, Simon , Mohit Iyyer. 2022. STraTA: Self-Training with
Mikel Artetxe Yanai Elazar , Tiago Pimentel , Christos Task Augmentation for Better Few-shot Learning [12]
Christodoulopoulos, Karim Lasri Naomi Saphra , Arabella Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge No-
Sinclair, Dennis Ulmer , Florian Schottmann Khuyag- cedal,Mikhail Smelyanskiy (2017): ON LARGE-BATCH
baatar Batsuren, Kaiser Sun, Koustuv Sinha, Leila Kha- TRAINING FOR DEEP LEARNING: GENERALIZA-
latbari Maria Ryskina, Rita Frieske, Ryan Cotterell, Zhi- TION GAP AND SHARP MINIMA
jing Jin. 2023. State-of-the-art generalization research in [13] Makoto Morishita, Yusuke Oda, Graham Neu-
NLP: A taxonomy and review. big,Koichiro Yoshino : An Empirical Study of Mini-Batch
[2] Hanie Sedghi, Google Research and Preetum Nakki- Creation Strategies for Neural Machine Translation
ran, Harvard University. A New Lens on Understanding [14] Sebastian Ruder (2017): An overview of gradient
Generalization in Deep Learning. descent optimization algorithms

You might also like