You are on page 1of 8

MUD lab 4 - NN-based NERC

Authors: Iva Bokšić, Toni Ivanković

Introduction

This report presents the findings during the implementation of a neural network for the task
of NERC.
An initial F1 score of 51.3 % was recorded, and improvements were tested in order to
increase that to an acceptable value. The low overall F1 score was mainly showing because
of the low F1 scores of two of the rare classes, brand and drug_n, even though the F1
scores of the other two classes were fairly high.
The goal of the work was to improve the prediction accuracy on test data by experimenting
with the neural network architecture and input features.

Input information

The initial model had only the inputs of the word form index and the word suffix of length 3
index.

Suffix length of 4 was initially tried, without any other modifications. Another experiment
added additional input layers representing suffix indices also for suffixes of length by 1 larger
and by 1 smaller than the chosen suffix length (in this case, that would be lengths of 3 and
5).

In the end, the modification was made to enable all suffix lengths between 2 and 5 to be
input to the network. The idea behind including more suffix lengths was that various lengths
of suffixes could help identify a chemical compound, which is often present in the name of a
drug.

Experiments were done with the maximum length of the index dictionary. The initial
experiment had the length of 150, and a double length of 300 was tried. Limiting the length
of the dictionary to a larger value was thought to be able to help the inputs store more
information and increase the performance.

Different embedding output dimensions were tried. In the original, the embedding
dimensions were 150 for the word forms and 50 for suffixes. However, larger values were
tried to give a bigger learning capacity to the network.

Aside from the word form and its suffixes, a lowercase form of the word was added as an
additional input layer. A lowercase form in combination with the normal form of the word was
thought to provide the network with some information about the word itself and recognize
patterns which only appear in named entities.

In the end, various indicators were added to the network as inputs: indicators of presence in
external files (Drug bank, HSDB,), indicators of presence of numbers, dashes, type of casing
(all lowercase, all uppercase, camelCase). All of these indicators could strongly indicate that
a word is part of a named entity.

Architecture

In the beginning, the architecture of the network looked like this:

When adding new input information, new input layers had to be added, and for the case of
the lowercase form of the word, an embedding layer was added after it as well. The
indicators explained at the end of the previous chapter did not have the embedding layer
after them, since they are binary/categorical variables and embedding them does not make
any sense even in theory.

In the LSTM layer, the initial number of LSTM units was 200, and other values were used in
the experiments, namely 150 and 300. The idea was to give more learning power to the
network by increasing the number of units, or to decrease the number of units to limit the
complexity of the network and enable it to learn quicker.

The network was then reordered so that the dropout layer is in the end of the network. The
reasoning was that throughput should affect outputs of all the layers in the network, not just
the embedding. Other dropout layer formations were also tried, with multiple dropout layers
after each layer in the network.
Dropout rate was also modified, increasing it from the initial 0.1, in order to slow down the
learning process and delay overfitting. Values up to 0.99 were tried.

A convolutional layer was added, with the kernel size of 5, and no pooling. Pooling was not
done in order to keep the original dimensions of the data. Various kernel sizes were tried - 3,
5, 7 and 9.

Different learning rates were tried, in order to speed up or slow down the learning process.
The initial learning rate (for the Adam optimizer) was 0.001, and the values of 0.0005 and
0.005 were tried. No other optimizers were tried, since Adam is the most usual optimizer and
it showed great results in combination with additional input information and network
architecture.

Different batch sizes were explored. The initial thought was that a bigger batch size would
make more sense with the unbalanced dataset, since it is usually helpful to try to ensure at
least one instance of each class is present in each batch. However smaller batch sizes were
also explored.

The final selected architecture will be presented in the results section.


Code
In the codemaps.py, the only change of interest is for the drug bank presence indicator. The
types were enumerated, and if a word was found in the drug bank, a respective index was
returned (e.g. 0 for brand, 1 for drug…). If the word was not found, another value was used.

Another important change to the code was in the training process. Namely, in the initial
version, the model was trained in 10 epochs and the model from the last epoch was always
used for testing. However, by that point, the model has most often already reached
overfitting (with the training accuracy rising very close to 1 and the training loss dropping
very close to 0, and the validation accuracy and loss still fluctuating).
To that end, a model checkpoint was added. It saved the model with the largest validation
accuracy. That way, a model with the best generalization power is saved and used later on
the test set. This does not introduce any bias, since the test set has not been seen during
the training process. However, it does introduce a bias for testing on the validation set, since
the information about the performance on it has been used to choose the model.
Saving the best model also enabled introducing more epochs without the risk of testing on
an overfitted model.
The checkpoint was implemented as follows:

Experiments and results:

All of the experiments produced an accuracy of between 99.55%-99.72% on the devel set.
However, since that set was later used as a model checkpoint to select the best model (and
prevent overfitting), there was no point in testing and reporting the accuracy and F1-score on
it. Hence, only the results for the test set are reported below.

As expected, the rise in accuracy on the devel set did somewhat correlate with the rise in F1
score on the test set, but not perfectly - sometimes a lower accuracy on the devel set gave
better F1 score on the test set. That was expected because all sets contain some noise so,
even though the NN newer saw any of the data of the devel set, picking a model according
to a validation set indirectly fits it to the noise in that set.

Finally, since the results needed to be presented in the form that the program output them,
they could not be averaged to eliminate the impact of randomness. Instead, each
configuration was run N=5 times and the results with best F1 values are reported here.
It is clear that using multiple runs reduces, but does not completely eliminate the variance
introduced because of the randomness. Hence, the F1 scores were not blindly followed, but
they provided guidance for the direction of parameter tuning.
Initial configuration produced these results on the test set. The max_len was 150, number of
LTSM units 200 and the dropout rate 0.1. The architecture looked as described in the
beginning of the architecture chapter. The batch size was 32. The initial results were as
follows:
tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 46 1 328 47 374 97.9% 12.3% 21.9%
drug 1624 173 282 1797 1906 90.4% 85.2% 87.7%
drug_n 5 4 40 9 45 55.6% 11.1% 18.5%
group 531 159 156 690 687 77.0% 77.3% 77.1%
-----------------------------------------------------------------------
M.avg - - - - - 80.2% 46.5% 51.3%

It is visible that drug_n and brand produce the worst results and diminish the macro F1
score. Further modifications resulted in the rise of these F1 scores.

Adding suffix lengths by 1 smaller and by 1 larger than the selected suffix length resulted in a
diminished F1 score (50.2 %).

Changing the suffix length to 4 increased the F1 score.


tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 60 5 314 65 374 92.3% 16.0% 27.3%
drug 1561 147 345 1708 1906 91.4% 81.9% 86.4%
drug_n 6 14 39 20 45 30.0% 13.3% 18.5%
group 517 92 170 609 687 84.9% 75.3% 79.8%
-----------------------------------------------------------------------
M.avg - - - - - 74.6% 46.6% 53.0%
Changing the max_len hyperparameter from 150 to 300 did not significantly impact the F1
score:
tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 58 3 316 61 374 95.1% 15.5% 26.7%
drug 1502 135 404 1637 1906 91.8% 78.8% 84.8%
drug_n 7 17 38 24 45 29.2% 15.6% 20.3%
group 523 106 164 629 687 83.1% 76.1% 79.5%
-----------------------------------------------------------------------
M.avg - - - - - 74.8% 46.5% 52.8%

Here, the max_len hyperparameter was changed back to 150 and a model checkpoint was
added, resulting in a better generalization:
tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 55 0 319 55 374 100.0%14.7% 25.6%
drug 1617 163 289 1780 1906 90.8% 84.8% 87.7%
drug_n 7 11 38 18 45 38.9% 15.6% 22.2%
group 538 101 149 639 687 84.2% 78.3% 81.1%
-----------------------------------------------------------------------
M.avg - - - - - 78.5% 48.4% 54.2%
A larger number of LSTM units was tested - setting it to 300 did not significantly change the
accuracy (54.1 %). Reducing the number of units to 150 diminished the F1 score to 53.3 %.

Different embedding output dimensions were tried. Increasing the word embedding
dimension from 150 to 300 and suffix embedding dimension from 50 to 100 did not
significantly change the F1 score (53.8 %).

A big improvement was achieved by reordering the layers, putting the dropout layer just
before the last layer, instead of between the embedding and biLSTM layer.
tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 73 6 301 79 374 92.4% 19.5% 32.2%
drug 1658 169 248 1827 1906 90.7% 87.0% 88.8%
drug_n 5 5 40 10 45 50.0% 11.1% 18.2%
group 556 139 131 695 687 80.0% 80.9% 80.5%
-----------------------------------------------------------------------
M.avg - - - - - 78.3% 49.6% 54.9%

Continuing with the architecture change, a convolutional layer was introduced, with the
kernel size of 5. It is the change that most significantly increased the prediction capacity:
tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 197 98 177 295 374 66.8% 52.7% 58.9%
drug 1652 144 254 1796 1906 92.0% 86.7% 89.2%
drug_n 8 43 37 51 45 15.7% 17.8% 16.7%
group 566 116 121 682 687 83.0% 82.4% 82.7%
-----------------------------------------------------------------------
M.avg - - - - - 64.4% 59.9% 61.9%
This change increased the F1 score for the brand class most significantly of all the changes.

Adding another input - lowercase form of the word did not significantly change the F1 score
(61.5 %). The same result was achieved by adding a range of suffix lengths (2-5). However,
both the changes were kept, since adjusting the parameters of layers could result with this
additional information being utilized to increase the prediction capacity.

As expected, adding an indicator of presence of the word in a drug bank significantly


increased the accuracy:
tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 253 68 121 321 374 78.8% 67.6% 72.8%
drug 1658 103 248 1761 1906 94.2% 87.0% 90.4%
drug_n 6 19 39 25 45 24.0% 13.3% 17.1%
group 550 128 137 678 687 81.1% 80.1% 80.6%
-----------------------------------------------------------------------
M.avg - - - - - 69.5% 62.0% 65.2%
This change further increased the F1 score for the brand class significantly.

Adding an additional indicator of presence in HSDB (hazardous substance DB) further


increased the results:
tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 260 24 114 284 374 91.5% 69.5% 79.0%
drug 1648 94 258 1742 1906 94.6% 86.5% 90.4%
drug_n 16 154 29 170 45 9.4% 35.6% 14.9%
group 578 126 109 704 687 82.1% 84.1% 83.1%
-----------------------------------------------------------------------
M.avg - - - - - 69.4% 68.9% 66.8%

The third increase in the accuracy by indicators was increased by introducing indicators of
dashes, numbers and casing in the word (as described in the input information chapter):
tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 230 23 144 253 374 90.9% 61.5% 73.4%
drug 1723 178 183 1901 1906 90.6% 90.4% 90.5%
drug_n 9 10 36 19 45 47.4% 20.0% 28.1%
group 574 132 113 706 687 81.3% 83.6% 82.4%
-----------------------------------------------------------------------
M.avg - - - - - 77.6% 63.9% 68.6%
It is interesting to notice that this is the first change to significantly increase the drug_n class
F1 score.

Different learning rates were tried, 0.05 and 0.005, which both produced significantly worse
results than the initial value of 0.001. With 0.05, an F1 score of 65.3 %, and with 0.005 an F1
score of 63.7 % were achieved.

Different kernel sizes were tested, with kernel size of 3 reducing the F1 score to 66.2 %, and
the kernel size of 7 not changing the F1 score significantly (68.5 %).

Different dropout rates were tried, 0.2, 0.3, 0.5 and 0.9. Interestingly, none of the changes
seemed to affect the F1 score significantly, not even the rate as big as 0.9. They did,
however, postpone the overfitting process to later epochs of training. Only a dropout rate of
0.99 diminished the prediction capacity.

Introducing 3 dropout layers (one after embedding, one after convolution, and one after the
biLSTM layer) did not seem to affect the prediction capacity either.

Larger batch sizes reduced the training time. However, they significantly reduced the
prediction capacity (batch size of 128 reduced the F1 score to 64 %, and a size of 256
reduced the score to 62.1 %).
A smaller batch size produced the best results out of all the tests - a batch size of 16 gave
the following results:

tp fp fn #pred #exp P R F1
-----------------------------------------------------------------------
brand 213 11 161 224 374 95.1% 57.0% 71.2%
drug 1692 130 214 1822 1906 92.9% 88.8% 90.8%
drug_n 13 12 32 25 45 52.0% 28.9% 37.1%
group 548 99 139 647 687 84.7% 79.8% 82.2%
-----------------------------------------------------------------------
M.avg - - - - - 81.2% 63.6% 70.3%

It is visible that even the drug_n class F1 score increased to around 37 %, with all the other
classes being above 70 %. F1 scores of the drug and group classes were increased, though
not as significantly as the other two classes. The best prediction capacity was for the drug
class, which was expected because of its relative frequency in all the datasets.

In the end, the final architecture was as follows:

Conclusion:

The biggest improvements in the prediction capacity were introducing a convolutional layer,
indicators of presence in external drug banks and indicators of special characters. However,
noticeable improvements were made with adding a model checkpoint and modifying the
batch size.
This shows that designing a neural network is not a simple task that can be focused on just
one aspect of the network, but all the different aspects have to be considered - architecture,
input information and the training process.

An F1 score of 70.3 % seems satisfactory, but it has to be compared to the results from the
previous exercises. The same job that a neural network has done here was done by a CRF
algorithm, using the same input information, and it was performed with bigger precision - a
F1 score of 72.8 % was achieved.
This does not imply that neural networks are inferior to traditional machine learning
approaches - on the contrary, they are a very powerful tool which can outperform the
traditional approaches in certain cases. However, the traditional approaches still are
relevant, especially in cases where the training corpus is relatively small in size.

You might also like