Professional Documents
Culture Documents
DAAL
DAAL
crease model certainty, this would be undesirable. Every time DAAL queries for more annotations,
Therefore, the useful examples to query on are a new fθ is trained from scratch, and the proce-
those with a mismatch between the level of annota- dure is repeated until the annotation budget is ex-
tor uncertainty and model uncertainty, rather than hausted. If needed, DAAL may query the same
Measuring Hate Speech Wikipedia
Characteristics Respect Dehumanize Genocide Toxicity Toxicity-5
Number of Total Examples 17, 282 17, 282 17, 282 20, 000 20, 000
Avg Number of Annotations per Example 3.35 3.35 3.35 10.0 5.0
Number of Examples Test Set 1, 778 1, 778 1, 778 2, 000 2, 000
Probability Two Annotators Disagree 0.520 0.689 0.371 0.524 0.522
examples multiple times, but it is not required to tributes of which we select one for experiments—
waste the annotation budget on examples where all Toxicity—which is also rated on a five-point Lik-
useful information is learned after one annotation ert scale from very toxic to very healthy. This data
(or zero). When the annotator entropy is zero (i.e., consists of 100k examples with 10 annotations per
all annotators agree on a single label), DAAL re- example in almost all cases; we randomly downse-
duces to simple uncertainty sampling. As in the lect to 20k examples for congruity with MHS.
case of entropy-based active learning, retraining fθ
and recomputing model entropy after every sample 4.2 Experimental Details
is computationally expensive, so in practice, we We measure the classifier’s performance accord-
retrain and re-evaluate only after every 10 rounds. ing to Jensen-Shannon divergence (JS), defined
in Eq 2.4 We introduce an oracle trained on the
4 Experimental Setup
full dataset for each task to calibrate model perfor-
In this section, we introduce the datasets we use mance against the best possible.
and experimental details. For each method, we finetune RoBERTa-base
(Liu et al., 2020). We finetune the task model each
4.1 Datasets round from scratch, which worked better than con-
We conduct experiments in simulation by start- tinuing training in preliminary experiments. We
ing with datasets with multiple annotations per use early stopping with a tolerance of 1 based
example and returning one of these at random on the KL divergence between the model’s pre-
when the oracle is called. We choose two datasets dicted distribution and the distribution of annotator
with multiple labels for each attribute: Measuring votes on a held-out set, training for a maximum
Hate Speech (MHS) (Sachdeva et al., 2022) and of 50 epochs. For DAAL’s entropy predictor, we
Wikipedia Talk (Wulczyn et al., 2017); basic data also finetune a RoBERTa-base model and use early
statistics are summarized in Table 1. stopping with a tolerance of 5 based on the mean
The MHS dataset was collected from YouTube, squared error on the held-out set.
Twitter, and Reddit examples. It has nine scale Each experiment’s result is averaged over 5 runs,
attributes that contribute to their definition of hate and we present 95% confidence intervals based on
speech, from which we select three for our experi- these runs. For all algorithms, we disallow query-
ments: Dehumanize (which has high levels of hu- ing on examples where all available annotations
man disagreement), Respect (which has medium are already in the training set.5
levels), and Genocide (which has low levels).
Each attribute is labeled for every example on a 5 Results and Analysis
five-point Likert scale from strongly disagree to
strongly agree. There are 50k examples, each of In this section, we present results for baseline meth-
which is annotated between 1 and 6 times in the ods (§5.1) and DAAL (§5.2). We also investigate
main dataset (see Figure 17); for our simulated how the budget size and the number of annotations
experiments we only consider those with 3 − 6 an- per example affect the entropy predictor’s perfor-
notations, resulting in around 20k total examples. mance (§5.3). In addition, we discuss in which sit-
The Wikipedia dataset was created as a result 4
We additionally report total variational distance as well
of the Wikipedia Detox Project.3 It has three at- as Macro F1 and accuracy in the Appendix.
5
This issue only arises in simulation: in a real condition,
3
https://meta.wikimedia.org/wiki/Research: one could always query more. In practice, we found that re-
Detox/Data_Release annotation queries were not frequent enough to raise concerns.
Dehumanize (High Disagreement) Genocide (Low Disagreement)
0.22
0.26
0.20
0.25
0.24 0.18
Jensen-Shannon
0.23 0.16
0.22 0.14
0.21
0.12
0.20
0.10
0.19
0 1000 2000 3000 4000 0 1000 2000 3000 4000 5000 6000
Average Number of Annotations Average Number of Annotations
Passive, Batched 3 AL H(f ), Batched 3 Oracle (full data)
Passive, Single 3 AL H(f ), Single 3
Figure 2: JS divergence scores for two attributes from the MHS dataset for passive learning baselines and entropy-
based active learning (AL) baselines. For these experiments, we define N ≈ 3, which means that there are
approximately 3 annotations per example available in the data pool.6 Both baselines have two variations when
querying: “Batched” receives all 3 annotations per example while “Single” receives only one.
Figure 3: Jensen-Shannon divergence vs the number of required annotations. The lines in red show DAAL’s
0.0
improvement in the number of annotations. They connect the first measurement where DAAL was within 5%
0.0point where
of its best JS to the 0.2 the baseline
0.4 achieves
0.6 the same
0.8 performance
1.0 (if available). We compare DAAL
with the empirically determined best budget size (See §5.3) and best performing baseline. We show in the legend
labels whether the task model receives single or batched annotations for queried examples, the number of available
annotations per example, and (for DAAL) the size of the entropy predictor’s budget in annotations. The x-axis
includes the annotations in the entropy predictor’s budget.
age than all baselines. Other approaches might traditional learning settings where there may not
achieve the same performance but require at least be much disagreement.
26% more annotations on average. For instance,
DAAL achieves 0.225 JS divergence for the 5.3 Size of the Entropy Budget, Bent
Dehumanize attribute after approximately 566 an-
notations, while the best baseline needs 1022 an- We explore different budgets for the annotator en-
notations to achieve the same performance (80% tropy predictor described in §3.5. We experiment
more). The one exception is on the Toxicity with budgets of 25, 100, and 200 examples on
dataset, which we explore in §5.3. MHS Respect. Since the entropy predictor must
be trained on multiply-annotated examples, our
In some cases, as with the Genocide attribute,
goal is to ensure it can be trained with a very small
the baseline models never get to the same perfor-
budget. The comparison of performances is shown
mance as DAAL. We observe no strong pattern
in Figure 4. In general, we see that the entropy
for DAAL working better or worse for high ver-
predictor can, indeed, be learned with relatively
sus low disagreement attributes, suggesting that
few examples and that a budget of 100 examples is
it’s a “safe” option that can also be used in more
near optimal. We confirm that this finding extends
to the Toxicity dataset in §A.4.
7
Beyond the two simple active and passive learning base- In §5.2, we noted a situation on the Toxicity
lines discussed in §3.3 and §3.4, we also considered BADGE
(Ash et al., 2020), an active learning method that samples a
dataset when DAAL performs slightly worse (re-
diverse set of uncertain examples to annotate based on the quires about 4% to 11% more annotations) than
magnitude of the gradient in the final hidden layer. Using entropy-based active learning (Table 2). This
BADGE’s default hyperparameters and with 200 epochs per
round (vs a limit of 50 for DAAL and the other baselines), dataset has markedly more annotations per exam-
we found that with both BERT and RoBERTa BADGE never ple (Table 1), which is an artifact of the simulation
outperformed our other baselines on datasets with annotator used for the experiment. For a direct comparison,
disagreement. For example, the final JS divergence of BADGE
was 28% worse that the strongest baseline on MHS Respect, we repeat this experiment where we fix the total
and 7% worse on MHS Dehumanize. number of annotations to smaller values. Results
Jensen-Shannon Divergence
0.325 Oracle (full data) Annotation Budget
Passive, Single 3 0.30 max
0.300 AL H(f ), Single 3 100
AL |H(f ) fent|, Single 3, 75 200
0.275 AL |H(f ) fent|, Single 3, 300 0.25 300
AL |H(f ) fent|, Single 3, 600 1000
0.250 0.20
MSE
0.225 0.15
0.200
0.10
0.175
0.05
0.150
0 250 500 750 1000 1250 1500 1750
Average Number of Annotations 2 3 4 5 6 7 8 9 10
Annotations per Example
Figure 4: Comparison of JS Divergence when using
Figure 5: Entropy predictor performance on Toxicity
different budgets for annotator entropy predictors de-
on varying the total annotation budget and the number
scribed in § 3.5 on the MHS Respect attribute. We
of annotations per example. We find that decreasing the
compare budgets of 25, 100, and 200 examples with
annotations per example to 5 and the budget to 200 is
pre-collected annotations. For MHS (N = 3), this
generally sufficient.
translates to budget sizes of 75, 300, and 600 annota-
tions
to query on new examples, but in the second half,
are shown in Figure 5. We see that having more an- about 2/3 of annotations are re-annotations.
notations per example gives better performance on Combining this change in re-annotation rate with
the entropy predictor. (We show task model results the change in which term dominates the query func-
on 3, 5, and 10 annotation per example DAAL in tion, we can see a more clear strategy. Early in
§A.4.) We notice that the optimal number of anno- training, when the model is focusing on examples
tations is 5 per example, which suggests 5 might be with low fent , there is no need to query for multi-
a reasonable cap for the maximum number of times ple labels. Once the model starts considering more
a single example could be queried in a real-world examples with high fent , re-annotations become
deployment. necessary to better capture the annotator distribu-
tion. These re-annotations are largely not given to
5.4 fent vs H(fθ ) and Re-annotation Strategy examples with low fent , as these are not likely to
DAAL chooses examples to query based on the require more than one annotation.
absolute difference between model and annotator
6 Conclusion
entropy (See §3.5). This means that the model can
select two kinds of examples depending on which In this paper, we emphasize the importance of ac-
term is larger. When H(fθ ) > fent , the model is counting for disagreement present in data. We pro-
unsure of the correct label but predicts that annota- pose DAAL, an active learning approach, which
tors will agree on the label. When fent > H(fθ ), incorporates both annotator and model uncertain-
the model is overconfident in its label prediction ties, aiming to reduce the cost of annotation. This
given its prediction of annotator agreement levels. cost includes both time and money, but also an
In Figure 6, we consider which of these two often overlooked cost related to the repeated ex-
kinds of examples the model is querying on at dif- posure of annotators to toxic and harmful content.
ferent points in learning. We find that our model When the annotation is performed on crowdsourc-
begins by querying overwhelmingly on cases with ing platforms, where workers are often from vul-
H(fθ ) > fent but that the reverse is true later nerable populations who may require more flexible
in training. This can be interpreted as beginning employment options—such as those with disabili-
with “easy” examples where annotators are likely ties or who have caregiver roles (Berg, 2016)—this
to agree and then choosing examples with higher mental health cost compounds existing marginal-
disagreement later to correct overconfidence. ization.
We also consider how often DAAL re-annotates In our experiments on training classifiers for
an already annotated example. In Figure 7, we hate speech and toxicity detection, we show that
see that early in training, DAAL mostly chooses DAAL achieves comparable Jensen-Shannon di-
0.7 0.7
Passive, Single 10
AL H(f ), Single 10
0.6 0.6 AL |H(f ) fent|, Single 10, 1000
0.5 0.5
% re-annotations
0.4 0.4
%
0.3 0.3
0.2 0.2
Figure 6: Re-annotation rate and fent vs H(fθ ) strat- Figure 7: Re-annotation rate for single annotation strate-
egy for DAAL on Toxicity. Like Figure 7, the gies on Toxicity. We find that our method has a con-
re-annotation rate increases over time (green). Ad- sistently higher re-annotation rate than the baselines and
ditionally, the selection strategy goes from choosing that the rate increases over time.
mostly examples where fent (x) ≤ H(fθ (x)) to choos-
ing the opposite (blue). Later in training, these in-
creased re-annotations largely go to examples where tion. The Wikipedia Talk dataset does not provide
fent (x) > H(fθ (x)) (red). demographic information about its annotators.
A classifier for toxic text or hate speech trained
on a pool of annotators whose backgrounds do not
vergence with the classic baselines’ performance
reflect anywhere near the full diversity of human
but requires an average of 1.235× fewer annota-
identities (and especially the identities of the tar-
tions in the worst case. It is also equally effective
gets of the text being classified) is inherently lim-
when there is little annotator disagreement, making
ited. Applying such a classifier, whether it predicts
it a strong general solution candidate even when
a single label or a distribution, to text from and
one does not know ahead of time how much anno-
about marginalized populations not represented in
tator disagreement is likely for a given task.
the annotator pool carries inherent risks to the well-
being of these populations. Such a classifier could
7 Limitations
systematically fail to flag content that annotators
There are several limitations to our experiments: from privileged groups do not find harmful or incor-
we work only with English data and with datasets rectly flag innocuous speech written by members
concerning hate speech and toxicity. Frequently of marginalized groups.
such data do not represent i.i.d. samples from the
data that we might encounter in real life. In ad- 8 Acknowledgements
dition, experiments are all conducted in the sim- The authors are grateful to all the reviewers who
ulation with these existing datasets. The annota- have provided helpful suggestions to improve this
tions in the simulated experiments were already work, and thank members of the CLIP lab at the
checked for quality by the original dataset creators University of Maryland for the support on this
(Sachdeva et al., 2022; Wulczyn et al., 2017). In project.
real-world deployment, further steps would need
to be taken to ensure that the entropy in annota-
tions truly comes from disagreements and not other References
kinds of noise.
Sohail Akhtar, Valerio Basile, and Viviana Patti. 2021.
While DAAL is designed to capture disagree- Whose opinions matter? perspective-aware models
ment due to annotator positionalities, the datasets to identify opinions of hate speech victims in abusive
used may not have had a diverse enough pool of language detection.
annotators to fully test this. In the portion of the
Dina Almanea and Massimo Poesio. 2022. ArMIS -
MHS dataset used in our experiments, 67.9% of an- the Arabic misogyny and sexism corpus with anno-
notators were cisgender, straight, and white, while tator subjective disagreements. In Proceedings of
only 0.4% of examples targeted this same popula- the Thirteenth Language Resources and Evaluation
Conference, pages 2282–2291, Marseille, France. Eu- Manfred Klenner, Anne Göhring, and Michael Amsler.
ropean Language Resources Association. 2020. Harmonization sometimes harms. In Pro-
ceedings of the 5th Swiss Text Analytics Conference
Lora Aroyo and Chris Welty. 2013. Crowd truth: Har- (SwissText) & 16th Conference on Natural Language
nessing disagreement in crowdsourcing a relation Processing (KONVENS).
extraction gold standard. WebSci2013. ACM.
Jordan T. Ash, Chicheng Zhang, Akshay Krishnamurthy, Savannah Larimore, Ian Kennedy, Breon Haskett, and
John Langford, and Alekh Agarwal. 2020. Deep Alina Arseniev-Koehler. 2021. Reconsidering anno-
batch active learning by diverse, uncertain gradient tator disagreement about racist language: Noise or
lower bounds. In 8th International Conference on signal? In Proceedings of the Ninth International
Learning Representations, ICLR 2020, Addis Ababa, Workshop on Natural Language Processing for So-
Ethiopia, April 26-30, 2020. cial Media, pages 81–90, Online. Association for
Computational Linguistics.
Joris Baan, Wilker Aziz, Barbara Plank, and Raquel
Fernandez. 2022. Stop measuring calibration when Elisa Leonardelli, Stefano Menini, Alessio
humans disagree. In Proceedings of the 2022 Con- Palmero Aprosio, Marco Guerini, and Sara
ference on Empirical Methods in Natural Language Tonelli. 2021. Agreeing to disagree: Annotating
Processing, pages 1892–1915, Abu Dhabi, United offensive language datasets with annotators’ dis-
Arab Emirates. Association for Computational Lin- agreement. In Proceedings of the 2021 Conference
guistics. on Empirical Methods in Natural Language Process-
ing, pages 10528–10539, Online and Punta Cana,
Valerio Basile, Michael Fell, Tommaso Fornaciari, Dirk Dominican Republic. Association for Computational
Hovy, Silviu Paun, Barbara Plank, Massimo Poesio, Linguistics.
and Alexandra Uma. 2021. We need to consider
disagreement in evaluation. In Proceedings of the David D. Lewis. 1995. A sequential algorithm for train-
1st Workshop on Benchmarking: Past, Present and ing text classifiers: Corrigendum and additional data.
Future, pages 15–21, Online. Association for Com- SIGIR Forum, 29(2):13–19.
putational Linguistics.
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Man-
Janine Berg. 2016. Income security in the on-demand dar Joshi, Danqi Chen, Omer Levy, Mike Lewis,
economy: Findings and policy lessons from a survey Luke Zettlemoyer, and Veselin Stoyanov. 2020.
of crowdworkers. Comparative Labor Law & Policy Ro{bert}a: A robustly optimized {bert} pretraining
Journal, 37(3). approach.
Aron Culotta and Andrew McCallum. 2005. Reducing
labeling effort for structured prediction tasks. In Jennimaria Palomaki, Olivia Rhinehart, and Michael
AAAI, volume 5, pages 746–751. Tseng. 2018. A case for a range of acceptable anno-
tations. In Workshop on Subjectivity, Ambiguity and
Aida Mostafazadeh Davani, Mark Díaz, and Vinodku- Disagreement in Crowdsourcing.
mar Prabhakaran. 2022. Dealing with disagreements:
Looking beyond the majority vote in subjective an- Ellie Pavlick and Tom Kwiatkowski. 2019. Inherent
notations. Transactions of the Association for Com- disagreements in human textual inferences. Transac-
putational Linguistics, 10:92–110. tions of the Association for Computational Linguis-
tics, 7:677–694.
Xinyue Dong, Shilin Gu, Wenzhang Zhuge, Tingjin Luo,
and Chenping Hou. 2021. Active label distribution Barbara Plank. 2022. The “problem” of human label
learning. Neurocomputing, 436:12–21. variation: On ground truth in data, modeling and
evaluation. In Proceedings of the 2022 Conference
Tommaso Fornaciari, Alexandra Uma, Silviu Paun, Bar-
on Empirical Methods in Natural Language Process-
bara Plank, Dirk Hovy, and Massimo Poesio. 2021.
ing, pages 10671–10682, Abu Dhabi, United Arab
Beyond black & white: Leveraging annotator dis-
Emirates. Association for Computational Linguistics.
agreement via soft-label multi-task learning. In Pro-
ceedings of the 2021 Conference of the North Amer- Maria E. Ramirez-Loaiza, Manali Sharma, Geet Kumar,
ican Chapter of the Association for Computational and Mustafa Bilgic. 2017. Active learning: An em-
Linguistics: Human Language Technologies, pages pirical study of common baselines. Data mining and
2591–2597, Online. Association for Computational knowledge discovery, 31(2).
Linguistics.
Steve Hanneke. 2014. Theory of disagreement-based ac- Pratik Sachdeva, Renata Barreto, Geoff Bacon, Alexan-
tive learning. Foundations and Trends® in Machine der Sahn, Claudia von Vacano, and Chris Kennedy.
Learning, 7(2-3):131–309. 2022. The measuring hate speech corpus: Leverag-
ing rasch measurement theory for data perspectivism.
Ashish Khetan, Zachary C Lipton, and Animashree In Proceedings of the 1st Workshop on Perspectivist
Anandkumar. 2018. Learning from noisy singly- Approaches to NLP @LREC2022, pages 83–94, Mar-
labeled data. In International Conference on Learn- seille, France. European Language Resources Asso-
ing Representations. ciation.
Maarten Sap, Swabha Swayamdipta, Laura Vianna,
Xuhui Zhou, Yejin Choi, and Noah A. Smith. 2022.
Annotators with attitudes: How annotator beliefs
and identities bias toxic language detection. In Pro-
ceedings of the 2022 Conference of the North Amer-
ican Chapter of the Association for Computational
Linguistics: Human Language Technologies, pages
5884–5906, Seattle, United States. Association for
Computational Linguistics.
Burr Settles. 2009. Active learning literature survey.
Computer Sciences Technical Report 1648, Univer-
sity of Wisconsin–Madison.
Anna Sotnikova, Yang Trista Cao, Hal Daumé III, and
Rachel Rudinger. 2021. Analyzing stereotypes in
generative text inference tasks. In Findings of the
Association for Computational Linguistics: ACL-
IJCNLP 2021, pages 4052–4065, Online. Association
for Computational Linguistics.
0.40
0.35 0.26
0.20 0.40
0.25 0.22
0.15 0.21 0.38
0.20
0.10 0.20
0.36
0.15 0.19
0.05
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
0.40 0.95 0.22
0.90 0.20
MHS Genocide
0.40
0.35
0.85 0.18
0.80 0.35
0.30 0.16
Low Disag.
0.75
0.14
0.25 0.70 0.30
0.12
0.65
0.20 0.10 0.25
0.60
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
Figure 8: Comparison of passive and active leaner baselines on a high and low disagreement MHS attribute.
On the high disagreement attribute, Dehumanize, we see that passive learning still outperforms active
learning when using accuracy-based measures, Macro F1 and Accuracy, though the effect is more subtle
than with the distributions-based measures, JS Divergence and TVD.
For the low disagreement attribute, Genocide, we see that passive learning achieves the same perfor-
0.6 0.8 as active
mance 1.0learning in fewer annotations when considering Accuracy, JS Divergence, and TVD. For
Macro F1, we see a much stronger trend, with the performance of the passive learner plateauing before the
active learner. Noting how quickly all baselines achieved high accuracies, we argue that these trends are
caused by the heavy class imbalance in the Genocide attribute which is heavily skewed to non-genocidal
examples (See §A.5).
0.250 0.250
0.225 0.225
0.200 0.200
0.175 0.175
0.150 0.150
0 1000 2000 3000 4000 0 200 400 600 800 1000 1200
Average Number of Annotations Average Number of Annotations
Figure 9: Standard training vs training on only examples with full annotator agreement on MHS Respect.
To more directly investigate the effect of the level of disagreement on baseline model performance, we
consider alternative train sets containing only examples with full annotator agreement. In other words, we
use a subset of the original unlabeled data where all N available annotations have the same label value y.
When querying for all available annotations (Figure 9a), the passive learner outperforms the active
learner when they have access to the full training set. When they can only access training examples with
full annotator agreement, the relationship is reversed.
When querying for single annotations at a time (Figure 9b), we still find that the passive learner
performs better on the full training set. Using the training set with full annotator agreement, the active
learner performs better earlier in training, but the final performance is not significantly different.
These results further show that model entropy alone isn’t a good metric when humans disagree, which
leads the passive approach, which simply picks at random, to perform better than the active learner.
0.250
0.3
0.225
(a) (b)
Figure 10: Comparison of training on hard labels via majority vote vs soft labels with N annotations on MHS
Respect
For both metrics distribution-based and accuracy-based metrics, we see a significant disadvantage when
using hard labels. Considering Macro F1 (Figure 10a), using majority votes decreases the performance of
the passive and active learners by 7.43% and 10.6% respectively. Considering Jensen-Shannon Divergence
(Figure 10b), using majority votes decreases the performances by 6.25% and 14.4% respectively.
For both metrics, we see that by the end of training, using soft vs hard labels, not the querying method,
determines which methods will be most successful. We see that the active batched model (weaker than
its passive counterpart) does as good or better than the passive majority vote model. This confirms that
aggregating annotation by majority vote can hurt performance when annotators disagree.
A.3 DAAL Improvements on Accuracy, Macro F1, TDV, JS Divergence
In this section, we show the full graphs of the JS Divergence results listed in Table 2 as well as for
accuracy, macro F1, and total variational distance.
In Figure 11, we compare to the active learning baselines. For the MHS datasets, this tended to be
the weaker baseline, with DAAL strongly outperforming both baselines on distribution-based metrics.
Results on accuracy-based metrics were weaker on average, especially for Genocide. We see similar
trends with Toxicity-5, though the JS Divergence is slightly worse on average at the optimal point.
Jensen-Shannon Total Variational
Macro F1 (↑) Accuracy (↑) Divergence (↓) Distance (↓)
0.325
0.5 0.6
MHS Dehumanize MHS Respect
0.300
0.50
0.4 0.5 0.275
0.250 0.45
0.3 0.4
0.225
0.2 0.40
0.3 0.200
0.175
0.1 0.2 0.35
0.150
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
0.40 0.46
0.35 0.26
0.84 0.20
0.35
0.82 0.375
0.18
0.30 0.80 0.350
0.16
0.78 0.325
0.25 0.14
0.76 0.300
0.74 0.12
0.20 0.275
0.72 0.10 0.250
0.15 0.70
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
0.70 0.375
0.16
0.6 0.65 0.350
Wiki-Toxicity-5 Wiki-Toxicity
0.14 0.325
0.60
0.5
0.12 0.300
0.55
0.275
0.4 0.10
0.50
0.250
0.45 0.08
0.3 0.225
0.40 0.06 0.200
0.2
0.35 0.04 0.175
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
Macro F1 Accuracy Jensen-Shannon Divergence Total Variation Distance
0.60 0.325
0.14
0.5
0.55 0.300
0.12
0.4 0.50 0.275
0.10
0.45 0.250
0.3 0.08
0.40 0.225
0.06 0.200
0.2 0.35
0.04 0.175
0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
Figure 11: Comparison of DAAL (green, purple, or pink based on annotations per example) and entropy-based
active learning (orange). The lines in red show DAAL’s improvement in number of annotations. They connect
the first measurement where DAAL was withing 5% of its best performance to the point where the batched active
learning baseline achieves the same performance (if available).
0.5 0.6
0.300
0.475
0.4 0.275
0.5 0.450
0.250
0.425
0.3 0.4 0.225 0.400
0.200 0.375
0.2
0.3
0.175 0.350
0.1 0.150
0.2 0.325
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
0.44
0.40 0.26
0.35
0.25 0.42
0.30 0.35
0.24
0.25 0.40
0.30 0.23
0.20 0.22
0.38
0.15 0.25 0.21
0.20 0.36
0.10 0.20
0.19
0.05
0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000 0 1000 2000 3000 4000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
0.95 0.22
0.425
0.35
MHS Genocide
0.90 0.20
0.400
0.85 0.18
0.30 0.375
0.80
0.16 0.350
0.25 0.75 0.325
0.14
0.70 0.300
0.12
0.20 0.65 0.275
0.10 0.250
0.60
0.15
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
Average Number of Annotations Average Number of Annotations Average Number of Annotations Average Number of Annotations
0.70
0.6 0.200 0.50
0.65
Wiki-Toxicity-5 Wiki-Toxicity
0.175 0.45
0.60
0.5
0.55 0.150 0.40
0.4 0.50 0.125 0.35
Figure 12: Comparison of DAAL (green, purple, or pink based on annotations per example) and passive learning
(blue). The lines in red show DAAL’s improvement in number of annotations. They connect the first measurement
where DAAL was withing 5% of its best performance to the point where the batched passive learning baseline
achieves the same performance (if available).
Here, we continue § 5.3’s discussion of the effects of budget sizes and annotations per example. In
Figure 5, we showed how the entropy predictor’s performance on Toxicity does not significantly degrade
until fewer than 5 annotations per example are available. In Figure 13, we can see that the 5 annotations
passive learner sees a performance decrease. However, the baselines’ overall performance did not drop
significantly. On the other hand, in Figure 13b, we can see that the effect of decreasing to 3 annotations
per example is much more significant.
0.125 0.125
0.100 0.100
0.075 0.075
0.050 0.050
0 2000 4000 6000 8000 10000 12000 0 1000 2000 3000 4000 5000 6000
Average Number of Annotations Average Number of Annotations
(a) (b)
Figure 13: Baseline Toxicity results varying the number of annotations per example. We find that decreasing the
annotations to 5 per example causes a small decrease in performance. Decreasing to 3 (a similar ammount to MHS)
Significantly decreases the performance of the Batch AL model.
Jensen-Shannon Divergence
0.200
0.175
Oracle (full data)
0.150 AL H(f ), Single 10
Passive, Single 10
0.125 AL |H(f ) fent|, Single 10, 250
AL |H(f ) fent|, Single 10, 1000
0.100 AL |H(f ) fent|, Single 10, 2000
AL |H(f ) fent|, Single 5, 200
0.075 AL |H(f ) fent|, Single 3, 300
0.050
0.025
0 500 1000 1500 2000 2500 3000
Average Number of Annotations
Figure 14: Comparison of performances on Toxicity when using different budgets for annotator entropy predictors
described in the §3.5.
We find similar trends in DAAL when decreasing the number of annotations per example in 14. When
we compare DAAL and entropy-based active learning using different numbers of annotations per example
(Figure 15), we find a small trend of DAAL performing better in comparison to the baseline when the
number of annotations per example is small, especially with as few annotations as MHS.
0.14
0.200
0.12
0.12 0.175
0.10 0.150
0.10
0.125
0.08 0.08
0.100
0.050
0.04 0.04
0 1000 2000 3000 4000 5000 6000 0 500 1000 1500 2000 2500 3000 0 500 1000 1500 2000 2500 3000 3500 4000
Average Number of Annotations Average Number of Annotations Average Number of Annotations
Figure 15: DAAL vs AL H(fθ ) Single (orange) on varied annotations per example. On average DAAL can
perform slightly worse than the baseline when the number of potential annotations is high.
0 0 0 0
0 1 2 3 4 0 1 2 3 4 0 1 2 3 4 2 1 0 1 2
(a) MHS Dehumanize (b) MHS Respect (c) MHS Genocide (d) Wikipeida Toxicity
Figure 16: Label distributions for MHS and Wikipedia Toxicity datasets.
Here, we have diverse settings. For instance, Genocide has the lowest level of disagreement between
two random annotators (See Table 1), and we can see the majority of labels concentrate between two labels
with the most examples of non-Genocide data. The Respect and Toxicity attributes have approximately
the same level of disagreement with almost a 50% chance that two random annotators disagree. However,
the distributions are quite different. The Toxicity label distribution has mostly two labels in use:
neutral and toxic. This is similar to Genocide with the majority votes distributed between two labels:
“strongly disagree” and “disagree” that text relates to genocide. The Respect attribute has annotations
distributed between all labels, forming a left-skewed distribution, showing more different perspectives
on this attribute. Dehumanize has the highest disagreement level. There is almost a 70% chance of two
annotators disagreeing and the label distribution is almost uniform. This shows that there are enough
examples that are seen differently by annotators (See Table 1).
10000
8000
Number of Examples
6000
4000
2000
0
3 4 5 6
Number of Annotations per Example
Figure 17: Annotations per example on our used portion of the MHS dataset. This excludes reference set examples
(with > 200 annotations) and examples with less than 3 annotations.
The original MHS dataset contains both a reference set containing examples with more than 200
annotations per example and a larger set of examples with 1-6 annotations. As we discussed in §4.1, we
use in our experiments a subset of the MHS dataset with 3-6 annotations (with an average of 3.35). The
distribution of annotations per example in the data used in our experiments is shown in Figure 17.