Professional Documents
Culture Documents
Ece Takmaz◁∗, Nicolo’ Brandizzi⋄∗, Mario Giulianelli◁ , Sandro Pezzelle◁ , Raquel Fernández◁
◁
University of Amsterdam ⋄ Sapienza University of Rome
{ece.takmaz|m.giulianelli|s.pezzelle|raquel.fernandez}@uva.nl
brandizzi@diag.uniroma1.it
Abstract
Table 5: Speaker results on the test set as measured by common natural language generation evaluation metrics.
IND OOD
Domain Epoch
Accuracy MRR Accuracy MRR
Appliances 23 84.12 ± 0.33 90.27 ± 0.10 20.28 ± 0.23 44.07 ± 0.11
Food 21 85.40 ± 0.28 91.20 ± 0.20 17.72 ± 0.18 42.42 ± 0.06
Indoor 14 82.94 ± 0.13 89.32 ± 0.09 19.14 ± 0.09 43.46 ± 0.06
Outdoor 19 83.96 ± 0.23 90.01 ± 0.14 19.64 ± 0.07 43.52 ± 0.06
Vehicles 17 78.99 ± 0.35 86.81 ± 0.14 18.46 ± 0.28 42.36 ± 0.20
Table 6: Listener performance on gold utterances. Accuracy and MRR for the in-domain (IND) and out-of-domain
(OOD) samples given to listeners trained on specific domains (indicated under the ‘Domain’ column).
Table 7: Listener accuracies on speaker-generated data. Each row indicates the domain a listener was trained on and
the columns indicate the domain of the input samples. Results over 5 seeds.
Table 8: Simulator’s accuracy in predicting the behaviour of a listener knowledgeable about all domains (as the
speaker) and a listener with domain-specific knowledge for IND and OOD samples. ‘Avg’ is the overall accuracy,
‘Pos’ and ‘Neg’ are the percentages of correct predictions for the samples where the listener picked the correct (Pos)
and the incorrect image (Neg).
OOD IND
Golden Speaker Adapted Golden Speaker Adapted
appliances 20.21 19.30 27.74 84.21 57.21 74.28
indoor 18.50 19.53 28.34 83.22 52.94 69.62
food 17.06 18.31 26.26 85.61 55.54 78.15
outdoor 18.89 18.54 26.21 84.38 52.83 73.04
vehicles 18.25 17.35 25.16 78.67 42.09 63.75
Table 9: Test results for the audience-aware adaptation pipeline, 5 seeds for each domain.
Motivation Motivation
Practical Cognitive Intrinsic Fairness
Practical Cognitive Intrinsic Fairness
△⃝ △⃝
□△⃝ □△⃝
Generalisation type Generalisation type
Compo- Cross Cross Cross Robust-
Compo- Cross Cross Cross Robust- Structural
Structural sitional Task Language Domain ness
sitional Task Language Domain ness
△⃝
□△⃝
Shift type Shift type
Covariate Label Full No shift
Covariate Label Full No shift
△ ⃝
□△⃝
Shift source Shift source
Naturally Partitioned Generated shift Fully
Naturally Partitioned Generated shift Fully
occurring natural generated
occurring natural generated
□△⃝
△⃝
Table 10: Generator’s evaluation card for the three Table 11: Simulator’s evaluation card for the two setups
main setups: baseline □, self-aware adaptation △, and in which it is used (i.e., baseline setup excluded): self-
audience-aware adaptation ⃝ aware adaptation △, and audience-aware adaptation ⃝.
Figure 7: Unigram POS distribution across adaptation steps.
Motivation
Practical Cognitive Intrinsic Fairness
□△⃝ □△⃝
Generalisation type
Compo- Cross Cross Cross Robust-
Structural
sitional Task Language Domain ness
□△⃝
Shift type
Covariate Label Full No shift
□△⃝ □△⃝
Shift source
Naturally Partitioned Generated shift Fully
occurring natural generated
□△⃝ □△⃝ Figure 8: Mean utterance Age of Acquisition over adap-
tation steps. Step 0 corresponds to the non-adapted
Shift locus utterance.
Train–test Finetune Pretrain–train Pretrain–test
train–test
□△⃝