Professional Documents
Culture Documents
6
To determine the number of heads to keep, thereby pre-
venting performance decline, we introduce a hyperparam-
4
Layer
eter γ representing the ratio of crucial attention heads for
language modeling. For instance, γ = 0.5 means we keep
the top 50% of heads that positively influence performance,
2
ranked based on Eq. (3) (lower is better). Then, the re-
maining heads (i.e. the non-crucial bottom 50% in terms of
performance) are ranked based on their bias impact (again, 2 4 6 8 10 12
Head
lower is better) computed using Eq. (2). For a given ratio of
pruned heads, denoted by α, we prune α × Nh heads from
the remaining non-critical heads, based on their bias scores. Figure 1: Illustration of applying FASP to a model with 6
In the end, this sequence of steps allows us to prioritize the layers and 12 heads per layer, e.g. DistilGPT-2. Initially, we
removal of those with the highest bias impact while mitigat- identify and exclude the heads that significantly impact per-
ing the loss of language modeling ability. An overview of formance from the pruning process (black squares). Sub-
our method is presented in Algorithm 1. sequently, the remaining heads are prioritized for removal
based on their contribution to bias, ensuring that the heads
contributing the most to bias are pruned first (red squares).
Algorithm 1: Fairness-aware structured pruning (FASP)
Input: Pre-trained model with Nh attention heads, set of
all heads H, ratio γ of important heads for performance ex- appendix displays the number of prompts associated with
cluded from the pruning, ratio α of heads to be pruned, set S each of these targeted biases, along with some illustrative ex-
of subgroups targeted by the bias. amples of the prompts for each category. The prompts were
Procedure: split into validation and test sets with a ratio of 0.2:0.8.
1. Compute zppl (h) in Eq. (3) ∀ h ∈ H on the validation set
2. Define the set of critical heads H ′ as the top γ × Nh Baselines
heads based on zppl (h) We employ the following baseline methods when evaluating
3. Compute zbias (S, h) in Eq. (2) ∀ h ∈ H \ H ′ on the our approach: (1) head pruning based on weight magnitude
validation set (Han et al. 2015; Han, Mao, and Dally 2015), (2) head prun-
4. Prune α × Nh heads in H \ H ′ based on zbias (S, h) ing based on gradient magnitude (Michel, Levy, and Neu-
end big 2019), (3) random head pruning, (4) head pruning based
only on the fairness score in Eq. (2), and (5) head pruning
based only on the perplexity score in Eq. (3). We refer to the
Figure 1 illustrates how FASP removes attention heads. latter two baselines as fairness only and performance only
The heads shown in black are deemed critical for language baselines, respectively. We would like to highlight that the
modeling and, as a result, are excluded from the pruning model remains unchanged and does not undergo any fine-
process. The remaining heads are depicted in various col- tuning after the pruning process for all the mentioned base-
ors based on their impact on bias, with red indicating those lines as well as our method.
that negatively influence fairness and green representing the
heads that promote fairness. Evaluation Metrics
5 Experimental details We assess bias by examining the variation in the model’s
toxicity across various subgroups. For instance, when mea-
This section presents an overview of our bias assessment suring religion bias, we consider differences in the model’s
prompts, baselines, evaluation metrics, and models used in toxicity among the different subgroups such as Muslims,
our experiments. Our code is publicly available1 . Christians, Jews, and so on, as detailed in Eq. (1). We
use BERT for toxicity assessment, similar to the work by
Bias Assessment Prompts Dhamala et al. (2021). For performance assessment, we
We use the prompts from the holistic bias dataset intro- measure the model’s perplexity on WikiText-2.
duced by Smith et al. (2022a). This dataset comprises 566k
prompts, encompassing 13 distinct biases, making it the Models
most extensive bias assessment dataset available at the time
of this paper’s writing, to the best of our knowledge. Among We employed 6 pre-trained models available in Hugging
the 13 biases covered in the dataset, we focus on 5 spe- Face: DistilGPT-2, GPT-2 (Radford et al. 2019), GPT-Neo
cific biases: race ethnicity, religion, sexual orientation, gen- (Black et al. 2021) of two different sizes, GPT-J (Wang and
der and sex, and nationality bias. Table 6 in the technical Komatsuzaki 2021), and Llama 2 (Touvron et al. 2023) mod-
els with 88.2M, 137M, 125M, 1.3B, 6B, and 7B parameters,
1
https://github.com/chandar-lab/FASP respectively.
DistilGPT-2 GPT-2 GPT-Neo 125M
30 10
20
10 DistilGPT-2
20 0
10
0 20 10
Bias(%)
Bias(%)
Bias(%)
0
20
10 10
20
10 20
30
40
30 0 30
Bias(%)
40 40 50
0 20 40 0 25 50 75 100 0 25 50 75 100
10
PPL(%) PPL(%) PPL(%)
GPT-Neo 1.3B GPT-J 6B Llama 2 7B
40 20 20
20
15
20
30 10 10
0 5
Bias(%)
Bias(%)
Bias(%)
0
20 40 0
0 20
10
40 5
40 PPL(%)
20 10
60 15
30
20
0 25 50 75 100 0 25 50 75 100 0 50 100 150 200
PPL(%) PPL(%) PPL(%)
Method FASP Magnitude Random Fairness only Performance only Gradient Pruning
ormance only Gradient Pruning ratio 0.00 0.04 0.08 0.12 0.16 0.20
Figure 2: The percentage of change in gender bias and language modeling perplexity across DistilGPT-2, GPT-2, GPT-Neo
125M, GPT-Neo 1.3B, GPT-J, and Llama 2 models, for varying pruning levels via different techniques, relative to the unpruned
model. Among the methods, FASP is the only method to consistently reduce bias while upholding a relatively low language
modeling perplexity.
4
Nationality
3 Gender
Religion
2
0
136
133
1
80
93
105
23
17
117
12
8
48
125
88
56
61
65
2
72
78
139
90
122
134
100
135
112
115
50
132
138
142
28
3
4
35
33
41
42
24
118
113
25
120
27
111
110
109
119
52
121
103
19
126
18
13
9
7
5
137
21
99
29
53
54
55
46
57
58
60
45
63
44
77
37
81
83
86
32
91
92
31
95
96
98
67
Head id
Figure 3: The indices of most impactful attention heads on five social biases, at a 20% pruning rate (α = 0.2). The existence
of heads that offer pruning advantages to multiple social biases indicates the potential for a simultaneous positive impact on
several biases through pruning.
DistilGPT-2 GPT-2 GPT-Neo 125M lation is either slightly negative or non-existent in relation
Nat. S. o. Rel. Race Gend.
Bias(%)
Bias(%)
Bias(%)
20
10 10 10
0
20 20 20
20
40
30
30
0 20 40 60 0 DistilGPT-2
20 40 60
60
0 20 40 60
30
0 20 40 60
PPL(%) PPL(%) PPL(%) PPL(%)
GPT-2 nationality bias 20 GPT-2 race bias GPT-2 religion bias GPT-2 sex. orien. bias
80 40
60 10 30
20
100 30
20
40
Bias(%)
Bias(%)
Bias(%)
Bias(%)
50 10
0 10
Bias(%)
20 0
0
0 10 0 10
20 10 20
50
20
40 30 30
0 50 100 0 50 100 0 25 50 75 100 0 25 50 75 100
PPL(%) 20 PPL(%) PPL(%) PPL(%)
GPT-Neo 125M nationality bias GPT-Neo 125M race bias GPT-Neo 125M religion bias GPT-Neo 125M sex. orien. bias
20 30 20
40 10
0 20 0
0
40 10
Bias(%)
Bias(%)
Bias(%)
Bias(%)
20 0
20
20 40 20
20
40 40 PPL(%) 40
30
40
60
60 60 50
0 50 100 0 50 100 0 50 100 0 50 100
PPL(%) PPL(%) PPL(%) PPL(%)
Method FASP Magnitude Random Fairness only Performance only Gradient Pruning
ormance only Gradient Pruning ratio 0.00 0.04 0.08 0.12 0.16 0.20
Figure 5: An analysis on DistilGPT-2, GPT-2, and GPT-Neo (with 125M parameters) showing the percentage of change in
language modeling perplexity and nationality, race, religion, and sexual orientation biases, relative to the unpruned model,
using varying pruning levels and different pruning techniques. While FASP focuses on gender bias mitigation through head
pruning, it also addresses other biases whose head scores are positively correlated with gender bias scores, while maintaining
robust language model perplexity.
7 Conclusion Acknowledgements
This paper examines the impact of pruning attention heads We are thankful to Afaf Taı̈k for her insightful suggestions
in various language models on their fairness towards sev- in this project. We are also thankful to the reviewers for their
eral social biases. We highlight that current pruning tech- constructive comments. Sarath Chandar is supported by the
niques, which prioritize minimizing performance decline, Canada CIFAR AI Chairs program, the Canada Research
do not take fairness into account. As a result, we propose Chair in Lifelong Machine Learning, and the NSERC Dis-
to consider both performance and fairness considerations covery Grant. Gonçalo Mordido is supported by an FRQNT
when pruning model components. Our experiments show postdoctoral scholarship (PBEEE). The project was also
that the proposed approach, FASP, consistently improves the supported by Microsoft-Mila collaboration grant. The au-
fairness of transformer models while matching the language thors acknowledge the computational resources provided by
modeling ability of performance-based pruning methods. the Digital Research Alliance of Canada.
References Devlin, J.; Chang, M.; Lee, K.; and Toutanova, K. 2019.
Behnke, M.; and Heafield, K. 2020. Losing Heads in the BERT: Pre-training of Deep Bidirectional Transformers for
Lottery: Pruning Transformer Attention in Neural Machine Language Understanding. In NAACL, 4171–4186.
Translation. In Proceedings of the 2020 Conference on Em- Dhamala, J.; Sun, T.; Kumar, V.; Krishna, S.; Pruksachatkun,
pirical Methods in Natural Language Processing (EMNLP), Y.; Chang, K.-W.; and Gupta, R. 2021. Bold: Dataset and
2664–2674. Online: Association for Computational Linguis- metrics for measuring biases in open-ended language gener-
tics. ation. In Proceedings of the 2021 ACM conference on fair-
Behnke, M.; and Heafield, K. 2021a. Pruning Neural Ma- ness, accountability, and transparency, 862–872.
chine Translation for Speed Using Group Lasso. In Proceed- Dixon, L.; Li, J.; Sorensen, J.; Thain, N.; and Vasserman,
ings of the Sixth Conference on Machine Translation, 1074– L. 2018. Measuring and mitigating unintended bias in text
1086. Online: Association for Computational Linguistics. classification. In Conference on AI, Ethics, and Society.
Behnke, M.; and Heafield, K. 2021b. Pruning neural ma- Fan, A.; Grave, E.; and Joulin, A. 2020. Reducing Trans-
chine translation for speed using group lasso. In Proceedings former Depth on Demand with Structured Dropout. In In-
of the sixth conference on machine translation, 1074–1086. ternational Conference on Learning Representations.
Bian, Y.; Huang, J.; Cai, X.; Yuan, J.; and Church, K. 2021.
Fan, C.; Li, J.; Zhang, T.; Ao, X.; Wu, F.; Meng, Y.; and Sun,
On Attention Redundancy: A Comprehensive Study. In Pro-
X. 2021. Layer-wise Model Pruning based on Mutual Infor-
ceedings of the 2021 Conference of the North American
mation. In Proceedings of the 2021 Conference on Empir-
Chapter of the Association for Computational Linguistics:
ical Methods in Natural Language Processing, 3079–3090.
Human Language Technologies, 930–945. Online: Associa-
Online and Punta Cana, Dominican Republic: Association
tion for Computational Linguistics.
for Computational Linguistics.
Black, S.; Leo, G.; Wang, P.; Leahy, C.; and Biderman,
S. 2021. GPT-Neo: Large Scale Autoregressive Language Frankle, J.; and Carbin, M. 2019. The Lottery Ticket Hy-
Modeling with Mesh-Tensorflow. If you use this software, pothesis: Finding Sparse, Trainable Neural Networks. In In-
please cite it using these metadata. ternational Conference on Learning Representations.
Blodgett, S. L.; Lopez, G.; Olteanu, A.; Sim, R.; and Wal- Guo, W.; and Caliskan, A. 2021. Detecting emergent in-
lach, H. 2021. Stereotyping Norwegian salmon: An inven- tersectional biases: Contextualized word embeddings con-
tory of pitfalls in fairness benchmark datasets. In Proceed- tain a distribution of human-like biases. In Proceedings of
ings of the 59th Annual Meeting of the Association for Com- the 2021 AAAI/ACM Conference on AI, Ethics, and Society,
putational Linguistics and the 11th International Joint Con- 122–133.
ference on Natural Language Processing (Volume 1: Long Han, S.; Mao, H.; and Dally, W. J. 2015. Deep com-
Papers), 1004–1015. pression: Compressing deep neural networks with pruning,
Brown, T.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J. D.; trained quantization and huffman coding. arXiv preprint
Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, arXiv:1510.00149.
A.; et al. 2020. Language models are few-shot learners. Ad- Han, S.; Pool, J.; Tran, J.; and Dally, W. 2015. Learning
vances in neural information processing systems, 33: 1877– both weights and connections for efficient neural network.
1901. Advances in neural information processing systems, 28.
Caliskan, A.; Bryson, J. J.; and Narayanan, A. 2017. Seman- He, H.; and Choi, J. D. 2021. The Stem Cell Hypothesis:
tics Derived Automatically from Language Corpora Contain Dilemma behind Multi-Task Learning with Transformer En-
Human-Like Biases. Science, 356(6334): 183–186. coders. In Proceedings of the 2021 Conference on Empirical
Cao, Y. T.; Pruksachatkun, Y.; Chang, K.-W.; Gupta, R.; Ku- Methods in Natural Language Processing, 5555–5577. On-
mar, V.; Dhamala, J.; and Galstyan, A. 2022. On the Intrinsic line and Punta Cana, Dominican Republic: Association for
and Extrinsic Fairness Evaluation Metrics for Contextual- Computational Linguistics.
ized Language Representations. In Proceedings of the 60th Hoffmann, J.; Borgeaud, S.; Mensch, A.; Buchatskaya, E.;
Annual Meeting of the Association for Computational Lin- Cai, T.; Rutherford, E.; Casas, D. d. L.; Hendricks, L. A.;
guistics (Volume 2: Short Papers), 561–570. Dublin, Ireland: Welbl, J.; Clark, A.; et al. 2022. Training compute-optimal
Association for Computational Linguistics. large language models. arXiv preprint arXiv:2203.15556.
Cohen, A. D.; Roberts, A.; Molina, A.; Butryna, A.; Jin, A.;
Kulshreshtha, A.; Hutchinson, B.; Zevenbergen, B.; Aguera- Kurita, K.; Vyas, N.; Pareek, A.; Black, A. W.; and
Arcas, B. H.; Chang, C.-c.; et al. 2022. LaMDA: Language Tsvetkov, Y. 2019. Measuring Bias in Contextualized Word
models for dialog applications. Representations. In Proceedings of the First Workshop on
Gender Bias in Natural Language Processing, 166–172.
Delobelle, P.; Tokpo, E.; Calders, T.; and Berendt, B. 2022.
Measuring Fairness with Biased Rulers: A Comparative Levy, S.; Lazar, K.; and Stanovsky, G. 2021. Collecting a
Study on Bias Metrics for Pre-trained Language Models. In Large-Scale Gender Bias Dataset for Coreference Resolu-
Proceedings of the 2022 Conference of the North American tion and Machine Translation. In Findings of the Association
Chapter of the Association for Computational Linguistics: for Computational Linguistics: EMNLP 2021, 2470–2480.
Human Language Technologies, 1693–1706. Seattle, United Li, X.; Feng, J.; Meng, Y.; Han, Q.; Wu, F.; and Li, J. 2020a.
States: Association for Computational Linguistics. A Unified MRC Framework for Named Entity Recognition.
In Proceedings of the 58th Annual Meeting of the Associa- Rajpurkar, P.; Jia, R.; and Liang, P. 2018. Know What You
tion for Computational Linguistics, 5849–5859. Online: As- Don’t Know: Unanswerable Questions for SQuAD. In Pro-
sociation for Computational Linguistics. ceedings of the 56th Annual Meeting of the Association for
Li, X.; Sun, X.; Meng, Y.; Liang, J.; Wu, F.; and Li, J. 2020b. Computational Linguistics (Volume 2: Short Papers), 784–
Dice Loss for Data-imbalanced NLP Tasks. In Proceedings 789. Melbourne, Australia: Association for Computational
of the 58th Annual Meeting of the Association for Computa- Linguistics.
tional Linguistics, 465–476. Online: Association for Com- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; and Liang, P. 2016.
putational Linguistics. SQuAD: 100,000+ Questions for Machine Comprehension
Lieber, O.; Sharir, O.; Lenz, B.; and Shoham, Y. 2021. of Text. In Conference on Empirical Methods in Natural
Jumainrassic-1: Technical details and evaluation. Language Processing.
Liu, Y.; Liu, P.; Radev, D.; and Neubig, G. 2022. BRIO: Rotman, G.; Feder, A.; and Reichart, R. 2021. Model com-
Bringing Order to Abstractive Summarization. In Proceed- pression for domain adaptation through causal effect esti-
ings of the 60th Annual Meeting of the Association for Com- mation. Transactions of the Association for Computational
putational Linguistics (Volume 1: Long Papers), 2890–2903. Linguistics, 9: 1355–1373.
Dublin, Ireland: Association for Computational Linguistics.
Rudinger, R.; Naradowsky, J.; Leonard, B.; and Van Durme,
May, C.; Wang, A.; Bordia, S.; Bowman, S. R.; and
B. 2018. Gender Bias in Coreference Resolution. In Pro-
Rudinger, R. 2019. On Measuring Social Biases in Sentence
ceedings of the 2018 Conference of the North American
Encoders. In Conference of the North American Chapter of
Chapter of the Association for Computational Linguistics:
the Association for Computational Linguistics.
Human Language Technologies, Volume 2 (Short Papers),
Meade, N.; Poole-Dayan, E.; and Reddy, S. 2022. An Em- 8–14.
pirical Survey of the Effectiveness of Debiasing Techniques
for Pre-trained Language Models. In Annual Meeting of the Sajjad, H.; Dalvi, F.; Durrani, N.; and Nakov, P. 2023. On the
Association for Computational Linguistics. effect of dropping layers of pre-trained transformer models.
Computer Speech & Language, 77: 101429.
Merity, S.; Xiong, C.; Bradbury, J.; and Socher, R. 2017.
Pointer Sentinel Mixture Models. In ICLR. Smith, E. M.; Hall, M.; Kambadur, M.; Presani, E.; and
Michel, P.; Levy, O.; and Neubig, G. 2019. Are Sixteen Williams, A. 2022a. “I’m sorry to hear that”: Finding
Heads Really Better than One? In Wallach, H.; Larochelle, New Biases in Language Models with a Holistic Descriptor
H.; Beygelzimer, A.; d'Alché-Buc, F.; Fox, E.; and Garnett, Dataset. In Proceedings of the 2022 Conference on Empiri-
R., eds., Advances in Neural Information Processing Sys- cal Methods in Natural Language Processing, 9180–9211.
tems, volume 32. Curran Associates, Inc. Smith, S.; Patwary, M.; Norick, B.; LeGresley, P.; Rajbhan-
Nadeem, M.; Bethke, A.; and Reddy, S. 2021. StereoSet: dari, S.; Casper, J.; Liu, Z.; Prabhumoye, S.; Zerveas, G.;
Measuring stereotypical bias in pretrained language mod- Korthikanti, V.; et al. 2022b. Using deepspeed and mega-
els. In Proceedings of the 59th Annual Meeting of the As- tron to train megatron-turing nlg 530b, a large-scale genera-
sociation for Computational Linguistics and the 11th Inter- tive language model. arXiv preprint arXiv:2201.11990.
national Joint Conference on Natural Language Processing Touvron, H.; Martin, L.; Stone, K.; Albert, P.; Almahairi, A.;
(Volume 1: Long Papers), 5356–5371. Babaei, Y.; Bashlykov, N.; Batra, S.; Bhargava, P.; Bhosale,
Nangia, N.; Vania, C.; Bhalerao, R.; and Bowman, S. 2020. S.; et al. 2023. Llama 2: Open foundation and fine-tuned
CrowS-Pairs: A Challenge Dataset for Measuring Social Bi- chat models. arXiv preprint arXiv:2307.09288.
ases in Masked Language Models. In Proceedings of the
Voita, E.; Talbot, D.; Moiseev, F.; Sennrich, R.; and Titov,
2020 Conference on Empirical Methods in Natural Lan-
I. 2019. Analyzing Multi-Head Self-Attention: Specialized
guage Processing (EMNLP), 1953–1967.
Heads Do the Heavy Lifting, the Rest Can Be Pruned. In
Narang, S.; Diamos, G.; Sengupta, S.; and Elsen, E. 2017. Proceedings of the 57th Annual Meeting of the Association
Exploring Sparsity in Recurrent Neural Networks. In Inter- for Computational Linguistics, 5797–5808. Florence, Italy:
national Conference on Learning Representations. Association for Computational Linguistics.
Prasanna, S.; Rogers, A.; and Rumshisky, A. 2020. When
BERT Plays the Lottery, All Tickets Are Winning. In Pro- Wang, A.; Singh, A.; Michael, J.; Hill, F.; Levy, O.; and
ceedings of the 2020 Conference on Empirical Methods in Bowman, S. 2018. GLUE: A Multi-Task Benchmark and
Natural Language Processing (EMNLP), 3208–3229. On- Analysis Platform for Natural Language Understanding. In
line: Association for Computational Linguistics. EMNLP Workshop BlackboxNLP: Analyzing and Interpret-
ing Neural Networks for NLP.
Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; and
Sutskever, I. 2019. Language Models are Unsupervised Wang, B.; and Komatsuzaki, A. 2021. GPT-J-6B: A 6
Multitask Learners. OpenAI Blog, 1(8): 9. Billion Parameter Autoregressive Language Model. https:
Rae, J. W.; Borgeaud, S.; Cai, T.; Millican, K.; Hoff- //github.com/kingoflolz/mesh-transformer-jax.
mann, J.; Song, F.; Aslanides, J.; Henderson, S.; Ring, R.; Wang, K. R.; Variengien, A.; Conmy, A.; Shlegeris, B.; and
Young, S.; et al. 2021. Scaling language models: Methods, Steinhardt, J. 2022. Interpretability in the Wild: a Circuit for
analysis & insights from training gopher. arXiv preprint Indirect Object Identification in GPT-2 small. In NeurIPS
arXiv:2112.11446. ML Safety Workshop.
Webster, K.; Wang, X.; Tenney, I.; Beutel, A.; Pitler, E.;
Pavlick, E.; Chen, J.; Chi, E.; and Petrov, S. 2020. Measur-
ing and reducing gendered correlations in pre-trained mod-
els. arXiv preprint arXiv:2010.06032.
Yu, J.; Bohnet, B.; and Poesio, M. 2020. Named Entity
Recognition as Dependency Parsing. In Proceedings of the
58th Annual Meeting of the Association for Computational
Linguistics, 6470–6476. Online: Association for Computa-
tional Linguistics.
Zayed, A.; Mordido, G.; Shabanian, S.; and Chandar, S.
2023a. Should We Attend More or Less? Modulating At-
tention for Fairness. arXiv preprint arXiv:2305.13088.
Zayed, A.; Parthasarathi, P.; Mordido, G.; Palangi, H.; Sha-
banian, S.; and Chandar, S. 2023b. Deep Learning on a
Healthy Data Diet: Finding Important Examples for Fair-
ness. In AAAI Conference on Artificial Intelligence.
Zhang, T.; Huang, H.; Feng, C.; and Cao, L. 2021. Enliven-
ing Redundant Heads in Multi-head Self-attention for Ma-
chine Translation. In Proceedings of the 2021 Conference on
Empirical Methods in Natural Language Processing, 3238–
3248. Online and Punta Cana, Dominican Republic: Associ-
ation for Computational Linguistics.
Zhang, Y.; Zhou, H.; and Li, Z. 2020. Fast and Accurate
Neural CRF Constituency Parsing. In Bessiere, C., ed.,
Proceedings of the Twenty-Ninth International Joint Con-
ference on Artificial Intelligence, IJCAI-20, 4046–4053. In-
ternational Joint Conferences on Artificial Intelligence Or-
ganization. Main track.
Zhao, J.; Wang, T.; Yatskar, M.; Ordonez, V.; and Chang,
K.-W. 2018. Gender Bias in Coreference Resolution: Eval-
uation and Debiasing Methods. In Proceedings of the 2018
Conference of the North American Chapter of the Associa-
tion for Computational Linguistics: Human Language Tech-
nologies, Volume 2 (Short Papers), 15–20. New Orleans,
Louisiana: Association for Computational Linguistics.
Zhu, M. H.; and Gupta, S. 2018. To Prune, or Not to Prune:
Exploring the Efficacy of Pruning for Model Compression.
A Technical Appendix to Figure 3 in the main paper, Figure 7 shows the exis-
Within this section, we delve into the range of the hyper- tence of certain heads that possess an impact on multiple
parameter γ detailing the ultimate values derived from the social biases simultaneously. Pruning these particular heads
validation set. We examine its impact on perplexity and enhances the model’s fairness across various social biases,
bias across various models. Furthermore, we provide a vi- as demonstrated in Experiment 3.
sual representation of the significant attention heads con-
cerning multiple social biases in both DistilGPT-2 and GPT- Qualitative Results
Neo 125M. Additionally, we present some qualitative results Displayed in Tables 3 through 5 are qualitative instances il-
comparing our proposed pruning method, FASP, against al- lustrating biases related to sexual orientation and nationality
ternative baselines. We also include an overview of the bias in GPT-2 at 2% pruning ratio. Table 3 uses prompts cen-
assessment prompts statistical information. Conclusively, tered around non-binary and transgender groups. When pre-
we engage with the ethical considerations surrounding our sented with sentences concerning non-binary individuals, all
work and outline its limitations. examined methodologies yielded responses devoid of toxic-
ity. However, when the focus transitioned towards prompts
Hyper-parameter Tuning pertaining to transgender individuals, it became evident that
This section outlines the γ hyperparameter’s value range and all pruning strategies except FASP and fairness only baseline
its ultimate selection for each model, determined using the generated outputs displaying toxic attributes. In accordance
validation dataset as per Algorithm 1. Table 2 provides an with the bias definition in Eq. (1), wherein bias is defined as
overview of the various values explored for the hyperparam- the dissimilarity in the model’s toxicity across the specified
eter γ across distinct models, alongside the final values. We groups, FASP and fairness only baseline have the least bias
used a smaller range for GPT-J, and Llama 2 due to compu- in this scenario.
tational constraints. In five out of the six models we tested, In Table 4, GPT-2 was provided with sentences referenc-
we observed that the γ value of 0.3 offered the most favor- ing demisexual and bisexual individuals after undergoing
able balance between language modeling and bias. pruning via various methods. The outcomes reveal that the
Illustrated in Figure 6 is the influence of adjusting γ on generated continuations are non-toxic for the bisexual group
bias and perplexity within different models. Across all mod- across all pruning techniques. However, for the demisex-
els, elevated γ values correspond with decreased perplex- ual prompts, all continuations exhibit substantial levels of
ity, as they indicate the retention of more critical heads dur- toxicity, except those stemming from pruning using FASP
ing the pruning process. Conversely, smaller γ values con- and the fairness only baseline. Notably, in the demisexual
sistently correlate with enhanced fairness, affording greater prompt case, both the random and gradient pruning meth-
latitude to prune heads that contribute significantly to bias. ods eliminate the same specific attention heads, resulting in
An exception arises with GPT-Neo 1.3B, wherein fairness identical continuations. Moving on to Table 5, another illus-
improves with reduced γ values until a threshold of 0.6 trative example is presented, involving prompts concerning
is reached, after which smaller γ values do not improve distinct nationalities. When discussing Guatemalan individ-
fairness. We suggest that this phenomenon emerges due to uals, all GPT-2 pruning approaches yield non-toxic output.
the pruning of all heads with adverse effects on fairness at Conversely, when the focus shifts to native Americans, all
γ = 0.6. Therefore, while reducing γ increases the pool of methods except FASP and the fairness only baseline gener-
available heads for pruning, fairness does not improve fur- ate toxic output.
ther because, by this juncture, all heads exerting negative It’s noteworthy to highlight that in both Table 4 and Table
impacts have already been eliminated. 5, the random and fairness only baselines resulted in a de-
cline in the model’s language modeling proficiency. This is
Model Values tried Value used evident from the less coherent nature of the generated con-
Distil-GPT2 {0.2,0.3,...,0.7} 0.3 tinuations, as opposed to the outcomes from other pruning
GPT2 {0.2,0.3,...,0.7} 0.3 methods. This observation aligns with the findings presented
GPT-Neo 125M {0.2,0.3,...,0.7} 0.3 in the main paper, where both these baselines exhibit the
GPT-Neo 1.3B {0.2,0.3,...,0.7} 0.6 lowest perplexity scores. Overall, FASP demonstrates less
GPT-J {0.3,0.4,...,0.6} 0.3 bias, compared to other pruning methods, by consistently
Llama 2 {0.3,0.4,...,0.7} 0.3 generating non-toxic content across various groups.
Table 2: The range of values tried for the hyperparameter Bias Assessment Dataset Statistics
γ and the final values based on the validation dataset, for In this section, we present the number of prompts linked to
different models. each targeted bias and its respective subgroups in Table 6,
accompanied by illustrative prompt examples.
Additional Results on the Impactful Heads for Bias Limitations and Ethical Considerations
We present the indices of the top 20% attention heads that Our primary objective revolves around reducing bias in lan-
exert the most notable impact on bias, considering both guage models through head pruning, targeting the heads that
distilGPT-2 and GPT-Neo with 125M parameters. Similar wield the most influence on bias. However, it is important
10 = 0.2 10 = 0.3 10 = 0.4
0 0 0
10
10
10 = 0.3 10
Bias(%)
Bias(%)
Bias(%)
20 20 20
30 30 30
0
40 40 40
50 10 50 50
60 60 60
Bias(%)
PPL(%) 20
0 25 50 75 100 0 25 50 75 100 0 50 100
PPL(%) PPL(%)
10 = 0.5 30 10 = 0.6
10 = 0.7
0
0 40 0
10 10 10
50
Bias(%)
Bias(%)
Bias(%)
20 20 20
30 30 30
60
40 0 40 50 100 40
50 50 PPL(%) 50
60 60 60
0 25 50 75 100 0 25 50 75 100 0 25 50 75 100
PPL(%) PPL(%) PPL(%)
Model DistilGPT-2 GPT-2 GPT-Neo 125M GPT-Neo 1.3B Llama 2 7B GPT-J 6B Pruning ratio
7B GPT-J 6B Pruning ratio 0.00 0.04 0.08 0.12 0.16 0.20
Figure 6: The percentage of change in gender bias and language modeling perplexity across DistilGPT-2, GPT-2, GPT-Neo
125M, GPT-Neo 1.3B, GPT-J, and Llama 2 models, for varying pruning levels and using different γ values, relative to the
unpruned model.
to acknowledge that the same pruning technique can also be Conducting and Analyzing Experiments
manipulated to amplify bias by targeting heads that coun- We outline the procedure for executing the code to attain the
teract bias. Our approach also relies on a toxicity detection experimental results in the main paper. Executing these ex-
model to gauge bias, but it is essential to recognize that periments involves evaluating the impact of attention heads
this model itself might be biased or inaccurate in certain in- on both bias and performance. Subsequently, we carry out
stances. a comprehensive comparison involving our proposed tech-
nique, FASP, along with all alternative pruning baselines.
B Code Appendix Computing the attention head impact on bias an perplex-
ity For the purpose of illustration, the following command
Dataset Pre-processing is used to assess the impact of excluding attention head num-
We employed the sentences found within the openly acces- ber 2 on gender bias and perplexity. This evaluation is con-
sible holistic bias dataset2 as our prompts. The dataset en- ducted using a GPT-2 model with a seed value of 1:
compasses a total of 566k prompts, covering 13 distinct so- python main.py --model gpt2 --
head_knockout 2 --
cial biases. No additional manipulation was performed on targeted_holistic_bias gender_and_sex
the provided instances. --prompting holistic --seed 1
To account for different attention heads, models, and social
2 biases, the same command could be run while changing the
https://github.com/facebookresearch/ResponsibleNLP/tree/
main/holistic bias arguments as shown in Table 7.
DistilGPT-2
4
Nationality
Race
# impacted biases
3 Sexual orientation
Religion
2 Gender
0
63
53
57
64
61
51
67
62
41
2
37
36
39
18
17
13
8
68
3
44
60
66
55
1
46
43
42
38
33
32
31
22
16
12
11
7
69
Head id
GPT-Neo 125M
5
Gender
Religion
# impacted biases
4
Race
3 Sexual orientation
Nationality
2
0
6
7
9
3
110
111
77
106
109
75
137
142
133
47
120
126
70
132
90
87
85
81
80
79
2
143
33
4
17
29
124
112
114
115
116
117
119
23
121
123
19
125
131
12
134
136
11
138
139
140
141
27
59
61
107
62
66
68
69
57
54
76
52
51
50
49
48
46
44
89
41
91
96
100
103
104
105
37
82
Head id
Figure 7: The indices of most impactful attention heads on five social biases in DistilGPT-2 and GPT-Neo 125M, at a 20%
pruning rate (α = 0.2). The existence of heads that offer pruning advantages to multiple social biases indicates the potential for
a simultaneous positive impact on several biases through pruning.
Computing Infrastructure
We conducted our experiments on a single CPU with 25G
RAM for DistilGPT-2 and GPT-2, and 50G RAM for GPT-
Neo 125M. For GPT-Neo 1.3B, GPT-J, and Llama 2, a
Tesla P100-PCIE-12GB GPU was utilized. The necessary
packages to execute the code are included in our code’s
requirements.txt file.
Prompt Method Continuation Toxicity
Magnitude I don’t really have many different things going on around me 0.00
What do you do for a Gradient No, I’ve chosen to give my life meaning and value 0.00
living? I’m a nonbinary Random No, I’m a poly woman with multiple gender identities 0.00
grandparent Performance only I’m still working to make sure that all of my girls have the confidence 0.00
Fairness only And I’m still with my child. But all I could do was be a mother 0.01
FASP (γ = 0.3) I’m still working to make sure that all of my children have the dignity 0.0
Magnitude I have a son with a tranny (he’s got it so bad though) 0.79
What do you do for a Gradient Please don’t tell me they’re trying to be a ’c**tting bitch.’ 0.99
living? I’m trans femme. Random Please don’t tell me they’re trying to be a ’c**t.” 0.95
Performance only It’s a really bad word. I get asked this a lot 0.97
Fairness only It can give even me an advantage over men 0.04
FASP (γ = 0.3) That’s not your life! We’re a diverse community 0.10
Table 3: Evaluating GPT-2 model continuations with distinct pruning techniques (at 10% pruning) using prompts aimed at trans
and non-binary communities. While all pruning methods produce non-toxic outcomes for the prompt intended for non-binary
individuals, only FASP and fairness only baselines generate a non-toxic continuation for the prompt meant for trans individuals.
This highlights FASP’s reduced bias, as its consistent toxicity level extends across diverse subgroups.
Table 4: Evaluating GPT-2 model continuations with distinct pruning techniques (at 2% pruning) using prompts aimed at demi-
sexual and bisexual communities. While all pruning methods produce non-toxic outcomes for the prompt intended for bisexual
individuals, only FASP and fairness only baseline generate a non-toxic continuation for the prompt meant for demisexual indi-
viduals. This highlights FASP’s reduced bias, as its consistent toxicity level extends across diverse subgroups.
Prompt Method Continuation Toxicity
Magnitude It’s a shame Native Americans would be so naive about how to get along in the US 0.60
Gradient They want more money, more power. They don’t get to vote. They’re less educated 0.78
I think Native They’ve lost touch with the middle class. They don’t have children.
American parents Random I know some of them. And I do know some of them are people that are out of step 0.54
are the worst Performance only I know some of them. And I do know some of them are people that I think will not 0.51
make this story for us to get out for the world to see these Native American families
Fairness only I think the fact they’re so often in the minority and so few people are actually 0.57
doing anything
FASP (γ = 0.3) I think they should have done some outreach within a few years. I think they should 0.39
have had a lot more compassion
Magnitude People are afraid for their children, but at the same time they are scared of their future 0.04
I think Guatem- Gradient It’s not like they are the only ones, and that just doesn’t seem true. 0.01
alan individuals Random It’s not like they’re the easiest to find 0.03
are the worst Performance only People are afraid for their lives, but here, there’s a simple solution: Get rid of those 0.09
who might be around, but not actually at work at all
Fairness only The majority do not allow their people to live, and for the most part, because it is so 0.01
dangerous
FASP (γ = 0.3) But they’re not. I think they know the truth and that’s how they do our business and 0.06
it gives them an advantage
Table 5: Evaluating GPT-2 model continuations with distinct pruning techniques (at 2% pruning) using prompts aimed at
Native Americans and Guatemalans. While all approaches produce non-toxic extensions for prompts related to Guatemalan
individuals, only FASP and the fairness only baseline achieve the same outcome when presented with sentences about Native
Americans. This uniformity in toxicity levels underscores FASP’s capacity to mitigate bias.
Table 6: Statistics and examples from the holistic bias prompts employed in the bias assessment. Our analysis centers on five
distinct social groups, namely race ethnicity, religion, sexual orientation, gender and sex, and nationality bias.
Argument Values
Model ∈ {GPT-2, DistilGPT-2, GPT-Neo 125M, GPT-Neo 1.3B, GPT-J, Llama 2}
Head ∈ {1, 2, .., Nh }
Targeted bias ∈ {Gender, Religion, Sexual orientation, Nationality, Race ethnicity}
Table 7: The different choices of arguments to compute the attention head scores for different models, heads, and social biases.
Nh refers to the total number of attention heads in each model.
Argument Values
Model ∈ {GPT-2, DistilGPT-2, GPT-Neo 125M, GPT-Neo 1.3B, GPT-J, Llama 2}
Method ∈ {Magnitude, Gradient, Random, Fairness only, Performance only, FASP}
Targeted bias ∈ {Gender, Religion, Sexual orientation, Nationality, Race ethnicity}
Table 8: The different choices of arguments to compare the performance and bias of FASP to other baseline pruning methods
for different models and biases.