Professional Documents
Culture Documents
Subject Section
Abstract
Motivation: We present a multi-sequence generalization of Variational Information Bottleneck (VIB) (Alemi
et al., 2016) and call the resulting model Attentive Variational Information Bottleneck (AVIB). Our AVIB
model leverages multi-head self-attention (Vaswani et al., 2017) to implicitly approximate a posterior
distribution over latent encodings conditioned on multiple input sequences. We apply AVIB to a fundamental
immuno-oncology problem: predicting the interactions between T cell receptors (TCRs) and peptides.
Results: Experimental results on various datasets show that AVIB significantly outperforms state-of-the-art
methods for TCR-peptide interaction prediction. Additionally, we show that the latent posterior distribution
learned by AVIB is particularly effective for the unsupervised detection of out-of-distribution (OOD) amino
acid sequences.
Availability: The code and the data used for this study are publicly available at: https://github.com/nec-
research/vibtcr.
Contact: filippo.grazioli@neclab.eu, renqiang@nec-labs.com
Supplementary information: Supplementary data are available at Bioinformatics online.
1 Introduction if TCR recognition takes place can cytokines be released, which leads to
Predicting whether T cells recognize peptides presented on cells is a the death of a target cell.
fundamental step towards the development of personalized treatments to TCRs consist of an α- and a β-chain whose structures determine
the interaction with the pMHC complex. Each chain consists of three
enhance the immune response, like therapeutic cancer vaccines (Slansky
et al., 2000; Meng and Butterfield, 2002; McMahan et al., 2006; Corse loops, referred to as complementarity determining regions (CDR1-3). It is
et al., 2011; Buhrman and Slansky, 2013; Hundal et al., 2020). In the believed that the CDR3 loops primarily interact with the peptide of a given
human immune system, T cells monitor the health status of cells by pMHC complex (Feng et al., 2007; Rossjohn et al., 2015; La Gruta et al.,
identifying foreign peptides on their surface (Davis and Bjorkman, 1988; 2018). Supplementary Materials 1 depict the 3D structure of a TCR-pMHC
Krogsgaard and Davis, 2005). The T cell receptors (TCRs) are able to complex.
bind to these peptides, especially if they originate from an infected or Recent discoveries (Dash et al., 2017; Lanzarotti et al., 2019) have
cancerous cell. The binding of TCRs - also known as TCR recognition demonstrated that both the CDR3α- and β-chains carry information on
- with peptides, presented by major histocompatibility complex (MHC) the specificity of the TCR toward its cognate pMHC target. Obtaining
molecules in peptide-MHC (pMHC) complexes, constitutes a necessary information about paired TCR α- and β-chains requires specific and
expensive experiments, like single cell (SC) sequencing, which limits
step for immune response (Rowen et al., 1996; Glanville et al., 2017). Only
its availability. Conversely, the bulk sequencing of a population of cells
reactive to a peptide is cheaper, but it only allows to gain information about
either the α- or the β-chain.
1
© The Author(s) 2022. Published by Oxford University Press.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License
(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium,
provided the original work is properly cited.
i i
i i
i i
2 Grazioli et al.
In this work, we propose Attentive Variational Information Bottleneck Variational Autoencoder (MMVAE) (Shi et al., 2019) factorizes the joint
(AVIB) to predict TCR-peptide interactions. AVIB is a multi-sequence variational posterior as a combination of unimodal posteriors, using a
generalization of Variational Information Bottleneck (VIB) (Alemi et al., Mixture of Experts (MoE). MoE-based models have been used in the
2016). Notably, we introduce Attention of Experts (AoE), a novel method biomedical field to tackle challenges such as protein-protein interactions
for combining single-sequence latent distributions into a joint multi- (Qi et al., 2007), biomolecular sequence annotation (Caragea et al., 2009),
sequence latent encoding distribution using self-attention. Owing to its and clustering cell phenotypes from single-cell data (Kopf et al., 2021).
design, AoE can naturally leverage the abundant available data where Their main advantage is that they can infer global patterns in the genetic
either the CDR3α or the CDR3β sequence is missing when estimating the or peptide sequences in supervised and unsupervised settings (Kopf et al.,
multi-sequence variational posterior. The model learns to predict whether 2021).
the binding between the peptide and the TCR takes place or not. Supervised learning. The Variational Information Bottleneck (VIB)
Extensive experiments demonstrate that AVIB significantly outperforms (Alemi et al., 2016) is to supervised learning what the β-VAE (Higgins
state-of-the-art methods. In addition, the probabilistic nature of the VIB et al., 2017) is to unsupervised learning. VIB leverages variational
i i
i i
i i
i i
i i
i i
4 Grazioli et al.
which is the equivalent of the number of word tokens in a natural language 3 Results and Discussion
processing (NLP) setting (see Section 3.5). First, we provide a description of the datasets used in this work. We then
In this work, we only benchmark AoE against PoE and do not apply AVIB to the TCR-peptide interaction prediction problem. Last, we
compare against MoE. We believe MoE’s "OR"-nature is not suitable for demonstrate AVIB’s effectiveness in the context of OOD detection. All
modeling the chemical specificity of multiple molecules. If taken alone, the experiments are implemented using PyTorch (Paszke et al., 2019). Code
single-sequence variational posteriors are not informative of the chemical
and data are publicly available at: https://github.com/nec-research/vibtcr.
reaction. Analogously, sampling from a MoE - which has similarities to
the OR operator - is not suitable for capturing how molecules chemically
3.1 Datasets
interact.
Recent studies (Jurtz et al., 2018; De Neuter et al., 2018; Jokinen et al.,
2.3 Information Bottleneck Mahalanobis Distance 2019; Wong et al., 2019; Moris et al., 2019; Gielis et al., 2019; Tong et al.,
2020; Springer et al., 2020; Fischer et al., 2020; Springer et al., 2021;
i i
i i
i i
Table 1. TCR-peptide interaction prediction results. The reported confidence intervals are standard errors over 5 repeated experiments with different independent
training/test random splits. Reported scores are computed on the test sets. Baselines: NetTCR-2.0 (Montemurro et al., 2021), ERGO II (Springer et al., 2021),
LUPI-SVM (Abbasi et al., 2018). Best results are in bold.
MVIB AVIB
Inputs Metric ↑/↓ AvgPOOLoE
(PoE) (AoE)
Figure 7 in the Supplementary Materials depicts the distributions longest sequence. Information on the length distribution of the amino acid
of the human TCR data for both the α+β set and the β set. The two sequences can be found in Supplementary Materials 5.1.
datasets have similar peptide distributions, but contain different CDR3β
sequences. Supplementary Materials 5.1 provides information regarding 3.3 TCR-peptide Interaction Prediction
the distribution of the length of the amino acid sequences in the various
In order to evaluate AVIB’s performance on the TCR-peptide interaction
datasets. Supplementary Materials 5.2 provides information on the class
prediction task, we perform experiments on three datasets: the α+β set, the
distribution of the α+β set and β set. Supplementary Materials 5.3
β set and their union β set ∪ α+β set. For the β set and the union set, input
describes the binding affinity distributions of the NetMHCIIpan-4.0 set.
samples are (xP eptide , xCDR3β ) pairs. For the α+β set, inputs can be
either (xP eptide , xCDR3β ) pairs or (xP eptide , xCDR3α , xCDR3β )
3.2 Pre-processing triples. For all tri-sequence experiments, we adopt a full multi-sequence
In this work, peptides, CDR3α and CDR3β are represented as sequences extension of the JAV IB objective (see Supplementary Materials 6,
of amino acids. The 20 amino acids translated by the genetic code Equation 12).
are in general represented as English alphabet letters. Analogously to Baselines. We benchmark AVIB against two state-of-the-art deep
Montemurro et al. (2021), we pre-process the amino acid sequences learning methods for TCR-peptide interaction predictions: ERGO II
using BLOSUM50 encodings (Henikoff and Henikoff, 1992), i.e. the (Springer et al., 2021) and NetTCR-2.0 (Montemurro et al., 2021).
substitution value of each amino acid represented by the BLOSUM50 Additionally, we benchmark AVIB against the LUPI-SVM (Abbasi et al.,
matrix’ diagonal. This allows us to represent a sequence of N amino acids 2018), by leveraging the α-chain at training time as privileged information.
as a 20 × N matrix, analogously to the approach proposed by Nielsen For all benchmark methods, we adopt the original publicly available
et al. (2003). After performing BLOSUM50 encoding, we standardize the implementations5 .
features by subtracting the mean and scaling to unit variance. As the length
of the amino acid sequences is not constant, we operate 0-padding after 5 https://github.com/IdoSpringer/ERGO-II;
the BLOSUM50 encoding (Mösch and Frishman, 2021). This ensures that https://github.com/mnielLab/NetTCR-2.0;
all matrices have shape 20 × Nmax , where Nmax is the length of the https://github.com/wajidarshad/LUPI-SVM
i i
i i
i i
6 Grazioli et al.
i i
i i
i i
Table 3. OOD detection results. Distinguishing in- and out-of-distribution test samples. DID is the in-distribution set; DOOD is the out-of-distribution set, not
ID ID
available at training time. The reported confidence intervals are standard errors over 5 repeated experiments with different independent Dtrain /Dtest random
splits. ↑ indicates larger value is better; ↓ indicates lower value is better. Best results are in bold. Baselines: MSP, ODIN, AVIB-R. For ODIN, the hyperparameters
are ε = 0.001 and T = 1000, tuned on a validation set.
FPR Detection
AUROC AUPR
D ID / D OOD Method @ 95% TPR Error
↑ ↑
↓ ↓
Human TCR MSP 0.962 ± 0.001 0.505 ± 0.001 0.540 ± 0.003 0.627 ± 0.003
/ ODIN 0.962 ± 0.002 0.506 ± 0.001 0.425 ± 0.008 0.559 ± 0.014
Non-human AVIB-R 0.719 ± 0.018 0.384 ± 0.009 0.768 ± 0.010 0.714 ± 0.011
TCR AVIB-Maha
0.699 ± 0.011 0.374 ± 0.006 0.850 ± 0.002 0.871 ± 0.001
i i
i i
i i
8 Grazioli et al.
References Grazioli, F., Mösch, A., Machart, P., Li, K., Alqassem, I., O’Donnell,
Abbasi, W. A., Asif, A., Ben-Hur, A., et al. (2018). Learning protein T. J., and Min, M. R. (2022b). On TCR binding predictors failing to
binding affinity using privileged information. BMC bioinformatics, generalize to unseen peptides. Frontiers in Immunology, 13.
19(1), 1–12. Hendrycks, D. and Gimpel, K. (2016). A baseline for detecting
misclassified and out-of-distribution examples in neural networks. arXiv
Alemi, A. A., Fischer, I., Dillon, J. V., and Murphy, K. (2016). Deep
preprint arXiv:1610.02136.
variational information bottleneck. arXiv preprint arXiv:1612.00410.
Alemi, A. A., Fischer, I., and Dillon, J. V. (2018). Uncertainty in the Henikoff, S. and Henikoff, J. G. (1992). Amino acid substitution matrices
variational information bottleneck. arXiv preprint arXiv:1807.00906. from protein blocks. Proceedings of the National Academy of Sciences,
Bagaev, D. V., Vroomans, R. M., Samir, J., Stervbo, U., Rius, C., 89(22), 10915–10919.
Dolton, G., Greenshields-Watson, A., Attaf, M., Egorov, E. S., Zvyagin, Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M.,
I. V., et al. (2020). Vdjdb in 2019: database extension, new analysis Mohamed, S., and Lerchner, A. (2017). beta-vae: Learning basic visual
concepts with a constrained variational framework. In International
i i
i i
i i
McMahan, R. H., McWilliams, J. A., Jordan, K. R., Dow, S. W., Wilson, Springer, I., Besser, H., Tickotsky-Moskovitz, N., Dvorkin, S., and
D. B., Slansky, J. E., et al. (2006). Relating tcr-peptide-mhc affinity Louzoun, Y. (2020). Prediction of specific tcr-peptide binding from large
to immunogenicity for the design of tumor vaccines. The Journal of dictionaries of tcr-peptide pairs. Frontiers in immunology, 11, 1803.
clinical investigation, 116(9), 2543–2551. Springer, I., Tickotsky, N., and Louzoun, Y. (2021). Contribution of t
Meng, W. S. and Butterfield, L. H. (2002). Rational design of peptide- cell receptor alpha and beta cdr3, mhc typing, v and j genes to peptide
based tumor vaccines. Pharmaceutical research, 19(7), 926–932. binding prediction. Frontiers in immunology, 12.
Montemurro, A., Schuster, V., Povlsen, H. R., Bentzen, A. K., Jurtz, V., Szeto, C., Lobos, C. A., Nguyen, A. T., and Gras, S. (2020). TCR
Chronister, W. D., Crinklaw, A., Hadrup, S. R., Winther, O., Peters, recognition of peptide–MHC-i: Rule makers and breakers. International
B., et al. (2021). Nettcr-2.0 enables accurate prediction of tcr-peptide Journal of Molecular Sciences, 22(1), 68.
binding by using paired tcrα and β sequence data. Communications Tickotsky, N., Sagiv, T., Prilusky, J., Shifrut, E., and Friedman, N. (2017).
biology, 4(1), 1–13. Mcpas-tcr: a manually curated catalogue of pathology-associated t cell
Moris, P., De Pauw, J., Postovskaya, A., Ogunjimi, B., Laukens, K., receptor sequences. Bioinformatics, 33(18), 2924–2929.
i i
i i