Professional Documents
Culture Documents
Abalign - A Comprehensive Multiple Sequence Alignment Platform For B-Cell Receptor Immune Repertoires
Abalign - A Comprehensive Multiple Sequence Alignment Platform For B-Cell Receptor Immune Repertoires
Received February 26, 2023; Revised April 23, 2023; Editorial Decision May 01, 2023; Accepted May 08, 2023
C The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work
is properly cited. For commercial re-use, please contact journals.permissions@oup.com
W18 Nucleic Acids Research, 2023, Vol. 51, Web Server issue
the widespread adoption of BCR sequencing in the study of about BCR sequence patterns. Unfortunately, to the best of
infectious diseases, allergies, vaccination, autoimmune dis- our knowledge, there are no such MSA methods currently
orders, tumor immunity, and microbiota(5–9). Sequencing available.
BCR involves the sequencing of the entire BCR gene, or BCR/antibody sequences exhibit region-specific features,
a portion of it, from a large number of individual B cells. with complementarity-determining regions (CDRs) dis-
This generates a comprehensive dataset of BCR sequences, playing a high degree of diversity and framework re-
known as the ‘BCR repertoire’, demonstrating the picture gions (FRs) being more conserved. Certain residue sites
of BCR landscape. The analysis of the BCR repertoires en- tend to remain invariant, while others are prone to in-
ables us to understand the diversity and evolution of BCRs sertion and deletion. These features can be summarized
over time, and provides insights into the specificity of the by antibody numbering, which is an immunoinformatic
B cell population. Furthermore, researchers can identify technique that enables standardization of the residue in-
respectively) for all sequences, and records the largest length alignment that agree to the tested alignment. TC score mea-
mLk . The positions where gaps are inserted at the pre- sures the fractions of columns that are totally aligned.
annotated insertion positions defined by the antibody num- STRIKE (53) score is also employed to assess MSA re-
bering schemes. The number of gaps to be inserted is given sults. Unlike SP and TC scores which need reference align-
by: ment and annotation information, STRIKE score only re-
quires one protein structure for a whole MSA. Hence,
numi,k = mLk − CLi,k (2)
STRIKE score can be applied to assess MSA with a huge
It is worth noting that the results of the MSA may exhibit number of BCR sequences.
minor discrepancies depending on the selected numbering
scheme, owing to differences in the definition of CDRs and Software implementation
insertion positions. In the following work, we have used the
assess the performance as the data size increased, we divided Efficiency is a crucial factor when it comes to processing
the dataset into 10 subsets, each containing 1000, 2000, large-scale BCR sequencing data. Using a Ryzen 2990WX
4000, . . . , 512 000 sequences, respectively. Our results show CPU with 16 threads, Abalign is able to complete the align-
that Abalign significantly outperforms the other three state- ment of the above 512000 sequences in just 6 minutes.
of-the-art methods in terms of STRIKE score, and main- This is notably faster than MUSCLE, Clustal Omega, and
tains its high stability in all the subsets (Figure 1C). This MAFFT, which takes 53, 13 and 7 times longer, respectively
result can be attributed to the fact that Abalign utilizes an- (Figure 2A). In terms of memory consumption, Abalign
tibody numbering to assign the positions of each residue. As also outperforms these other methods. It uses only 2.3 GB
a result, the positions are not influenced by other sequences of memory, which is 20%, 34% and 67% size of the oth-
in the MSA, and therefore, the accuracy of the results is ers, respectively (Figure 2B). These differences become even
maintained even when the data size increases. Conversely, more pronounced as the size of the input data increases.
ClustalO and MAFFT demonstrate a significant decrease Overall, the performance of Abalign in terms of accuracy,
in performance, primarily because of their tendency to blur speed and memory usage makes it an ideal choice for the
the boundaries between FRs and CDRs as the data size MSA of large-scale BCR sequencing data.
increases (Supplementary Figure S3A). We also observed
the performance of MUSCLE, exhibiting relative stabil- Software features
ity in heavy chains, which may attribute to its divide-and-
conquer algorithm (38). However, similar to Clustal Omega Leveraging the advantages of its unique MSA method,
and MAFFT, its performance decreases in light chains due Abalign has integrated a range of useful functions and pre-
to its inability to accurately align FR4 (Supplementary Fig- sented them in a user-friendly manner through visual in-
ure S3B). terfaces (Figure 3A and Supplementary Figure S4). Users
Nucleic Acids Research, 2023, Vol. 51, Web Server issue W21
Figure 3. Software features of Abalign. (A) The graphic interface displaying the aligned sequences after MSA. The FR1/2/3/4 and CDR1/2/3 are high-
lighted in colors. (B) B-cell lineage tree, annotated with the aligned sequences as well as their proportions. The most abundant sequence is highlighted in
red. (C) Residue preferences obtained from BCR repertoire analysis. The column chart on the upper panel displays the frequencies of residue mutation at
each position. The heatmap in the lower panel shows the frequencies of the 20 residues at each position. The dark colors indicate positive selection, while
the light colors indicate negative selection. (D) Diversity statistics of BCR repertoire, including Shannon–Weiner index, Gini-Simpson index, Shannon
evenness and Simpson evenness. (E) CDR3 length distribution of multiple BCR repertoires. (F) V gene usage distribution. (G-H) Dynamic VJ gene usage
of multiple BCR repertoires. The annotation will be displayed when users hover the mouse over the target. (I) The variation of clonotypes between two
BCR repertoires. The top 20 most variable clonotypes are displayed in colors. (J) The numbers of public clonotypes in multiple BCR repertoires. (K)
Heatmap of observed residues in human BCRs obtained from the OAS immune repertoire database. The query sequence is displayed at the bottom of the
heatmap for comparison. Residue frequencies for the human antibody are indicated by colors. Unusual residues are marked in grey.
W22 Nucleic Acids Research, 2023, Vol. 51, Web Server issue
can interact with Abalign in real-time with just a few clicks the MSA results. Abalign demonstrates the mutation pro-
of buttons. The results can be easily presented and exported files through colored blocks, indicating the increased or de-
as image or text files, with various customization options creased frequency of replacement mutations, as evidence of
available to users. antigen-driven positive or negative selection, respectively.
Users can gain insights into the contribution of residue mu-
Input format tations to affinity maturation. Figure 3C provides an exam-
ple of the mutation profile of COVID-19 patients.
Abalign accepts nucleotide or protein sequence files in the
FASTA format. To ensure accuracy, we recommend per-
forming an error-correction process for BCR repertoire se- Diversity statistics
quencing data with unique molecular identifiers (UMIs) us- The quantification of BCR repertoire diversity is informa-
ultrafast MSA. The platform enables intensive computa- 4. Melchers,F. (2015) Checkpoints that control B cell development. J.
tion on personal computers, eliminating the need for high- Clin. Invest., 125, 2203–2210.
5. Mikocziova,I., Greiff,V. and Sollid,L.M. (2021) Immunoglobulin
performance computing clusters. Abalign’s unique feature germline gene variation and its impact on human disease. Genes
of ultrafast MSA produces results consistent with well- Immun., 22, 205–217.
established antibody numbering schemes, enabling it to 6. Hou,X. L., Wang,L., Ding,Y. L., Xie,Q. and Diao,H. Y. (2016)
seamlessly connect with a broad range of BCR data anal- Current status and recent advances of next generation sequencing
ysis. Users can perform various analyses, including clonal techniques in immunological repertoire. Genes and immunity, 17,
153–164.
grouping, lineage tree construction, mutation profiling, di- 7. Lindau,P. and Robins,H.S. (2017) Advances and applications of
versity statistics, VJ gene assignment, repertoire compari- immune receptor sequencing in systems immunology. Current
son and more, without the need for programming scripts. Opinion in Systems Biology, 1, 62–68.
Overall, Abalign represents a powerful and accessible tool 8. Song,L., Ouyang,Z., Cohen,D., Cao,Y., Altreuter,J., Bai,G., Hu,X.,
post-analysis of T cell receptor repertoires. PLoS Comput. Biol., 11, immunoglobulin and T cell receptor genes. Nucleic Acids Res., 33,
e1004503. D256–D261.
26. Zhang,W., Du,Y., Su,Z., Wang,C., Zeng,X., Zhang,R., Hong,X., 46. Edgar,R.C. and Batzoglou,S. (2006) Multiple sequence alignment.
Nie,C., Wu,J., Cao,H. et al. (2015) IMonitor: a robust pipeline for Curr. Opin. Struct. Biol., 16, 368–373.
TCR and BCR repertoire analysis. Genetics, 201, 459–472. 47. Pei,J., Kim,B.-H. and Grishin,N.V. (2008) PROMALS3D: a tool for
27. Margreitter,C., Lu,H.-C., Townsend,C., Stewart,A., multiple protein sequence and structure alignments. Nucleic Acids
Dunn-Walters,D.K. and Fraternali,F. (2018) BRepertoire: a Res., 36, 2295–2300.
user-friendly web server for analysing antibody repertoire data. 48. Zhou,C.L.E. (2015) CombAlign: a code for generating a one-to-many
Nucleic Acids Res., 46, W264–W270. sequence alignment from a set of pairwise structure-based sequence
28. Cortina-Ceballos,B., Godoy-Lozano,E.E., Sámano-Sánchez,H., alignments. Source Code Biol. Med., 10, 9.
Aguilar-Salgado,A., Velasco-Herrera,M.D.C., Vargas-Chávez,C., 49. Dunbar,J., Krawczyk,K., Leem,J., Baker,T., Fuchs,A., Georges,G.,
Velázquez-Ramı́rez,D., Romero,G., Moreno,J., Téllez-Sosa,J. et al. Shi,J. and Deane,C.M. (2014) SAbDab: the structural antibody
(2015) Reconstructing and mining the B cell repertoire with database. Nucleic Acids Res., 42, D1140–D1146.
C The Author(s) 2023. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution-NonCommercial License
(http://creativecommons.org/licenses/by-nc/4.0/), which permits non-commercial re-use, distribution, and reproduction in any medium, provided the original work
is properly cited. For commercial re-use, please contact journals.permissions@oup.com