Professional Documents
Culture Documents
Genome analysis
Abstract
Motivation: A growing amount of noncoding genetic variants, including single-nucleotide polymorphisms
(SNPs), are found to be associated with complex human traits and diseases. Their mechanistic
interpretation is relatively limited and can use the help from computational prediction of their effects on
epigenetic profiles. However, current models often focus on local, 1D genome sequence determinants and
disregard global, 3D chromatin structure that critically affects epigenetic events.
Results: We find that noncoding variants of unexpected high similarity in epigenetic profiles, with regards
to their relatively low similarity in local sequences, can be largely attributed to their proximity in chromatin
structure. Accordingly we have developed a multimodal deep learning scheme that incorporates both data
of 1D genome sequence and 3D chromatin structure for predicting noncoding variant effects. Specifically,
we have integrated convolutional and recurrent neural networks for sequence embedding and graph
neural networks for structure embedding despite the resolution gap between the two types of data, while
utilizing recent DNA language models. Numerical results show that our models outperform competing
sequence-only models in predicting epigenetic profiles and their use of long-range interactions complement
sequence-only models in extracting regulatory motifs. They prove to be excellent predictors for noncoding
variant effects in gene expression and pathogenicity, whether in unsupervised “zero-shot” learning or
supervised “few-shot” learning.
Availability: Codes and data access can be found at https://github.com/Shen-Lab/ncVarPred-1D3D
Contact: yshen@tamu.edu
Supplementary information: Supplementary data are available at Bioinformatics online.
in 3D chromosome structure whereas such 3D global perspective of similarity profiles, respectively; and performed one-side Kolmogorov–
DNA sequences is important to regulatory mechanisms and functional Smirnov (K-S) test to compare the interaction frequencies of the “under”
annotation of noncoding variants (Klemm et al., 2019; D’haene and (or “over”) set and those of the “consistent” set (the null hypothesis being
Vergult, 2021). Technologies such as Hi-C (Lieberman-Aiden et al., 2009) that the former are below (or above) the latter). The process was repeated
and ChIA-PET (Fullwood et al., 2009) reveal chromatin interactions 100 times and the frequencies of p-value falling below 0.05 were collected.
such as long-range intra- and inter-chromosome enhancer–promoter More details can be found in Supplementary Sec. S2.
interactions (Zhang et al., 2013; Dai et al., 2018; Chen et al., 2020)
controlling gene expression. 2.3 Machine Learning for Predicting Epigenetic Profiles
In this study we fill the gap by incorporating both 1D local sequences
Architecture. As shown in the schematic illustration in Figure 1, our
and 3D global structures into predicting epigenetic impacts of noncoding
models take two inputs, a 1D local sequence of one kilobase and a 2D
variants. The approach of multimodal data integration is currently lacking
global interaction-frequency matrix of the whole genome. The two types
for the topic, although some recent methods predicted how noncoding
of inputs are separately encoded with multi-layer neural networks, except
variant sequences can change chromatin conformations (Trieu et al.,
in our most advanced versions (to be detailed next), to address a resolution
2020; Zhou, 2022). Specifically, we first statistically test the hypothesis
gap — whereas local DNA sequences are available in every base, global
underlying the approach and verify that disparity between the levels of
chromosome interactions are typically among regions of 100 kilobases.
sequence (feature) similarity and epigenetic profile (label) similarity can
The resulting embeddings are concatenated and fed to a fully-connected
be attributed to chromatin interactions. Given the rationale verified we
layer with 919 neurons and sigmoid activation functions that outputs the
then introduce a series of models combining the embedding of 1D local
probability of each of the 919 epigenetic events.
sequences (through CNN, RNN, and transformer) and the embedding
of 3D global structures / interaction intensity graphs (through various
graph neural networks, or GNN), while overcoming the challenge from
modality difference (texts versus graphs) and resolution difference (1-bp
versus 100K-bp). We show that our models outperform DeepSEA and
DanQ in predicting epigenetic profiles of genome sequences on held-out
chromosomes. And we analyze the (in)sensitivity of the improvements
with regards to the cell line and the resolution associated with the input
Hi-C data. We further examine our models’ interpretability in extracting
regulatory motifs and found that long-range interactions in chromatin
structure data enhanced such interpretability. We last demonstrate utility
of our models for predicting noncoding variant effects in gene expression
(eQTL) and pathogenicity, using model-predicted epigenetic profile
changes as zero-shot or few-shot predictors of variant effects.
followed by ReLU activation. (2) GCN (topology only): Each kilobase DanQ as well as our sequence and cell line specific structure-based models
sample was treated as a node in a graph whose node features were simply all described above) calculated the differences in predicted probabilities
ones of 768 dimensions and edge weights were corresponding normalized (between the wild type and variants) and the differences in predicted log
interaction frequencies. It was encoded through graph convolutional odds for cell-line epigenetic events (91 for GM12878 and 5 for IMR90
networks or GCN (Kipf and Welling, 2016) of 3 layers (latent dimensions among 919). Using the predicted differences as features, we held out
being 1000, 400, and 128, respectively). Whereas MLPs can capture 10% data for hyper-parameter tuning and used the rest for 10-fold cross
global relationship across the chromatin, GNNs including GCNs better validation when building logistic regression models with l2 regularization
capture graph structures and their sparsity. We note that most of the Hi- (scikit-learn). Through 10-fold cross validation AUROC over the 10% held
C data used in the study have less than 10% nonzero frequencies. (3) out data, we chose the differences in log odds as features and tuned the
GCN (sequence+topology): We additionally introduced DNA sequences values for the inverse of regularization strength (Supplementary Sec. S7).
to the GCN (topology only) encoder by replacing the all-one node features As each epigenetic predictor (such as CNN+MLP) was trained five times
with mean-pooled regional sequence embedding from a pretrained DNA and used as fixed encoders, the corresponding supervised eQTL classifier
language model, DNABERT (Ji et al., 2021). Specifically, we used was accordingly trained five times to calculate the average AUROC and
DNABERT to embed each 100K-bp with a 500-bp sliding window and AUPRC over 10 folds of the 90%. In the end, we report the mean and the
a 250-bp stride by averaging the 768-dimensional embeddings of all standard deviation of the average performances for each model.
windows. For each window, we used 497 tokens (495 6-mers plus special
tokens CLS for sentence class and SEP as sentence separator), obtained a 2.6 Noncoding Variant Effect Prediction: Pathogenecity
768 by 497 dimensional embedding, and used the row average to mitigate
We used ncVarDB (Biggs et al., 2020), a manually curated noncoding
the impact of the starting position.
variant data set, to assess our models’ performance in identifying
More details about model architectures are in Supplementary Sec. S3.1
pathogenic variants. Out of 721 pathogenic variants and 7226 benign
Training. We used binary cross entropy with L1 and L2 regularization as variants in ncVarDB, 713 pathogenic and 7134 benign variants were
the training loss and optimized hyperparameters including learning rate, retained as the test set (positive rate: 9.08%), 20 variants as the validation
the two regularization coefficients, and the choice of replicate per cell line set, and 80 variants as the source of the training set that grows from
(that provides chromatin structure to encode) over combinations of options. 0 to 80 at the increment of 10, simulating zero-shot (unsupervised) to
For each hyperparameter combination, part of the models were warm- few-shot (supervised) learning. In the case of zero-shot unsupervised
started with DNA sequence encoders (CNN or CNN/RNN) initialized at learning where no ncVarDB data is used, we directly used the absolute
the values from pretrained sequence-only models (DeepSEA or DanQ) difference in the predicted epigenetic probabilities (finally chosen for
and trained for up to 40 epochs with early-stopping (patience being 4 the better performance) or the difference in log odds, between the wild-
epochs). The optimal hyperparameters were chosen based on validation type/reference and the alternate/variant sequences, as a pathogenicity
loss. In particular, replicates ENCFF014VMM, ENCFF928NJV, and indicator. In the case of few-shot supervised learning, we used the
ENCFF013TGD were chosen for cell lines GM12878, IMR90, and pre-trained epigenetic predictors to build a Siamese neural network:
K562, respectively, in CNN+MLP and fixed in our other models. Once two identical genome sequence-structure encoders, one for wild-type /
hyperparameters were tuned, each model was trained 5 times with the reference and one for variant / alternative sequences, whose squared
cold-started portion randomly initialized and its performances’ means and differences of the 919 epigenetic probabilities, magnitude-only and
standard deviations were based on the 5 repeats. More details about model differentiable, are fed to a 919D-to-1D FC layer followed by a sigmoid
training including values of tuned hyperparameters can be found in the activation function, to predict whether the given variant is pathogenic
Supplementary Sec. S3.2. or benign. The resulting pathogenicity classifier was trained end to end
(unlike the eQTL classifier where a single encoder is used and fixed),
using 10, 20, . . ., 80 training samples, with all parameters warm-started
2.4 Interpreting the Learned Epigenetic Predictors
but the last FC layer. The loss function was balanced cross entropy
Following DanQ (Quang and Xie, 2016), we examined the regulatory with L2 regularization whose weights were tuned from 1E-12 to 1E-
significance of the 320 kernels in the first convolutional layer for sequence 8 log-uniformly and optimized at 1E-11 with validation loss. As each
embedding (kernel size being 8 and 26 bases for CNN and CNN/RNN, type of epigenetic predictor (such as CNN+MLP) was trained five times,
respectively). Specifically, for each kernel, the contiguous 8 or 26 bases the corresponding pathogenicity classifier was accordingly initialized and
with the largest activation value were selected for each test sequences and a trained five times before an average performance is reported.
position-specific frequency matrix (profile) was accordingly built over all
test sequences. We queried each of the 320 model-learned profiles against
HOCOMOCO (v11 core) (Kulakovskiy et al., 2018), a dataset of human
3 Results
transcription-factor binding motifs, using the tool TOMTOM (Bailey et al.,
2015) with a cutoff E < 0.01. In this section, we first statistically assess the hypothesis that the
inconsistency between genome sequence similarity and epigenetic profile
similarity can be partly attributed to the chromatin structure. With the
2.5 Noncoding Variant Effect Prediction: eQTL hypothesis verified, we adopted it as the rationale of our approach and
We use the trained epigenetic predictors as featurizers to predict started with the simplest version of models where chromatin structures are
noncoding variant effect on expression levels, that is, to predict eQTL encoded with fully connected layers. We showed that this straightforward
(expression quantitative trait loci). We collected cell line specific eQTL way to introduce chromatin structures already outperformed the competing
data from the Genotype-Tissue Expression project (GTEx Analysis v7): sequence-only models; and confirmed that the model improved the
3845 entries for GM12878 and 11091 for IMR90 using the filter q-value no performances by correcting the bias from genome sequence alone using
greater than 0.05, labeled with whether gene expression levels increase or chromatin structure. We also examined the (in)sensitivity of the model to
decrease. These two sets of data were relatively label balanced – positive the resolution and the cell line of the input chromatin structure data.
(increase) rates were 49.6% for GM12878 and 49.8% for IMR90. For We further evaluate the performances of the more advanced versions
each entry, trained epigenetic predictors (sequence-based DeepSEA and where chromatin interactions are modeled as graphs and region sequences
bioRxiv preprint doi: https://doi.org/10.1101/2022.12.20.521331; this version posted December 21, 2022. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC 4.0 International license.
4 Tan and Shen
as nodes are embedded by DNA language models. These models are similarity did not see more significance as the disparity increased.
more accurate in epigenetic profile prediction than competing sequence- The observations agree with biological knowledge: spatial proximity
only models; and they are explainable in extracting regulatory motifs. helps explain high profile similarity unexpected from low sequence
Importantly, as demonstrated in eQTL and pathogenicity prediction, similarity, as it implies similar chromosome access for epigenetic factors
these models are well-performing unsupervised zero-shot predictors of in transcription or modification; but spatial remoteness doesn’t necessarily
noncoding variant effects; and they can be flexibly fine-tuned to broadly explain the opposite disparity, as it doesn’t necessarily determine dissimilar
applicable and accurate few-shot predictors, using few labeled data and no chromosome access.
feature engineering.
Table 1. Epigenetic profile prediction assessed in AUPRC (Area Under the Precision-Recall Curve) whose base-line value for random classifiers is 0.02. Our
models intentionally used the same neural network architectures for DNA local 1D sequence embedding as in DeepSEA or DanQ. Thus their additional introduction
of chromatin global 3D structure embedding led to much improved and robust performances. 1 Performances using chromatin structure data from the cell line
GM12878 / IMR90 / K562, respectively. 2 All-one node features for 100K-bp regions. 3 DNABERT-encoded node features for 100K-bp regions.
DanQ CNN+RNN (biLSTM) N/A 0.371 (Quang and Xie, 2016) or 0.358 (reproduced)
MLP (topology only) 0.379 ± 0.002 / 0.384 ± 0.003 / 0.381 ± 0.001
Ours CNN+RNN (biLSTM) GCN (topology only) 0.389 ± 0.004 / 0.389 ± 0.005 / 0.390 ± 0.002
GCN (topology+sequence) 0.395 ± 0.003 / 0.398 ± 0.001 / 0.393 ± 0.001
the Supplemental Sec. S5.4 shows that mismatched cell line’s chromatin models to predict noncoding variant effect on gene expression (increase
structure input didn’t negatively impact the performance for an average or decrease).
epigenetic event, and it did for those events related to the cell line whose
chromatin structure was used during training. Nevertheless, either way our
basic CNN+MLP performed better than sequence-only DeepSEA when
tested with a mismatched cell line’s chromatin structure input.
improvements were consistent across model versions or Hi-C data used. effects and broadly applicable to variants of various types at various
The more advanced CNN/RNN+GCN (with DNABERT) showed even genome positions.
more consistency across Hi-C data choices. Numerical values on the
unsupervised performances are in the Supplementary Sec. S8.1.
Acknowledgement
This project was in part supported by the National Institute of General
Medical Sciences (R35GM124952 to Y.S.). Portions of this research were
conducted with the advanced computing resources provided by Texas
A&M High Performance Research Computing.
References
Bailey, T. L. et al. (2015). The MEME suite. Nucleic acids research,
43(W1), W39–W49.
Biggs, H. et al. (2020). ncvardb: a manually curated database for
pathogenic non-coding variants and benign controls. Database, 2020.
Boeva, V. (2016). Analysis of genomic sequence motifs for deciphering
transcription factor binding and transcriptional regulation in eukaryotic
cells. Frontiers in genetics, 7, 24.
Chen, C.-H. et al. (2020). Determinants of transcription factor regulatory
range. Nature communications, 11(1), 1–15.
Fig. 7: Test performances (AUROC and AUPRC) of zero-shot or few-shot Dai, Y. et al. (2018). Multiple transcription factors contribute to inter-
pathogenicity predictors, using (a) CNN+MLP and (b) CNN/RNN+GCN chromosomal interaction in yeast. BMC systems biology, 12(8), 67–76.
(with DNABERT), improve as the size of the training set increases. D’haene, E. and Vergult, S. (2021). Interpreting the impact of noncoding
structural variation in neurodevelopmental disorders. Genetics in
Medicine, 23(1), 34–46.
ENCODE Project Consortium, T. (2012). An integrated encyclopedia of
Our noncoding variant effect predictors are broadly applicable to
dna elements in the human genome. Nature, 489(7414), 57.
substitutions, insertions, and deletions across various genome positions.
Frydas, A. et al. (2022). Uncovering the impact of noncoding variants in
In fact, state of theart predictors CADD (Rentzsch et al., 2021),
neurodegenerative brain diseases. Trends in Genetics, 38(3), 258–272.
DANN (Quang et al., 2015) and FATHMM-XF (Rogers et al., 2018),
Fu, Y. et al. (2014). Funseq2: a framework for prioritizing noncoding
failed to make prediction for 535 of the 7847 test variants, including those
regulatory variants in cancer. Genome biology, 15(10), 1–15.
on chromosomes X, Y, and M as well as some substitutions across various
Fullwood, M. J. et al. (2009). An oestrogen-receptor-α-bound human
genome positions, which makes it impossible to compare performances
chromatin interactome. Nature, 462(7269), 58–64.
fairly. This was in contrast to our models as detailed in the Supplementary
Ji, Y. et al. (2021). Dnabert: pre-trained bidirectional encoder
Sec. S8.2.
representations from transformers model for dna-language in genome.
Bioinformatics, 37(15), 2112–2120.
Kelley, D. R. et al. (2016). Basset: learning the regulatory code of the
4 Conclusion accessible genome with deep convolutional neural networks. Genome
Toward noncoding variant effect prediction, we have built multi-modal research, 26(7), 990–999.
machine learning models that use both genome sequences and chromatin Kipf, T. N. and Welling, M. (2016). Semi-supervised classification with
structures to predict epigenetic profiles, whereas previous methods only graph convolutional networks. arXiv preprint arXiv:1609.02907.
used genome sequences. To rationalize the approach, we have verified Klemm, S. L. et al. (2019). Chromatin accessibility and the regulatory
that there is often a disparity between sequence similarity and epigenetic- epigenome. Nature Reviews Genetics, 20(4), 207–220.
profile similarity and such disparity can be explained by, thus mitigated Koohy, H. et al. (2013). Chromatin accessibility data sets show bias due
by, spatial proximity. Accordingly, we have used the combination of to sequence specificity of the dnase i enzyme. PloS one, 8(7), e69853.
various neural networks to embed local 1D kilobase sequences (CNN Kulakovskiy, I. V. et al. (2018). Hocomoco: towards a complete collection
and CNN/RNN) and global 3D chromosome structures represented in of transcription factor binding models for human and mouse via large-
2D graphs (MLP and GCN), that are available in different resolutions, scale chip-seq analysis. Nucleic acids research, 46(D1), D252–D259.
and concatenated the embeddings to feed epigenetic prediction layers. Lieberman-Aiden, E. et al. (2009). Comprehensive mapping of long-range
We have additionally introduced a version where the two embedding interactions reveals folding principles of the human genome. science,
processes are coupled, by using a pretrained DNA language model to 326(5950), 289–293.
embed 100K-long bases as node features before embedding the 2D graphs Lim, J. and Kim, K. (2019). Genetic variants differentially associated
of chromatin structures. Compared to sequence-only methods, our models with rheumatoid arthritis and systemic lupus erythematosus reveal the
robustly improve the accuracy of epigenetic prediction and complement disease-specific biology. Scientific reports, 9(1), 1–7.
the explainability in extracting regulatory motifs. Luo, Y. et al. (2020). New developments on the Encyclopedia of DNA
For noncoding variant effect prediction, these pre-trained models Elements (ENCODE) data portal. Nucleic Acids Res, 48(D1), D882–
are used directly as unsupervised zero-shot predictors or incorporated D889.
into a Siamese neural network as supervised few-shot predictors, as Ngo, V. et al. (2019). Epigenomic analysis reveals dna motifs regulating
demonstrated in eQTL and pathogenicity prediction. Without manual histone modifications in human and mouse. Proceedings of the National
feature engineering, our models are shown to be fine-tunable for various Academy of Sciences, 116(9), 3668–3677.
bioRxiv preprint doi: https://doi.org/10.1101/2022.12.20.521331; this version posted December 21, 2022. The copyright holder for this preprint
(which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made
available under aCC-BY-NC 4.0 International license.
8 Tan and Shen
Ohnmacht, J. et al. (2020). Missing heritability in parkinson’s disease: Trieu, T. et al. (2020). Deepmilo: a deep learning approach to predict
the emerging role of non-coding genetic variation. Journal of Neural the impact of non-coding sequence variants on 3d chromatin structure.
Transmission, 127(5), 729–748. Genome biology, 21(1), 1–11.
Park, S. et al. (2020). Enhancing the interpretability of transcription factor van Ouwerkerk, A. F. et al. (2019). Identification of atrial
binding site prediction using attention mechanism. Scientific reports, fibrillation associated genes and functional non-coding variants. Nature
10(1), 1–10. communications, 10(1), 1–14.
Quang, D. and Xie, X. (2016). Danq: a hybrid convolutional and recurrent Zaheer, M. et al. (2020). Big bird: Transformers for longer sequences.
deep neural network for quantifying the function of dna sequences. Advances in Neural Information Processing Systems, 33, 17283–17297.
Nucleic acids research, 44(11), e107–e107. Zhang, F. and Lupski, J. R. (2015). Non-coding genetic variants in human
Quang, D. et al. (2015). Dann: a deep learning approach for annotating disease. Human molecular genetics, 24(R1), R102–R110.
the pathogenicity of genetic variants. Bioinformatics, 31(5), 761–763. Zhang, Y. et al. (2013). Chromatin connectivity maps reveal dynamic
Rentzsch, P. et al. (2021). Cadd-splice—improving genome-wide variant promoter–enhancer long-range associations. Nature, 504(7479), 306–
effect prediction using deep learning-derived splice scores. Genome 310.
medicine, 13(1), 1–12. Zhou, J. (2022). Sequence-based modeling of three-dimensional genome
Ritchie, G. R. et al. (2014). Functional annotation of noncoding sequence architecture from kilobase to chromosome scale. Nature genetics, 54(5),
variants. Nature methods, 11(3), 294–296. 725–734.
Roadmap Epigenomics Consortium, T. et al. (2015). Integrative analysis Zhou, J. and Troyanskaya, O. G. (2015). Predicting effects of noncoding
of 111 reference human epigenomes. Nature, 518(7539), 317–330. variants with deep learning–based sequence model. Nature methods,
Rogers, M. F. et al. (2018). Fathmm-xf: accurate prediction of pathogenic 12(10), 931–934.
point mutations via extended features. Bioinformatics, 34(3), 511–513.