You are on page 1of 8

Bioinformatics Page 2 of 8

1
2
3
4
5

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab388/6278295 by University of Exeter user on 26 May 2021


6
7
8
9 Genome analysis
10
11
12
DeepEBV: A deep learning model to predict
13 Epstein-Barr virus (EBV) integration sites
14
15 Jiuxing Liang1,#, Zifeng Cui2,#, Canbiao Wu1,#, Yao Yu3,4#, Rui Tian5, Hongxian
16 Xie6, Zhuang Jin2, Weiwen Fan2, Weiling Xie2, Zhaoyue Huang2, Wei Xu2, Jingjing
17 Zhu2, Zeshan You2, Xiaofang Guo7, Xiaofan Qiu1, Jiahao Ye1,8, Bin Lang9,
18 Mengyuan Li2, *, Songwei Tan10, * and Zheng Hu2,11,*
19
1Key Laboratory of Brain, Cognition and Education Sciences, Ministry of Education, China; Institute for Brain
20
Research and Rehabilitation, South China Normal University, Guangzhou 510631, China, 2Department of
21 Gynaecological oncology, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080,
22 Guangdong, China, 3Department of Urology, The First Medical Center of Chinese PLA General Hospital,
23 Beijing 100853 China, 4School of Medicine, Nankai University, Tianjin 300071, China, 5Center for
24 Translational Medicine, The First Affiliated Hospital, Sun Yat-sen University, Guangzhou 510080, Guangdong,
25 China, 6STech Company Bio-X Lab, Zhuhai 519000, Guangdong, China, 7Department of Medical Oncology of
the Eastern Hospital, the First Affiliated Hospital, Sun Yat-sen University, Guangzhou, 510700, China, 8School
26 of Computer Science, South China Normal University, Guangzhou 510631, China, 9School of Health Sciences
27 and Sports, Macao Polytechnic Institute, China, 10School of Pharmacy, Tongji Medical College, Huazhong
28 University of Science and Technology, Wuhan 430030, China, 11Department of Obstetrics and Gynaecology,
29 Tongji Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430030,
30 Hubei, China
31 *To whom correspondence should be addressed.
32
33 #These authors contributed equally to this work, the authors should be regarded as Joint First Authors.
34 Associate Editor: Pier Luigi Martelli
35
36 Received on XXXXX; revised on XXXXX; accepted on XXXXX
37
38 Abstract
39 Motivation: Epstein-Barr virus (EBV) is one of the most prevalent DNA oncogenic viruses. The
40 integration of EBV into the host genome has been reported to play an important role in cancer
41 development. The preference of EBV integration showed strong dependence on the local genomic
42 environment, which enables the prediction of EBV integration sites.
Results: An attention-based deep learning model, DeepEBV, was developed to predict EBV
43
integration sites by learning local genomic features automatically. First, DeepEBV was trained and
44 tested using the data from the dsVIS database. The results showed that DeepEBV with EBV
45 integration sequences plus Repeat peaks and 2 fold data augmentation performed the best on the
46 training dataset. Furthermore, the performance of the model was validated in an independent dataset.
47 In addition, the motifs of DNA-binding proteins could influence the selection preference of viral
48 insertional mutagenesis. Furthermore, the results showed that DeepEBV can predict EBV integration
49 hotspot genes accurately. In summary, DeepEBV is a robust, accurate and explainable deep learning
50 model, providing novel insights into EBV integration preferences and mechanisms.
51 Availability: DeepEBV is available as open-source software and can be downloaded from
52 https://github.com/JiuxingLiang/DeepEBV.git
53 Contact: huzheng1998@163.com (Z.H.), tansw@hust.edu.cn (S.T), mengyuan.li96@outlook.com
(M.L)
54
Supplementary information: Supplementary data are available at Bioinformatics online.
55
56
57
58 © The Author(s) (2021). Published by Oxford University Press. All rights reserved. For Permissions, please email:
journals.permissions@oup.com
59
60
Page 3 of 8 Bioinformatics

1
2
description. The deep learning network uses EBV integration sites as
3 positive samples to learn features from, while it also uses negative
4 1 Introduction
samples that do not contain EBV integration sites as backgrounds. After
5 Oncogenic DNA viruses establish long-term persistent intracellular evaluating samples of a series of lengths (Supplementary Note 4), EBV

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab388/6278295 by University of Exeter user on 26 May 2021


6 infections and can induce malignancy by providing a selective growth integration sequences of length 2 kb were used as input samples because
7 advantage to host cells (Moore et al., 2010). Epstein-Barr virus (EBV) is they had the largest AUROC (0.67) and lowest loss value (0.92) among
8 one of the first described human cancer viruses and is associated with up all tested lengths (Supplementary Table 2). EBV integration sites from
to 10 types of cancers, including Burkitt lymphoma, nasopharyngeal
9 1k bp sequences from upstream and 1k bp downstream sequences were
carcinoma (NPC) (Arvey et al., 2012, Xu et al., 2019), Hodgkin taken and denoted as positive samples. In addition, 2k bp sequences were
10
lymphomas, natural killer/T (NK/T) cell lymphomas (Lu et al., 2011, randomly selected on the hg38 reference genome as negative samples, at
11 Peng et al., 2019), and a subset of gastric carcinomas (Iizasa et al., 2012, least 50 k bp away from positive samples.
12 Nishikawa et al., 2018). The integration of EBV DNA into the human Each sample was denoted as S = (n1, n2, …, n2000), where ni represents
13 genome was reported to make important contributions to cancer the nucleotide in position i. Extracted DNA sequences were encoded as a
14 development (Chakravorty et al., 2019, Peng et al., 2019). EBV can one-hot code, considering the model performance and robustness.
15 integrate into specific tumor suppressor and inflammation-related genes Original DNA sequences were coded as binary matrixes of length 4, and
16 such as PARK2, CDK15, and TNFAIP3, which are involved in TNF- each dimension represents a nucleotide. A 2k bp DNA sequence will be
17 alpha-induced apoptosis/NF-κB pathway regulation (Peng et al., 2019). converted into a 2000×4 binary matrix.
18 EBV integration into these genes can destroy gene function, dysregulate
19 TNF-alpha-induced apoptosis/NF-κB pathways and promote cancer 2.2 Feature extraction
development (Chakravorty et al., 2019). In addition, by integrating into
20 The DeepEBV model employed convolution and pooling modules to
the DNA repair-related gene NHEJ1, EBV could impair the function of
21 learn eigensequences surrounding sequences of EBV integration sites
this gene and the related DNA repair pathway, leading to genomic
22 (Supplementary Figure 1). The features of the input binary matrix were
instability of the host cell and malignant transformation. To date, studies
23 extracted by 3 continuous convolution layers, each of which contained
on EBV integration are increasing but are still limited (Xiao et al., 2016, 64, 64 and 128 convolution kernels.
24 Xu et al., 2019), and the detailed mechanisms of EBV integration and its Multiple convolution kernels were activated to obtain different
25 preference remain to be elucidated. eigenvalues in each convolution layer. The convolution calculation refers
26 Deep learning is a subject under artificial intelligence, which is an to , which can be described as:
27 ideal state-of-the-art prediction method in computational biology (Cruz
= (1)
28 et al., 2006, Hu et al., 2019, Koohi-Moghadam et al., 2019). The
In this formula, , where refers to the number of
29 convolutional neural network (CNN) enables deep learning models to
kernels, , where refers to the index, refers to
30 learn automatically via abstract features with translation invariance
the kernel size, n refers to input sequence length, refers to the
(Aghdam et al., 2017, Deeplearning.net, 2020, Zhang et al., 1990).
31 kernel weight, and E denotes a one-hot binary matrix of a specific
However, CNN is a double-edged sword. It is possible to give an
32 DNA sequence S.
accurate prediction with an uncertain intermediate process, which can be
33 regarded as a black box effect (Zhang et al., 2018). This phenomenon After extracting relative eigenvectors, a rectified linear unit (ReLU)
34 makes it difficult to explain the inside-model behaviours such as the was adopted in convolution layers to stimulate the activation of real
35 detailed process of abstract features. Luckily, the attention mechanism neurons, which enables better data fitting in the model. ReLU is an
36 activation function in artificial neural networks that can be described as
can partially reveal the black box by using an extra neural network to
37 . Finishing the actions in convolutional layers, each
calculate each position weight from the input sequence (Guidotti et al.,
activated element was mapped on a sparse matrix. Then, a max-pooling
38 2018).
strategy was applied to complete dimension reduction and support the
39 In this study, we developed DeepEBV, an attention-based deep
maximum retention of predicted information. This strategy will reduce
40 learning model, to predict EBV integration sites accurately. DeepEBV
the amount of calculation and improve the calculation efficiency at the
41 can automatically extract patterns with translation invariance. The
same time. At this point, the input binary matrix was able to export the
DeepEBV model uses only DNA sequences (denoted by EBV
42 eigenvector as after convolution and pooling abstraction.
integration sequences) as input to predict EBV integration sites, and the
43
attention mechanism highlights positions with potentially important
44 biological meaning to confirm the judgment. 2.3 Attention mechanism in the DeepEBV model
45 To capture and understand the position contribution level in the
46 abstracted eigenvector . An attention mechanism layer was
47 2 Methods added into DeepEBV. The attention layer calculated the attention
48 2.1 Data preparation weight for each dimension in . These weights represented the
49 The DeepEBV model was trained and tested with our database of EBV importance of that position given by the neural network.
represents the contribution score at position j, and a larger
50 integration sites (http://dsvis.wuhansoftware.com). To improve the
score means a larger contribution in this position to EBV
51 confidence quality of our datasets, the integration sites located in the
repetitive regions of EBV genomes were filtered out as previous studies integration site prediction. All contribution scores were normalized
52 to achieve the dense eigenvector matrix, which is denoted as :
suggested (Xiao et al., 2016, Xu et al., 2019). A total of 1288 unique
53
integration sites (neutral: 76; tumor: 760; unknown: 452) were involved = (2)
54
(Supplementary Table 1). Detailed step-by-step instructions for the
55 DeepEBV model are provided in Supplementary Note 1 and 2, including
= (3)
where represents the relevant normalization score and
56 the model structure (Supplementary Figure 1) and mathematical represents the eigenvector at position of the input eigenmatrix.
57
58
59
60
Bioinformatics Page 4 of 8

DeepEBV: deep learning predict EBV integration sites


1
2
3 Each position represents an extracted eigenvector in each database of EBV integration sites (http://dsvis.wuhansoftware.com). A
4 convolution kernel. total of 1288 EBV integration site samples were divided into 913 and
5 The model prediction has to integrate the convolution-pooling module 375 as the positive training dataset and internal test dataset, respectively.

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab388/6278295 by University of Exeter user on 26 May 2021


6 and attention mechanism module, which means that eigenvector and Considering the data balance between positive and negative samples and
relative eigenimportant score should work together in prediction. the natural imbalance of integration and non-integration site amounts, the
7
Thus, we employed the dense layers by linking values in eigenvector size of negative samples was multiplied to be 10 times the size of
8
together and linearly mapped their values to : positive samples (3750 in testing datasets), as suggested by a previous
9 study to avoid false positive prediction (Hu et al., 2019) (Supplementary
= dense(flatten( )) (4)
10 In this step, the flattened layer performed function to reduce Table 4). Meanwhile, translation among augmentation approaches was
11 dimensionality and concatenate data; function was executed by adopted to optimize the model. In this case, augmented samples were
12 a dense layer, which mapped dimension-reduced data to a single value. selected 500 bp from upstream and downstream of the middle 1 kb of the
13 Then, and concatenated vectors entered linear classifier prediction 2 kb EBV integration sequences to obtain the 2-fold augmented training
14 to calculate the probability of EBV integration occurring within the dataset. Then, the augmented samples were mixed with original samples
15 current sequence, with: as a positive sample dataset to train and test the model. Of note, in the
16 = (5) training dataset, we added 2-fold simulated integration data based on 512
17 where is the predicted score, represents the transcription start sites (TSSs), which were reported to be integrated by
18 activation function and acts as a classifier in the final output, and EBV (Cao et al., 2015, Chakravorty et al., 2019, Luo et al., 2004,
represents the concatenate operation. Takakuwa et al., 2004, Takakuwa et al., 2005, Xiao et al., 2016, Xu et
19
Meanwhile, if we give the output eigenvector from the al., 2019). The sites located 2k bp upstream and downstream of the TSS
20 were included as positive samples. The results showed that augmentation
convolution-and-pooling module as input and execute the attention
21 mechanism, weight vector can be achieved: (AUROC = 0.67) and TSS simulation (AUROC = 0.79) strategies both
22 = (6)
23 where refers to the attention mechanism, denotes the
24 eigenvector in the dimension in the eigenmatrix, and
25 represents the dataset containing the contribution scores of each
26 position in the eigenmatrix extracted by the convolution-and-
27 pooling module.
28
29 2.4 DeepEBV model training
30 DeepEBV parameters were confirmed according to the
instructions mentioned in Supplementary Note 1 and
31
Supplementary Table 3. Then, a binary cross-entropy loss
32 improved the performance of the model based on the original EBV
function was applied to train the DeepEBV deep neural network,
33 integration sequences (AUROC = 0.61) (Supplementary Table 5).
and the definition of the loss function in this model is:
34 Fig. 1. The deep learning framework implemented in DeepEBV. (a) Scheme of
loss = - (7)
encoding 2000 bp DNA sequences into 2000 × 4 binary matrix; (b) A brief flow
35 where represents the prediction score, and represents the
chart shows DeepEBV framework, a detailed version was presented in
36 binary tag value of that sequence (in this dataset, positive
Supplementary Figure 1.
37 samples were labeled 1 and negative samples were labeled 0).
38 The back propagation algorithm was adapted in the training
Furthermore, the performance of DeepEBV using a 2-fold TSS
39 progress, and the Nesterov-accelerated adaptive moment
simulation strategy was compared with the performance of DeepHINT
40 estimation (Nadam) gradient descent algorithm was applied to
(Hu et al., 2019), a deep learning approach for predicting HIV
41 optimize parameter initialization.
integration sites, on both the un-augmentation and augmentation data of
The deep learning neural network model was adapted in Python
42 the internal test dataset of dsVIS. Briefly, the DeepEBV model gave an
3.7, Keras library 2.2.4 (Chollet, 2015) using three NVIDIA® Tesla
43 AUROC of 0.79 and an AUPR of 0.54 on the un-augmented training
V100-PCIE-32G (NVIDIA Corporation, California, USA) for
44 training and testing. DeepEBV takes approximately 90 min and 30
dataset and an AUROC of 0.79 and an AUPR of 0.57 on the 2-fold
45 augmented training dataset. When using DeepHINT, an AUROC of 0.71
s for model training and testing, respectively, using the
46 and an AUPR of 0.45 were achieved on the un-augmented training
computational platform under such software and hardware
dataset, while an AUROC of 0.73 and an AUPR of 0.46 were achieved
47 settings.
on the 2-fold augmented training dataset (Figure 2a).
48
Furthermore, the DeepEBV model was validated on the external virus
49
50 3 Results integration site database VISDB (Tang et al., 2019) as an independent
test dataset and further compared with DeepHINT using both the un-
51 augmentation and augmentation strategies. Generally, the DeepEBV
52 3.1 DeepEBV predicts EBV integration sites effectively with model with the un-augmented strategy showed an AUROC of 0.75 and
53 data augmentation an AUPR of 0.46 on the validation dataset, while DeepEBV with the 2-
54 The DeepEBV model is described in Figure 1, with the scheme of fold augmented strategy achieved an AUROC of 0.76 and an AUPR of
55 converting a 2 kb input sequence into a binary matrix (Figure 1a) and a 0.48 on the validation dataset. When using DeepHINT, an AUROC of
56 brief model structure recording matrix dimension variance in each layer 0.74 and an AUPR of 0.45 were obtained on the validation dataset with
57 (Figure 1b). The DeepEBV model was trained and tested with our the un-augmented strategy, while an AUROC of 0.74 and an AUPR of
58
59
60
Page 5 of 8 Bioinformatics

J.X. Liang et al.


1
2
0.46 were achieved using the 2-fold augmented strategy (Figure 2b). better performance than DeepHINT on both training and independent
3 Therefore, DeepEBV with a 2-fold augmented strategy (DeepEBV with validation datasets.
4 EBV integration sequences + aug) demonstrated comparable or slightly
5

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab388/6278295 by University of Exeter user on 26 May 2021


6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26 Fig. 2. ROC curve and PR curves of: (a) DeepEBV with EBV integration sequences + with/without augmented strategy tested on the dsVIS test dataset; (b) DeepEBV with
27 EBV integration sequences + with/without augmented strategy validated on the VISDB test dataset; (c) DeepEBV with EBV integration sequences + genomic features +
28 augmented strategy tested on the dsVIS test dataset; (d) DeepEBV with EBV integration sequences + genomic features + augmented strategy validated on the VISDB test

29 dataset.

30
model performance. The performance of 9 genomic features in
31
the pretest on the internal test dataset of dsVIS (without
32 3.2 The performance of DeepEBV could be improved by augmentation or TSS simulation) is recorded in Supplementary
33 adding genomic features Table 8. From the pretest results, we observed that two important
34 Because the insertion of DNA viruses into the human genome features ranked by AUROC were Repeat peaks (AUROC = 0.67)
35 may be influenced by local surrounding genomic features (Xu et and TCGA Pan Cancer (AUROC = 0.7).
36 al., 2019), adding these features may significantly improve the Furthermore, these two genomic features were integrated into
37 performance of the DeepEBV model. Therefore, 9 genomic the model of DeepEBV with EBV integration sequences + aug. In
38 features as three subgroups were tested by being added into EBV the internal test dataset of the training dataset from the dsVIS
39 integration sequences as a training dataset for DeepEBV dataset, we found that adding Repeat peaks increased the
40 (Supplementary Figure 2): (1) genomic content features: AUROC (from 0.79 to 0.84) and the AUPR (from 0.57 to 0.66)
41 deoxyribonuclease (DNase) Clusters, RepeatMasker and Fragile (Figure 2d). In the external test dataset VISDB, both features
sites; (2) epigenetic features: CpG islands, GeneHancer and improved the performance: (1) Repeat peaks (AUROC: from 0.76
42
ChIP-seq (H3K4Me3 and H3K27ac); and (3) mutation-related to 0.79; AUPR: from 0.48 to 0.54) and (2) TCGA Pan Cancer
43
features: Cons 20 Mammals, TCGA Pan-Cancer (sources are (AUROC: from 0.76 to 0.80; AUPR: from 0.48 to 0.55). The results
44 suggested that the model with 2-fold TSS simulation and 2-fold
recorded in Supplementary Table 6). The genomic feature sample
45 extraction principles is mentioned in detail in Supplementary Note augmentation plus Repeat peaks greatly improved EBV
46 3, and the tuning strategies were the same for both positive and integration site prediction compared to the original EBV integration
47 negative samples to avoid potential overfitting issues sequences.
48 (Supplementary Table 7). First, the sequences with positions of To better understand the relationship between Repeat and EBV
49 genomic features on the hg38 reference genome (sources are integration sites, a Chi-square test (Supplementary Table 9) was
50 mentioned in Supplementary Table 6) were downloaded and cut performed and showed a significant difference in the distances to
51 or extended to 2k bp. Second, the 2k bp samples that overlapped Repeat elements between the positive and the negative samples
52 with original positive (negative) EBV integration sequences were in both the training datasets (p < 0.001) and VISDB testing
labelled as positive (negative). Next, labelled samples were mixed datasets (p < 0.001). Furthermore, two positive EBV integration
53
with original EBV integration sequences as a training dataset sites with correct predictions of the final DeepEBV model (chr6:
54
(details described in the “model training” section in the Methods 90,295,174 and chr3: 171,080,920, all located within a 2k bp
55 distance to the nearest Repeat elements) were successfully
section). Once a subgroup performed well, the genomic features
56 were split and retested to determine which one(s) influenced the
57
58
59
60
Bioinformatics Page 6 of 8

DeepEBV: deep learning predict EBV integration sites


1
2
3 validated using traditional PCR and Sanger sequencing methods attention mechanism enables the extracted features to be highlighted
4 (Supplementary Figure 3). inside deep learning models.
5 Then, we compared our model with DeepHINT plus genomic features The DeepEBV model reshaped the input data matrix from 2000 × 4 to

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab388/6278295 by University of Exeter user on 26 May 2021


6 and found that DeepHINT had weaker performance when tuned with 218 × 128 via multiple convolution and pooling operations. Thus,
genomic features (Figure 2c-d and Supplementary Table 10). depooling and deconvolution were applied to reshape the attention
7
Meanwhile, DeepHINT showed no difference in adding Repeat peaks weight to a 662 × 1 matrix, in which each weight value represented an
8
(internal test dataset of the training dataset from dsVIS, AUROC: 0.68, attention weight of 3 nucleotides. The performance of the model with
9 AUPR: 0.40; VISDB independent test dataset, AUROC: 0.68, AUPR: and without the attention layer was compared. The performance of the
10 0.38) or in adding TCGA Pan Cancer peaks (internal test dataset of the model with the attention layer was found to be comprehensively better
11 training dataset from dsVIS, AUROC: 0.62, AUPR: 0.34; VISDB than the performance of the model without the attention layer
12 independent test dataset, AUROC: 0.67, AUPR: 0.36). Therefore, (Supplementary Table 11).
13 DeepHINT has the best performance when using EBV integration sites The sites with the top 5% attention weight scores were defined as
14 plus 2-fold augmentation without any genomic features added (internal attention-intensive sites, and the 10 bp regions near them were defined as
15 test dataset of the training dataset from dsVIS, AUROC: 0.73, AUPR: attention-intensive regions (Tian et al., 2020). Attention-intensive sites
16 0.46; VISDB independent test dataset, AUROC: 0.74, AUPR: 0.46). In inside the CNN were mapped to the hg38 reference genome with an
17 contrast, DeepEBV with EBV integration sites with 2-fold augmentation illustration of known genomic features that may be related to EBV
plus Repeat peaks achieved the highest AUROC (0.84) and AUPR (0.66) integration (Figure 3a-b).
18
on the dsVIS test dataset and achieved an AUROC of 0.79 and AUPR of Therefore, the motifs of mammalian DNA-binding proteins near
19
0.55 on the VISDB independent test dataset. The above results showed attention-intensive sites were analyzed using HOMER (Heinz et al.,
20 that DeepEBV with genomic features surpassed DeepHINT in predicting 2010). The input was 10 bp DNA sequences near attention-intensive
21 EBV integration sites. sites derived from the DeepEBV model with EBV integration sequences
22 plus Repeat features. The top 10 known binding motifs and de novo
23 motifs with P values, backgrounds and targets are displayed in Figure 3c.
24 3.3 Essential sequence elements were identified for EBV All the HOMER results are listed in Supplementary Figure 4. Specific
25 integration site selection preference enriched motifs of DNA binding proteins might have a close relationship
26 Deep learning models have a black box effect due to their complex with EBV integration, including signal transducer and activator of
27 structure and theory, which makes it difficult to use intermediate data for transcription 5 (STAT5), zinc finger protein 416 (ZNF416), and brain
28 further analysis (Brouillette, 2020, Rudin, 2019). The development of the and muscle ARNT-like 1 (BMAL1). Details of these binding motifs are
described in the Discussion.
29
30
Fig. 3. Attention intensive regions highlighted essential local genomic features on predicting EBV integration sites in the model of DeepEBV with EBV integration
31
sequences + Repeat peaks. Representative examples showed the position relationship between the integration sites and several genomic features in (a) chr6:90295174-
32 :90297174; (b) chr3:171080920-171082920. IGV (Robinson, et al., 2011). “Attention Intensive Sites” denotes the sites with top 5% attention weight. “Repeat”, “TCGA Pan
33 Cancer”, “DNase Clusters”, “Con 20 mammals”, “fragile site”, “CpG island”, “GeneHancer”, “H3K27ac ChIP-seq”, “H3K36me3 ChIP-seq” are genomic features. (d) The
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
Page 7 of 8 Bioinformatics

J.X. Liang et al.


1
2 top 10 known and de novo motifs of DNA binding proteins detected by HOMER using the attention intensive regions from the output of DeepEBV with EBV integration
3 sites plus repeat peaks. Other motifs calculated by HOMER were recorded in Supplementary Figure 4.
4
5
cutoff score: 0.5). The inner circle showed the linkage of genome positions to

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab388/6278295 by University of Exeter user on 26 May 2021


6 related genes.
7 3.4 DeepEBV can predict EBV integration hot-spot genes
8 Known EBV integration sites were annotated using ANNOVAR (Wang
9 Table 1. Comparison of the model ability of predicting EBV integration
et al., 2010). After observing DeepEBV prediction results, a hypothesis hot-spot genes according to different statistical measures.
10 was raised that a gene is likely to be a hot region of integration if there is
11 more than one predicted integration site in it. Then, the known genes
12 related to EBV integration were separated into the experimental group Source model Sen. Spe. PPV NPV AUC
13 and the control group to test the model performance in predicting EBV integration sites without
14 recurrent genes of EBV integration. The experimental group contained 82.65% 80.72% 62.79% 92.20% 0.8169
augmentation
15 all samples in the hot-spot gene list, while the control group contained
only one integration site with no integration events 14 kb nearby (14 kb EBV integration sites with 2-
16 76.53% 69.08% 49.34% 88.21% 0.7280
is the rough average length of hot-spot genes). In the dsVIS dataset, 98 fold augmentation
17
hot-spot genes and 249 nonhot-spot integration sites were obtained,
18 EBV integration sites with 2-
85.71% 49.40% 40.00% 89.78% 0.6756
which generated almost 1:2.5 positive and negative data groups. Table 1
19 shows the evaluation of DeepEBV models using different strategies for
fold augmentation + repeats
20 predicting EBV integration hotspot genes on the dsVIS dataset. EBV integration sites with 2-
21 Generally, four strategies applied in DeepEBV were: (1) EBV fold augmentation +TCGA 82.65% 67.47% 50.00% 90.81% 0.7506
22 integration sites without augmentation, (2) EBV integration sites with 2- Pan Cancer
23 fold augmentation, (3) EBV integration sites with 2-fold augmentation + Definition of positivity ≥2 predicted integration sites (score > 0.5) in a gene; Test
24 Repeats, and (4) EBV integration sites with 2-fold augmentation + performed on 98 EBV integration hot-spot genes and 249 EBV integration sites
with no other known EBV integrative sites within 14 kb nearby; Sen, Sensitivity;
25 TCGA Pan Cancer, which all demonstrated comparable results (Figure
Spe, Specificity; PPV, Positive predictive value; NPV, Negative predictive value;
26 4). Although EBV integration sites with 2-fold augmentation + Repeats
AUC, Area under curve.
27 achieved the highest sensitivity of 85.7%, EBV integration sites without
augmentation achieved the highest AUC of 0.82 with more balanced
28
29
performance (sensitivity: 82.65% and specificity: 80.72%). In practical 4 Discussion
applications, the choice of model should depend on the situation. In this study, we developed an explainable attention-based deep learning
30 Together, these results indicated that DeepEBV with the above strategies model, DeepEBV, to predict EBV integration sites. We demonstrated
31 had a good ability to predict integration hotspot genes. that the performance of DeepEBV could be significantly improved by
32
adding TSS simulation data and translation data augmentation. The
33 performance of DeepEBV could be further improved by adding local
34 genomic features of the UCSC genome Repeat and TCGA Pan Cancer.
35 Among these features, the strategy of adding Repeat had the best
36 performance, which was validated on the independent dataset VISDB.
37 Repeat elements are considered to increase the possibility of integrating
38 different viruses, including EBV (Peng et al., 2019, Xu et al., 2019). The
39 underlying mechanism may be due to genome instability caused by
40 Repeat elements (McIvor et al., 2010), leading to the generation of
41 double-stranded breaks (DSBs). Then, EBV may integrate into the host
genome by DNA repair pathways such as fork stalling and template
42
switching (FoSTeS) and microhomology-mediated break-induced
43
replication (MMBIR) (Xu et al., 2019, Zhang et al., 2009). Furthermore,
44 DeepEBV identified specific binding motifs closely related to EBV
45 integration preference and showed the ability to predict EBV integration
46 hotspot genes. These results provide new insights into EBV integration
47 preferences and mechanisms.
48 The testing results of the DeepEBV model were compared with 3
49 traditional machine learning methods (Supplementary Table 12): support
50 vector machine (SVM), logistic regression (LR) and random forest (RF),
51 in the internal test dataset of the training dataset from dsVIS and in the
52 VISDB independent test dataset. Compared with traditional machine
learning methods, DeepEBV exhibited better AUROC performance in
53
Fig. 4. Distribution of integration hot-spot genes in the human genome. The outer both the internal test dataset of the training dataset from dsVIS
54 circle showed the model-predicted scores of each EBV integration site involved in (DeepEBV = 0.84, SVM = 0.52, LR = 0.50 and RF = 0.57) and the
55 hot-spot genes in 24 human chromosomes. Colored points represented the predicted
VISDB independent test dataset (DeepEBV = 0.8, SVM = 0.52, LR =
56 score calculated by DeepEBV model with different strategies, and a position with
0.47 and RF = 0.55).
57 higher score meant more possible to be an EBV integration site (the prediction

58
59
60
Bioinformatics Page 8 of 8

DeepEBV: deep learning predict EBV integration sites


1
2
3 Data augmentation is a common solution in deep learning to solve the References
4 problem of a lack of samples and improve model generalization at the Aghdam, H. H., et al. (2017) Guide to Convolutional Neural Networks : A
same time. Its strategies include flip, rotation, scale, crop, translation and Practical Application to Traffic-Sign Detection and Classification, Springer
5 International Publishing.

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab388/6278295 by University of Exeter user on 26 May 2021


6 Gaussian noise (Shorten et al., 2019). Among these strategies, translation Arvey, A., et al. (2012) An atlas of the Epstein-Barr virus transcriptome and
is suitable for DNA sequence data because the nucleotides in the genome epigenome reveals host-virus regulatory interactions, Cell Host Microbe 12,
7 233-245.
are continuous. Slight movement of the sample-extracting window still
8 Brouillette, M. (2020) Deep Learning Is a Black Box, but Health Care Won’t Mind,
contains the surroundings of EBV integration sites. Thus, using MIT Technology Review.
9 translation to augment the training dataset is reasonable. Our results Cao, S., et al. (2015) High-throughput RNA sequencing-based virome analysis of
10 showed that data augmentation could improve the AUPR and sensitivity 50 lymphoma cell lines from the Cancer Cell Line Encyclopedia project, J
Virol 89, 713-729.
11 of DeepEBV, while adding genomic features could greatly improve the Chakravorty, S., et al. (2019) Integrated Pan-Cancer Map of EBV-Associated
12 AUROC, AUPR and sensitivity of DeepEBV. Neoplasms Reveals Functional Host-Virus Interactions, Cancer Res 79, 6010-
6023.
13 DeepEBV may reveal the biological background beneath the deep Chakravorty, S., et al. (2019) Integrated pan-cancer map of EBV-associated
14 learning frame due to attention mechanisms. Each unit will be given a neoplasms reveals functional host-virus interactions, Cancer Research,
canres.0615.2019.
15 specific score by the attention mechanism, with higher scores being more
Chen, H., et al. (2001) Linkage between STAT regulation and Epstein-Barr virus
16 important in the prediction (Chollet, 2015). According to the translation gene expression in tumors, J Virol 75, 2929-2937.
17 invariance in CNNs, attention-intensive regions are more likely to be Chollet, F. a. o. (2015) Keras.
Cruz, J. A., et al. (2006) Applications of Machine Learning in Cancer Prediction
conserved. The prediction output of DeepEBV with EBV integration
18 and Prognosis, Cancer Informatics 2, 117693510600200030.
sites plus Repeat was applied to verify this hypothesis because this Deeplearning.net. (2020) Convolutional Neural Networks (Lenet).
19 Guidotti, R., et al. (2018) A Survey Of Methods For Explaining Black Box Models,
model showed the best performance. Binding motifs of ZNF416,
20 STAT5, and BMAL1 are the most noteworthy candidates enriched near
arXiv:1802.01933 [cs.CY].
He, Q., et al. (2017) The Circadian Clock Gene BMAL1 and Ki-67 Protein Affect
21 the EBV integration sites. ZNF416 (found by de novo HOMER strategy the Prognosis in Nasopharyngeal Carcinoma, International Journal of
22 with p-value 1e-32) was found to be a C2H2 zinc finger factor, Radiation Oncology• Biology• Physics 99, E340.
Heinz, S., et al. (2010) Simple combinations of lineage-determining transcription
23 expression of which is sustained in EBV-positive but not in EBV- factors prime cis-regulatory elements required for macrophage and B cell
24 negative B cell lines (Tune et al., 2002). The activation of STAT5 (found identities, Mol Cell 38, 576-589.
Hu, H., et al. (2019) DeepHINT: understanding HIV-1 integration via deep learning
25 by de novo HOMER strategy with p-value 1e-225) may be both a
with attention, Bioinformatics 35, 1660-1667.
26 necessary and predisposing event for EBV-driven tumorigenesis in Iizasa, H., et al. (2012) Epstein-Barr Virus (EBV)-associated gastric carcinoma,
immunocompetent individuals (Chen et al., 2001). BMAL1 (found by Viruses 4, 3420-3439.
27 Koohi-Moghadam, M., et al. (2019) Predicting disease-associated mutation of
28 the known motif HOMER strategy with a P value of 1e-24) is a metal-binding sites in proteins using a deep learning approach, Nature Machine
transcriptional activator that forms circadian clock core components, and Intelligence 1, 561-567.
29 Lahti, T., et al. (2012) Circadian clock disruptions and the risk of cancer, Annals of
circadian rhythm disruption has been indicated as a risk factor for cancer
30 Medicine 44, 847-853.
development (He et al., 2017, Lahti et al., 2012). These motifs might Lu, J., et al. (2011) Epstein–Barr Virus nuclear antigen 1 (EBNA1) confers
31 provide important hints to EBV integration preference and warrant future resistance to apoptosis in EBV-positive B-lymphoma cells through up-
32 experimental confirmation. regulation of survivin, Virology 410, 64-75.
Luo, W. J., et al. (2004) Epstein-Barr virus is integrated between REL and BCL-
33 For further research, it is possible to use DeepEBV to go through 11A in American Burkitt lymphoma cell line (NAB-2), Lab Invest 84, 1193-
34 more whole genome sequencing and virus capture sequencing data to 1199.
McIvor, E. I., et al. (2010) New insights into repeat instability: role of RNA*DNA
35 predict potential EBV integration sites and combine the predicted sites hybrids, RNA Biol 7, 551-558.
36 with known EBV integration sites to build a more comprehensive map of Moore, P. S., et al. (2010) Why do viruses cause cancer? Highlights of the first
century of human tumour virology, Nat Rev Cancer 10, 878-889.
37 EBV insertional mutagenesis. In addition, EBV integrated into different
Nishikawa, J., et al. (2018) Clinical Importance of Epstein⁻Barr Virus-Associated
38 breakpoints might lead to different disease stages due to various genomic Gastric Cancer, Cancers (Basel) 10, 167.
39 surroundings. Thus, more attention should be paid to the disease types Peng, R.-J., et al. (2019) Genomic and transcriptomic landscapes of Epstein-Barr
virus in extranodal natural killer T-cell lymphoma, Leukemia 33, 1451-1462.
when collecting EBV integration sites in the future.
40 Peng, R. J., et al. (2019) Genomic and transcriptomic landscapes of Epstein-Barr
In summary, DeepEBV is the first deep learning model to predict virus in extranodal natural killer T-cell lymphoma, Leukemia 33, 1451-1462.
41 Rudin, C. (2019) Stop explaining black box machine learning models for high
EBV integration sites, which can identify EBV integration hotspot genes
42 and local genomic elements, providing a new tool for research on the
stakes decisions and use interpretable models instead, Nature Machine
Intelligence 1, 206-215.
43 EBV integration mechanism. Shorten, C., et al. (2019) A survey on Image Data Augmentation for Deep Learning,
44 Journal of Big Data 6, 60.
Takakuwa, T., et al. (2004) Integration of Epstein-Barr virus into chromosome
45 6q15 of Burkitt lymphoma cell line (Raji) induces loss of BACH2 expression,
46 Funding Am J Pathol 164, 967-974.
Takakuwa, T., et al. (2005) Identification of Epstein-Barr virus integrated sites in
47 This work was supported by the National Science and Technology Major Project of
lymphoblastoid cell line (IB4), Virus Res 108, 133-138.
48 the Ministry of science and technology of China [2018ZX10301402, Tang, D., et al. (2019) VISDB: a manually curated database of viral integration
sites in the human genome, Nucleic Acids Res.
49 2018YFC2001600]; National Natural Science Foundation of China [81761148025,
Tian, R., et al. (2020) DeepHPV: a deep learning model to predict human
82001919, 81871473]; Guangzhou Science and Technology Programme
50 papillomavirus integration sites, Brief Bioinform.
[201704020093]; National Ten Thousands Plan for Young Top Talents; Key-Area Tune, C. E., et al. (2002) Sustained expression of the novel EBV-induced zinc
51 finger gene, ZNFEB, is critical for the transition of B lymphocyte activation to
Research and Development Program of Guangdong Province [2019B03035001];
52 General Program of Natural Science Foundation of Guangdong Province of China
oncogenic growth transformation, J Immunol 168, 680-688.
Wang, K., et al. (2010) ANNOVAR: functional annotation of genetic variants from
53 [2021A1515012438]; China Postdoctoral Science Foundation [2020M672995], and high-throughput sequencing data, Nucleic Acids Res 38, e164.
54 National Postdoctoral Program for Innovative Talent [BX20200398]. Xiao, K., et al. (2016) Genome-wide Analysis of Epstein-Barr Virus (EBV)
Integration and Strain in C666-1 and Raji Cells, J Cancer 7, 214-224.
55 Conflict of Interest: none declared. Xu, M., et al. (2019) Genome-wide profiling of Epstein-Barr virus integration by
56 targeted sequencing in Epstein-Barr virus associated malignancies,
Theranostics 9, 1115-1124.
57
58
59
60
Page 9 of 8 Bioinformatics

J.X. Liang et al.


1
2 Zhang, F., et al. (2009) The DNA replication FoSTeS/MMBIR mechanism can
3 generate genomic, genic and exonic complex rearrangements in humans, Nat
Genet 41, 849-853.
4 Zhang, Q.-s., et al. (2018) Visual interpretability for deep learning: a survey,
5 Frontiers of Information Technology & Electronic Engineering 19, 27-39.
Zhang, W., et al. (1990) Parallel distributed processing model with local space-

Downloaded from https://academic.oup.com/bioinformatics/advance-article/doi/10.1093/bioinformatics/btab388/6278295 by University of Exeter user on 26 May 2021


6 invariant interconnections and its optical architecture, Appl Opt 29, 4790-4797.
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60

You might also like