Professional Documents
Culture Documents
1
2
3
4
5
1
2
description. The deep learning network uses EBV integration sites as
3 positive samples to learn features from, while it also uses negative
4 1 Introduction
samples that do not contain EBV integration sites as backgrounds. After
5 Oncogenic DNA viruses establish long-term persistent intracellular evaluating samples of a series of lengths (Supplementary Note 4), EBV
29 dataset.
30
model performance. The performance of 9 genomic features in
31
the pretest on the internal test dataset of dsVIS (without
32 3.2 The performance of DeepEBV could be improved by augmentation or TSS simulation) is recorded in Supplementary
33 adding genomic features Table 8. From the pretest results, we observed that two important
34 Because the insertion of DNA viruses into the human genome features ranked by AUROC were Repeat peaks (AUROC = 0.67)
35 may be influenced by local surrounding genomic features (Xu et and TCGA Pan Cancer (AUROC = 0.7).
36 al., 2019), adding these features may significantly improve the Furthermore, these two genomic features were integrated into
37 performance of the DeepEBV model. Therefore, 9 genomic the model of DeepEBV with EBV integration sequences + aug. In
38 features as three subgroups were tested by being added into EBV the internal test dataset of the training dataset from the dsVIS
39 integration sequences as a training dataset for DeepEBV dataset, we found that adding Repeat peaks increased the
40 (Supplementary Figure 2): (1) genomic content features: AUROC (from 0.79 to 0.84) and the AUPR (from 0.57 to 0.66)
41 deoxyribonuclease (DNase) Clusters, RepeatMasker and Fragile (Figure 2d). In the external test dataset VISDB, both features
sites; (2) epigenetic features: CpG islands, GeneHancer and improved the performance: (1) Repeat peaks (AUROC: from 0.76
42
ChIP-seq (H3K4Me3 and H3K27ac); and (3) mutation-related to 0.79; AUPR: from 0.48 to 0.54) and (2) TCGA Pan Cancer
43
features: Cons 20 Mammals, TCGA Pan-Cancer (sources are (AUROC: from 0.76 to 0.80; AUPR: from 0.48 to 0.55). The results
44 suggested that the model with 2-fold TSS simulation and 2-fold
recorded in Supplementary Table 6). The genomic feature sample
45 extraction principles is mentioned in detail in Supplementary Note augmentation plus Repeat peaks greatly improved EBV
46 3, and the tuning strategies were the same for both positive and integration site prediction compared to the original EBV integration
47 negative samples to avoid potential overfitting issues sequences.
48 (Supplementary Table 7). First, the sequences with positions of To better understand the relationship between Repeat and EBV
49 genomic features on the hg38 reference genome (sources are integration sites, a Chi-square test (Supplementary Table 9) was
50 mentioned in Supplementary Table 6) were downloaded and cut performed and showed a significant difference in the distances to
51 or extended to 2k bp. Second, the 2k bp samples that overlapped Repeat elements between the positive and the negative samples
52 with original positive (negative) EBV integration sequences were in both the training datasets (p < 0.001) and VISDB testing
labelled as positive (negative). Next, labelled samples were mixed datasets (p < 0.001). Furthermore, two positive EBV integration
53
with original EBV integration sequences as a training dataset sites with correct predictions of the final DeepEBV model (chr6:
54
(details described in the “model training” section in the Methods 90,295,174 and chr3: 171,080,920, all located within a 2k bp
55 distance to the nearest Repeat elements) were successfully
section). Once a subgroup performed well, the genomic features
56 were split and retested to determine which one(s) influenced the
57
58
59
60
Bioinformatics Page 6 of 8
58
59
60
Bioinformatics Page 8 of 8