You are on page 1of 17

B. Tech.

Project Report Phase I

Predicting Splicing from Primary Sequences using


Deep learning

Submitted in partial fulfillment of requirements


for the award of the degree of Bachelor of Technology from IIT
Guwahati

Under the supervision of

Dr. Kusum Kumari Singh

Submitted by

Yash Sharma
160106055

November 4, 2019
Department of Biotechnology
Indian Institute of Technology Guwahati
Guwahati 781039, Assam, INDIA

1
Certificate

This is to certify that the work presented in the report entitled “Predicting Splicing from
Primary sequences using Deep learning” by Yash Sharma (160106055), represents an
original work under the guidance of Dr Kusum Kumari Singh. This study has not been
submitted elsewhere for a degree.

Signature of student:

Date: 4/11/19
Place: IIT Guwahati Yash Sharma
160106055

Signature of supervisor I

Date: 4/11/19
Place: IIT Guwahati Dr Kusum K. Singh
Department of Biosciences and Bioengineering

Signature of HOD

Date:
Place: IIT Guwahati Head
Department of Biotechnology
Indian Institute of Technology Guwahati
Guwahati, India

2
Table of Content

Page no.
1. Abstract 4
2. Introduction 5
2.1 Splicing Mechanism 5
2.2 Alternative Splicing 6
2.3 Deep learning and Splicing 7
2.4 UPFBA and UPF3B 7
3. Literature Review 8
4. Objective 11
5. Materials and Methods 12
6. Results 13
7. Conclusion 16
8. Future Work 16
9. Reference 17

3
Abstract
Splicing is a central part of human life. The mechanism behind splicing is the key to truly
understanding the information in the human genome. Many diseases in humans occur due to
the non-expression and expression of certain proteins. The misregulation of Splicing cause or
modify different human diseases. Due to the high prevalence of splicing in human diseases,
deep understanding and accurate prediction of splicing are necessary. There are many obstacles
involved in developing such a code owning to the vast number of elements and features
involved in the regulation of splicing and high context specificity of the mechanism of splicing.
With the advent of deep learning, the prediction of splicing has become incredibly accurate
with their prediction of how often a given exon is included or excluded in RNA and thus how
frequently various proteins are synthesized. These algorithms are rapidly improving their
accuracy of prediction, the marginal improvements which are made on them will arise from
the formulation of data topology, and a deeper understanding of the various underlying
biological phenomenon. We aim to predict splice sites of genes that plays a prominent role in
mental diseases using deep learning models.

4
Introduction
Splicing is a stochastic, complex, context-specific, and a highly regulated process. It is a part
of Gene Expression which is a very fundamental step where a genotype produces an observable
trait, i.e., a phenotype. In eukaryotic cells, the expression of a gene starts from the nucleus,
which, after transcription, gives rise to precursor messenger RNAs (pre-mRNAs). The
transcription is followed by translation. Before undergoing translation, the precursor messenger
RNAs are subjected to post-transcriptional processing, which includes splicing, 5′ end-capping,
3′ end polyadenylation, and RNA export.
During Splicing, the intervening introns (non-coding sequences) are removed from pre-mRNA,
and the exons (protein-coding sequences) are ligated in order to produce a mature mRNA to
facilitate translation of mRNA into proteins. Splicing is an essential part of gene expression,
and it is responsible for high genome complexity in vertebrates. Around 90% of human genes
contain multiple exons, and many tissue-wide RNA-Seq studies revealed that 95% of these
multi-exon genes undergo Alternative Splicing, which gives rises to many more proteins due
to a slight variation in Splice junction in the same gene under different conditions.
The Splicing process is immanently stochastic yet far from random. The Stochastic behaviour
is apparent from the employment of NMD (Nonsense-mediated decay) to degrade the defective
transcripts while the non-random behaviour comes from the fact that thousands of changes in
splicing are conserved between tissues and developmental stage.
The spliceosome is the chief executioner of Splicing, which is a massive complex containing
five small nuclear ribonucleoprotein particles (snRNPs) and many auxiliary proteins. Through
a chain of biochemical reaction, major splicing signals are recognized, namely the 3′ and 5′
splice sites, the polypyrimidine tract , and the branch point sequences (BPS) and many other
RNA Recognition Motif (RRM).

Splicing mechanism
Splicing process occurs in several steps, and snRNPs catalyze it. First, the 5’ end of the intron
in pre-mRNA is cleaved, and U1 snRNP is attached to its complementary sequence in the
intron. The cut end then gets attached to the conserved branchpoint region to form a loop
structure called lariat (Figure 1). After that U2, U4, U5, and U6 helps the attachment of 3’ end
to the 5’ end of the intron. This occurs due to a transesterification reaction, which results in the
formation of a covalent bond between the adjoining exons, and the lariat is released along with
snRNPs U6, U5, and U2 bound to it.

5
Figure 1: Pre-mRNA splicing (Nature Publishing Group)

The detailed mechanism of assembly and disassembly of spliceosome is complex as the


auxiliary proteins involved in spliceosome for catalyzing the splicing process are highly
dynamic with respect to composition and conformation, which provides the flexibility and
accuracy to the spliceosome.

Alternative Splicing
Alternative Splicing is a process which enables mRNA (messenger RNA) to direct the
synthesis of diverse proteins variants (isoforms), which can have different properties or cellular
functions. It occurs by arranging the pattern of exon and intron elements, which are joined by
splicing in order to alter the coding sequence of mRNA. Consequently, after the translation of

6
alternatively spliced mRNAs, they will have different amino acid sequences, and they will
differ in biological function. Most of the genes in humans are Alternatively Spliced (around
95% of multi-exon genes.

Deep learning and Splicing


The recent development in the research of Deep learning made prediction of Splice sites far
easier than before. A deep neural network can be constructed to predict splicing from arbitrary
pre-mRNA transcripts accurately and efficiently. Deep learning models learn the relationships
and form a pattern to determine how splicing works, and we don’t have to manually input the
features which govern the splicing. With state of the art computational prowess and improving
accuracies of prediction of the algorithm, it is possible to make a model that has a very high
prediction value. These models, in some sense, quantifies the process of splicing.
These deep learning models can also be used to predict cryptic splice sites, which is crucial in
revealing pathogenic variants in neuronal development. Cryptic splice sites gives rise to
Alternative Splicing. In this project, the aim is to predict the splice site of UPF3A and the
UPF3B gene, which plays an essential role in neuronal development, and their misregulation
can lead to X-linked mental retardation.

UPF3A and UPF3B


UPF3A and UPF3B are antagonistic gene paralogs. Both genes are the regulator of Nonsense-
mediated RNA decay (NMD). UPF3A inhibits NMD, on the other hand, UPF3B activates
NMD. UPF3A suppresses NMD by sequestering the UPF2 from the NMD machinery.
Nonsense-mediated decay (NMD) should be downregulated for crucial developmental
processes. UPF3B is related to X-linked mental retardation. Mutations in UPF3B on X
chromosome have implications in various neurodevelopmental disorders like autism, X-linked
mental retardation, and schizophrenia. The patients are found to have downregulated UPF3B
gene because the missense mutation reduces the activity of UPF3B protein in NMD. The
mutation causes a disturbance in the neuronal differentiation, and the complexity of branching
of the neurites is also reduced.

7
Literature Review
In earlier studies, splicing ‘code’ is used for Splice site prediction. The code is a set of rules
which enables us to predict the splicing pattern from a primary transcript from its sequences.
The code determines new splicing patterns, identifies specific regulatory programs in different
tissues, and it also identifies the mutation-verified regulatory sequence. The code facilitates the
detailed characterization and discovery of alternative regulated splicing on a genome-wide
scale.
A major contribution towards the splicing code is made by Yoseph Barash et. Al. in his paper
under the title of “Deciphering the splicing code.” Their research was based on the evidence
that tissue-dependent splicing is regulated by cis-acting RNA sequence motifs, trans-acting
factors, and other RNA features. They have gathered various RNA features which have been
sought by researchers [1,2,3,4,5]. They constructed a compendium of 1014 features, which
were of four types, namely: new motifs, short motifs, known motifs, and features related to
transcript structure. The main challenge here is that the regulatory activity of a particular motif
is known, but the activity of the same motif in proximity to other motifs is not known. Their
code recursively selects certain features from the compendium to predict splicing. Their code
takes up a collection of exons and surrounding intronic sequence and data profiling as an input.
We get the probability of increased or decreased exon inclusion or no change in exclusion of
an exon. Their code is able to predict the direction of splicing more accurately, but the accuracy
in predicting the splice site was subpar.

Figure 2: Code extract features from the sequence (only a small part of the intronic sequence
is used) (source: Deciphering the Splicing code, Barash et al.)

8
There are many other challenges to this approach. In the above method, they have to put
together many RNA features and splicing regulatory factors. These RNA features are very
dynamic, and there are many features that are still unknown (features related to structure). The
pattern of splicing with a set of feature works differently in different tissue, and a little change
in the vicinity can change the whole splicing pattern of a particular gene. Another problem with
the previous research is the underappreciation of splicing regulatory factors deep inside the
non-coding regions. In the earlier studies a small part of the intron is taken into account and
features are extracted from that part and the splicing patterns are predicted accordingly which
make the prediction less accurate as the recent studies have revealed that non-coding regions
are key factors in gene regulation and they are accountable for approximately 90% of causal
diseases loci which are discovered in unbiased genome-wide association studies of human
complex diseases [6,7]. Another thing that was underappreciated was cryptic splice variants,
which play a prominent role in rare genetic diseases [8]. Therefore, it is of biomedical interest
to take into account the cryptic splice patterns for a gene.
Recent studies employ Deep learning to build a model that can predict splicing from primary
sequences. Deep learning is still new in the field of Biology, and it has a wide range of
applications in biology like the classification of cellular images, splice site prediction, structure
prediction, genome analysis, and many more. The most recent study for predicting splicing
also employs deep learning to build a model that can predict splicing from primary sequences.
Their model, SpliceAI, consists of 32 convolutional layers to predict splicing from pre-mRNA
sequences.

Figure 3: SpliceAI training using pre-mRNA sequences and identify cryptic splice variants
using SpliceAI. (Source: Predicting splicing from primary sequence using deep learning,
Jaganathan et al.)

SpliceAI has achieved an accuracy of 84% in splice site prediction as validated by RNA-Seq
and experimental data. The employment of Deep learning models for predicting splicing

9
provided enough tools to overcome the challenges that we were facing in previous studies. The
deep learning models learn the parameters and hyperparameters, which will affect the splicing
patterns, and after training the model over a good set of data, it can achieve good accuracy.
With this method, we don’t have to provide manual input of RNA features or any other
regulatory factors. The model itself finds the pattern in the data which has been fed to for
training. SpliceAI has also overcome the challenge of taking only a short window of the intronic
region into account. It takes in account the whole non-coding region which provides better
accuracy, and the underlying principle of many complex human diseases can be understood
from the model’s prediction. The prediction using the model have also revealed cryptic splice
variants which were previously left out. There are also some downsides to deep learning-based
splicing prediction. Since the model extract features from the sequence automatically, it can
use various sequence determinants which are not well defined by human expert, but there is
also a possibility that the model can sometimes incorporate certain features which do not reflect
the true nature of spliceosome. This project uses SpliceAI to predict splice sites in UPF3A and
UPF3B genes.

10
Objective
The broader aim of this project is to build a tissue-specific splice site prediction model. To
achieve this broader aim, my immediate aim is to understand the working of the model and its
parameter. My objective is to predict splice sites for UPF3A and UPF3B genes, which can
subsequently be determined experimentally and to create a Jupyter notebook to make the
process of prediction easier for non-programmers in the field.

11
Methods and Materials
The model used for the prediction of splice sites is SpliceAI, which has been released by
Illumina on GitHub (https://github.com/Illumina/SpliceAI). The following method is used to
predict the splice sites from pre-mRNA sequences.
 The sequence was padded with 5000 N’s on either side.
 The padded sequence was then one-hot encoded.
 The predictions were made using the models provided, and delta score was calculated
using the formula given below

Delta Score= np.mean([ann.models[m].predict(x_ref) for m in range(5)], axis=0))

The materials used in the project are Jupyter, python programming language, Keras,
Tensorflow, Pandas, Seaborn.

12
Result
The aim of the study is to find the splice sites in the minigene construct of UPF3A and UPF3B
genes, which were prepared in our lab. To achieve this, I used the SpliceAI models to predict
the splice sites of the gene and calculated the delta score and obtained output of L*3 scores,
where L is the length of the gene, the 3 channel corresponds to the probability of not splicing,
splice acceptor and splice donor respectively. The graphs are plotted for splice acceptor and
splice donor position against the length of the gene pixelated to a smaller scale. The following
graphs were obtained after the predictions. Accurate probabilities, code, and other information
can be seen in the jupyter notebook available at my GitHub profile
https://github.com/YashSharma666.

Splice acceptor
UPF3A Splice donor

UPF3A MINIGENE

13
From the graphs of UPF3A gene and UPF3A minigene construct, we can infer the model
predicted 14 splice sites, 6 Splice acceptor and 8 splice donors. While for the minigene
construct, the model predicted only 4 splice sites, 2 each of splice donor and acceptor. The
exact position of these splice sites in the minigene construct is
Splice acceptor position- (3466, 3683) and Splice donor positions- (107, 3563). These positions
are depicted below, along with the minigene construct of UPF3A gene.

UPF3A minigene construct

UPF3B

14
UPF3B MINIGENE

The results with UPF3B gene and UPF3B minigene construct is similar to that of UPF3A and
UPF3A minigene. In this case, also the model predicted 14 splice sites for the UPF3B gene,
but for the UPF3B minigene construct, the model predicted only 2 splice sites, one each for
splice donor and acceptor. The exact position of these splice sites are Splice acceptor position-
2732; Splice donor position- 183.

UPF3B minigene construct

15
Conclusion
The main objective of this study is to find the splice sites in the minigene construct which can
be subsequently determined experimentally. The prediction of splice sites using this model is
beneficial as it gives us more accuracy and this model allows us to reconstruct the specificity
of determinants that enables the spliceosome to achieve great precision in vivo. The
employment of deep learning in this particular aspect is ground-breaking as other techniques
show a lot of noise in their prediction, while SpliceAI gives clear cut predictions of the splice
sites. Another point which can be concluded while working on the project is the discovery of
various features and factors which affect splicing but are not well described by human experts,
and no information is found in the literature about them. The main point to conclude from the
results are:
 UPF3A minigene have 2 splice acceptor sites at beginning of exon 4 and exon 5, and 2
splice donor position at the end of exon 2 and exon 4.
 UPF3B minigene have 1 splice acceptor site at the beginning of Exon 9 and 1 splice
donor site at the end of exon 7.

Future Work
Deep learning is still a relatively new technique in the field of Biology, and there is a lot more
to explore. The most foreseeable objective related to this project is the identification of motifs
that determines the splicing pattern In the genes. This model can also be used to identify various
cis and trans-regulatory factors which are significant in determining the splice pattern in a gene.
Our understanding of mutations in noncoding genomes are very little and how mutations in
these noncoding regions give rise to human diseases still far from complete. There is also a lot
of research to be done in order to understand the tissue-specific splicing regulation and splicing
regulation under different cellular contexts. With more advancements in the field of Biology
which gives rise to a plethora of data to work with, these goals are getting closer like recent
advancements in oligonucleotide therapy which has the potential to target the defects in
splicing in a sequence-specific manner has brought us close to understanding the regulatory
mechanisms of splicing which can clear the path for many novel therapeutic interventions.

16
References
1. Wang, Z. & Burge, C. B. Splicing regulation: from a parts list of regulatory elements
to an integrated splicing code. RNA 14, 802–813 (2008).

2. Hartmann, B. & Valcarcel, J. Decrypting the genome’s alternative messages. Curr.


Opin. Cell Biol. 21, 377–386 (2009).

3. Hallegger, M. Llorian, M. & Smith, C. W. Alternative splicing: global insights. FEBS


J. 277, 856–866 (2010).

4. Nilsen, T. W. & Graveley, B. R. Expansion of the eukaryotic proteome by alternative


splicing. Nature 463, 457–463 (2010).

5. Blencowe, B. J. Alternative splicing: new insights from global analyses. Cell 126, 37–
47 (2006).

6. Farh, K.K.H., Marson, A., Zhu, J., Kleinewietfeld, M., Housley, W.J., Beik, S.,
Shoresh, N., Whitton, H., Ryan, R.J.H., Shishkin, A.A., et al. Genetic and epigenetic
fine mapping of causal autoimmune disease variants. Nature 518, 337–343.9(2015)

7. Cooper, T.A., Wan, L., and Dreyfuss, G. RNA and disease. Cell 136, 777–793.(2009)

8. Maurano, M.T., Humbert, R., Rynes, E., Thurman, R.E., Haugen, E., Wang, H.,
Reynolds, A.P., Sandstrom, R., Qu, H., Brody, J., et al. Systematic localization of
common disease-associated variation in regulatory DNA. Science 337, 1190–
1195.(2012)

17

You might also like