Professional Documents
Culture Documents
Alu Repeats, Cause or Consequence of Colon Cancer by Muhammad Hamid
Alu Repeats, Cause or Consequence of Colon Cancer by Muhammad Hamid
By
Muhammad Hamid
SP11-BSB-024
BS Thesis (2011-2014)
i|Page
COMSATS Institute of Information Technology
A Thesis Presented to
BS
(Bioinformatics)
By
Muhammad Hamid
CIIT/SP11-BSB-024/ISB
December, 2014
ii | P a g e
Alu Repeats, Cause or Consequence of Colon Cancer
Supervisor
Dr. Abdullah Ahmed
Assistant Professor, Department of Biosciences
Co-Supervisor
Dr. Aamira Tariq
Assistant Professor, Department of Biosciences
iii | P a g e
Final Approval
Muhammad Hamid
Has been approved for the COMSATS Institute of Information Technology Islamabad
Supervisor: ________________________________________________________
Dr. Abdullah Ahmed
Assistant Professor, Department of Biosciences, CIIT Islamabad
Co-Supervisor: _____________________________________________________
Dr. Aamira Tariq
Assistant Professor, Department of Biosciences, CIIT Islamabad
Chairman: _________________________________________________________
Prof. Dr. Syed Habib Bokhari
Department of Biosciences, CIIT Islamabad
iv | P a g e
Declaration
I Muhammad Hamid hereby declare that I have produced the work presented in this thesis, during the
scheduled period of study. I also declare that I have not taken any material from any source except referred
to wherever due. If a violation of HEC rules on research has occurred in this thesis, I shall be liable to
punishable action under the plagiarism rules of the HEC.
Date:
____________________
(Muhammad Hamid)
(CIIT/SP11-BSB-024/ISB)
v|Page
Certificate
It is certified that Muhammad Hamid has carried out all the work related to this thesis under my
supervision at the Department of Biosciences, COMSATS Institute of Information Technology, Islamabad
campus.
Supervisor:
Submitted through:
vi | P a g e
DEDICATION
&
vii | P a g e
ACKNOWLEDGEMENTS
With profound gratitude and deep sense of devotion, I wish to thank my worthy
supervisor Dr. Abdullah Ahmed, Assistant Professor Department of Biosciences, COMSATS
Institute of Information Technology, Islamabad for his most cooperative attitude, help, keen
interest and valuable comments throughout the course of these studies and guidance in the
preparation of this thesis.
I highly appreciate the guidance and help rendered Dr. Aamira Tariq, Assistant
Professor Department of Biosciences, COMSATS Institute of Information Technology,
Islamabad. Without his help it would have not been possible to complete this work.
(Muhammad Hamid)
viii | P a g e
ABSTRACT
Alu elements are short ~280 nucleotide sequences that are found in primate genome. Pairing of
inverted Alu repeats forms duplex structures, which contribute in hyperediting. Alu elements are
rich in 3'UTR as compare to 5'UTR. Alu elements play a vital role in genome evolution, influence
the gene expression and translation of mRNA. Alu insertions play a role in verity of regulatory
mechanisms that can lead to various forms of cancers. Microarray data is collected from Intogen
and group of genes that contains Alu elements and are involved in the development of colon
cancer are extracted using programming methodology. 3UTR RNA Database, BioMart,
REPBase Censor, xPADExpression & PolyA Database is utilized for genes of our interest to
enhance their genomic annotations e.g. sequences, orientation, long and short transcripts both in
normal and cancer cells etc. Graph is generated and classified with the help of PANTHER tool to
analyze the genes that are up regulated and down regulated in colon cancer. TargetScan tool is
used to categorize and analysis transcripts on the basis of miRNA sites located in 3UTR isoform.
ix | P a g e
TABLE OF CONTENTS
1. Introduction: ............................................................................................................................... 2
x|Page
2.1.4 REPBase Censor: ........................................................................................................... 13
3. Results: ...................................................................................................................................... 21
3.1 CombinedTranscriptData:..................................................................................................... 21
xi | P a g e
3.4.4 Protein Class: ................................................................................................................ 37
4. Discussion:................................................................................................................................. 43
5. Reference:.................................................................................................................................. 44
xii | P a g e
LIST OF FIGURES
Figure 1: Age range in selected countries is 15 and older excluding Asia and Africa (American
Figure 2: Transcription of Alu elements by RNA polymerase II and RNA polymerase III. ........... 6
Figure 3: Different ways in which Alu elements might influence gene expression by A-to-I
editing. .............................................................................................................................................. 8
Figure 4: UTR were extracted from the EMBL database and with the aid of Repeatmasker tool
these regions were analyzed to get information regarding the existence of Alu elements. Each
dataset was manually analyzed to know its annotation and finally non annotated genes were
removed from entries. Rest of the entries were arranged as the orientation and architecture of Alu
RNAs .............................................................................................................................................. 12
Figure 6: Graphical representation of programmatically accessed data from several classes into
xiii | P a g e
LIST OF TABLES
xiv | P a g e
Table 3.4.1 (B): Molecular function 33
xv | P a g e
LIST OF ABBREVIATION
Ago Argonaute
RNP Ribonucleoprotein
xvi | P a g e
Chapter 1
Introduction
1|Page
Introduction Chapter 1
1. Introduction:
Cancer of the lower part of digestive system i.e. large intestine (colon), is referred to as
colon cancer, whereas the cancer of last few inches of the colon is known as rectal cancer. They
may also be specified as colorectal cancers (CRC). Adenomatous polyps that are noncancerous
(benign) clumps of cells initially small in size indicating the beginning stage of most colon cancer
cases. Polyps may or may not produce any kind of symptoms and mostly remain small to medium
in size, with the passage of time some of these polyps may turn into cancers (Mayo Clinic, 2013).
Polyps are majorly formed in people of older age, however, most polyps may not turn in
to cancerous cells (Jessica, 2010). The American Cancer Society has listed a number of symptoms
for the detection of colorectal cancer that includes rectum bleeding, bloody stools, bowels habit
changes, cramps in the region of colorectal, fatigue and weakness followed by weight loss
(Cancer Facts and Figures, 2013).
Almost 75% of patients that have CRC suffer sporadic disease with no obvious evidence
that the disorder is inherited. The 25% of remaining patients include a CRC family history. In
some families prone to colon cancer, it has been observed that genetic mutations are the cause of
inherited cancer. Such mutations have been predicted to cause only 5% to 6% of CRC cancer
cases as a whole. It is possible that background genetic factors and undiscovered genes may be
the key contributors to the familial CRC development in relation with risk factors that are non
genetic in nature (Leggett, 2010).
Men are 30% to 40% more susceptible for colorectal cancer as compare to women
(Colorectal Facts and Figures, 2014). Each year 600,000 new CRC cases are diagnosed globally,
that varies from 48.3 to 72.5 in men and 32.3 to 56 in women out of 100,000. Young patients
mainly less than 40 years are more vulnerable to CRC in Pakistan and their survival rate is lower
than older patients reason (probably due to late identification) (Abdul Qaiyoume Amini, 2013).
Diagnosis of colorectal cancer in early stages and removal of low-risk adenomas reduces the risk
of death, whereas removal of high-risk adenomas increased the death rate by 16% from colorectal
cancer (Magnus et al, 2014).
2|Page
Introduction Chapter 1
Five year relative survival rates (%) of CRC patients in select countries
70
60
50
40
30
20
10
Figure 1: Age range in selected countries is 15 and older excluding Asia and Africa (American Cancer
Society, 2008)
3UTR contains both binding sites for regulatory proteins as well as microRNA,
contributing to post-transcriptional control of gene expression.
3|Page
Introduction Chapter 1
1.1.2 Polyadenylation:
Polyadenylation is the process of addition of poly(A) tail or adenosine monophosphates to
mRNA sequence. Mechanism comprises the protein complex which cleaves pre-mRNA (initial
product of transcription), poly(A) tail is then inserted at the 3end at various possible sites of
mRNA resulting multiple transcripts (Nick J. Proudfoot, 2002). Alternative polyadenylation
(APA) can produce mRNA isoforms with variable length of 3UTR due to the presence of
multiple polyadenylation signals, and therefore, produce multiple proteins from a single gene.
Alternative polyadenylation enhances diversity of transcripts. Translational efficiency varies with
the length of 3UTR. Previous studies have shown increased protein production from transcripts
bearing shortened 3UTR as compared to transcripts bearing the long 3UTR.
It has been observed that tissue specific polyadenylation sites occur across the major
cancers and respective normal tissues. Multiple polyadenylation sites are found in 30% of genes
in their 3UTR, although most of the genes have two polyadenylation sites. Polyadenylation
signals are position specific, AT-rich motif (TATATW) is highly preferred by short isoforms and
(AATAAA) is preferred by long isoforms. High frequency of short as compared to long isoforms
has been observed in cancer up-regulated genes (Yuefeng Lin, 2012).
miRNA form complexes with Argonaute (Ago) protein giving rise to RNA-induced
silencing complex (RISC) to repress mRNA expression mRNA recognition and regulation is
achieved by miRNA that functions as adaptor for miRISC complex. In animal miRNA contains
binding sites that lie in the 3UTR. Most of the miRNAs forms imperfect complementarity with
4|Page
Introduction Chapter 1
mRNA. Whereas in plants miRNA forms perfect complementarity with coding sequence of their
targets. Therefore miRNA-mRNA binding considered as essential in regulatory mechanism.
Free Alu RNAs from their own RNA polymerase III promoter which produces
RNA transcription initiation but lacks terminator.
Embedded Alu RNAs from RNA polymerase II as part of protein and non-protein
coding RNA.
Alu elements consist of two monomers, left and right arms connected by A-rich linker
followed by a short poly(A) tail. It has been shown in recent researches that only a limited
number of Alu elements are capable to retrotranspose as they do not code for protein, they
amplified by the transposition machinery of other elements which is supposed to be LINE-1.
5|Page
Introduction Chapter 1
Figure 2: Transcription of Alu elements by RNA polymerase II and RNA polymerase III.
Free Alu elements are transcribed by RNA polymerase III. They play an important role in
genome evolution via insertion and recombination, however, a majority is genetically inert. This
internal promoter could not drive the process of transcription and therefore Alu elements are
reliant for expression on the flanking sequences to their region of insertion. Alu elements have
been observed increase in number due to the certain stress conditions such as heat shock,
adenovirus infection and cycloheximide exposure.
Free SRP9/14 binds with Alu RNA forming complex Alu RNP (ribonucleoprotein) which
acts as inhibitor of protein translation whereas Alu RNA enhances the translation of mRNA. Alu
RNA inhibits RNA-dependent protein kinase (PKR) consequently stimulates protein translation.
Embedded Alu elements influence gene expression via splicing, ADAR editing and
polyadenylation (Prescott Deininger, 2001). Research has been conducted to reveal that high
amount of Alu RNA are embedded in 3'UTR as compare to 5'UTR. There is single Alu element
per 24,000 bases in 5'UTR and single Alu per 14,000 bases in 3'UTR. It has been documented that
6|Page
Introduction Chapter 1
Alu RNAs are embedded in 5'UTR of particular mRNAs inhibit protein translation. A transcript
isoform of the DNA repair protein BRCA1 (expressed in breast cancer tissue), ZNF177 (a zinc
finger protein) and contactin reveals in decrease of the translation efficiency of the mRNA.
Antisense Alu elements inserted in 3'UTR can produce adenine/uracil rich elements or AREs.
mRNA expression could be influenced by AREs as it is involve in destabilization of mRNA.
SRP9/14 protein can bind with some of the Alu RNA to alter the stability of certain transcripts.
Alu RNA in 3'UTR helps to regulate mRNA stability whereas Alu RNA in 5'UTR represses
translation system (J. Hasler, 2007).
IRAlus form dsRNA of 300 base pairs because high homology found in all Alu sub
families. Gene expression could be affected by IRAlu. Editing of IRAlu can control quality
function and this mechanism can be use to regulate the amount of mRNA, to prevent random
editing of mRNA from reaching the cytoplasm when it is exported from the nucleus. Mouse
CTN-RNA remains in the nucleus until cell stress occurs. Cleavage of the hyperedited 3'UTR
nuclear retention signal enable its RNA export to the cytoplasm where translation occurs.
7|Page
Introduction Chapter 1
Figure 3: Different ways in which Alu elements might influence gene expression by A-to-I editing.
Another related mechanism of regulation involves the expression of alternative 3'-UTR via
alternative pre-mRNA splicing. Two genes caspase 8 and caspase 10 that lie on chromosomes 2
can express two different 3'UTR of mRNA, an upstream which contains IRAlu and downstream
which do not contain it. Splicing decides 3'UTR for insertion therefore these alternative 3'UTR
can affect the expression level of encoded proteins, it can be regulated by cellular stress. It has
been observed that proliferating cells express mRNAs with shortened 3'UTR also contain miRNA
target sites (Ling-Ling Chen, 2008).
8|Page
Introduction Chapter 1
1.3 Objectives:
The main objective of this research is to elaborate the key functions of Alu elements and
their role in cancer. Research has been conducted to predict the regulation of genes mediated by
Alu elements and how Alu elements act on regulatory mechanisms. We will be able to distinguish
normal and cancer cells by differentiating certain parameters e.g. alternative polyadenylation,
position in 3UTR, amount of Alu RNAs and strand. We will observe the impact of Alu elements
on mRNA stability, their interaction with miRNA and various other regulatory factors. This study
will help us to understand role of Alu elements in oncogenes and cellular level of gene
expression. Analysis of genome wide association of mutated cancer genes and Alu is urgently
needed to be performed.
9|Page
Chapter 2
Materials and Methods
10 | P a g e
Materials & Methods Chapter 2
2.1 Materials:
This section provides us details of procedures used in completing this research. It also covers data
sources from where we have extracted data and the analyses performed.
2.1.1 IntOGen:
Intogen is an integrative oncogenomics tool which is helpful for studying and understanding
cancers, it provides a platform to analyze oncogenomics data for gene prediction and involvement
of groups of genes in the development of cancer (Gonzalez-Perez A, 2013). Intogen has a
collection of genomic experimental data of microarrays. These are vital to study the alterations
that lead to various cancer types Intogen data was obtained from International Cancer Genome
Consortium (www.icgc.org) and The Cancer Genome Atlas databases.
11 | P a g e
Materials & Methods Chapter 2
Figure 4: UTR were extracted from the EMBL database and with the aid of Repeatmasker tool these
regions were analyzed to get information regarding the existence of Alu elements. Each dataset was manually
analyzed to know its annotation and finally non annotated genes were removed from entries. Rest of the
entries were arranged as the orientation and architecture of Alu RNAs
2.1.3 BioMart:
Biomart allow users to rapid access of ensemble data mostly recent genomic annotations
(Smedley D, 2009). It generates results according to the users interest and produces several
output formats (.html, .csv, .xls etc). BioMart helps us to retrieve relevant information from the
large genomic datasets by using variety of programming methodology.
Figure 5: (www.ensembl.org/biomart)
12 | P a g e
Materials & Methods Chapter 2
2. The censored query sequences, with an "N" ("X") replacing each base of the removed
repeats.
4. The fragments that were censored out, i.e. fragments homologous to one of the repeats
from the reference collection.
250 0.77 0.7656 c LTR 943 883 ALFARE1_I 21715 21651 N48
13 | P a g e
Materials & Methods Chapter 2
1523 0.69 0.6854 c LTR/Gypsy 3290 2619 DIASPORA_I 22622 21966 N48
224 0.82 0.8235 c LTR/Copia 2659 2607 ATCOPIA35_I 23200 23152 N48
1355 0.67 0.6672 c LTR/Gypsy 1737 1130 DIASPORA_I 24003 23391 N48
2.1.7 TargetScanHuman:
TargetScan is an online web server which is used to predict the target sites of miRNA that lies in
3UTR of mRNA. The tool searches the presence of conserved 8mer and 7mer sites that match
14 | P a g e
Materials & Methods Chapter 2
the seed region of each miRNA. The nonconserved sites are also predicted as an option
(TargetScan, 2015).
2.1.8 Workbench:
There are various scripts which were written in order to extract oncogenes and their related data in
organized form.
Extract_Genes:
i. Extract the genes that are up regulated and down regulated in colon cancer.
iii. Extract other relevant data of the genes from two files.
This class is written in Java programming language to extract the Alu containing genes that are up
regulated and down regulated in colon cancer. Firstly, it reads two files genes_site_colon.tsv
and 3'UTR.txt which were downloaded from Intogen and 3UTR RNA Database respectively
and transfer them into the buffer, it is easy to establish arrays where data can be stored from
buffer. intogenRawData and alu_containing_utr are two different arrays which contains row-wise
data from both files. intogenRawData array is processed to extract genes that are up regulated and
down regulated in cancer. Then the genes containing Alu elements are filtered from
alu_containing_utr array. Other relevant data of the genes which are EMBL number, Alu type,
Alu orientation, strand, up regulation and down regulation values are also extracted from both
arrays and write two excel files Genes_Extract_Upreg_output.txt and
Genes_Extract_Upreg_output.txt.
15 | P a g e
Materials & Methods Chapter 2
Intogen_AluContainingUTR_
Data
MartSeqAndLength
CombinedTranscriptData
Upreg CensorResultData
Gene ID
Transcript ID
Strand
AluRepeatsCount 3'UTR Length
3'UTR Sequence
Repeat Data
Alu No
Alu Family
xPAD_Data Alu Orientation
Alu From
Alu To
XPad Data
Long Normal
Intogen_AluContainingUTR_
Data Long Tumor
Short Normal
Short Tumor
Repeats in Long Normal
Repeats in Long Tumor
MartSeqAndLength Repeats in Short Normal
Repeats in Short Tumor
Short Normal Sequence
Short Tumor Sequence
Regulation Data
Downreg CensorResultData Upregulation Data
Downregulation Data
AluRepeatsCount
CombinedTranscriptData
xPAD_Data
Figure 6: Graphical representation of programmatically accessed data from several classes into main
class CombinedTranscriptData that is finally generating output.
16 | P a g e
Materials & Methods Chapter 2
Two major projects has been design for genes showing up regulation and down regulation in
colon cancer. Each project contains following classes in order to obtain our desired output in a
single file.
CensorResultData: This class does not contain any function, it reads censor exported file
and produce global arrays for several attributes. It manipulates the file and searches for genes and
transcripts and their respective attributes. Global arrays comprises of censor_genes,
censor_transcript, censor_orientation, censor_from, censor_to and censor_repeats.
xPAD_Data: This class contains various functions to get xPad data as showing in diagram
from xpad_upreg.txt file. This file is manually created in excel by observing xPAD tool which
contains details of genes for both long and short transcripts and in case of normal and tumor cells.
The detail contains gene id, transcript id, positions, repeats and sequences of short. Each output is
being generated from a separate function e.g. getLongNormal(), getLongTumor(),
getShortNormal(), getShortTumor() etc. there are ten different functions.
MartSeqAndLength: This class read BioMart exported FASTA file and manipulate in
such a way to isolate sequences of each transcript and their lengths. This class contains single
function getMart() which required a gene and transcript as argument because sequences present in
BioMart file is different for each transcript. Function searches each gene and transcript from file
then stores their respective sequences in StringBuilder finally returns them in
CombinedTranscriptData.
AluRepeatsCount: This class contains a function getAluNo() that takes a transcript from
CombinedTranscriptData and searches it from global array (That can be access through all
17 | P a g e
Materials & Methods Chapter 2
classes) censor_transcript. After that it searches for Alu in censor_repeats array, if transcript
founds Alu elements then the function counts number of repeats and returns to the
CombinedTranscriptData.
2.2 Methods:
Various steps have been performed one after the other in order to obtain data and to analyze for
results.
1. Two separate files were obtained from the Intogen microarray data in case of colon
cancer. One file contains genes that are frequently up regulated in colon cancer while
other file contains genes that are down regulated in colon cancer.
3. A script was written to isolate up regulated and down regulated genes whose up regulation
value was less than 0.05. These genes are matched with 3UTR file, if the genes contain
Alu elements then only the particular genes is extracted as an output.
4. EMBL numbers which were derived from 3UTR in the output of the Intogen script were
entered into BioMart to extract the sequences of the 3'UTRs.
5. The 3'UTRs from the previous step were analysed using Censor.
6. Transcripts that are isolated from the Censor tools are manually inserted in to the xPAD
tool separately to analyze the expressions of genes on short, long, normal and tumor
18 | P a g e
Materials & Methods Chapter 2
transcript, and also precedence of Alu elements in a particular transcript. All these
information is stored in two excel files for both up regulated and down regulated genes.
7. Different scripts is then written using Java programming language to extract, organize,
and manipulate the data collected from several resources. Eclipse platform is used as
Integrated Development Environment (IDE) to accomplish these tasks.
8. Several classes is written to collect the desired data in one place, the main class from
where data is collected is CombinedTranscriptData. Two projects are written for that
purpose one project for genes appear up regulated in colon cancer whereas second project
for genes appear down regulated in colon cancer.
19 | P a g e
Chapter 3
Results
20 | P a g e
Results Chapter 3
3. Results:
We have extracted data of the genes expressed in cancer from various sources as
mentioned in Materials and Methods section. The main goal of this research is to build combine
transcript data file that allow us to access the data programmatically. The data is a set of
information of genes which are differentially expressed in colon cancer.
3.1 CombinedTranscriptData:
The data which were collected from the different sources, has been filtered out and the
meaningful information is extracted using combine transcript data. The purpose of constructing
combine transcript data is to generate data warehousing, which will be helpful to study genes,
their relevant Alu elements in normal and colon cancer cells, transcript variants and other
genomic annotations. This compact information will be explored and will be useful in
oncogenomics research. This final format of the file which contains all information gathered from
different sources can be use in cancer analysis. Combine transcript data produces output in one
file which can be viewed in excel. The organization of the data in output file of combine transcript
data exists in easy accessible format, which can be used for further processing by data mining.
21 | P a g e
Results Chapter 3
3.3 Up Regulation:
3.3.1 Molecular Function:
22 | P a g e
Results Chapter 3
Protein Binding
Structural Molecule
Transcription Factor Receptor Activity Transporter Activity
Activity
Activity
ENST00000577886 ENST00000263379 ENST00000233336 ENST00000380590
ENST00000578237 ENST00000230882 ENST00000215957 ENST00000260191
ENST00000336708 ENST00000357703 ENST00000353836 ENST00000530277
ENST00000276431 ENST00000442846 ENST00000392770
ENST00000353836 ENST00000441969 ENST00000299333
ENST00000442846 ENST00000268661
ENST00000441969 ENST00000300933
ENST00000260191
ENST00000243673
23 | P a g e
Results Chapter 3
ENST00000373857
ENST00000539896
Table 3.1.1 (B): Molecular function
Cellular Component
Apoptotic Process Biological Regulation Organization or Cellular Process
Biogenesis
ENST00000360132 ENST00000223095 ENST00000215957 ENST00000263379
ENST00000286186 ENST00000251269 ENST00000300933 ENST00000539749
ENST00000346817 ENST00000360132 ENST00000296387
ENST00000276431 ENST00000286186 ENST00000333703
ENST00000346817 ENST00000368729
ENST00000357066 ENST00000230882
ENST00000276431 ENST00000357703
ENST00000407965 ENST00000357066
ENST00000338483 ENST00000215957
ENST00000426621 ENST00000341754
ENST00000538320 ENST00000382038
ENST00000538999 ENST00000276431
ENST00000467448 ENST00000258774
24 | P a g e
Results Chapter 3
ENST00000432329 ENST00000436444
ENST00000353267 ENST00000353836
ENST00000577886 ENST00000442846
ENST00000578237 ENST00000441969
ENST00000336708 ENST00000467448
ENST00000398174
ENST00000260191
ENST00000296161
ENST00000300933
ENST00000243673
ENST00000432329
ENST00000353267
ENST00000530277
ENST00000392770
ENST00000299333
ENST00000373857
ENST00000539896
Table 3.1.2 (A): Biological process
25 | P a g e
Results Chapter 3
ENST00000341754
ENST00000382038
ENST00000258774
ENST00000436444
ENST00000407965
ENST00000338483
ENST00000426621
ENST00000538320
ENST00000538999
ENST00000467448
ENST00000398174
ENST00000254908
ENST00000512783
ENST00000268661
ENST00000296161
ENST00000300933
ENST00000430095
ENST00000358495
ENST00000272645
ENST00000307720
ENST00000308086
ENST00000432329
ENST00000353267
ENST00000250092
ENST00000505337
ENST00000439211
ENST00000380097
ENST00000278302
ENST00000577886
ENST00000578237
ENST00000336708
Table 3.1.2 (B): Biological process
26 | P a g e
Results Chapter 3
ENST00000243673
ENST00000432329
ENST00000353267
ENST00000373857
ENST00000539896
Table 3.1.2 (C): Biological process
Macromolecular
Cell Part Membrane Organelle
Complex
ENST00000233336 ENST00000233336 ENST00000380590 ENST00000233336
ENST00000380590 ENST00000539749 ENST00000380590
ENST00000539749 ENST00000296387 ENST00000215957
ENST00000296387 ENST00000300933
ENST00000215957 ENST00000577886
ENST00000300933 ENST00000578237
ENST00000336708
Table 3.1.3: Cellular component
27 | P a g e
Results Chapter 3
Defense/Immunity
Enzyme Modulator Hydrolase Isomerase
Protein
ENST00000263379 ENST00000370823 ENST00000360132 ENST00000307720
ENST00000230882 ENST00000223095 ENST00000286186
ENST00000357703 ENST00000360132 ENST00000346817
ENST00000353836 ENST00000286186 ENST00000341754
ENST00000442846 ENST00000346817 ENST00000382038
ENST00000441969 ENST00000357066 ENST00000398174
ENST00000467448
Table 3.1.4 (B): Protein class
Membrane Traffic
Ligase Lyase Nucleic Acid Binding
Protein
ENST00000233336 ENST00000254908 ENST00000317620 ENST00000394456
ENST00000296161 ENST00000512783 ENST00000317668 ENST00000373019
ENST00000380097 ENST00000250092 ENST00000341754
28 | P a g e
Results Chapter 3
ENST00000278302 ENST00000382038
ENST00000577886 ENST00000407965
ENST00000578237 ENST00000338483
ENST00000336708 ENST00000426621
ENST00000538320
ENST00000538999
ENST00000467448
ENST00000268661
ENST00000430095
ENST00000358495
ENST00000308086
ENST00000432329
ENST00000353267
Table 3.1.4 (C): Protein class
Transfer/Carrier
Structural Protein Transcription Factor Transferase
Protein
ENST00000353836 ENST00000394456 ENST00000380590 ENST00000383789
ENST00000442846 ENST00000251269 ENST00000443029
ENST00000441969 ENST00000381222 ENST00000383790
ENST00000381223 ENST00000373115
ENST00000381218
ENST00000394807
ENST00000407965
ENST00000338483
ENST00000426621
ENST00000538320
29 | P a g e
Results Chapter 3
ENST00000538999
ENST00000272645
ENST00000432329
ENST00000353267
ENST00000577886
ENST00000578237
ENST00000336708
Table 3.1.4 (E): Protein class
Transporter
ENST00000380590
ENST00000260191
ENST00000530277
ENST00000392770
ENST00000299333
Table 3.1.4 (F): Protein class
3.3.5 Pathway:
30 | P a g e
Results Chapter 3
ENST00000346817
ENST00000276431
ENST00000432329
ENST00000353267
Table 3.1.5 (A): Pathway
De-Novo Pyrmidine
FAS Signaling
Ribonucleotides Enkephalin Release Folate Biosynthesis
Pathway
Biosythesis
ENST00000232607 ENST00000432329 ENST00000360132 ENST00000505337
ENST00000353267 ENST00000286186 ENST00000439211
ENST00000346817
Table 3.1.5 (B): Pathway
Heterotrimeric G-
Gonadotropin
General protein Signaling
Formyltetrahydroformate Releasing
Transcription Pathway-Gi Alpha
Biosynthesis Hormone Receptor
Regulation and Gs Alpha
Pathway
Mediated Pathway
ENST00000505337 ENST00000394456 ENST00000432329 ENST00000432329
ENST00000439211 ENST00000272645 ENST00000353267 ENST00000353267
Inflammation
Mediated by Transcription
Plasminogen Wnt Signaling
Chemokine and Regulation by bZIP
Activating Cascade Pathway
Cytokine Signaling Transcription Factor
Pathway
ENST00000373857 ENST00000223095 ENST00000394456 ENST00000333703
ENST00000539896 ENST00000272645
ENST00000432329
ENST00000353267
Table 3.1.5 (D): Pathway
31 | P a g e
Results Chapter 3
32 | P a g e
Results Chapter 3
ENST00000332305 ENST00000278302
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
ENST00000417689
ENST00000317091
Table 3.2.1 (A): Molecular function
Protein Binding
Structural Molecule
Transcription Factor Receptor Activity Transporter Activity
Activity
Activity
ENST00000441801 ENST00000373857 ENST00000233336 ENST00000380590
ENST00000375766 ENST00000539896 ENST00000268661 ENST00000235345
ENST00000375771 ENST00000300933 ENST00000347644
ENST00000332305 ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
Table 3.2.1 (B): Molecular function
33 | P a g e
Results Chapter 3
Cellular Component
Apoptotic Process Biological Regulation Organization or Cellular Process
Biogenesis
ENST00000305046 ENST00000251269 ENST00000300933 ENST00000539749
ENST00000357066 ENST00000441801 ENST00000296387
ENST00000441801 ENST00000375766 ENST00000368729
ENST00000375766 ENST00000375771 ENST00000357066
ENST00000375771 ENST00000332305 ENST00000398174
ENST00000332305 ENST00000296161
ENST00000300933
ENST00000307046
ENST00000337514
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
ENST00000373857
ENST00000539896
Table 3.2.2 (A): Biological process
34 | P a g e
Results Chapter 3
ENST00000268661
ENST00000296161
ENST00000300933
ENST00000272645
ENST00000308086
ENST00000221307
ENST00000305046
ENST00000250092
ENST00000505337
ENST00000439211
ENST00000307046
ENST00000337514
ENST00000380097
ENST00000278302
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
ENST00000417689
ENST00000317091
Table 3.2.2 (B): Biological process
35 | P a g e
Results Chapter 3
Macromolecular
Cell Junction Cell Part Membrane
Complex
ENST00000441801 ENST00000233336 ENST00000233336 ENST00000380590
ENST00000375766 ENST00000380590 ENST00000539749
ENST00000375771 ENST00000539749 ENST00000296387
ENST00000332305 ENST00000296387 ENST00000441801
ENST00000300933 ENST00000375766
ENST00000441801 ENST00000375771
ENST00000375766 ENST00000332305
ENST00000375771
ENST00000332305
Table 3.2.3 (A): Cellular component
Organelle
ENST00000233336
ENST00000380590
ENST00000300933
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
Table 3.2.3 (B): Cellular component
36 | P a g e
Results Chapter 3
Calcium-Binding
Cell Junction Protein Cytoskeletal Protein Enzyme Modulator
Protein
ENST00000380590 ENST00000539749 ENST00000233336 ENST00000370823
ENST00000368729 ENST00000296387 ENST00000300933 ENST00000357066
ENST00000441801 ENST00000441801 ENST00000441801
ENST00000375766 ENST00000375766 ENST00000375766
ENST00000375771 ENST00000375771 ENST00000375771
ENST00000332305 ENST00000332305 ENST00000332305
Table 3.2.4 (A): Protein class
Membrane Traffic
Hydrolase Ligase Lyase
Protein
ENST00000398174 ENST00000233336 ENST00000344366 ENST00000272462
ENST00000417689 ENST00000296161 ENST00000178638 ENST00000317620
ENST00000317091 ENST00000380097 ENST00000321764 ENST00000317668
ENST00000278302 ENST00000250092
Table 3.2.4 (B): Protein class
37 | P a g e
Results Chapter 3
ENST00000505337
ENST00000439211
Table 3.2.4 (C): Protein class
Transfer/Carrier
Signaling Molecule Transcription Factor Transferase
Protein
ENST00000368729 ENST00000251269 ENST00000380590 ENST00000383789
ENST00000398174 ENST00000381222 ENST00000443029
ENST00000307046 ENST00000381223 ENST00000383790
ENST00000337514 ENST00000381218
ENST00000272645
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
Table 3.2.4 (D): Protein class
Transporter
ENST00000380590
ENST00000235345
ENST00000347644
Table 3.2.4 (E): Protein class
3.4.5 Pathway:
38 | P a g e
Results Chapter 3
Inflammation Insulin/IGF
General Gonadotropin Mediated by Pathway-Mitogen
Transcription Releasing Hormone Chemokine and Activated Protein
Regulation Receptor Pathway Cytokine Signaling Kinase Kinase/MAP
Pathway Kinase Cascade
ENST00000272645 ENST00000307046 ENST00000373857 ENST00000307046
ENST00000337514 ENST00000539896 ENST00000337514
Table 3.2.5 (B): Pathway
1. miR dependent
2. miR independent
The miR independent is consists of Alu Exclusion Associated Polyadenylation (AEP) and Alu
Directed Alternative Polyadenylation (ADP).
UTR Alu
Transcript Name Alu Details S/L(N) S/L(T) miR
Length Position
Up Regulation
39 | P a g e
Results Chapter 3
miR is
Alu in all
TNFRSF10B ENST00000276431 2538 423 0.2 0.121 included in
transcripts
long transcript
miR is
Alu in long
HUS1 ENST00000258774 2068 983 0.578 2.066 included in
only
long transcript
Down Regulation
miR is
Alu in long
IGF1 ENST00000337514 6633 1222 0.095 0.059 included in
only
long transcript
miR is
Alu in all
CA12 ENST00000178638 4907 1489 0.014 0.022 included in
transcripts
long transcript
UTR Alu
Transcript Name Alu Details S/L(N) S/L(T) miR
Length Position
Up Regulation
40 | P a g e
Results Chapter 3
Down Regulation
UTR Alu
Transcript Name Alu Details S/L(N) S/L(T) miR
Length Position
Up Regulation
miR is
Alu in all
POLR2D ENST00000272645 1841 479 0.166 0.166 included in all
transcripts
transcript
Down Regulation
miR is
Alu in all
BVES ENST00000314641 4267 580 0.137 0.125 included in all
transcripts
transcript
miR is
Alu in all
SLC35D1 ENST00000235345 5008 697 0.082 0.047 included in all
transcripts
transcript
41 | P a g e
Chapter 4
Discussion
42 | P a g e
Discussion Chapter 4
4. Discussion:
Untranslated region is important in regulation of gene expression at post-transcriptional
level therefore Alu RNA embedded in this region is essential for gene expression. A majority of
RNA editing occurs within the Alu elements, editing of inverted Alu elements could affect gene
expression. mRNA isoforms are produced by alternative polyadenylation, short isoforms lacks
miRNA which leads to several types of cancers.
Large data sets are created successfully to understand how Alu repeats contribute to gene
expression. Large amount of data is available in different online sources which is collected and
manipulated to filter useful information. Combined transcript data file is constructed which
contains useful data of genes up regulating and down regulating in colon cancer. Genetic
annotations of colon genes are organized in combined transcript data file that makes the data
programmatically accessible. This will allow us to machine learning to extract meaningful trends
from the data which can be used for further processing by data mining.
Different graphs have been generated to analyze behavior of the transcripts produced from
combined transcript data. These graphs are obtained from Panther tool and transcripts of up
regulation and down regulation are listed in form of tables according to their role in molecular
functions, biological processes, cellular components, protein classes and pathways. Analysis of
graphs shows the dispersion of genes performing different biological functions and processes. The
graphs are explaining the functional distribution of transcripts which can help to understand the
activity of transcript in expression of genes. Combined transcript data needs to be analyzed
further and machine learning will be done for this purpose.
The miRNA sites have been identified with the help of TargetScan and further analysis is
performed. Transcripts are classified into two broad categories on the basis of manual analysis of
miRNA target sites. The miR dependent classification include only those transcripts in which
miRNA sites are present in long isoform but lack in short 3UTR. On the other hand miR
independent classification is further divided in to AEP (Alu Exclusion Associated
Polyadenylation) and ADP (Alu Directed Alternative Polyadenylation). Presence of Alu only in
long 3UTR but lack in short isoform is categorized in AEP. ADP consists of all those transcripts
which contain miRNA sites and Alu in both short and long 3UTR isoform.
43 | P a g e
5. Reference:
Howe JR, Mitros FA, Summers RW: The risk of gastrointestinal carcinoma in familial
juvenile polyposis. Ann SurgOncol 5 (8): 751-6, 1998
Shinya H, Wolff WI: Morphology, anatomic distribution and cancer potential of colonic
polyps. Ann Surg 190 (6): 679-83, 1979
Magnus Lberg, M.D., MetteKalager, M.D., Ph.D., yvindHolme, M.D., Geir Hoff,
M.D., Ph.D., Hans-OlovAdami, M.D., Ph.D., and Michael Bretthauer, M.D., Ph.D. "Long-
Term Colorectal-Cancer Mortality after Adenoma Removal" 2014.
Gary H. Perdew, Jack P. Vanden Heuvel, Jeffrey M. Peters. (2006). Regulation of Gene
Expression. Totowa, New Jersey: Humana Press.
Hoopes, L. (2008) Introduction to the gene expression and regulation topic room. Nature
Education 1(1):160
Lucy W. Barrett, Sue Fletcher, Steve D. Wilton. (2012). Regulation of eukaryotic gene
expression by the untranslated gene regions and other non-coding elements.
Cydney Brooke Nielsen "Mammalian Gene Regulation through the 3' UTR" 2001.
Nick J. Proudfoot, Andre Furger, and Michael J. Dye Integrating mRNA Processing with
Transcription 2002
Yuefeng Lin, Zhihua Li, FatihOzsolak, Sang Woo Kim, Gustavo Arango-Argoty, Teresa
T. Liu, Scott A. Tenenbaum, Timothy Bailey, A. Paula Monaghan, Patrice M. Milos and
Bino John "An in-depth map of polyadenylation sites in cancer" 2012
44 | P a g e
David P. Bartel Review MicroRNAs: Genomics, Biogenesis, Mechanism, and Function
2004
Richard W. Carthew and Erik J. Sontheimer "Origins and Mechanisms of miRNAs and
siRNAs" 2009
J. Hasler, T. Samuelsson, K. Strub "Useful junk: Alu RNAs in the human transcriptome"
(2007).
Chen LL, DeCerbo JN, Carmichael GG. "Alu element-mediated gene silencing." (2008).
Ling-Ling Chen and Gordon G. Carmichael "Gene regulation by SINES and inosines"
2008.
Prescott L. Deininger and Mark A. Batzer "Alu Repeats and Human Disease" 1996.
Bass BL. RNA editing by adenosine deaminases that act on RNA. Annu Rev Biochem
2002; 71:817-46.
45 | P a g e
Landry, J.R., Medstrand, P. and Mager, D.L. (2001) Repetitive elements in the 5'
untranslated region of a human zinc-finger gene modulate transcription and translation
efficiency. Genomics, 76, 110-116.
Julien Hsler, Tore Samuelsson and Katharina Strub "Alu RNAs embedded in 5 and 3
UTRs of human mRNAs" 2007
Kohany O, Gentles AJ, Hankus L, Jurka J. "Annotation, submission and screening of
repetitive elements in Repbase: RepbaseSubmitter and Censor." 2006
Yuefeng Lin, Zhihua Li, Fatih Ozsolak, Sang Woo Kim, Gustavo Arango-Argoty, Teresa
T. Liu, Scott A. Tenenbaum, Timothy Bailey, A. Paula Monaghan, Patrice M. Milos
and Bino John "An in-depth map of polyadenylation sites in cancer" 2012
Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N, Ladunga I, Ulitsky-
Lazareva B, Muruganujan A, Rabkin S, Vandergriff JA, Doremieux O. "PANTHER: a
browsable database of gene products organized by biological function, using curated
protein family and subfamily classification." (2003).
46 | P a g e
NCBI. (2011, April 06). The GenBank Submissions Handbook [Internet]. Retrieved 12
20, 2014, from National Center for Biotechnology Information (US):
http://www.ncbi.nlm.nih.gov/books/NBK53702/
Oracle. (n.d.). Class PrintWriter. Retrieved 12 20, 2014, from Oracle.com:
http://docs.oracle.com/javase/7/docs/api/java/io/PrintWriter.html
W3Schools. (n.d.). JavaScript Errors - Throw and Try to Catch. Retrieved 12 20, 2014,
from W3Schools.com: http://www.w3schools.com/js/js_errors.asp
Yang Zhang's Research Group. (n.d.). What is FASTA format? Retrieved 12 20, 2014,
from Zhang Lab University of Michigan: http://zhanglab.ccmb.med.umich.edu/FASTA/
TargetScan. (n.d.). TargetScanHuman. Retrieved 01 30, 2015, from targetscan.org:
http://www.targetscan.org/
47 | P a g e