You are on page 1of 63

Alu Repeats, Cause or Consequence of Colon Cancer

By

Muhammad Hamid

SP11-BSB-024

BS Thesis (2011-2014)

COMSATS Institute of Information Technology, Islamabad-Pakistan


December, 2014

i|Page
COMSATS Institute of Information Technology

Alu Repeats, Cause or Consequence of Colon Cancer

A Thesis Presented to

COMSATS Institute of Information Technology, Islamabad


In Partial Fulfillment

of the requirement for the Degree of

BS
(Bioinformatics)

By

Muhammad Hamid

CIIT/SP11-BSB-024/ISB

December, 2014

ii | P a g e
Alu Repeats, Cause or Consequence of Colon Cancer

An Undergraduate Thesis submitted to the Department of Bioscience as


partial fulfillment of the requirement for the award of the Degree of B.S.
(Bioinformatics).

Name Registration Number

Muhammad Hamid CIIT/SP11-BSB-024/ISB

Supervisor
Dr. Abdullah Ahmed
Assistant Professor, Department of Biosciences

Co-Supervisor
Dr. Aamira Tariq
Assistant Professor, Department of Biosciences

CIIT, Islamabad Campus.


December, 2014

iii | P a g e
Final Approval

This thesis titled

Alu Repeats, Cause or Consequence of Colon Cancer


Submitted for the Degree of BS Bioinformatics by

Muhammad Hamid

Has been approved for the COMSATS Institute of Information Technology Islamabad

External Examiner: __________________________________________________


Mr. Haroon Khan
Lecturer, Muhammad Ali Jinnah University, Islamabad

Supervisor: ________________________________________________________
Dr. Abdullah Ahmed
Assistant Professor, Department of Biosciences, CIIT Islamabad

Co-Supervisor: _____________________________________________________
Dr. Aamira Tariq
Assistant Professor, Department of Biosciences, CIIT Islamabad

Head of Department: ________________________________________________


Prof. Dr. Raheel Qamar (T.I.)
Department of Biosciences, CIIT Islamabad

Chairman: _________________________________________________________
Prof. Dr. Syed Habib Bokhari
Department of Biosciences, CIIT Islamabad

iv | P a g e
Declaration

I Muhammad Hamid hereby declare that I have produced the work presented in this thesis, during the
scheduled period of study. I also declare that I have not taken any material from any source except referred
to wherever due. If a violation of HEC rules on research has occurred in this thesis, I shall be liable to
punishable action under the plagiarism rules of the HEC.

Date:

Signature of the student:

____________________

(Muhammad Hamid)

(CIIT/SP11-BSB-024/ISB)

v|Page
Certificate

It is certified that Muhammad Hamid has carried out all the work related to this thesis under my
supervision at the Department of Biosciences, COMSATS Institute of Information Technology, Islamabad
campus.

Supervisor:

Dr. Abdullah Ahmed

Submitted through:

Prof. Dr. Raheel Qamar (T.I.)


Head of Department

Dr. Syed Habib Bokhari


Chairman

vi | P a g e
DEDICATION

To Almighty ALLAH and the Holy Prophet Muhammad


(S.A.W.W)

&

My Loving Family, Friends and Respected Teachers

vii | P a g e
ACKNOWLEDGEMENTS

Thanks to ALMIGHTY ALLAH, whos unlimited and unpredictable source of


help made me able to win honors of life.

I also pay all my respect to Hazrat Muhammad (S.A.W.W), his faithful


companions, who are forever a true torch of guidance for humanity as a whole.

With profound gratitude and deep sense of devotion, I wish to thank my worthy
supervisor Dr. Abdullah Ahmed, Assistant Professor Department of Biosciences, COMSATS
Institute of Information Technology, Islamabad for his most cooperative attitude, help, keen
interest and valuable comments throughout the course of these studies and guidance in the
preparation of this thesis.

I highly appreciate the guidance and help rendered Dr. Aamira Tariq, Assistant
Professor Department of Biosciences, COMSATS Institute of Information Technology,
Islamabad. Without his help it would have not been possible to complete this work.

My sincere thanks to my family who supported me a lot to continue my studies.


Sincere thanks to my mother for prayers and support of my father, sister and brothers without
their help I wouldnt be able to achieve it. Especially thanks to Mustafa Bilal, Waseem Ahmed,
Wafa Naqvi and Urooj Burhana for their support. Sincere thanks to all of my friends for their
encouragement. Also special thanks to my class fellows who guided me and supported me.
Thanks to all those who helped me in any way to make this work easier for me.

Finally, I wish to extend my hearty thanks to all of my family members whose


encouragement and continuous moral as well as material support paved the way for me to reach
this destination.

(Muhammad Hamid)

viii | P a g e
ABSTRACT

Alu elements are short ~280 nucleotide sequences that are found in primate genome. Pairing of
inverted Alu repeats forms duplex structures, which contribute in hyperediting. Alu elements are
rich in 3'UTR as compare to 5'UTR. Alu elements play a vital role in genome evolution, influence
the gene expression and translation of mRNA. Alu insertions play a role in verity of regulatory
mechanisms that can lead to various forms of cancers. Microarray data is collected from Intogen
and group of genes that contains Alu elements and are involved in the development of colon
cancer are extracted using programming methodology. 3UTR RNA Database, BioMart,
REPBase Censor, xPADExpression & PolyA Database is utilized for genes of our interest to
enhance their genomic annotations e.g. sequences, orientation, long and short transcripts both in
normal and cancer cells etc. Graph is generated and classified with the help of PANTHER tool to
analyze the genes that are up regulated and down regulated in colon cancer. TargetScan tool is
used to categorize and analysis transcripts on the basis of miRNA sites located in 3UTR isoform.

ix | P a g e
TABLE OF CONTENTS

1. Introduction: ............................................................................................................................... 2

1.1 Gene Expression and Regulation: ........................................................................................... 3

1.1.1 Untranslated Regions (UTR): .......................................................................................... 3

1.1.2 Polyadenylation: .............................................................................................................. 4

1.1.3 Micro RNA: ...................................................................................................................... 4

1.2 Alu Elements: ......................................................................................................................... 5

1.2.1 Altered Splicing: .............................................................................................................. 7

1.2.2 A-To-I RNA Editing: ........................................................................................................ 7

1.2.3 Polyadenylation in Inverted Repeated Alu: ..................................................................... 8

1.2.4 Alu RNA and Micro RNA: ................................................................................................ 9

1.3 Objectives: .............................................................................................................................. 9

2. Materials and Methods: ........................................................................................................... 11

2.1 Materials: .............................................................................................................................. 11

2.1.1 IntOGen: ........................................................................................................................ 11

2.1.2 3UTR RNA Database:................................................................................................... 11

2.1.3 BioMart: ......................................................................................................................... 12

x|Page
2.1.4 REPBase Censor: ........................................................................................................... 13

2.1.5 xPADExpression & PolyA Database: .......................................................................... 14

2.1.6 Panther Classification System: ...................................................................................... 14

2.1.7 TargetScanHuman: ........................................................................................................ 14

2.1.8 Workbench: ................................................................................................................... 15

2.2 Methods: ............................................................................................................................... 18

3. Results: ...................................................................................................................................... 21

3.1 CombinedTranscriptData:..................................................................................................... 21

3.2 Results of Panther Classification System: ............................................................................ 21

3.3 Up Regulation: ...................................................................................................................... 22

3.3.1 Molecular Function: ...................................................................................................... 22

3.3.2 Biological Process: ........................................................................................................ 24

3.3.3 Cellular Component:...................................................................................................... 27

3.3.4 Protein Class: ................................................................................................................ 28

3.3.5 Pathway: ........................................................................................................................ 30

3.4 Down Regulation: ................................................................................................................. 32

3.4.1 Molecular Function: ...................................................................................................... 32

3.4.2 Biological Process: ........................................................................................................ 33

3.4.3 Cellular Component:...................................................................................................... 36

xi | P a g e
3.4.4 Protein Class: ................................................................................................................ 37

3.4.5 Pathway: ........................................................................................................................ 38

3.5 Analysis of TargetScan: .................................................................................................... 39

4. Discussion:................................................................................................................................. 43

5. Reference:.................................................................................................................................. 44

5.1 Paper and Books References: ............................................................................................... 44

5.2 Electronic References: .......................................................................................................... 46

xii | P a g e
LIST OF FIGURES

Figure 1: Age range in selected countries is 15 and older excluding Asia and Africa (American

Cancer Society, 2008) ...................................................................................................................... 3

Figure 2: Transcription of Alu elements by RNA polymerase II and RNA polymerase III. ........... 6

Figure 3: Different ways in which Alu elements might influence gene expression by A-to-I

editing. .............................................................................................................................................. 8

Figure 4: UTR were extracted from the EMBL database and with the aid of Repeatmasker tool

these regions were analyzed to get information regarding the existence of Alu elements. Each

dataset was manually analyzed to know its annotation and finally non annotated genes were

removed from entries. Rest of the entries were arranged as the orientation and architecture of Alu

RNAs .............................................................................................................................................. 12

Figure 5: (www.ensembl.org/biomart) ........................................................................................... 12

Figure 6: Graphical representation of programmatically accessed data from several classes into

main class CombinedTranscriptData that is finally generating output. ......................................... 16

xiii | P a g e
LIST OF TABLES

Table 2.1.4: Censor output format 14

Table 3.3.1 (A): Molecular function 23

Table 3.3.1 (B): Molecular function 24

Table 3.3.2 (A): Biological process 25

Table 3.3.2 (B): Biological process 26

Table 3.3.2 (C): Biological process 27

Table 3.3.3: Cellular component 27

Table 3.3.4 (A): Protein class 28

Table 3.3.4 (B): Protein class 28

Table 3.3.4 (C): Protein class 29

Table 3.3.4 (D): Protein class 29

Table 3.3.4 (E): Protein class 30

Table 3.3.4 (F): Protein class 30

Table 3.3.5 (A): Pathway 31

Table 3.3.5 (B): Pathway 31

Table 3.3.5 (C): Pathway 31

Table 3.3.5 (D): Pathway 31

Table 3.3.5 (E): Pathway 31

Table 3.4.1 (A): Molecular function 33

xiv | P a g e
Table 3.4.1 (B): Molecular function 33

Table 3.4.2 (A): Biological process 34

Table 3.4.2 (B): Biological process 35

Table 3.4.2 (C): Biological process 35

Table 3.4.3 (A): Cellular component 36

Table 3.4.3 (B): Cellular component 36

Table 3.4.4 (A): Protein class 37

Table 3.4.4 (B): Protein class 37

Table 3.4.4 (C): Protein class 38

Table 3.4.4 (D): Protein class 38

Table 3.4.4 (E): Protein class 38

Table 3.4.5 (A): Pathway 39

Table 3.4.5 (B): Pathway 39

Table 3.4.5 (C): Pathway 39

Table 3.5 (A): Transcripts involved in miR dependent. 40

Table 3.5 (B): Transcripts involved in AEP. 41

Table 3.5 (C): Transcripts involved in ADP. 41

xv | P a g e
LIST OF ABBREVIATION

CRC Colorectal Cancer

UTR Untranslated Region

APA Alternative Polyadenylation

RNAi Ribonucleic Acid Interference

Ago Argonaute

RISC RNA-Induced Silencing Complex

SINE Short Interspersed Nuclear Elements

LINE Long Interspersed Elements

RNP Ribonucleoprotein

PKR Protein Kinase R

ADAR Adenosine Deaminase Acting on RNA

AREs Adenine/Uracil Rich Elements

IRAlu Inverted Repeat Alu Elements

EMBL European Molecular Biology Laboratory

xPAD Expression and PolyA Database

Panther Protein Analysis through Evolutionary Relationships

xvi | P a g e
Chapter 1
Introduction

1|Page
Introduction Chapter 1

1. Introduction:
Cancer of the lower part of digestive system i.e. large intestine (colon), is referred to as
colon cancer, whereas the cancer of last few inches of the colon is known as rectal cancer. They
may also be specified as colorectal cancers (CRC). Adenomatous polyps that are noncancerous
(benign) clumps of cells initially small in size indicating the beginning stage of most colon cancer
cases. Polyps may or may not produce any kind of symptoms and mostly remain small to medium
in size, with the passage of time some of these polyps may turn into cancers (Mayo Clinic, 2013).

Polyps are majorly formed in people of older age, however, most polyps may not turn in
to cancerous cells (Jessica, 2010). The American Cancer Society has listed a number of symptoms
for the detection of colorectal cancer that includes rectum bleeding, bloody stools, bowels habit
changes, cramps in the region of colorectal, fatigue and weakness followed by weight loss
(Cancer Facts and Figures, 2013).

Almost 75% of patients that have CRC suffer sporadic disease with no obvious evidence
that the disorder is inherited. The 25% of remaining patients include a CRC family history. In
some families prone to colon cancer, it has been observed that genetic mutations are the cause of
inherited cancer. Such mutations have been predicted to cause only 5% to 6% of CRC cancer
cases as a whole. It is possible that background genetic factors and undiscovered genes may be
the key contributors to the familial CRC development in relation with risk factors that are non
genetic in nature (Leggett, 2010).

Men are 30% to 40% more susceptible for colorectal cancer as compare to women
(Colorectal Facts and Figures, 2014). Each year 600,000 new CRC cases are diagnosed globally,
that varies from 48.3 to 72.5 in men and 32.3 to 56 in women out of 100,000. Young patients
mainly less than 40 years are more vulnerable to CRC in Pakistan and their survival rate is lower
than older patients reason (probably due to late identification) (Abdul Qaiyoume Amini, 2013).
Diagnosis of colorectal cancer in early stages and removal of low-risk adenomas reduces the risk
of death, whereas removal of high-risk adenomas increased the death rate by 16% from colorectal
cancer (Magnus et al, 2014).

2|Page
Introduction Chapter 1

Five year relative survival rates (%) of CRC patients in select countries

70

60

50

40

30

20

10

Figure 1: Age range in selected countries is 15 and older excluding Asia and Africa (American Cancer
Society, 2008)

1.1 Gene Expression and Regulation:


Gene expression is a process by which the DNA transcribes to RNA leading to protein
synthesis. Gene expression is regulated by number of mechanisms acting at transcription and
translation level. Moreover, some cis-acting elements and transacting factors contribute
significantly in modulating gene expression (Gary et al, 2006).

1.1.1 Untranslated Regions (UTR):


3UTR (three prime untranslated regions) are mRNA sequences found at the 3 end that
are not translated into protein products. 3' UTR contains regulatory regions that are responsible
for mRNA stability, polyadenylation, translation efficiency and transcript export. Previous studies
have documented a substantial role of 3UTR in gene expression regulation. (Lucy W. Barrett,
2012).

3UTR contains both binding sites for regulatory proteins as well as microRNA,
contributing to post-transcriptional control of gene expression.

3|Page
Introduction Chapter 1

1.1.2 Polyadenylation:
Polyadenylation is the process of addition of poly(A) tail or adenosine monophosphates to
mRNA sequence. Mechanism comprises the protein complex which cleaves pre-mRNA (initial
product of transcription), poly(A) tail is then inserted at the 3end at various possible sites of
mRNA resulting multiple transcripts (Nick J. Proudfoot, 2002). Alternative polyadenylation
(APA) can produce mRNA isoforms with variable length of 3UTR due to the presence of
multiple polyadenylation signals, and therefore, produce multiple proteins from a single gene.
Alternative polyadenylation enhances diversity of transcripts. Translational efficiency varies with
the length of 3UTR. Previous studies have shown increased protein production from transcripts
bearing shortened 3UTR as compared to transcripts bearing the long 3UTR.

It has been observed that tissue specific polyadenylation sites occur across the major
cancers and respective normal tissues. Multiple polyadenylation sites are found in 30% of genes
in their 3UTR, although most of the genes have two polyadenylation sites. Polyadenylation
signals are position specific, AT-rich motif (TATATW) is highly preferred by short isoforms and
(AATAAA) is preferred by long isoforms. High frequency of short as compared to long isoforms
has been observed in cancer up-regulated genes (Yuefeng Lin, 2012).

1.1.3 Micro RNA:


MicroRNAs (miRNAs) are small RNA molecules consisting of ~22 nucleotides that act in
regulation of genes in animals and plants. miRNA play a role in translational repression by
cleaving mRNA and influences the coding genes in many organisms (David P. Bartel, 2004).
miRNA acts as regulators of gene expression individually for plants and animals. In plants
miRNA cleave mRNA by forming accurate complement base pair with mRNA targets via RNA
interference (RNAi) while in animals miRNA acts as repressors by inhibiting protein synthesis by
forming inaccurate complementary base pair with mRNA targets (Victor Ambros, 2004). RNA
polymerase II is responsible for the production of miRNA transcripts, later on these transcripts are
capped and polyadenylated.

miRNA form complexes with Argonaute (Ago) protein giving rise to RNA-induced
silencing complex (RISC) to repress mRNA expression mRNA recognition and regulation is
achieved by miRNA that functions as adaptor for miRISC complex. In animal miRNA contains
binding sites that lie in the 3UTR. Most of the miRNAs forms imperfect complementarity with

4|Page
Introduction Chapter 1

mRNA. Whereas in plants miRNA forms perfect complementarity with coding sequence of their
targets. Therefore miRNA-mRNA binding considered as essential in regulatory mechanism.

1.2 Alu Elements:


Alu element is short DNA segment consisting of 280 nucleotide long stretch and has
dimeric structure. Their name is originally derived from Arthrobacterluteus restriction
endonuclease because this enzyme is able to cleave some Alu repeats. Different kinds of Alu
repeats are found in the primate genome. Alu elements belong to SINE family of retrotransposons
and are derived from 7SL RNA. Alu elements copies are in one million, amplified by the process
of retrotransposition. Approximately 25% of all genomic methylation is due to the fact that Alu
elements have massive amount of CpG residues. Methylation of an Alu is likely to increase
expression and it varies in different tissues therefore appear to decrease in certain tumors (Prescott
Deininger, 2011). Alu elements are transcribed by two polymerases:

Free Alu RNAs from their own RNA polymerase III promoter which produces
RNA transcription initiation but lacks terminator.

Embedded Alu RNAs from RNA polymerase II as part of protein and non-protein
coding RNA.

Alu elements consist of two monomers, left and right arms connected by A-rich linker
followed by a short poly(A) tail. It has been shown in recent researches that only a limited
number of Alu elements are capable to retrotranspose as they do not code for protein, they
amplified by the transposition machinery of other elements which is supposed to be LINE-1.

5|Page
Introduction Chapter 1

Figure 2: Transcription of Alu elements by RNA polymerase II and RNA polymerase III.

Free Alu elements are transcribed by RNA polymerase III. They play an important role in
genome evolution via insertion and recombination, however, a majority is genetically inert. This
internal promoter could not drive the process of transcription and therefore Alu elements are
reliant for expression on the flanking sequences to their region of insertion. Alu elements have
been observed increase in number due to the certain stress conditions such as heat shock,
adenovirus infection and cycloheximide exposure.

Free SRP9/14 binds with Alu RNA forming complex Alu RNP (ribonucleoprotein) which
acts as inhibitor of protein translation whereas Alu RNA enhances the translation of mRNA. Alu
RNA inhibits RNA-dependent protein kinase (PKR) consequently stimulates protein translation.

Embedded Alu elements influence gene expression via splicing, ADAR editing and
polyadenylation (Prescott Deininger, 2001). Research has been conducted to reveal that high
amount of Alu RNA are embedded in 3'UTR as compare to 5'UTR. There is single Alu element
per 24,000 bases in 5'UTR and single Alu per 14,000 bases in 3'UTR. It has been documented that

6|Page
Introduction Chapter 1

Alu RNAs are embedded in 5'UTR of particular mRNAs inhibit protein translation. A transcript
isoform of the DNA repair protein BRCA1 (expressed in breast cancer tissue), ZNF177 (a zinc
finger protein) and contactin reveals in decrease of the translation efficiency of the mRNA.
Antisense Alu elements inserted in 3'UTR can produce adenine/uracil rich elements or AREs.
mRNA expression could be influenced by AREs as it is involve in destabilization of mRNA.
SRP9/14 protein can bind with some of the Alu RNA to alter the stability of certain transcripts.
Alu RNA in 3'UTR helps to regulate mRNA stability whereas Alu RNA in 5'UTR represses
translation system (J. Hasler, 2007).

1.2.1 Altered Splicing:


5.2% of all known exons are originated from Alu elements. Alu contains splice sites most
of them are in minus strand. Alu sequences contains 9 possible 5 splice sites (position 158)
whereas 14 possible 3 splice sites (positions 275 and 279). Alu insertion has been observed in
pathologies such as the Alport and the Sly syndromes (J. Hasler, T. Samuelsson, K. Strub 2007).

1.2.2 A-To-I RNA Editing:


RNA editing is primate specific, RNA editing determines by the presence of embedded
Alu RNA in human transcriptome. Most of the A-to-I editing events occurred in non-coding
regions of mRNA, base pairing between two embedded Alu RNA occurs intramolecularly,
adenosine editing is always privileged in embedded Alu RNA (J. Hasler, T. Samuelsson, K. Strub
2007). Some Alu repeats can produce duplex structures by pairing of reverse complement hence
forming Inverted Repeat Alu elements (IRAlus), which contribute in hyperediting by ADAR
enzymes (Chen et al, 2008).

IRAlus form dsRNA of 300 base pairs because high homology found in all Alu sub
families. Gene expression could be affected by IRAlu. Editing of IRAlu can control quality
function and this mechanism can be use to regulate the amount of mRNA, to prevent random
editing of mRNA from reaching the cytoplasm when it is exported from the nucleus. Mouse
CTN-RNA remains in the nucleus until cell stress occurs. Cleavage of the hyperedited 3'UTR
nuclear retention signal enable its RNA export to the cytoplasm where translation occurs.

7|Page
Introduction Chapter 1

Figure 3: Different ways in which Alu elements might influence gene expression by A-to-I editing.

1.2.3 Polyadenylation in Inverted Repeated Alu:


As it is mentioned above, mammalian genes contain multiple polyadenylation signals.
Almost 50% of mammalian genes perform alternative cleavage and polyadenylation to produce
more than one mRNA isoforms with diverse 3'UTR. Approximately 333 human genes contain
IRAlu in their 3UTR and also have multiple polyadenylation sites. Therefore we can hypothesize
that alternative polyadenylation facilitate gene regulation with IRAlu.

Another related mechanism of regulation involves the expression of alternative 3'-UTR via
alternative pre-mRNA splicing. Two genes caspase 8 and caspase 10 that lie on chromosomes 2
can express two different 3'UTR of mRNA, an upstream which contains IRAlu and downstream
which do not contain it. Splicing decides 3'UTR for insertion therefore these alternative 3'UTR
can affect the expression level of encoded proteins, it can be regulated by cellular stress. It has
been observed that proliferating cells express mRNAs with shortened 3'UTR also contain miRNA
target sites (Ling-Ling Chen, 2008).

8|Page
Introduction Chapter 1

1.2.4 Alu RNA and Micro RNA:


As we have discussed earlier that miRNA play a vital role in regulation of gene expression
by translational repression or mRNA cleavage. Numerous miRNA found in intronic region of
mRNA are transcribed by Alu elements. Borchert and Colleagues revealed that miRNA could be
transcribed by RNA polymerase III but miRNA expression is linked with Alu transcription. Sense
Alu RNA embedded in 3UTR of mRNA show complementarity with specific sequences of 30
miRNAs (J. Hasler, T. Samuelsson, K. Strub 2007).

1.3 Objectives:
The main objective of this research is to elaborate the key functions of Alu elements and
their role in cancer. Research has been conducted to predict the regulation of genes mediated by
Alu elements and how Alu elements act on regulatory mechanisms. We will be able to distinguish
normal and cancer cells by differentiating certain parameters e.g. alternative polyadenylation,
position in 3UTR, amount of Alu RNAs and strand. We will observe the impact of Alu elements
on mRNA stability, their interaction with miRNA and various other regulatory factors. This study
will help us to understand role of Alu elements in oncogenes and cellular level of gene
expression. Analysis of genome wide association of mutated cancer genes and Alu is urgently
needed to be performed.

9|Page
Chapter 2
Materials and Methods

10 | P a g e
Materials & Methods Chapter 2

2. Materials and Methods:

2.1 Materials:
This section provides us details of procedures used in completing this research. It also covers data
sources from where we have extracted data and the analyses performed.

2.1.1 IntOGen:
Intogen is an integrative oncogenomics tool which is helpful for studying and understanding
cancers, it provides a platform to analyze oncogenomics data for gene prediction and involvement
of groups of genes in the development of cancer (Gonzalez-Perez A, 2013). Intogen has a
collection of genomic experimental data of microarrays. These are vital to study the alterations
that lead to various cancer types Intogen data was obtained from International Cancer Genome
Consortium (www.icgc.org) and The Cancer Genome Atlas databases.

2.1.2 3UTR RNA Database:


Alu RNAs are embedded in the untranslated regions of the mRNA transcripts. Through the
contribution of screening a non redundant UTR database 1,700 UTRs containing Alu RNAs were
identified (Landry et al., 2001). The Katharina Strub Laboratory extracted 5 and 3 UTR from the
EMBL database and analysed it with Repeatmasker software (http://www.repeatmasker.org).
SwissProt-TrEMBL annotation was manually assigned to each entry. Unknown or non annotated
entries are removed. After that further analysis are performed to sort sense and antisense
elements. Finally, more information like SwissProt accession numbers, position of Alu element
added and the data was complied in two files 3'UTR RNA assembly with embedded Alu
elements and 5'UTR RNA assembly with embedded Alu elements individually.

11 | P a g e
Materials & Methods Chapter 2

Figure 4: UTR were extracted from the EMBL database and with the aid of Repeatmasker tool these
regions were analyzed to get information regarding the existence of Alu elements. Each dataset was manually
analyzed to know its annotation and finally non annotated genes were removed from entries. Rest of the
entries were arranged as the orientation and architecture of Alu RNAs

2.1.3 BioMart:
Biomart allow users to rapid access of ensemble data mostly recent genomic annotations
(Smedley D, 2009). It generates results according to the users interest and produces several
output formats (.html, .csv, .xls etc). BioMart helps us to retrieve relevant information from the
large genomic datasets by using variety of programming methodology.

Figure 5: (www.ensembl.org/biomart)

12 | P a g e
Materials & Methods Chapter 2

2.1.4 REPBase Censor:


Repbase is a resource that contains annotations of repetitive DNA. It provides software tools
RepbaseSubmitter and Censor, both tools are available as web based online tool and
downloadable versions (Kohany et al, 2006). Censor is a tool which allows users to query
sequences against a reference collection of repeats and homologous portions in query sequence
are censored with masking symbols (N for nucleotides, X for amino acids). Censor finally
generates a report which includes classification to all known repeats. Output page contains
following information:
1. Map of repeats and its graphical representation.

2. The censored query sequences, with an "N" ("X") replacing each base of the removed
repeats.

3. The local alignment results.

4. The fragments that were censored out, i.e. fragments homologous to one of the repeats
from the reference collection.

5. Annotation portion of all detected repeats.

Repeats Output Format:

Score Pos Sim Dir Class To From Name To From Name

208 0.72 0.7154 c LTR/Copia 170 33 ZMCOPIA1_I 21158 21018 N48

250 0.77 0.7656 c LTR 943 883 ALFARE1_I 21715 21651 N48

13 | P a g e
Materials & Methods Chapter 2

1523 0.69 0.6854 c LTR/Gypsy 3290 2619 DIASPORA_I 22622 21966 N48

224 0.82 0.8235 c LTR/Copia 2659 2607 ATCOPIA35_I 23200 23152 N48

1355 0.67 0.6672 c LTR/Gypsy 1737 1130 DIASPORA_I 24003 23391 N48

Table 2.1.4: Censor output format.

2.1.5 xPADExpression & PolyA Database:


3'UTR is important in post-transcriptional gene expression, therefore, alteration can lead to many diseases
including cancer. It has been reported recently that in most cases genes change their expression from long
3'UTR to short isoforms. This is called alternative polyadenylation (APA) which can affect transcript
stability, efficiency and export. The Expression and PolyA Database (xPAD) was constructed in order to
list the polyadenylation sites and their result in production of 3UTR isoforms along their expression in
both normal and tumor cells. xPAD retains a map of polyadenylation sites in cancer tissues and tumor cell
lines.
The xPAD database is a tool that contains information of APA mediated gene regulation, analysis of
miRNA targets, 3'UTR and their cellular usage in five major organs across normal and tumor cell lines
(Yuefeng et al, 2012).

2.1.6 Panther Classification System:


Panther (Protein Analysis Through Evolutionary Relationships) is a large database consists of
various proteins and their genes. This system is designed for proteins to classify according to their
families and subfamilies. It produces statistical data and graphical representation of genes
involved in various biological activities (Thomas et al, 2003).

2.1.7 TargetScanHuman:
TargetScan is an online web server which is used to predict the target sites of miRNA that lies in
3UTR of mRNA. The tool searches the presence of conserved 8mer and 7mer sites that match

14 | P a g e
Materials & Methods Chapter 2

the seed region of each miRNA. The nonconserved sites are also predicted as an option
(TargetScan, 2015).

2.1.8 Workbench:
There are various scripts which were written in order to extract oncogenes and their related data in
organized form.

Extract_Genes:

Basic function of this class:

i. Extract the genes that are up regulated and down regulated in colon cancer.

ii. Filter the Alu containing genes.

iii. Extract other relevant data of the genes from two files.

iv. Write the data in two TSV files.

This class is written in Java programming language to extract the Alu containing genes that are up
regulated and down regulated in colon cancer. Firstly, it reads two files genes_site_colon.tsv
and 3'UTR.txt which were downloaded from Intogen and 3UTR RNA Database respectively
and transfer them into the buffer, it is easy to establish arrays where data can be stored from
buffer. intogenRawData and alu_containing_utr are two different arrays which contains row-wise
data from both files. intogenRawData array is processed to extract genes that are up regulated and
down regulated in cancer. Then the genes containing Alu elements are filtered from
alu_containing_utr array. Other relevant data of the genes which are EMBL number, Alu type,
Alu orientation, strand, up regulation and down regulation values are also extracted from both
arrays and write two excel files Genes_Extract_Upreg_output.txt and
Genes_Extract_Upreg_output.txt.

15 | P a g e
Materials & Methods Chapter 2

Intogen_AluContainingUTR_
Data

MartSeqAndLength
CombinedTranscriptData

Upreg CensorResultData

Gene ID
Transcript ID
Strand
AluRepeatsCount 3'UTR Length
3'UTR Sequence
Repeat Data
Alu No
Alu Family
xPAD_Data Alu Orientation
Alu From
Alu To
XPad Data
Long Normal
Intogen_AluContainingUTR_
Data Long Tumor
Short Normal
Short Tumor
Repeats in Long Normal
Repeats in Long Tumor
MartSeqAndLength Repeats in Short Normal
Repeats in Short Tumor
Short Normal Sequence
Short Tumor Sequence
Regulation Data
Downreg CensorResultData Upregulation Data
Downregulation Data

AluRepeatsCount

CombinedTranscriptData

xPAD_Data

Figure 6: Graphical representation of programmatically accessed data from several classes into main
class CombinedTranscriptData that is finally generating output.

16 | P a g e
Materials & Methods Chapter 2

Two major projects has been design for genes showing up regulation and down regulation in

colon cancer. Each project contains following classes in order to obtain our desired output in a

single file.

CensorResultData: This class does not contain any function, it reads censor exported file
and produce global arrays for several attributes. It manipulates the file and searches for genes and
transcripts and their respective attributes. Global arrays comprises of censor_genes,
censor_transcript, censor_orientation, censor_from, censor_to and censor_repeats.

xPAD_Data: This class contains various functions to get xPad data as showing in diagram
from xpad_upreg.txt file. This file is manually created in excel by observing xPAD tool which
contains details of genes for both long and short transcripts and in case of normal and tumor cells.
The detail contains gene id, transcript id, positions, repeats and sequences of short. Each output is
being generated from a separate function e.g. getLongNormal(), getLongTumor(),
getShortNormal(), getShortTumor() etc. there are ten different functions.

Intogen_AluContainingUTR_Data: This class generates following outputs, Strand, Alu


Family, Up Regulation and Down Regulation Data. In order to accomplish this task different
functions are written which are readColon(), getStrand(), getAluFamily(), getUpRegulation() and
getDownRegulation(). readColon function reads two files genes_site_colon.txt and
3UTR.txt. While rest of the functions take a gene from CombinedTransferData and searches it
in intogenRawData and alu_containing_utr arrays for their respective strand, family and
regulation data.

MartSeqAndLength: This class read BioMart exported FASTA file and manipulate in
such a way to isolate sequences of each transcript and their lengths. This class contains single
function getMart() which required a gene and transcript as argument because sequences present in
BioMart file is different for each transcript. Function searches each gene and transcript from file
then stores their respective sequences in StringBuilder finally returns them in
CombinedTranscriptData.

AluRepeatsCount: This class contains a function getAluNo() that takes a transcript from
CombinedTranscriptData and searches it from global array (That can be access through all

17 | P a g e
Materials & Methods Chapter 2

classes) censor_transcript. After that it searches for Alu in censor_repeats array, if transcript
founds Alu elements then the function counts number of repeats and returns to the
CombinedTranscriptData.

CombinedTranscriptData: is a collector class which acquires data from five other


classes and it produces output as shown in the diagram and exports two files for upreg and
downreg.

2.2 Methods:
Various steps have been performed one after the other in order to obtain data and to analyze for
results.
1. Two separate files were obtained from the Intogen microarray data in case of colon
cancer. One file contains genes that are frequently up regulated in colon cancer while
other file contains genes that are down regulated in colon cancer.

2. Another file was downloaded from Katharina Strub Laboratory website


(http://cms2.unige.ch/sciences/biologie/bicel/Strub/researchAlu.html). This file contains
information of a gene like EMBL number, gene name, gene id including Alu family,
orientation and structure etc.

3. A script was written to isolate up regulated and down regulated genes whose up regulation
value was less than 0.05. These genes are matched with 3UTR file, if the genes contain
Alu elements then only the particular genes is extracted as an output.

4. EMBL numbers which were derived from 3UTR in the output of the Intogen script were
entered into BioMart to extract the sequences of the 3'UTRs.

5. The 3'UTRs from the previous step were analysed using Censor.

6. Transcripts that are isolated from the Censor tools are manually inserted in to the xPAD
tool separately to analyze the expressions of genes on short, long, normal and tumor

18 | P a g e
Materials & Methods Chapter 2

transcript, and also precedence of Alu elements in a particular transcript. All these
information is stored in two excel files for both up regulated and down regulated genes.

7. Different scripts is then written using Java programming language to extract, organize,
and manipulate the data collected from several resources. Eclipse platform is used as
Integrated Development Environment (IDE) to accomplish these tasks.

8. Several classes is written to collect the desired data in one place, the main class from
where data is collected is CombinedTranscriptData. Two projects are written for that
purpose one project for genes appear up regulated in colon cancer whereas second project
for genes appear down regulated in colon cancer.

19 | P a g e
Chapter 3
Results

20 | P a g e
Results Chapter 3

3. Results:
We have extracted data of the genes expressed in cancer from various sources as
mentioned in Materials and Methods section. The main goal of this research is to build combine
transcript data file that allow us to access the data programmatically. The data is a set of
information of genes which are differentially expressed in colon cancer.

3.1 CombinedTranscriptData:
The data which were collected from the different sources, has been filtered out and the
meaningful information is extracted using combine transcript data. The purpose of constructing
combine transcript data is to generate data warehousing, which will be helpful to study genes,
their relevant Alu elements in normal and colon cancer cells, transcript variants and other
genomic annotations. This compact information will be explored and will be useful in
oncogenomics research. This final format of the file which contains all information gathered from
different sources can be use in cancer analysis. Combine transcript data produces output in one
file which can be viewed in excel. The organization of the data in output file of combine transcript
data exists in easy accessible format, which can be used for further processing by data mining.

3.2 Results of Panther Classification System:


Panther tool has been utilized to classify the genes that are up regulated and down
regulated in colon cancer. List of transcripts are extracted with respect to their molecular
functions, biological processes, cellular components, protein classes and pathways. Bar chart
graph is generated and further analysis is performed. These graphs are demonstrating ontology of
different transcripts involved in colon cancer. The genes are divided in different categories and
sub-categories. Each graph is self explanatory, which are showing activity and functional
distribution of genes containing Alu elements. Graphs are demonstrating classification of the
genes which are presented in tabular form. The graphs are showing different aspects of transcripts
involved with respect to their biological functions and processes.

21 | P a g e
Results Chapter 3

3.3 Up Regulation:
3.3.1 Molecular Function:

Nucleic Acid Binding


Enzyme Regulator
Binding Catalytic Activity Transcription Factor
Activity
Activity
ENST00000370823 ENST00000370823 ENST00000370823 ENST00000394456
ENST00000263379 ENST00000223095 ENST00000263379 ENST00000251269
ENST00000380590 ENST00000233336 ENST00000380590 ENST00000381222
ENST00000394456 ENST00000383789 ENST00000394456 ENST00000381223
ENST00000333703 ENST00000443029 ENST00000333703 ENST00000381218
ENST00000251269 ENST00000383790 ENST00000251269 ENST00000394807
ENST00000368729 ENST00000360132 ENST00000368729 ENST00000407965
ENST00000381222 ENST00000286186 ENST00000381222 ENST00000338483
ENST00000381223 ENST00000346817 ENST00000381223 ENST00000426621
ENST00000381218 ENST00000373115 ENST00000381218 ENST00000538320
ENST00000360132 ENST00000357066 ENST00000360132 ENST00000538999
ENST00000286186 ENST00000373019 ENST00000286186 ENST00000272645
ENST00000346817 ENST00000341754 ENST00000346817 ENST00000432329
ENST00000230882 ENST00000382038 ENST00000230882 ENST00000353267
ENST00000357703 ENST00000467448 ENST00000357703

22 | P a g e
Results Chapter 3

ENST00000357066 ENST00000398174 ENST00000357066


ENST00000373019 ENST00000254908 ENST00000373019
ENST00000394807 ENST00000512783 ENST00000394807
ENST00000215957 ENST00000296161 ENST00000215957
ENST00000341754 ENST00000300933 ENST00000341754
ENST00000382038 ENST00000307720 ENST00000382038
ENST00000353836 ENST00000505337 ENST00000353836
ENST00000442846 ENST00000439211 ENST00000442846
ENST00000441969 ENST00000380097 ENST00000441969
ENST00000407965 ENST00000278302 ENST00000407965
ENST00000338483 ENST00000577886 ENST00000338483
ENST00000426621 ENST00000578237 ENST00000426621
ENST00000538320 ENST00000336708 ENST00000538320
ENST00000538999 ENST00000538999
ENST00000467448 ENST00000467448
ENST00000398174 ENST00000398174
ENST00000268661 ENST00000268661
ENST00000430095 ENST00000430095
ENST00000358495 ENST00000358495
ENST00000272645 ENST00000272645
ENST00000308086 ENST00000308086
ENST00000432329 ENST00000432329
ENST00000353267 ENST00000353267
ENST00000577886 ENST00000577886
ENST00000578237 ENST00000578237
ENST00000336708 ENST00000336708

Table 3.1.1 (A): Molecular function

Protein Binding
Structural Molecule
Transcription Factor Receptor Activity Transporter Activity
Activity
Activity
ENST00000577886 ENST00000263379 ENST00000233336 ENST00000380590
ENST00000578237 ENST00000230882 ENST00000215957 ENST00000260191
ENST00000336708 ENST00000357703 ENST00000353836 ENST00000530277
ENST00000276431 ENST00000442846 ENST00000392770
ENST00000353836 ENST00000441969 ENST00000299333
ENST00000442846 ENST00000268661
ENST00000441969 ENST00000300933
ENST00000260191
ENST00000243673

23 | P a g e
Results Chapter 3

ENST00000373857
ENST00000539896
Table 3.1.1 (B): Molecular function

3.3.2 Biological Process:

Cellular Component
Apoptotic Process Biological Regulation Organization or Cellular Process
Biogenesis
ENST00000360132 ENST00000223095 ENST00000215957 ENST00000263379
ENST00000286186 ENST00000251269 ENST00000300933 ENST00000539749
ENST00000346817 ENST00000360132 ENST00000296387
ENST00000276431 ENST00000286186 ENST00000333703
ENST00000346817 ENST00000368729
ENST00000357066 ENST00000230882
ENST00000276431 ENST00000357703
ENST00000407965 ENST00000357066
ENST00000338483 ENST00000215957
ENST00000426621 ENST00000341754
ENST00000538320 ENST00000382038
ENST00000538999 ENST00000276431
ENST00000467448 ENST00000258774

24 | P a g e
Results Chapter 3

ENST00000432329 ENST00000436444
ENST00000353267 ENST00000353836
ENST00000577886 ENST00000442846
ENST00000578237 ENST00000441969
ENST00000336708 ENST00000467448
ENST00000398174
ENST00000260191
ENST00000296161
ENST00000300933
ENST00000243673
ENST00000432329
ENST00000353267
ENST00000530277
ENST00000392770
ENST00000299333
ENST00000373857
ENST00000539896
Table 3.1.2 (A): Biological process

Developmental Immune System


Localization Metabolic Process
Process Process
ENST00000263379 ENST00000263379 ENST00000380590 ENST00000370823
ENST00000333703 ENST00000368729 ENST00000467448 ENST00000223095
ENST00000360132 ENST00000230882 ENST00000260191 ENST00000233336
ENST00000286186 ENST00000357703 ENST00000317620 ENST00000380590
ENST00000346817 ENST00000276431 ENST00000317668 ENST00000394456
ENST00000230882 ENST00000353836 ENST00000307720 ENST00000251269
ENST00000357703 ENST00000442846 ENST00000250092 ENST00000368729
ENST00000215957 ENST00000441969 ENST00000530277 ENST00000383789
ENST00000276431 ENST00000243673 ENST00000392770 ENST00000443029
ENST00000353836 ENST00000432329 ENST00000299333 ENST00000383790
ENST00000442846 ENST00000353267 ENST00000381222
ENST00000441969 ENST00000373857 ENST00000381223
ENST00000260191 ENST00000539896 ENST00000381218
ENST00000296161 ENST00000360132
ENST00000300933 ENST00000286186
ENST00000432329 ENST00000346817
ENST00000353267 ENST00000373115
ENST00000357066
ENST00000373019
ENST00000394807

25 | P a g e
Results Chapter 3

ENST00000341754
ENST00000382038
ENST00000258774
ENST00000436444
ENST00000407965
ENST00000338483
ENST00000426621
ENST00000538320
ENST00000538999
ENST00000467448
ENST00000398174
ENST00000254908
ENST00000512783
ENST00000268661
ENST00000296161
ENST00000300933
ENST00000430095
ENST00000358495
ENST00000272645
ENST00000307720
ENST00000308086
ENST00000432329
ENST00000353267
ENST00000250092
ENST00000505337
ENST00000439211
ENST00000380097
ENST00000278302
ENST00000577886
ENST00000578237
ENST00000336708
Table 3.1.2 (B): Biological process

Multicellular Organismal Process Response to Stimulus


ENST00000333703 ENST00000263379
ENST00000260191 ENST00000230882
ENST00000300933 ENST00000357703
ENST00000243673 ENST00000276431
ENST00000432329 ENST00000353836
ENST00000353267 ENST00000442846
ENST00000441969

26 | P a g e
Results Chapter 3

ENST00000243673
ENST00000432329
ENST00000353267
ENST00000373857
ENST00000539896
Table 3.1.2 (C): Biological process

3.3.3 Cellular Component:

Macromolecular
Cell Part Membrane Organelle
Complex
ENST00000233336 ENST00000233336 ENST00000380590 ENST00000233336
ENST00000380590 ENST00000539749 ENST00000380590
ENST00000539749 ENST00000296387 ENST00000215957
ENST00000296387 ENST00000300933
ENST00000215957 ENST00000577886
ENST00000300933 ENST00000578237
ENST00000336708
Table 3.1.3: Cellular component

27 | P a g e
Results Chapter 3

3.3.4 Protein Class:

Calcium-Binding Cell Adhesion


Cell Junction Protein Cytoskeletal Protein
Protein Molecule
ENST00000380590 ENST00000333703 ENST00000539749 ENST00000233336
ENST00000368729 ENST00000353836 ENST00000296387 ENST00000215957
ENST00000442846 ENST00000300933
ENST00000441969
Table 3.1.4 (A): Protein class

Defense/Immunity
Enzyme Modulator Hydrolase Isomerase
Protein
ENST00000263379 ENST00000370823 ENST00000360132 ENST00000307720
ENST00000230882 ENST00000223095 ENST00000286186
ENST00000357703 ENST00000360132 ENST00000346817
ENST00000353836 ENST00000286186 ENST00000341754
ENST00000442846 ENST00000346817 ENST00000382038
ENST00000441969 ENST00000357066 ENST00000398174
ENST00000467448
Table 3.1.4 (B): Protein class

Membrane Traffic
Ligase Lyase Nucleic Acid Binding
Protein
ENST00000233336 ENST00000254908 ENST00000317620 ENST00000394456
ENST00000296161 ENST00000512783 ENST00000317668 ENST00000373019
ENST00000380097 ENST00000250092 ENST00000341754

28 | P a g e
Results Chapter 3

ENST00000278302 ENST00000382038
ENST00000577886 ENST00000407965
ENST00000578237 ENST00000338483
ENST00000336708 ENST00000426621
ENST00000538320
ENST00000538999
ENST00000467448
ENST00000268661
ENST00000430095
ENST00000358495
ENST00000308086
ENST00000432329
ENST00000353267
Table 3.1.4 (C): Protein class

Oxidoreductase Protease Receptor Signaling Molecule


ENST00000505337 ENST00000360132 ENST00000263379 ENST00000263379
ENST00000439211 ENST00000286186 ENST00000230882 ENST00000368729
ENST00000346817 ENST00000357703 ENST00000230882
ENST00000398174 ENST00000276431 ENST00000357703
ENST00000353836 ENST00000353836
ENST00000442846 ENST00000442846
ENST00000441969 ENST00000441969
ENST00000260191 ENST00000398174
ENST00000243673
ENST00000373857
ENST00000539896
Table 3.1.4 (D): Protein class

Transfer/Carrier
Structural Protein Transcription Factor Transferase
Protein
ENST00000353836 ENST00000394456 ENST00000380590 ENST00000383789
ENST00000442846 ENST00000251269 ENST00000443029
ENST00000441969 ENST00000381222 ENST00000383790
ENST00000381223 ENST00000373115
ENST00000381218
ENST00000394807
ENST00000407965
ENST00000338483
ENST00000426621
ENST00000538320

29 | P a g e
Results Chapter 3

ENST00000538999
ENST00000272645
ENST00000432329
ENST00000353267
ENST00000577886
ENST00000578237
ENST00000336708
Table 3.1.4 (E): Protein class

Transporter
ENST00000380590
ENST00000260191
ENST00000530277
ENST00000392770
ENST00000299333
Table 3.1.4 (F): Protein class

3.3.5 Pathway:

5HT3 Type Receptor


Apoptosis Signaling Cadherin Signaling
Mediated Signaling Blood Coagulation
Pathway Pathway
Pathway
ENST00000260191 ENST00000360132 ENST00000223095 ENST00000333703
ENST00000286186

30 | P a g e
Results Chapter 3

ENST00000346817
ENST00000276431
ENST00000432329
ENST00000353267
Table 3.1.5 (A): Pathway

De-Novo Pyrmidine
FAS Signaling
Ribonucleotides Enkephalin Release Folate Biosynthesis
Pathway
Biosythesis
ENST00000232607 ENST00000432329 ENST00000360132 ENST00000505337
ENST00000353267 ENST00000286186 ENST00000439211
ENST00000346817
Table 3.1.5 (B): Pathway

Heterotrimeric G-
Gonadotropin
General protein Signaling
Formyltetrahydroformate Releasing
Transcription Pathway-Gi Alpha
Biosynthesis Hormone Receptor
Regulation and Gs Alpha
Pathway
Mediated Pathway
ENST00000505337 ENST00000394456 ENST00000432329 ENST00000432329
ENST00000439211 ENST00000272645 ENST00000353267 ENST00000353267

Table 3.1.5 (C): Pathway

Inflammation
Mediated by Transcription
Plasminogen Wnt Signaling
Chemokine and Regulation by bZIP
Activating Cascade Pathway
Cytokine Signaling Transcription Factor
Pathway
ENST00000373857 ENST00000223095 ENST00000394456 ENST00000333703
ENST00000539896 ENST00000272645
ENST00000432329
ENST00000353267
Table 3.1.5 (D): Pathway

p38 MAPK Pathway p53 Pathway


ENST00000432329 ENST00000223095
ENST00000353267 ENST00000276431
Table 3.1.5 (E): Pathway

31 | P a g e
Results Chapter 3

3.4 Down Regulation:


3.4.1 Molecular Function:

Nucleic Acid Binding


Enzyme Regulator
Binding Catalytic Activity Transcription Factor
Activity
Activity
ENST00000370823 ENST00000344366 ENST00000357066 ENST00000251269
ENST00000380590 ENST00000178638 ENST00000441801 ENST00000381222
ENST00000251269 ENST00000370823 ENST00000375766 ENST00000381223
ENST00000368729 ENST00000233336 ENST00000375771 ENST00000381218
ENST00000381222 ENST00000321764 ENST00000332305 ENST00000272645
ENST00000381223 ENST00000383789 ENST00000441801
ENST00000381218 ENST00000443029 ENST00000375766
ENST00000357066 ENST00000383790 ENST00000375771
ENST00000398174 ENST00000357066 ENST00000332305
ENST00000268661 ENST00000398174
ENST00000272645 ENST00000296161
ENST00000308086 ENST00000300933
ENST00000307046 ENST00000221307
ENST00000337514 ENST00000305046
ENST00000441801 ENST00000505337
ENST00000375766 ENST00000439211
ENST00000375771 ENST00000380097

32 | P a g e
Results Chapter 3

ENST00000332305 ENST00000278302
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
ENST00000417689
ENST00000317091
Table 3.2.1 (A): Molecular function

Protein Binding
Structural Molecule
Transcription Factor Receptor Activity Transporter Activity
Activity
Activity
ENST00000441801 ENST00000373857 ENST00000233336 ENST00000380590
ENST00000375766 ENST00000539896 ENST00000268661 ENST00000235345
ENST00000375771 ENST00000300933 ENST00000347644
ENST00000332305 ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
Table 3.2.1 (B): Molecular function

3.4.2 Biological Process:

33 | P a g e
Results Chapter 3

Cellular Component
Apoptotic Process Biological Regulation Organization or Cellular Process
Biogenesis
ENST00000305046 ENST00000251269 ENST00000300933 ENST00000539749
ENST00000357066 ENST00000441801 ENST00000296387
ENST00000441801 ENST00000375766 ENST00000368729
ENST00000375766 ENST00000375771 ENST00000357066
ENST00000375771 ENST00000332305 ENST00000398174
ENST00000332305 ENST00000296161
ENST00000300933
ENST00000307046
ENST00000337514
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
ENST00000373857
ENST00000539896
Table 3.2.2 (A): Biological process

Developmental Immune System


Localization Metabolic Process
Process Process
ENST00000296161 ENST00000368729 ENST00000380590 ENST00000344366
ENST00000300933 ENST00000373857 ENST00000235345 ENST00000178638
ENST00000305046 ENST00000539896 ENST00000347644 ENST00000370823
ENST00000441801 ENST00000272462 ENST00000233336
ENST00000375766 ENST00000317620 ENST00000380590
ENST00000375771 ENST00000317668 ENST00000321764
ENST00000332305 ENST00000250092 ENST00000235345
ENST00000307046 ENST00000251269
ENST00000337514 ENST00000368729
ENST00000383789
ENST00000443029
ENST00000383790
ENST00000381222
ENST00000381223
ENST00000381218
ENST00000347644
ENST00000357066
ENST00000398174

34 | P a g e
Results Chapter 3

ENST00000268661
ENST00000296161
ENST00000300933
ENST00000272645
ENST00000308086
ENST00000221307
ENST00000305046
ENST00000250092
ENST00000505337
ENST00000439211
ENST00000307046
ENST00000337514
ENST00000380097
ENST00000278302
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
ENST00000417689
ENST00000317091
Table 3.2.2 (B): Biological process

Multicellular Organismal Process Response to Stimulus


ENST00000300933 ENST00000373857
ENST00000539896
Table 3.2.2 (C): Biological process

35 | P a g e
Results Chapter 3

3.4.3 Cellular Component:

Macromolecular
Cell Junction Cell Part Membrane
Complex
ENST00000441801 ENST00000233336 ENST00000233336 ENST00000380590
ENST00000375766 ENST00000380590 ENST00000539749
ENST00000375771 ENST00000539749 ENST00000296387
ENST00000332305 ENST00000296387 ENST00000441801
ENST00000300933 ENST00000375766
ENST00000441801 ENST00000375771
ENST00000375766 ENST00000332305
ENST00000375771
ENST00000332305
Table 3.2.3 (A): Cellular component

Organelle
ENST00000233336
ENST00000380590
ENST00000300933
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
Table 3.2.3 (B): Cellular component

36 | P a g e
Results Chapter 3

3.4.4 Protein Class:

Calcium-Binding
Cell Junction Protein Cytoskeletal Protein Enzyme Modulator
Protein
ENST00000380590 ENST00000539749 ENST00000233336 ENST00000370823
ENST00000368729 ENST00000296387 ENST00000300933 ENST00000357066
ENST00000441801 ENST00000441801 ENST00000441801
ENST00000375766 ENST00000375766 ENST00000375766
ENST00000375771 ENST00000375771 ENST00000375771
ENST00000332305 ENST00000332305 ENST00000332305
Table 3.2.4 (A): Protein class

Membrane Traffic
Hydrolase Ligase Lyase
Protein
ENST00000398174 ENST00000233336 ENST00000344366 ENST00000272462
ENST00000417689 ENST00000296161 ENST00000178638 ENST00000317620
ENST00000317091 ENST00000380097 ENST00000321764 ENST00000317668
ENST00000278302 ENST00000250092
Table 3.2.4 (B): Protein class

Nucleic Acid Binding Oxidoreductase Protease Receptor


ENST00000268661 ENST00000221307 ENST00000398174 ENST00000373857
ENST00000308086 ENST00000305046 ENST00000539896

37 | P a g e
Results Chapter 3

ENST00000505337
ENST00000439211
Table 3.2.4 (C): Protein class

Transfer/Carrier
Signaling Molecule Transcription Factor Transferase
Protein
ENST00000368729 ENST00000251269 ENST00000380590 ENST00000383789
ENST00000398174 ENST00000381222 ENST00000443029
ENST00000307046 ENST00000381223 ENST00000383790
ENST00000337514 ENST00000381218
ENST00000272645
ENST00000441801
ENST00000375766
ENST00000375771
ENST00000332305
Table 3.2.4 (D): Protein class

Transporter
ENST00000380590
ENST00000235345
ENST00000347644
Table 3.2.4 (E): Protein class

3.4.5 Pathway:

38 | P a g e
Results Chapter 3

Bupropion FGF Signaling Formyltetrahydroformate


Folate Biosynthesis
Degradation Pathway Biosynthesis
ENST00000324071 ENST00000369056 ENST00000505337 ENST00000505337
ENST00000357555 ENST00000439211 ENST00000439211
ENST00000360144
ENST00000369061
ENST00000358487
Table 3.2.5 (A): Pathway

Inflammation Insulin/IGF
General Gonadotropin Mediated by Pathway-Mitogen
Transcription Releasing Hormone Chemokine and Activated Protein
Regulation Receptor Pathway Cytokine Signaling Kinase Kinase/MAP
Pathway Kinase Cascade
ENST00000272645 ENST00000307046 ENST00000373857 ENST00000307046
ENST00000337514 ENST00000539896 ENST00000337514
Table 3.2.5 (B): Pathway

Insulin/IGF Pathway-Protein Kinase B Transcription Regulation by bZIP


Signaling Cascade Transcription Factor
ENST00000307046 ENST00000272645
ENST00000337514
Table 3.2.5 (C): Pathway

3.5 Analysis of TargetScan:


Multifactorial manual analysis of miRNA sites shows two mechanisms.

1. miR dependent

2. miR independent

The miR independent is consists of Alu Exclusion Associated Polyadenylation (AEP) and Alu
Directed Alternative Polyadenylation (ADP).

UTR Alu
Transcript Name Alu Details S/L(N) S/L(T) miR
Length Position

Up Regulation

39 | P a g e
Results Chapter 3

miR is
Alu in all
TNFRSF10B ENST00000276431 2538 423 0.2 0.121 included in
transcripts
long transcript

miR is
Alu in long
HUS1 ENST00000258774 2068 983 0.578 2.066 included in
only
long transcript

Down Regulation

miR is
Alu in long
IGF1 ENST00000337514 6633 1222 0.095 0.059 included in
only
long transcript

miR is
Alu in all
CA12 ENST00000178638 4907 1489 0.014 0.022 included in
transcripts
long transcript

Table 3.5 (A): Transcripts involved in miR dependent.

UTR Alu
Transcript Name Alu Details S/L(N) S/L(T) miR
Length Position

Up Regulation

Alu in long miR is included


PPID ENST00000307720 602 205 0.024 0.022
only in all transcript

Alu in long miR is included


KCTD10 ENST00000228495 2939 1870 0.214 0.266
only in all transcript

Alu in long miR is included


PHLDA1 ENST00000266671 4672 2357 0.373 1.126
only in all transcript

40 | P a g e
Results Chapter 3

Down Regulation

Alu in long miR is included


KCTD10 ENST00000228495 2939 1870 0.214 0.266
only in all transcript

Table 3.5 (B): Transcripts involved in AEP.

UTR Alu
Transcript Name Alu Details S/L(N) S/L(T) miR
Length Position

Up Regulation

miR is
Alu in all
POLR2D ENST00000272645 1841 479 0.166 0.166 included in all
transcripts
transcript

Down Regulation

miR is
Alu in all
BVES ENST00000314641 4267 580 0.137 0.125 included in all
transcripts
transcript

miR is
Alu in all
SLC35D1 ENST00000235345 5008 697 0.082 0.047 included in all
transcripts
transcript

Table 3.5 (C): Transcripts involved in ADP.

41 | P a g e
Chapter 4
Discussion

42 | P a g e
Discussion Chapter 4

4. Discussion:
Untranslated region is important in regulation of gene expression at post-transcriptional
level therefore Alu RNA embedded in this region is essential for gene expression. A majority of
RNA editing occurs within the Alu elements, editing of inverted Alu elements could affect gene
expression. mRNA isoforms are produced by alternative polyadenylation, short isoforms lacks
miRNA which leads to several types of cancers.
Large data sets are created successfully to understand how Alu repeats contribute to gene
expression. Large amount of data is available in different online sources which is collected and
manipulated to filter useful information. Combined transcript data file is constructed which
contains useful data of genes up regulating and down regulating in colon cancer. Genetic
annotations of colon genes are organized in combined transcript data file that makes the data
programmatically accessible. This will allow us to machine learning to extract meaningful trends
from the data which can be used for further processing by data mining.
Different graphs have been generated to analyze behavior of the transcripts produced from
combined transcript data. These graphs are obtained from Panther tool and transcripts of up
regulation and down regulation are listed in form of tables according to their role in molecular
functions, biological processes, cellular components, protein classes and pathways. Analysis of
graphs shows the dispersion of genes performing different biological functions and processes. The
graphs are explaining the functional distribution of transcripts which can help to understand the
activity of transcript in expression of genes. Combined transcript data needs to be analyzed
further and machine learning will be done for this purpose.
The miRNA sites have been identified with the help of TargetScan and further analysis is
performed. Transcripts are classified into two broad categories on the basis of manual analysis of
miRNA target sites. The miR dependent classification include only those transcripts in which
miRNA sites are present in long isoform but lack in short 3UTR. On the other hand miR
independent classification is further divided in to AEP (Alu Exclusion Associated
Polyadenylation) and ADP (Alu Directed Alternative Polyadenylation). Presence of Alu only in
long 3UTR but lack in short isoform is categorized in AEP. ADP consists of all those transcripts
which contain miRNA sites and Alu in both short and long 3UTR isoform.

43 | P a g e
5. Reference:

5.1 Paper and Books References:


Leggett B, Whitehall V: Role of the serrated pathway in colorectal cancer pathogenesis.
Gastroenterology 138 (6): 2088-100, 2010

Howe JR, Mitros FA, Summers RW: The risk of gastrointestinal carcinoma in familial
juvenile polyposis. Ann SurgOncol 5 (8): 751-6, 1998

Shinya H, Wolff WI: Morphology, anatomic distribution and cancer potential of colonic
polyps. Ann Surg 190 (6): 679-83, 1979

Abdul QaiyoumeAmini, Khursheed Ahmed Samo, AmjadSiraj Memon3 "Colorectal


cancer in younger population: our experience" 2013

Magnus Lberg, M.D., MetteKalager, M.D., Ph.D., yvindHolme, M.D., Geir Hoff,
M.D., Ph.D., Hans-OlovAdami, M.D., Ph.D., and Michael Bretthauer, M.D., Ph.D. "Long-
Term Colorectal-Cancer Mortality after Adenoma Removal" 2014.

Gary H. Perdew, Jack P. Vanden Heuvel, Jeffrey M. Peters. (2006). Regulation of Gene
Expression. Totowa, New Jersey: Humana Press.

Hoopes, L. (2008) Introduction to the gene expression and regulation topic room. Nature
Education 1(1):160

Lucy W. Barrett, Sue Fletcher, Steve D. Wilton. (2012). Regulation of eukaryotic gene
expression by the untranslated gene regions and other non-coding elements.

Cydney Brooke Nielsen "Mammalian Gene Regulation through the 3' UTR" 2001.

Nick J. Proudfoot, Andre Furger, and Michael J. Dye Integrating mRNA Processing with
Transcription 2002

Yuefeng Lin, Zhihua Li, FatihOzsolak, Sang Woo Kim, Gustavo Arango-Argoty, Teresa
T. Liu, Scott A. Tenenbaum, Timothy Bailey, A. Paula Monaghan, Patrice M. Milos and
Bino John "An in-depth map of polyadenylation sites in cancer" 2012

44 | P a g e
David P. Bartel Review MicroRNAs: Genomics, Biogenesis, Mechanism, and Function
2004

Victor Ambros "The functions of animal microRNAs" 2004

Richard W. Carthew and Erik J. Sontheimer "Origins and Mechanisms of miRNAs and
siRNAs" 2009

Prescott Deininger "Alu elements: know the SINEs" (2011).

J. Hasler, T. Samuelsson, K. Strub "Useful junk: Alu RNAs in the human transcriptome"
(2007).

Chen LL, DeCerbo JN, Carmichael GG. "Alu element-mediated gene silencing." (2008).

Ling-Ling Chen and Gordon G. Carmichael "Gene regulation by SINES and inosines"
2008.

Prescott L. Deininger and Mark A. Batzer "Alu Repeats and Human Disease" 1996.

Bass BL. RNA editing by adenosine deaminases that act on RNA. Annu Rev Biochem
2002; 71:817-46.

Gonzalez-Perez A, Perez-Llamas C, Deu-Pons J, Tamborero D, Schroeder MP, Jene-Sanz


A, Santos A & Lopez-Bigas N IntOGen-mutations identifies cancer drivers across
tumor types

Nature Methods doi:10.1038//nmeth.2642 (2013)

Gundem G, Perez-Llamas C, Jene-Sanz A, Kedzierska A, Islam A, Deu-Pons J, Furney S


and Lopez-Bigas N. IntOGen: Integration and data-mining of multidimensional
oncogenomic data. Nature Methods, 7, 92-93 (2010)
Haider S, Ballester B, Smedley D, Zhang J, Rice P, Kasprzyk A. BioMart Central Portal--
unified access to biological data. Nucleic Acids Res. 2009 Jul 1;37(Web Server
issue):W23-7.
Smedley D, Haider S, Ballester B, Holland R, London D, Thorisson G, Kasprzyk A.
BioMart--biological queries made easy. BMC Genomics. 2009 Jan 14;10:22.

45 | P a g e
Landry, J.R., Medstrand, P. and Mager, D.L. (2001) Repetitive elements in the 5'
untranslated region of a human zinc-finger gene modulate transcription and translation
efficiency. Genomics, 76, 110-116.
Julien Hsler, Tore Samuelsson and Katharina Strub "Alu RNAs embedded in 5 and 3
UTRs of human mRNAs" 2007
Kohany O, Gentles AJ, Hankus L, Jurka J. "Annotation, submission and screening of
repetitive elements in Repbase: RepbaseSubmitter and Censor." 2006
Yuefeng Lin, Zhihua Li, Fatih Ozsolak, Sang Woo Kim, Gustavo Arango-Argoty, Teresa
T. Liu, Scott A. Tenenbaum, Timothy Bailey, A. Paula Monaghan, Patrice M. Milos
and Bino John "An in-depth map of polyadenylation sites in cancer" 2012
Thomas PD, Kejariwal A, Campbell MJ, Mi H, Diemer K, Guo N, Ladunga I, Ulitsky-
Lazareva B, Muruganujan A, Rabkin S, Vandergriff JA, Doremieux O. "PANTHER: a
browsable database of gene products organized by biological function, using curated
protein family and subfamily classification." (2003).

5.2 Electronic References:


American Cancer Society. (2013). Cancer Facts & Figures 2013. Atlanta: American
Cancer Society.
American Cancer Society. (2014). Colorectal Cancer Facts & Figures. Atlanta: American
Cancer Society.
American Cancer Society. (2008). Global Cancer Facts & Figures 2nd Edition. Atlanta,
Georgia: American Cancer Society.
Jessica Evert, MD, Benjamin McDonald. (2010, June 30). Colorectal Cancer. Retrieved 12
20, 2014, from Mental Help:
http://www.mentalhelp.net/poc/view_doc.php?type=doc&id=5206
Mayo Clinic Staff. (2013, August 22). Colon Cancer. Retrieved 12 20, 2014, from Mayo
Clinic: http://www.mayoclinic.org/diseases-conditions/colon-
cancer/basics/definition/con-20031877
National Cancer Institute. (2014, December 18). Genetics of Colorectal Cancer. Retrieved
12 20, 2014, from National Cancer Institute:
http://www.cancer.gov/cancertopics/pdq/genetics/colorectal/HealthProfessional/page1

46 | P a g e
NCBI. (2011, April 06). The GenBank Submissions Handbook [Internet]. Retrieved 12
20, 2014, from National Center for Biotechnology Information (US):
http://www.ncbi.nlm.nih.gov/books/NBK53702/
Oracle. (n.d.). Class PrintWriter. Retrieved 12 20, 2014, from Oracle.com:
http://docs.oracle.com/javase/7/docs/api/java/io/PrintWriter.html
W3Schools. (n.d.). JavaScript Errors - Throw and Try to Catch. Retrieved 12 20, 2014,
from W3Schools.com: http://www.w3schools.com/js/js_errors.asp
Yang Zhang's Research Group. (n.d.). What is FASTA format? Retrieved 12 20, 2014,
from Zhang Lab University of Michigan: http://zhanglab.ccmb.med.umich.edu/FASTA/
TargetScan. (n.d.). TargetScanHuman. Retrieved 01 30, 2015, from targetscan.org:
http://www.targetscan.org/

47 | P a g e

You might also like