You are on page 1of 10

BTEC 4300

Assignment 03
Due Thursday February 13, 2014 at 2:30 pm
Gene Prediction
PART 1: Prokaryote
1.

On the NCBI site, go to the complete genome page of E coli K12. Download the first 3000 letters
from the complete DNA sequence. There are two genes hidden in this sequence. See if you can
find these.

2
(a) How many start codons are there in the sequence above? Here, for the sake of simplicity,
assume that ATG is the only start codon.
They are 63 ATG start codons in the sequence.
(b) Use the ORF finder tool of NCBI (http://www.ncbi.nlm.nih.gov/gorf/gorf.html ). Record the
output below and list the candidate ORF that you predict could be potential coding regions

3

The first ROF value is +1 from the very top because it’s the longest open reading frame.
This one when run in blast it will provide the most information and similarity to other
organism. The rest of the ORF are much shorter. For them to be considered they must
be long enough roughly about 300bp or more and should have amino acid specific for
the give organism and should have codon use specific for the given organism. They can
be used but it won’t have a lot of information when it’s run in blast.

4
(c) Go to the GeneMark /FgeneSB/ Glimmer software and submit your sequence and find the genes.
Compare this prediction with ORF-finder.

When comparing the GeneMark /FgeneSB/ Glimmer software to that of the ORF finder they are
both similar and they predict the same which is 337-2799 for the first the second is also the same
which is 2801-2999. The GeneMark /FgeneSB/ predict 2 genes.

(d) Predict the gene/s based on the results? What are the most probable proteins that are encoded
by the genes you predicted.
The most protein that are encoded are homoserine dehydrogenase and bifunctional a
Spartokinase.

5

PART 2: Eukaryote
The DNA originates from Caenorhabditis elegans. This is an invertebrate, more precisely a
nematode, or earth worm which is a favored experimental organism because it only has around 1000
cells (also visible in the adult nematode) and 300 neurons. All of the cells and all of the neurons have
been mapped, as well as the complete cellular development from zygote to adult nematode. The
entire genome has been sequenced. If you want to read more about C. elegans you can visit the C.
elegans WWW server.
The DNA you will use for this exercise is available in file : C_elegansDNA.txt

Gene finding using ab initio methods
We will try to find genes in the piece of DNA using different methods.
To facilitate a comparison between the different results, and the elucidation of the correct gene
structure, store all the nucleotide positions for exon start and end sites. Use a table for this
purpose.

Remember to save all your results (you will soon need them).
From the results find the nucleotide positions of the starts and ends of the exons and write these into
your table.
Use the following methods and answer the questions:

GenScan: http://genes.mit.edu/GENSCAN.html

Sequence /tmp/02_13_14-15:09:21.fasta : 13990 bp : 35.93% C+G : Isochore 1 ( 0 - 43 C+G%)

Parameter matrix: HumanIso.smat

Predicted genes/exons:

Gn.Ex Type S .Begin ...End .Len Fr Ph I/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

1.01 Intr +

5712

5870

159

2

0

62

98

140 0.978

11.66

1.02 Intr +

6078

6893

816

2

0

8

87

975 0.937

80.09

1.03 Intr +

6937

7161

225

0

0

-10

115

333 0.995

23.46

1.04 Intr +

7619

7782

164

1

2

63

97

230 0.971

19.25

1.05 Intr +

7829

7940

112

2

1

37

85

131 0.999

7.16

6
1.06 Intr +

7988

8182

195

1

0

13

77

199 0.699

9.99

1.07 Intr +

8252

8442

191

1

2

14

42

194 0.938

4.86

1.08 Intr +

8526

8688

163

0

1

67

55

223 0.999

15.96

1.09 Intr +

8739

9071

333

2

0

52

115

299 0.999

23.84

1.10 Intr +

9603

9734

132

2

0

51

81

58 0.643

1.32

1.11 Intr +

9799

9971

173

0

2

73

87

63 0.771

2.52

1.12 Intr +

10742

10893

152

2

2

-7

34

141 0.123

-1.91

1.13 Intr +

11906

12012

107

0

2

14

113

81 0.147

2.21

1.14 Intr +

12234

12335

102

2

0

74

75

81 0.964

4.85

1.15 Intr +

12381

12590

210

2

0

39

92

363 0.999

29.99

1.16 Term +

12843

12929

87

2

0

79

41

136 0.987

4.68

1.17 PlyA +

12952

12957

6

1.05

2.00 Prom +

12973

13012

40

-11.54

2.01 Init +

13065

13315

251

2

2

35

16

296 0.866

14.08

2.02 Term +

13366

13723

358

1

1

-23

48

252 0.384

3.10

2.03 PlyA +

13892

13897

6

-0.45

Suboptimal exons with probability > 1.000

Exnum Type S .Begin ...End .Len Fr Ph B/Ac Do/T CodRg P.... Tscr..
----- ---- - ------ ------ ---- -- -- ---- ---- ----- ----- ------

NO EXONS FOUND AT GIVEN PROBABILITY CUTOFF

Predicted peptide sequence(s):

7
>/tmp/02_13_14-15:09:21.fasta|GENSCAN_predicted_peptide_1|1106_aa
NKADRMGPGGSRRPRNSQHATASTPSASSCKEQQKDVEHEFDIIAYKTTFWRTFFFYALS
FGTCGIFRLFLHWFPKRLIQFRGKRCSVENADLVLVVDNHNRYDICNVYYRNKSGTDHTV
VANTDGNLAELDELRWFKYRKLQYTWIDGEWSTPSRAYSHVTPENLASSAPTTGLKADDV
ALRRTYFGPNVMPVKLSPFYELVYKEVLSPFYIFQAISVTVWYIDDYVWYAALIIVMSLY
SVIMTLRQTRSQQRRLQSMVVEHDEVQVIRENGRVLTLDSSEIVPGDVLVIPPQGCMMYC
DAVLLNGTCIVNESMLTGESIPITKSAISDDGHEKIFSIDKHGKNIIFNGTKVLQTKYYK
GQNVKALVIRTAYSTTKGQLIRAIMYPKPADFKFFRELMKVCFDKTGTLTEDGLDFYALR
VVNDAKIGDNIVQIAANDSCQNVVRAIATCHTLSKINNELHGDPLDVIMFEQTGYSLEED
DSESHESIESIQPILIRPPKDSSLPDCQIVKQFTFSSGLQRQSVIVTEEDSMKAYCKGSP
EMIMSLCRPETVPENFHDIVEEYSQHGYRLIAVAEKELVVGSEVQKTPRQSIECDLTLIG
LVALENRLKPVTTEVIQKLNEANIRSVMVTGDNLLTALSVARECGIIVPNKSAYLIEHEN
GVVDRRGRTVLTIREKEDHHTERQPKIVDLTKMTNKDCQFAISGSTFSVVTHEYPDLLDQ
LVLVCNVFARMAPEQKQLLVEHLQDVGQTVAMCGDGANDCAALKAAHAGISLSEAEASIA
APFTSKGTAIFYVSLFHYIVLYFVFAAGPPYRASIASNKAFLISMIGVTVTCIAIVVFYV
TPIQYFLGCLQMPQEFRFIILAVATVTAVISIIYDRCVDWISERLREKSLKYAVSFLPTP
KFERLPIYNRKAFNFHSSFYSFSIMRAIVFDEKRYFVVDSSSEGLSTMKVETCVYSGYKI
HPGHGKRLVRTDGKVQIFLSGKALKGAKLRRNPRDIRWTVLYRIKNKKGTHGQEQVTRKK
TKKSVQVVNRAVAGLSLDAILAKRNQTEDFRRQQREQAAKIAKDANKAVRAAKAAANKEK
KASQPKTQQKTAKNVKTAAPRVGGKR

>/tmp/02_13_14-15:09:21.fasta|GENSCAN_predicted_peptide_2|202_aa
MRTLRIAQYSVLTVGFAIYMYRLIEEIPIDIRNLNSDSLEGIINSDELCDVTVSNRNRGL
LVRNDSLDLDILKAKFTTFFSKRYLTRFLSEQVPFLHVIDEALLVKRFVMCACFMVFCLT
VIWFLVIRRMGNLIKRLSVLNQLEDAESVEWARCIREFTQEKLAVLCFCIVPPFAQTDKL
VSDKIKLFREHKILRIRSVQH

Q1. How many exons are predicted?
They first gene has 16 exons and the second has 2 exons

8

Q2. What are the begin and end positions?
For the first it begins at 5712 and ends at 12929. The second one begins at
12973-13723
Q3. For the possible exons, note the probability of each
The possible exons probability of each is suboptimal exons with probability >
1.000
Q4. On which strand (+ or -) is the gene located?
They are both on forward strand since it's on the plus end.
Q5. Write down the first 6 amino acids and the total length of the predicted protein
The first 6 amino acids and the total length of the predicted protein

9

Pick any one of the other genefinding programs
a. GeneMark http://exon.gatech.edu/eukhmm.cgi
b. GeneID
http://genome.crg.es/software/geneid/geneid.html
c. FGENESH
http://linux1.softberry.com/berry.phtml?topic=fgenesh&group=programs&subgroup=gfind

Compare your results to those from GenScan
Use any one of the tools below to Find the SPLICE sites
http://spliceport.cbcb.umd.edu/
http://www.umd.be/HSF/
http://wangcomputing.com/assp/index.html
Do the predicted splice site match the ones from your ab-initio predictions?
Gene finding using HOMOLOGY methods
A. Gene finding using EST searches Use blastX
Perform a BlastN at NCBI against C. elegans (or invertebrate) ESTs.
Then select ESTs covering as much as possible of your genomic DNA and try to reconstitute the
entire gene (if this is possible). Retrieve your ESTs from the Blast results. You simply do this by
selecting the ESTs with the click boxes.
Remember to save your ESTs in fasta format (the starting DNA was in the correct format).
Run FGENESH_C using the CDNA from the ESTs

FGENESH_C:
http://linux1.softberry.com/berry.phtml?topic=fgenes_c&group=programs&subgroup=gfs

10

Finding the correct CDS
Go to your table. Look at the different exon start sites and exon end sites.
Are the predictions identical?
Which do you trust the most? Why?
Did any of the gene finding methods arrive at the correct sequence?
From the results choose the exon starts and exon ends that you trust the most and write them in the
last column (My Gene) of the table.

Analyzing the CDS
Perform a BlastP against non redundant protein databases. You can use the GenScan translated
peptides result directly for this.
What kind of protein(s) did you find?