You are on page 1of 49

Gene Prediction Tools

Presented To: Ms. Tahira


Presented by:
Maheen Fatima(718)
Saba Tasneem(710)
TOOLS 

GeneScan
AUGUSTUS
 GeneMark
 FGENESH
 GENSCAN was developed by Christopher Burge
at Stanford University.

GeneScan  The GENSCAN Web server can be found at MIT.

 It is a GHMM-based program that can be used to


predict the location of genes and their exon-intron
boundaries in genomic sequences of different
organisms.

 GENSCAN is capable of analyzing genomes in


situations where there are partial genes or no
genes, rather than only being able to analyze
single and complete gene sequences like other
programs at its time.
Homepage
Input Nucleotide Sequence of Insulin in FASTA format

Gn.Ex: Gene Number or Exon
Number
Intr: type Internal Exons (3’ splice
site to 5’ splice site )
PolyA: PolyA signals(consensus
AATAAAA

Begin: beginning of Exons or signal


(numbered on input strand)
End: End of Exons or signal
(numbered on input strand)
fr: reading Frame (o forward strand
codon ending at x has frame mode of 3

Ph: Net Phase of Exons


Length: length of exon or signal
I/Ac: initiatiI/Ac: initiation signal or
3’ splice site score
on signal
Do/T: or 3’site
5’ splice splice site score
or termination
signal scores

CodRg = Coding region Score

P: probability of exons

Tscr: Score of exon depend upon


length of sequence
AUGUSTU
 identifies genes in eukaryotic genomic sequences
using a combination of ab initio and homology-
based methods.
 stand-alone program and as a web server

S
 predict genes in a variety of eukaryotic organisms,
including plants, , animals, and fungi.
Evidence Source for Prediction:

● The coding sequence of the gene


● The exon-intron structure of the
gene
● The promoter region of the gene
● The conservation of the gene in
other species
Take Sequence From UCSC GENOME
Browser:
OUTPUT:
OUTPUT:
6-frames Translation:
GENEMARK
• GeneMark is a generic name for a family of
ab initio gene prediction software programs
developed at the Georgia Institute of
Technology in Atlanta.

• Developed in 1993, original GeneMark was

Introduction
used in 1995 as a primary gene prediction
tool for annotation of the first completely
sequenced bacterial genome of
Haemophilus influenzae
Algorithms
 Core Algorithm:

• Hidden Markov Model (HMM):


This forms the core of GeneMark, representing the genomic sequence as a series of hidden
states corresponding to coding regions, introns, intergenic regions, and exon borders.

 Additional Algorithms:

• Inhomogeneous Three Periodic Markov Chain Models: These models capture the specific
patterns of nucleotides within coding regions.
• Homogeneous Three Periodic Markov Chain Models:
These models identify non-coding regions
• Bayesian approach:
It is used to gene prediction in two DNA strands simultaneously
• Ribosome Binding Site (RBS) Model:
This model helps predict the start of genes in prokaryotic genomes by identifying the RBS
sequence upstream of the start codon.

• Heuristic Models:
These models estimate species-specific parameters for gene prediction in situations where
large genomic context is unavailable, such as viral genomes and metagenomic sequences.

• Self-training:
This technique allows the algorithm to automatically improve its parameter estimates by
iteratively predicting genes and using the predictions to further refine the model.
GeneMark Family of Gene Prediction Programs
 Bacteria and Archaea:
GeneMark
GeneMark.hmm algorithm
GeneMarkS
GeneMarkS-2
 Metagenomes and Metatranscriptomes:
MetaGeneMark
GeneMarkS-T
 Eukaryotes:
GeneMark
GeneMark.hmm
GeneMark-ES
GeneMark-ET
GeneMark-EP+
GeneMark-ETP
 Viruses
Heuristic models (MetaGeneMark and geneMarkS)
Eukaryotic genome annotation pipelines:
BRAKER1:
pipeline that combines AUGUDTUS with Genemark
-- uses GeneMark-ET and AUGUSTUS
BRAKER2:
integrates known proteins
-- uses GeneMark-EP+ and AUGUSTUS
BRAKER3:
integrates RNA-seq reads and known proteins
-- uses GeneMark-EP+ and AUGUSTUS
HOMEPAGE
LST (Locus Structure Table)
GFF (General Feature
Format)
GFF (General Feature
LST (Locus Structure Table)
Format)

• Focus: Primarily emphasizes the coding • Focus: Provides a more comprehensive view of the gene
potential of the sequence. structure, including exons, introns, coding regions etc
• Structure: Each line represents a coding • Structure: Each line represents a feature, with columns
region (CDS) or non-coding region (NCR), specifying:
with columns specifying: • Feature type: e.g., gene, exon, intron &untranslated region
• Strand: + or - indicating the DNA strand • Source: Program that predicted the feature.
coding direction. • Start and end positions
• Start and end position • Score: Optional value indicating the confidence in the
• Length: Size of the region in base pairs. prediction.
• Coding potential score: A numerical value • Strand: + or - indicating the DNA strand coding direction.
between 0 and 1, where 0 represents non- • Frame: For coding regions, indicates the reading frame.
coding and 1 indicates high coding • Attributes: Additional information about the feature, such
potential. as gene ID, protein ID, and specific gene name.
OUTPUT (LST format)
GFF Format

seq name Source Feature type Start & End position Confi. Strand Frame Attributes
PDF FILE OUTPUT
 Window Length and Step:
These parameters define the size and spacing of
segments scanned by GeneMark while searching
for coding regions.
 Threshold Value:
This value represents the minimum score required
for a region to be considered a potential coding
sequence.
 PostScript graph:
This option indicates that the output includes a
graphical representation of the predictions.

 Matrix:
It refers to a mathematical model that helps the
software determine the coding potential of a DNA
sequence
 Order
This specifies the order of the Markov model used
in the matrix. A higher order model leads to more
accurate predictions.
 The horizontal axis represents the
nucleotide position in the sequence, while
the vertical axis indicates the coding
potential score.
 A score of 0.5 or higher is generally
considered to be indicative of a coding
region.
 There is a long ORF spanning from
approximately 400 to 1600 bp with a
high coding potential score. This ORF is
likely to correspond to the actin protein-
coding sequence.
 There are also a few shorter ORFs with
lower coding potential scores.
 The region of the genome from
approximately 1200 to 2000 bp has a
relatively low coding potential score.
This may indicate that it contains non-
coding sequences, such as regulatory
elements or introns.
FGENESH: Ab Initio Gene
Prediction Tool
Introductio FGENESH (Flexible Gene Prediction
System) is an ab initio gene prediction tool
for prokaryotic and eukaryotic genomes.

n Algorithm:
It utilizes a hidden Markov model (HMM)
to scan DNA sequences and identify
potential coding regions, specifically exons
and introns.

Web version of FGENESH can be used


with parameters for the following genomes:
human, mouse, Drosophila, nematode, dicot
plants, monocot plants, yeast (S.pombe) and
Neurospora.
FGENESH - HMM-based gene structure prediction
(multiple genes, both chains)

FGENES - Pattern based human gene structure prediction


(multiple genes, both chains

Different
FGENESH-M - Prediction of multiple variants potential
genes in genomic DNA

FEX - Finding potential 5'-, internal and 3'-coding exons

Tools SPL - Search for potential splice sites

SPLM - Search for human potential splice sites using weight


matrices

FSPLICE - find splice sites in genomic DNA

FGENES, FGENES-M, FGENESH_GC and SPLM can be


used on human sequences only.
OUTPUT
Fgenesh output:
PolyA signal, indicating the end of the
transcript; TSS -Transcription start site,
G - predicted gene number, starting from start of sequence; where RNA polymerase starts transcribing
Str - DNA strand (+ for direct or - for complementary); the gene.
Feature - type of coding sequence: CDSf - First (Starting Weight - Log likelihood*10 score for the
with Start codon), CDSi - internal (internal exon), CDSl - feature;
last coding segment, ending with stop codon); CDSo: Single ORF - open reading frame of the predicted
coding segment, representing a gene with only one exon. gene, which is the portion of the CDS that
can be translated into a protein

You might also like