You are on page 1of 2

GENOMIC ANNOTATION

Points out chromosome, start and stop  3 information. An annotation is not a sequence, is
different. Something I take note about, like Naples is my annotation: if I draw Italy I can say in that
point there is Naples, which is my starting point, and then there’s Milan which can be my ending
point. I need that to make more easy to read sequency. A typical annotation starts with some
information about location of the gene (ENSxxxxxxx ensemble gene code).
Studying genome means finding protein coding genes, pseudogenes, repetitive sequences,
regulatory elements… there are manual and computational annotation. Manual annotaation are
subjected to personal interpretation of the gene (?).
Ab initio means I have only the sequence and I don’t know which type of gene or protein is, and so
I try to get a some structure possible for that protein. If I have at least annotation of which type of
protein that is, ex that protein is a globin, I can try to align it (BLAST, FASTA) to protein of that
family and I can have better results.

GENE PREDICTION
 Evidence based
o Protein sequence
o ET
o RNA-seq
 Ab initio
1. First of all to align sequences I have to search for Open Reading Frames. Typically I search
for a sequence between an ATG (Met) and a stop codon. Then found these ORF, can I
expect this is an actual gene? Depends on what I start from: A bacterial genome is full of
genes and I can really expect even a random sequence contains a real ORF, instead in
human genome only 2% is coding so it’s a rare event, but I might start from a sequence that
is not random and I can expect is an important one
2. Evaluate the amino acids sequence: can be possible to have a protein full of tryptophan?
Would it be stable? Or a poli-A seq can be an ORF?
3. Evaluate the length: if proteins are in the order of hundreds aa, this means an ORF is
hundreds * 3. So, considering I can expect a stop codon 3/64 (1/4*1/4*1/4 for each stop
codon, which are 3 so 3/64=21.33). a random ORF would be more like about 20 bases.
4. We have to consider there are exons and introns.
N50:
 Contig (sequences put together)
 Scaffold (ex entire chromosome)
N50 è: lo raggiungo misurando diverse volte e mettendo insieme vari contig, pezzi di cromosomi,
scaffold o quel che sono insieme. Questo finchè non raggiungo il 50% del genoma.
Typically, large genomes have large genes.
Genome browser: ensemble. Different sizes of zoom on genome. I can study different genomes.
This site is build in such way is comfortable: the human genome is in one page, the chromosomes in
23 pages, each chromosome is divided in scaffold of hundreds, and each one has thousands of pages
of genes. Is hierarchically organized.
Ensembl genome browser; NCB map viewer; USCS genome viewer…
In ensemble there are many things, starting from the single nucleotide, to the entire chromosome. G,
T, P, E for giving name to genes, transcript, peptide and exon.
We can also compare entire chromosomes or genomes. Chromosomes are evolving and changing.
Ensembl is smart because if I want to take all 5’UTR of many genes I should select the dataset, by
extracting one per one each sequence. So this site allows us to filter ex region, gene, gene ontology,
expression, protein, snp… and I have the possibility to extract from genes exactly what I need.

You might also like