You are on page 1of 47

GENOME BASIC

CONCEPT,TERMINOLOGY AND
TOOLS
How Many Genes in a Genome?

• One of the main tasks of genome annotation is to try to give a precise account of the total number of genes
in a genome.
• This may be more feasible for prokaryotes as their gene structures are relatively simple.
• However, the number of genes in eukaryotic genomes, in particular the human genome, has been a subject
of debate.
• This is mainly because of the complex structures of these genomes, which obscure gene prediction.
• Before the human genome sequencing was completed, the estimated gene numbers ranged from 20,000
to 120,000.
• Since the completion of the sequencing of the human genome, with the use of more sophisticated gene
finding programs, the total number of human genes now dropped to close to 25,000 to 30,000.
• Although no exact number is agreed upon by all researchers, it is now widely believed that the total number
of human genes will be no more than 30,000.
• This compares to estimates of 50,000 in rice, 30,000 in mouse, 26,000 in Arabidopsis, 18,400 in C. elegans,
and 6,200 in yeast.
Automated Genome Annotation
• With the genome sequence data being generated at an exponential rate, there is a need to develop fast and automated
methods to annotate the genomic sequences.
• The automated approach relies on homology detection, which is essentially heuristic sequence similarity searching.
• If a newly sequenced gene or its gene product has significant matches with a database sequence beyond a certain
threshold, a transfer of functional assignment is taking place.
• In addition to sequence matching at the full length, detection of conserved motifs often offers additional functional
clues. Because using a single database searching method is often incomplete and error prone, automated methods have
to mimic the manual process, which takes into consideration multiple lines of evidence in assigning a gene function, to
minimize errors. The following algorithm is an example that goes a step beyond examining sequence similarity and
provides functional annotations based on multiple protein characteristics.
• GeneQuiz (http://jura.ebi.ac.uk:8765/ext-genequiz/) is a web server for protein sequence annotation. The program
compares a query sequence against databases using BLAST and FASTA to identify homologs with high similarities. In
addition, it performs domain analysis using the PROSITE and Blocks databases (see Chapter 7) as well as analysis of
secondary structures and supersecondary structures that includes prediction of coiled coils and transmembrane helices.
Multiple search and analysis results are compiled to produce a summary of protein function with an assigned confidence
level (clear, tentative, marginal, and negligible).
CNV
Copy number variation (abbreviated CNV) refers to a circumstance in which the number of
copies of a specific segment of DNA varies among different individuals’ genomes. The
individual variants may be short or include thousands of bases. These structural differences
may have come about through duplications, deletions or other changes and can affect long
stretches of DNA. Such regions may or may not contain a gene(s).
SNP
A single nucleotide polymorphism (abbreviated SNP, pronounced snip) is a
genomic variant at a single base position in the DNA. Scientists study if and
how SNPs in a genome influence health, disease, drug response and other
traits.
Genome annotation
The genome annotation process provides comments for the features.
This involves two steps:
• gene prediction
•functional assignment.

• Gene annotation of the human genome employs a combination of theoretical prediction and
experimental verification.
• Gene structures are first predicted by ab initio exon prediction programs such as GenScan or
FgenesH.
• The predictions are verified by BLAST searches against a sequence database.
• The predicted genes are further compared with experimentally determined cDNA and EST
sequences using the pairwise alignment programs such as GeneWise, Spidey, SIM4, and
EST2Genome.
• All predictions are manually checked by human curators.
• Once open reading frames are determined, functional assignment of the encoded proteins is
carried out by homology searching using BLAST searches against a protein database.
• Further functional descriptions are added by searching protein motif and domain databases
such as Pfam and InterPro as well as by relying on published literature.
Single nucleotide
polymorphisms
SNPs
• DNA sequence variations that occur when a
single nucleotide is altered.
• Must be present in at least 1% of the
population to be a SNP.
• Occur every 100 to 300 bases along the 3
billion-base human genome.
• Many have no effect on cell function but some
could affect disease risk and drug response.
Toy example
SNPs on the chromosome
Bi-allelic SNPs
• Most SNPs have one of two nucleotides
at a given position
• For example:
– A/G denotes the varying nucleotide as
either A or G. We call each of these an allele
– Most SNPs have two alleles (bi-allelic)
SNP genotype
• We inherit two copies of each chromosome
(one from each parent)
• For a given SNP the genotype defines the type
of alleles we carry
• Example: for the SNP A/G one’s genotype may
be
– AA if both copies of the chromosome have A
– GG if both copies of the chromosome have G
– AG or GA if one copy has A and the other has G
– The first two cases are called homozygous and
latter two are heterozygous
SNP genotyping
Real SNPs
• SNP consortium: snp.cshl.org
• SNPedia: www.snpedia.com
Application of SNPs:
association with disease
• Experimental design to detect cancer
SNPs:
– Pick random humans with and without
cancer (say breast cancer)
– Perform SNP genotyping
– Look for associated SNPs
– Also called genome-wide association study
Case-control example
• Study of 100 people:
– Case: 50 subjects with
#Recessive #Dominant
cancer
alleles alleles
– Control: 50 subjects
without cancer Case 10 40

• Count number of
dominant and recessive Control 2 48

alleles and form a


contingency table
Odds ratio
• Odds of recessive in
cancer = a/b = e
#Recessive #Dominant
• Odds of recessive in alleles alleles
no-cancer = c/d = f Cancer a b
• Odds ratio of recessive
in cancer vs no-cancer = No cancer c d
e/f
Risk ratio (Relative risk)
• Probability of recessive
in cancer = a/(a+b) = e
#Recessive #Dominant
• Probability of recessive alleles alleles
in no-cancer = c/(c+d) = Cancer a b
f
• Risk ratio of recessive in No cancer c d
cancer vs no-cancer =
e/f
Odds ratio vs Risk ratio
• Risk ratio has a natural interpretation
since it is based on probabilities
• In a case-control model we cannot
calculate the probability of cancer given
recessive allele. Subjects are chosen
based disease status and not allele type
• Odds ratio shows up in logistic
regression models
Odds ratios in genome-wide
association studies
• Higher odds ratio means stronger
association
• Therefore SNPs with highest odds ratios
should be used as predictors or risk
estimators of disease
• Odds ratio generally higher than risk
ratio
• Both are similar when small
Statistical test of association
(P-values)
• P-value = probability of the observed data (or
worse) under the null hypothesis
• Example:
– Suppose we are given a series of co in-tosses
– We feel that a biased coin produced the tosses
– We can ask the following question: what is the probability that
a fair coin produced the tosses?
– If this probability is very small then we can say there is a
small chance that a fair coin produced the observed tosses.
– In this example the null hypothesis is the fair coin and the
alternative hypothesis is the biased coin
SNPs and GWAS
Polymorphism
• Polymorphism: sites/genes with “common”
variation
• Locus (location) vs alleles (variations)
• Minor allele frequency >= 1%, otherwise called
rare variant and not polymorphic
• Single Nucleotide Polymorphism
– Come from DNA-replication mistake
individual germ line cell, then transmitted
– ~90% of human genetic variation
• Copy number variations
– May or may not be genetic
STAT115 24
SNP Characteristics:
Linkage Disequilibrium
• Hardy-Weinberg equilibrium
– In a population with genotypes AA, aa, and Aa, if p =
freq(A), q =freq(a), the frequency of AA, aa and Aa
will be p2, q2, and 2 pq respectively at equilibrium.

– Similarly with two loci, each two alleles Aa, Bb

STAT115 25 0.26 ab
SNP Characteristics:
Linkage Disequilibrium
• LD: If Alleles occur together more often than
can be accounted for by chance, then indicate
two alleles are physically close on the DNA
• Haplotype block: a cluster of linked SNPs
• Haplotype boundary: blocks of sequence with
strong LD within blocks and no LD between
blocks, reflect recombination hotspots

STAT115 26
Haplotype
• Association studies using haplotype is more
accurate than using individual SNPs
• Haplotype size distribution

STAT115 27
SNP Profiling
• [C/T] [A/G] T X C [A/C] [T/A]
– 24 possible haplotype, although often a few
common ones explain 90% variations
• Tagging (non-redundant) SNPs that capture
most variations in haplotypes
– reference SNP ID number: rs12345678
• SNP arrays covering
whole genome
• Now WES or WGS
•STAT115
Geno-type 2 alleles
28
Association Studies
• Association between genetic markers and
phenotype
– E.g. Cystic Fibrosis ~70% of Cystic Fibrosis
patients have a deletion of 3 base pairs
resulting in the loss of a phenylalanine
amino acid at position 508 of the
gene
• Especially, find disease genes, SNP /
haplotype markers, for susceptibility prediction
and diagnosis 29
Warfarin and CY P2C9:
SNPs in Pharmacogenomics
• Warfarin anticoagulant drug; CYP2C9 gene
metabolizes warfarin.
• A patient requiring low dosage warfarin
compared to normal population, has an odd
ratio of 6.21 for having ≥ 1 variant allele
• Subgroup of patients who are poor
metabolisers of warfarin are potentially at
higher risk of bleeding
Break
Aithal et al., 1999, .
Genome-Wide Association Studies

• Quality Control
– Unusual similarity between individual
– Wrong sex
– Trio has non-Mendelian inheritance
– Genotyping quality
• Two strategies:
– Family-based association studies
– Population-based case-control association
studies 31
Family-based Association Studies

Look at allele transmission in unrelated families


and one affected child in each Like coin toss,
likelihood of fair coin

A a
A 0
a 0

32
TDT: Transmission Disequilibrium Test

• Only heterozygote parents matters, calculate


observed over expected

• Could also compare allele frequency between


affected vs unaffected33 children in the same
family
Case Control Studies
• SNP/haplotype marker frequency in
sample of affected cases compared to
that in age /sex /population-matched
sample of unaffected controls

34
From Genotyping to Allele Counts

35
Test Significant Associations

• Expected:
– (24 + 278) * (24 + 86) / (24 + 278 + 86 + 296) = 49
– (278+296) * (86+296) / (24 + 278 + 86 + 296) =
321

• χ2 = 27.5, 1df, p < 0.001

36
37
Association of Alleles and Genotypes of
rs1333049 with Myocardial Infarction

C G χ2
P-value
N (% ) N (% ) (1df)
Cases 2,132 (55.4) 1,716 (44.6)
55.1 1.2 x 10-13
Controls 2,783 (47.4) 3,089 (52.6)
Allelic Odds R atio = 1.38

• OR = 1, no disease association
• OR > 1, allele C increase risk of disease
• OR < 1, allele C decrease risk of disease
• Adjusting for multiple hypotheses Break
testing?
Samani N et al, 2007; 357:443-453.
Reproducibility of Association Studies
• Most reported associations have not been
consistently reproduced
• Hirschhorn et al, , 2002,
review of association studies
– 603 associations of polymorphisms and disease
– 166 studied in at least three populations
– Only 6 seen in > 75% studies

39
Size Matters

Visscher, AJHG 2012

40
Unusual Pvalue distributions
• Pvalue QQ plot • ??

41
Population Stratification
• Population stratification
– e.g. some SNP unique to ethnic group
– Need to make sure sample groups match
– Hidden environmental structure
● Two populations have different disease
frequency, and different allele frequency.
● Association picks up the fact they are different
populations!

42
Genotyping Principal Components (PCs)
Can Model Population Stratification

• Li et al., Science 2008


IBD: Identity By Descent Test
• If two individuals share common ancestor,
they will share many SNPs / haplotype blocks
on their genome (identical by state: IBS)
• IBD are IBS by definition; IBS not necessarily
IBD

44
IBD: Identity By Descent Test
• Pairwise IBD probability between samples
• Probability two individuals share 0 (Z0), 1 (Z1),
and 2 (Z2) haplotypes across the genome.
• Remove IDBs

45
Detection Power of GWAS

46
Manolio et al., Clin Invest 2008

You might also like