You are on page 1of 48

Computational Biology 12BBI152

Human Genome Project Ultra-conservation in human genome


Bioinformatics and Functional Genomics Jonathan Pevsner, 2009 John Wiley & Sons, Inc

International Human Genome Sequencing Consortium [IHGSC]


International collaboration (IHGSC) and Celera The sequencing and analysis of a draft version of the human genome (2001) Major findings of the human genome project Web-based resources A bioinformatics perspective Euchromatic sequence (IHGSC, 2004) and characterizing each of the 22 autosomes and two sex chromosomes (as well as the mitochondrial genome). Variation in the human genome, including the analysis of individual genomes

Major findings
20,000 to 25,000 protein-coding genes (IHGSC, 2004) The same number of genes as much simpler organisms such as Arabidopsis thaliana (26,000 genes) and puffer fish (21,000 genes) The human proteome is far more complex than the set of proteins encoded by invertebrate genomes About 40 genes that underwent horizontal transfer from bacteria (Salzberg et al., 2001)

Major findings contd..


More than 98% of the human genome does not code for genes. repetitive DNA elements such as long interspersed elements (LINEs) (20%), short interspersed elements (SINEs) (13%), long terminal repeat (LTR) retrotransposons (8%), and DNA transposons (3%) Half the human genome is derived from transposable elements. Unlike the mouse genome, human genome displays no activity of transposable elements

Major findings contd..


Segmental duplication is a frequent occurrence in the human genome, particularly in pericentromeric and subtelomeric regions several hundred thousand Alu repeats in the human genome The mutation rate is about twice as high in male meiosis than in female meiosis More than 1.4 million single nucleotide polymorphisms (SNPs) were identified

ENCyclopedia Of DNA Elements


ENCODE Consortium, 2004, 2007

Objectives
High throughput sequencing of chromatin regulatory elements including transcription factor binding sites, using chromatin immuno-precipitation followed by high throughput DNA sequencing Comprehensively identifying active functional elements in human chromatin (in part using DNase I hypersensitivity assays) Characterizing the human transcriptome. Developing a reference gene set for protein-coding genes, non-coding genes, and pseudogenes

Web resources for Human genome


NCBI
human genome resources Map Viewer
integrates human sequence and data from cytogenetic maps, genetic linkage maps, radiation hybrid maps, and YAC chromosomes links to Entrez Gene, Entrez Nucleotide, Entrez Protein, human and mouse UniGene and a human mouse homology map

Evidence viewer
displays evidence supporting the proposed structure of a gene highlights possible discrepancies in the nucleotide sequence, exonintron boundaries, or other aspects of an annotated gene

Evidence Viewer
Contig GenBank

RefSeq EST color coding scheme

Ensembl
A comprehensive resource for information about the human genome as well as many other genomes (Flicek et al., 2008) effectively interconnects a wide range of genomics tools with a focus on annotation of known and newly predicted genes

Ensembl contd..
Contig view allows you to search across an entire chromosome Gene view includes the transcript DNA sequence and information on exonintron boundaries (splice sites) Anchor view allows you to select two features from a chromosome as anchor points and to display the intervening region Disease view links to disease entries in OMIM Map view shows an ideogram of each chromosome, including the known genes, GC content, and SNPs Cyto view displays genes, BAC end clones, repetitive elements, and the tiling path across genomic DNA regions

University of California at Santa Cruz Human Genome Browser The Golden Path is the human genome sequence annotated at UCSC Along with the Ensembl and NCBI sites, the human genome browser at UCSC is one of the three main web-based sources of information for both the human genome and other genomes

The National Human Genome Research Institute


NHGRI has a leading role in genome sequencing, coordinating pilot-scale and large-scale sequencing efforts, technology development, and policy development

The Wellcome Trust Sanger Institute


a leading genomics institute is essential to the fields of bioinformatics and genomics

The Human Genome Project


First proposed by the National Research Council (1988) Goals
Human DNA sequence Sequencing technology Sequence variation Functional genomics Comparative genomics Ethical, legal and social issues (ELSI) Bioinformatics and computational Biology Training and Manpower

DNA Sequencing efforts

Twenty institutions were involved Developing hierarchical sequencing


BAC-by-BAC sequencing (IHGSC) Whole-genome shotgun sequencing (Celera)

Sequencing Strategy

The average length of a clone or a contig genome has been sequenced and assembled? The N50 length describes the largest length L such that 50% of all nucleotides are contained in contigs or scaffolds of at least size L. Half of all nucleotides were present in a fingerprint clone contig of at least 8.4 megabases (2001) The N50 length rose to 38.5 megabases with the most recent freeze of the genome assembly (2008)

IHGSC, 2001

Broad Genomic Landscape


The autosomes are numbered approximately in order of size The largest chromosome, chromosome 1, is 223 megabases in length; the smallest, chromosome 21, is about 47 megabases Broad features include: The distribution of GC content CpG islands and recombination rates The repeat content The gene content

Long-Range Variation in GC Content


Average GC content - 41% There are regions that are relatively GC rich and GC poor Regions of uniform A,C,G, and T distribution are called isochores Mammalian genomes are organized into a mosaic of large DNA segments (e.g., .300 kb) called isochores (Bernardi et al., 2001) Isochores are fairly homogeneous compositionally and can be divided into GC-poor families (L1 and L2) or GCrich families (H1, H2, and H3)

A histogram of the overall GC content (in 20 kb windows) shows a broad profile with skewing to the right Fifty-eight percent of the GC content bins are below the average, while 42% are above the average, including a long tail of highly GC-rich regions

CpG Islands

28,890 CpG islands

The dinucleotide CpG is greatly underrepresented in genomic DNA, occurring at about one-fifth its expected frequency. Most CpG dinucleotides are methylated on the cytosine and subsequently are deaminated to thymine bases. However, the genome contains many CpG islands which are typically associated with the promoter and exonic regions of housekeeping genes (Gardiner-Garden and Frommer, 1987) CpG islands have roles in processes such as gene silencing, genomic imprinting (Tycko and Morison, 2002), and X chromosome inactivation (Avner and Heard, 2001)

Repetitive DNA in human genome


Five main classes (IHGSC, 2001; Jurka, 1998)
1. Interspersed repeats (transposon-derived repeats) 2. Processed pseudogenes: inactive, partially retroposed copies of protein-coding genes 3. Simple sequence repeats: microsatellites and minisatellites, including short sequences such as (A)n, (CA)n, or (CGG)n 4. Segmental duplications, consisting of blocks of 10 to 300 kb that are copied from one genomic region to another 5. Blocks of tandemly repeated sequences such as are found at centromeres, telomeres, and ribosomal gene clusters

Transposon-Derived Repeats
Incredibly, 45% of the human genome or more consists of repeats derived from transposons. Also called interspersed repeats Transposon-derived repeats can be classified in four categories

Other Repetitive Elements


Simple sequence repeats About 3%
perfect (or slightly imperfect) tandem repeats of k-mers. k ~ 1 - 12 bases microsatellite k ~ 12 - 500 bases minisatellite

Segmental Duplications- About 5.5%


Duplicated blocks of from 1 to 200 kb of genomic sequence typically of 10 to 50 kb Inter-chromosomal or intra-chromosomal

The centromeres contain large amounts of interchromosomal duplicated segments, with almost 90% of a 1.5 Mb region containing these repeats Smaller regions of these repeats also occur near the telomeres

Gene Content of the Human Genome


The average exon is only 50 codons
Such small elements are hard to identify as exons unambiguously

Exons are interrupted by introns


some many kilobases in length.

In the extreme case, the human dystrophin gene extends over 2.4 Mb,
the size of an entire genome of a typical prokaryote!

Use of cDNAs continues to provide an essential approach to gene identification There are many pseudogenes that may be difficult to distinguish from functional protein-coding genes The nature of non-coding genes is poorly understood

Gene Expression Transcription & Translation

DNA RNA One Gene One Message

One Gene Many Messages

Average coding sequence 1344 bp internal exons are about 50 to 200 bp the size of introns is far more variable Protein-coding genes are associated with a high GC content Gene density increases 10-fold as GC content rises from 30% to 50%

Non-coding RNAs
Classes of genes that do not encode proteins Noncoding RNAs can be difficult to identify in genomic DNA because
lack open reading frames may be small, and not polyadenylated difficult to detect by gene-finding algorithms, and often not present in cDNA libraries

These noncoding RNAs include the following:


Transfer RNAs, required as adapters to translate mRNA into the amino acid sequence of proteins Ribosomal RNAs, required for mRNA translation Small nucleolar RNAs (snoRNAs), required for RNA processing in the nucleolus Small nuclear RNAs (snRNAs), required for spliceosome function

The Human Proteome


European Bioinformatics Institute Integr8 database
40,014 proteins in the human proteome

74% of the proteins were significantly related to other known proteins The number of protein-coding genes in humans is comparable to the number of genes in other metazoans and plants and only five-fold greater than the number in unicellular fungi human proteome may be far more complex (compared to other metazoans and plants)
relatively more domains and protein families relatively more paralogs, potentially yielding more functional diversity relatively more multi-domain proteins having multiple functions Domain architectures tend to be more complex Alternative RNA splicing may be more extensive

Complexity of Human Proteome

Taxonomic Distribution of human Proteome

Human Chromosomes

Human Chromosomes

Genetic Variation
SNPs represent a fundamental form of variation in the human population (copy number being the other) number of characterized SNPs to 3.1 million
International HapMap Consortium, 2007)

most SNPs are biallelic SNPs are spaced apart 875 base pairs across the genome SNPs have varying extents of linkage disequilibrium (LD) Why study SNPs?
SNP microarray analyses are used for genome-wide studies of disease association SNPs reveal patterns of variation, such as shared ancestry, in human populations SNP analyses can reveal regions of the genome under strong positive selection use of SNPs is to identify chromosomal deletions, duplications, inversions and other abnormalities

outstanding problems
How can we accurately determine the number of protein-coding genes? How can we determine the number of noncoding genes? How can we determine the function of genes and proteins? What is the evolutionary history of our species? What is the degree of heterogeneity between individuals at the nucleotide level?

International HapMap Project


Although the main sequencing phase of the HGP has been completed, studies of DNA variation continues goal is to identify patterns of single-nucleotide polymorphism (SNP) groups (called haplotypes, or haps) The DNA samples for the HapMap came from a total of 270 individuals
Yoruba people in Ibadan, Nigeria Japanese people in Tokyo Han Chinese in Beijing; and the French Centre dEtude du Polymorphisms Humain (CEf) resource, which consisted of residents of the United States having ancestry from Western and Northern Europe.

Private Venture
Celera Genomics: DNA from five different individuals were used for sequencing.
The lead scientist of Celera Genomics at that time, Craig Venter, later acknowledged (in a public letter to the journal Science) that his DNA was one of 21 samples in the pool, five of which were selected for use

On September 4, 2007, a team led by Craig Venter published his complete DNA sequence unveiling the six-billion-nucleotide genome of a single individual for the first time.

Benefits? ELSI?

Ultra-conservation in Human Genome


An ultra-conserved element (UCE) is a region of DNA that is identical in at least two different species One of the first studies of UCEs showed that human DNA sequences of length 200 nucleotides or greater were entirely conserved (identical nucleic acid sequence) in both rats and mice. Despite often being non-coding DNA, some ultraconserved elements have been found to be transcriptionally active, giving non-coding RNA molecules The percentage of the conserved elements that overlap with a known coding region steadily rises from 14% to 34.7% as the length criteria defining these elements is reduced from 200bp to 50bp Ultra-conserved Elements in the Human Genome
Bejerano et al., (2004) Science, 304 (5675): 13215 (481 UCEs) Renecker et al., (2012) PNAS, 304 (5675): 13215

- The UCEs exhibit almost no natural variation in the human population - Widely distributed in the genome - On all chromosomes except chromosomes 21 and Y - Often found in clusters

About 1.2% of the human genome appears to code for protein As much as 5% is more conserved than expected from neutral evolution and hence may be under negative or purifying selection
Specific non-coding segments in the human genome that appear to be under selection using a threshold for conservation of 70% or 80% identity with mouse over more than 100bp

Evolutionary issues

Evolutionary importance
Perfect conservation of these long stretches of DNA

UCEs appear to have experienced strong negative selection for 300-400 million years The probability of finding UCEs by chance (under neutral evolution) has been estimated at less than 1022 in 2.9 billion bases

UCEs in Human Genome


Bejerano et al., 2004 481 segments
longer than 200 bp that are absolutely conserved (100% identity with no insertions or deletions) between orthologous regions of the human, rat and mouse genomes Nearly all of these segments are also conserved in the chicken and dog genomes, with an average of 95% and 99% identity, respectively Many significantly conserved in fish

Ultra-Conserved Elements
Along with more than 5,000 sequences of over 100bp that are absolutely conserved among the three sequenced mammals Most often located
either overlapping exons in genes involved in RNA processing or in introns or nearby genes involved in regulation of transcription and development.

A class of genetic elements


whose functions and evolutionary origins are yet to be determined more highly conserved between these species than proteins appear to be essential for the ontogeny of mammals and other vertebrates

How are they distributed?


UCEs - 481, 111 overlap the mRNA of a known human protein coding gene (including the UTR regions) partly exonic or exonic
Type I genes - 93 known genes RNA binding proteins and splicing

256 show no evidence of transcription from any matching EST or mRNA from any species Non-exonic
Type II genes 255 known genes DNA binding and regulation of expression

For the remaining 114 the evidence for transcription is inconclusive Possibly exonic Distribution A hundred non-exonic elements are located in introns of known genes and the rest are intergenic. The non-exonic elements tend to congregate in clusters near transcription factors and developmental genes The exonic and possibly exonic elements are more randomly distributed along the chromosomes

Functions?
493 ultra-conserved elements have been identified in the human genome A small number of those which are transcribed have been connected with human carcinomas and leukemias For example, TUC338 is strongly upregulated in human hepatocellular carcinoma cells A study comparing ultra-conserved elements between humans and Takifugu rubripes proposed an importance in vertebrate development Several ultra-conserved elements are located near transcriptional regulators or developmental genes Other functions include enhancing and splicing regulation?

Importance of UCEs
Evolutionary importance - Under negative selection Possible biological role ?
The longest elements (779bp, 770bp, and 731bp) all lie in the last three introns in the 3 portion of POLA, the DNA polymerase alpha catalytic subunit on chromosome X, along with other shorter UCEs A similar-sized conserved region, 711bp formed by concatenation of uc.468 and uc.469 (separated by a single base), lies in the ~7Kb intergenic region between the 3 end of POLA and its downstream neighbor, ARX gene. ARX involved in CNS development and is associated with a host of X-linked Mendelian diseases, including epilepsy, mental retardation, autism and cerebral malformations They instead form a cluster of enhancers of ARX?

Chromosome X

You might also like