You are on page 1of 65

Genome Analysis

1000 Genomes: A Deep Catalog of Human Genetic Variation


1000 Genomes Project
 The 1000 Genomes Project is an international research consortium
formed to create the most detailed and medically useful picture to date of
human genetic variation.

 The project involves sequencing the genomes of approximately 1200


people from around the world and receives major support from the
Wellcome Trust Sanger Institute in Hinxton, England, the Beijing Genomics
Institute Shenzhen in China and the National Human Genome Research
Institute (NHGRI), part of the National Institutes of Health (NIH).

 Drawing on the expertise of multidisciplinary research teams, the 1000


Genomes Project will develop a new map of the human genome that will
provide a view of biomedically relevant DNA variations at a resolution
unmatched by current resources.

 As with other major human genome reference projects, data from the
1000 Genomes Project will be made swiftly available to the worldwide
scientific community through freely accessible public databases.
1000 Genomes Project Strategy
 The goal of the 1000 Genomes Project is to find most genetic variants that have
frequencies of at least 1% in the populations studied. This goal can be attained by
sequencing many individuals lightly.

 To sequence a person's genome, many copies of the DNA are broken into short
pieces and each piece is sequenced. The many copies of DNA mean that the DNA
pieces are more-or-less randomly distributed across the genome. The pieces are
then aligned to the reference sequence and joined together.

 To find the complete genomic sequence of one person with current sequencing
platforms requires sequencing that person's DNA the equivalent of about 28 times
(called 28X). If the amount of sequence done is only an average of once across the
genome (1X), then much of the sequence will be missed, because some genomic
locations will be covered by several pieces while others will have none.

 The deeper the sequencing coverage, the more of the genome will be covered at
least once. Also, people are diploid; the deeper the sequencing coverage, the more
likely that both chromosomes at a location will be included. In addition, deeper
coverage is particularly useful for detecting structural variants, and allows
sequencing errors to be corrected.
1000 Genomes Project Strategy

 Sequencing is still too expensive to deeply sequence the many samples being
studied for this project.

 However, any particular region of the genome generally contains a limited


number of haplotypes.

 Data can be combined across many samples to allow efficient detection of most
of the variants in a region.

 The Project currently plans to sequence each sample to about 4X coverage; at


this depth sequencing cannot provide the complete genotype of each sample, but
should allow the detection of most variants with frequencies as low as 1%.

 Combining the data from 2500 samples should allow highly accurate estimation
(imputation) of the variants and genotypes for each sample that were not seen
directly by the light sequencing.
1000 Genomes Project Timelines

 January 22, 2008: International Consortium Announces the 1000 Genomes


Project

 June 21, 2010: 1000 Genomes Project Releases Data from Pilot Projects on Path
to Providing Database for 2,500 Human Genomes (2500 unidentified people from
about 25 populations around the world)

December 16, 2010: Sequencing of 629 individuals completed

October 12, 2011: An integrated set of variant calls and phased genotypes
including SNPS, short INDELs and Deletions based on low coverage and exome
sequencing data across 1092 individuals.

For details see:


http://www.internationalgenome.org/
The 1000 Genomes Project Publications

The main publications from the 1000 Genomes Project are the final
publications from phase 3 of the project, which were published
in Nature in October 2015.
“A global reference for human genetic variation” Nature 526 68-74
2015
“An integrated map of structural variation in 2,504 human
genomes” Nature 526 75-81 2015

The Consortium also produced publications from the earlier data


phases of the project, which were the initial pilot and phase 1 of the
main project. No equivalent paper was produced for phase 2, which
focused on technical development work.
“An integrated map of genetic variation from 1,092 human
genomes” Nature 491 56-65 2012

“A map of human genome variation from population-scale


sequencing” Nature 467 1061-1073 2010
GenomeAsia100K Consortium

https://genomeasia100k.org/

Nature. 2019 Dec;576(7785):106-111. doi: 10.1038/s41586-019-1793-z.


Epub 2019 Dec 4.
The GenomeAsia 100K Project enables genetic discoveries across
Asia.

Abstract
The underrepresentation of non-Europeans in human genetic studies so
far has limited the diversity of individuals in genomic datasets and led to
reduced medical relevance for a large proportion of the world's
population. Population-specific reference genome datasets as well as
genome-wide association studies in diverse populations are needed to
address this issue. Here we describe the pilot phase of the GenomeAsia
100K Project. This includes a whole-genome sequencing reference
dataset from 1,739 individuals of 219 population groups and 64 countries
across Asia. We catalogue genetic variation, population structure,
disease associations and founder effects. We also explore the use of this
dataset in imputation, to facilitate genetic studies in populations across
Asia and worldwide.
Genome Analysis Tools
Two Broad Genomics Research Areas
Functional Genomics Study Techniques

How to measure the pattern of gene expression in a given


tissue over a period of time?

1. Northern Blot

2. In situ hybridization

3. RT-PCR

4. Gene Chip/ DNA Microarray


RT– PCR (Real-time Reverse Transcription
PCR)
• Used for amplifying a defined piece of mRNA molecule.

• Traditionally RT-PCR involves two steps:

(i) RT reaction
(ii) PCR amplification.

RNA is first reverse transcribed into cDNA (complementary


DNA)
So, first step of RT PCR is:
• Isolation of mRNA from the cell

• Next, make cDNA from the mRNA

• This is reversing “transcription”– so use an


enzyme originally obtained from viruses–
REVERSE TRANSCRIPTASE

RT efficiency :
Random hexamer primers > poly-dT primer > gene-specific primers
2nd step of RT PCR:

• The resulting cDNA is used as templates


for subsequent PCR amplification using
primers for one or more genes.

• RT – PCR can also be carried out as one


step RT – PCR in which all reaction
components are mixed in one tube prior to
starting reaction (would require hot-start
Taq)
RT– PCR

• Application:
– allows for a high sensitivity detection technique, where
low copy number or less abundant mRNA molecules
can be detected. Used in gene expression studies.
Real Time PCR
• Real time PCR was developed because of the need to quantitate differences in
mRNA expression.

• Conventional PCR does not yield truly quantitative data because of the difficulties of
observing the reaction during the truly linear part of the amplification process.

• Particularly valuable when amounts of RNA are low ( e.g. SMALL AMOUNTS OF
TISSUE; PRIMARY CELLS)

• Syber Green is a dye which binds to double stranded DNA but not to single-stranded
DNA and is frequently used to monitor the synthesis of DNA during real-time PCR
reactions.
Real Time PCR
• kinetic approach
• early stages
• while still linear

9
www.biorad.com
3.
intensifier 5. ccd
detector
1. halogen
350,000
tungsten lamp 2b. emission pixels
filters

2a. excitation
filters
4. sample plate

www.biorad.com
Real Time PCR

So, how to measure differences in


concentration of DNA or cDNA?
This graph shows a series of 10-fold dilutions
of a sample.

As one dilutes the sample, it takes more


cycles before the amplification is detectable.

Samples which differed by a factor of 2


would expect to be 1 cycle apart.

Samples that differ by 10-fold would be ~3.3


cycles apart.

Note: If the plateau values are


4000 to 15000, a threshold of 300
usually works well.

Same data plotted on Logarithmic scale. It is


easy to get the Ct values from this plot.
Relative Expression= 2^(- ΔΔCt )

Condition Mouse Gene A Actin ΔCt ΔΔCt Rel Expression


How do you generate accurate q-PCR data?

(i) Good quality RNA

(ii) No genomic DNA contamination – DNase I treatment and primer

designing strategy, no-reverse transcriptase control

(iii) Ensuring non-specific amplification- gel electrophoresis

Semi –quantitative RT-PCR


How do you generate accurate q-PCR data?
More than one internal control is better

Real-time PCR was carried out using the DyNAmoTM HS SYBR® Green qPCR Kit
(Finnzymes, USA) and following Hmgcr gene specific primers. For normalization of Hmgcr
expression, GAPDH and 18S rRNA abundances were measured using the following primer
pairs. The relative gene expression levels were determined by calculating the 2(-ΔΔCt) values.

Sonawane et al. 2011. Functional Promoter Polymorphisms Govern Differential Expression of HMG-
CoA Reductase Gene in Mouse Models of Essential Hypertension PLoS ONE. 6(1): e16661.
doi:10.1371/journal.pone.0016661.
GeneChip Technology

Miniaturized, high density arrays

Expression arrays 1,300,000 DNA oligos 1-cm by 1-cm


DNA mapping array 7,000,000 DNA oligo 1.3 by 1.3 cm

Manufacturing Process

Solid-phase chemical synthesis and Photolithographic fabrication


techniques employed in semiconductor industry
DNA Microarrays
Photolithographic Synthesis

Manufacturing Process
Probe arrays are manufactured by light-directed chemical
synthesis process which enables the synthesis of hundreds of
thousands of discrete compounds in precise locations

Lamp

Mask Chip

Computer algorithms are used to design photolithographic masks for use in


manufacturing
Affymetrix Wafer and Chip Format

20 - 50 µm

20 - 50 µm

one
oligonucleotide
sequence per “pixel”
49 - 400
chips/wafer
1.0 cm
up to ~ 1.3 million features/chip
Selection of Expression Probes

3’
Sequence
Probes

• Set of oligos to be synthesized is defined from sense sequence of known


genes and EST’s

•Each gene is represented on the probe array by multiple probe pairs

•Each probe pair consists of a perfect match and a mismatch oligonucleotide


Overview: Creating Targets
mRNA
Reverse Transcriptase

cDNA

in vitro transcription

cRNA

Fragmentation of cRNA

GeneChip Hybridization
mRNA

cRNA
Fragmentation of biotinylated cRNA
Fragmentation -Metal mediated alkali induced hydrolysis
Hybridization and Staining

Array

RNA:DNA Hybridized Array

Fragmented cRNA Target

Streptavidin phycoerythrin
[Fluorescent dye]
Instrumentation
Affymetrix GeneChip System
3000-7G Scanner
450 Fluidic Station
640 Hybridization Oven
Currently Available GeneChips

Expression Arrays
B. subtilis Plasmodium Genome Array
Barley Genome Array Porcine Genome Array
Bovine Genome Array Rat Genome Arrays
C. elegans Genome Array Rice Genome Array
Canine Genome Array Soybean Genome Array
Chicken Genome Array Sugar Cane Genome Array
Drosophila Genome Arrays Vitis vinifera (Grape) Array
E. coli Genom e Arrays Wheat Genome Array
Human Genome Arrays Xenopus laevis Genome Array
Maize Genome Array Yeast Genome Arrays
Mouse Genome Arrays Zebrafish Genome Array
P. aeruginosa Genome Array Arabidopsis Genome Arrays
Hybridization of fluorescently labeled cDNA preparations to DNA microarrays

This technique is useful for analyzing gene expression patterns on a genomic scale
Data Analysis

Absolute Analysis –whether transcripts are


Present or not (uses data from one probe array
experiment).

Comparison Analysis –determine the relative


change in transcripts (uses data from two probe
array experiments).

Intensities for each experiment are compared to a


baseline/control.
Validation of Gene Chip data
Genomics Tools
Phage Display

This is a very powerful genomics tool to discover interaction of a protein


with an immobilized target (e.g. a likely disease susceptibility molecule:
an enzyme, a receptor)
cDNA library

• Accurate and complete representation of all mRNA


sequences expressed in a cell, tissue, or organism.

• Facilitates analysis of sequences when only interested in


mRNA and the protein it encodes.

• Since many eukaryotic genes have introns, analysis of the


mRNA simplifies deciphering the coding regions of a
gene.

• Protein encoded and its function can now be predicted


with some accuracy without knowing what you have cloned
based on sequence analysis of cDNA.
How to clone cDNA:
• cDNA has blunt ends, thus need to add restriction site
linkers to make them “sticky”.

• Use T4 DNA ligase and blunt end ligation to add restriction


site linkers to each end of the cDNA.

• Next, digest the linkers with the same restriction enzyme


used to cleave the vector.

• Mix cDNA with cut vector DNA in the presence of DNA


ligase.

• If cDNA has the same restriction site as the linkers,


cDNA will be cloned in pieces. Solution: use adapters with
single-stranded overhangs that match the restriction site
on the vector.
Linker and adapter

Cloning of cDNA using


BamHI linkers

Alternative: use
adapter

5’-GATCCAGAC-3’
GTCTG-5’
Cloning cDNA in Bacteriophages
Plaque formation on a lawn of bacterial cells
Amplification of Phage Libraries

• Primary library, which consists of individual phage


recombinant particles.

• If the sequence of interest have not been found, more


recombinant DNA will have to be produced and
packaged – Amplification.

• Amplification of the library is achieved by plating the


packaged phage on a suitable E.coli strain (e.g. BLT
5616), and then resuspend the plaques by gently
washing the plated by a buffer solution. The resulting
phage suspension can be stored almost indefinitely and
will provide enough material for many screening and
isolation procedures.
Amplification of Libraries

Disadvantages
1. Some recombinant phage may be lost – Perhaps due to
the presence of repetitive sequences in the insert giving
rise to recombinational instability. This can be minimized
by plating on a recombination deficient host.

2. Some phage may exhibit differential growth


characteristics which may cause a particular phage to be
over-expressed or under-expressed in the amplified
library and this may mean greater number of plaques
have to be screened to isolate the desired gene.
Phage Display
Phage Display
Phage Display
Phage Display
Phage Display
Phage Display
Phage Display

Bound phage is
eluted and amplified.
Single plaques are
isolated, followed
by PCR amplification
of phage DNA

Identification
Biopanning
Phage Display

8: Identification

Sequencing of the phage


DNA identifies the
Candidate protein/ peptide

You might also like