This action might not be possible to undo. Are you sure you want to continue?
by N. AdithiSridhar, Bioinformatics.
Before the HGP
The Human Genome
• Genome: the total genetic information in an organism • 23 chromosomes (diploid content = 46) • ~30,000 genes • ~ 3 billion bases • just the genome sequence alone would require 3 gigabytes of computer storage space
Single nucleotide polymorphisms (SNPs)
History of the Human Genome Project (HGP)
The US Department of Energy (DOE) and the Human Genome
1983 - National Laboratories of the DOE begin producing libraries of human chromosomes 1988 - DOE and US National Institutes of Health (NIH) sign a memorandum of understanding outlining their cooperative effort in genome research
History of the HGP
1988 - HUGO (Human Genome Organization) founded by genome scientists 1989 - DOE and NIH establish a working group to study the Ethical, Legal and Social Implications (ELSI) of the HGP 1990 - DOE and NIH present a 5-year HGP plan to the US Congress. This marks the beginning of the 15-year project
• The public consortium management of the Human Genome Project was entrusted to James Watson, who was in charge up until 1993, being replaced since then by Francis Collins • 20 centres in United Kingdom, France, Germany, Japan and China and Canada • In 1998, the private company Celera Genomics, dependent on the PE Corporation, whose president, molecular biologist Craig Venter, had previously worked in the public consortium of the Human Genome Project, publicly announced that it was going to try the whole sequencing of the Human Genome
• The most important of them was The Institute for Genomic Research (TIGR), founded by Craig Venter in 1992, after his resignation as a researcher for the NIH • 2000 - Craig Venter of Celera and Francis Collins of NIH (representing the HGP) jointly announce the completion of a “working draft” of the human genome • First draft published in Science and Nature in February, 2001
Human Genome Project
Project goals are to • Identify all the approximate 30,000 genes in human DNA, • Determine the sequences of the 3 billion chemical base pairs that make up human DNA, • Store this information in databases, • Improve tools for data analysis, • Transfer related technologies to the private sector, and • Address the ethical, legal, and social issues (ELSI) that may arise from the project
During the HGP
Co mp aring th e Hu ma n Genome wit h other G enome s
➲ ➲ ➲ ➲ ➲ ➲ ➲
Gene numbers of different species Humans: 31,000 Thale cress: 26,000 Nematode worm: 18,000 Fruit fly: 13,000 Yeast: 6,000 Tuberculosis microbe: 4,000
Co mp aring th e Hu ma n Genome wi th th at of Mus musculus (mouse)
• The human genome has about 400 million more nucleotides than the mouse. • Humans and mice genetically diverged about 75 million years ago • The human and the mouse genomes both have approximately 30,000 genes . (99% identical) • There are only three hundred genes unique to either organism
Co mp aring th e Hu ma n Genome wit h that of Pan troglodytes (Chimpanzees)
Humans and chimps diverged from a common ancestor only about 5 million years ago. Preliminary sequence comparisons indicate that chimp DNA is 98.7% identical with human DNA. If just the gene sequences encoding proteins are considered, the similarity increases to 99.2%.
How cou ld t wo species diff er s o much in bod y a nd be havior , an d y et ha ve almo st equ iv elen t s ets of ge ne s?
• Observations reveal that chimp and human genomes show very different patterns of gene transcription activity, at least in brain cells. • Humans have one less chromosome than chimpanzees, gorillas, and orangutans. • At some point in time, two mid-sized ape chromosomes fused to make what is now human chromosome 2, the second largest chromosome in our genome.
How the human genome was sequenced
The genome is digested and the DNA cut into segments with a range of different restriction enzymes Since each restriction enzyme cuts the DNA at slightly different points, the genome is broken up in such a way that there is a degree of overlap between adjacent DNA segments – a fundamental requirement for determining the complete sequence. The subsequent incorporation of the DNA fragments into living cells, such as bacteria or yeast, stores the DNA fragments and enables more copies to be made as the cells reproduce
Organizing mapped large clone contigs Contig: Joined overlapping collection of clones or sequences
• Law of independent assortment: Genes are transmitted from parents to offspring independently of one another. • Genes that are located on the same chromosome and are described as link ed genes . • Linked genes are not always transmitted as block because of the phenomenon of re comb inatio n. • cross in g over causes reshuffling of genes
• An example of genetic linkage The three loci are closely linked i.e. they are situated very close to one another on the same chromosome. At each locus there are three possible genotypes:
C C D locus: D D E locus: E E
C c D d E e
c c d d e e
•The particular combination present in a given individual is called a haplotyp e •Let us assume that the haplotype of a particular parent is Cc, dD, eE
Offspring c D
d e or
c D E
Inheritance of linked genes : Unaltered haplotypes
Offspring c D
D E or
c d e
Inheritance of linked genes : Recombinant haplotypes
• The frequency of recombination is directly proportional to the distance between two genes. One centimorgan • The distance between two genes in which recombination occurs with a frequency of 1%. • Measures of genetic distance using recombination frequencies are accurate only if the genes are closely linked i.e. if the gene distance is small.
CONSTRUCTING A GENE MAP
• Studies of genetic linkage and recombination frequencies have been used to create gene maps • Drosophila is very suitable for such study because it has very prominent, easily studied characteristics, has a short life cycle and produces hundreds of offspring •In an experiment investigating two characteristics A and B it was found that their recombinant frequency was 1.0%. In further experiments it was found that the recombinant frequency for characteristics A and C was 0.6% and the recombinant frequency between B and C was 0.4%.
• A genetic map of the three genes responsible for these characteristics may be constructed as follows. The values are in centimorgans.
Further recombinant studies can be performed to estimate the gene distances between gene C and other genes D and E, and then between B and other genes, E,F, G and so on. In this way a larger and more detailed map is gradually constructed. This approach has been used to construct an extensive and detailed gene map of Drosophila
Techniques: Restriction Fragment Length Polymorphisms Molecular hybridization
Molecular hybridization 1. Single-stranded DNA is generated 2. A probe is a known sequence of part of gene to be identified tagged with a radioactive label. Specific probes are synthesised in the laboratory.. 3. The probe hybridizes only to the fragment with the corresponding sequence. This is detected by the label , which gives a fluorescent signal.
RESTRICTION FRAGMENT LENGTH POLYMORPHISMS •Disease-causing gene could be mapped by linkage and recombination studies with other known genes. However, informative families for such studies are rare. •Numerous markers have been identified throughout the genome using restriction endonucleases and so it is possible to construct maps of disease genes in relation to closely linked markers. •A particular restriction endonuclease recognises a specific nucleotide sequences in DNA and cleave it.
16 5 4
An example of Restriction Fragment Length Polymorphism generated by Hind III. A,B,C,D,E and F indicate the sites where DNA is cleaved. The second individual lacks the restriction site at C and gives a different pattern of fragments from individual 1. •If the sequence were missing at site C, there would be 4 fragments of lengths 16, 5, 2 and 8 units. This variation is referred to as a res tri ction frag ment length poly morp hism (R FL P).
• Using a large number of restriction endonucleases, it is likely that one finds one or more RFLPs close to the gene of interest. • Such RFLPs are then used as markers for linkage studies with known genes • Linkage studies have been one of the most important tools for gene mapping • If more than one marker is used the accuracy of the procedure is further increased.
Finding genes by UCSC Genome Browser
From Early maps to
. . . to a multi-resolution view . ..
. . . at the gene cluster level . . .
. . . the single gene level . . .
Location and display of the human gene implicated in Fragile X syndrome.
• Dideoxyribose - ribose in which the hydroxyl group is missing from both the 2’ and the 3’ carbons • whenever a dideoxynucleotide was incorporated into a polynucleotide, the chain would irreversibly stop, or terminate • four separate reactions, each incorporating a different dideoxynucleotide along with the four deoxynucleotides, would produce a population of fragments all ending in the same dideoxynucleotide • primer, Polymerase are needed ddA
Collins vs. Venter
IHGSC and Celera
Hi dd en Ma rk ov mo de l •
Hidden Markov Model (HMM) system for segmenting uncharacterized human genomic DNA into exons, introns, and intergenic regions. • Three separate models were designed for each of the three types of human DNA (exons, introns, and intergenic), • using biological knowledge about splice junction these models are tied together
Expr es sed se qu en ce tags
• ESTs are DNA sequences read from both ends of expressed gene fragments • The Merck-WashU EST Project and several other public EST projects are being performed to rapidly discover the complement of human genes, and make them easily accessible. • These ESTs are widely used to discover novel members of gene families
Genome Assembly and Annotation Process
• The primary data produced by genome sequencing projects are often highly fragmented and sparsely annotated • NCBI assimilates data of various types, from numerous sources, to provide an integrated view of a genome, making it easier for researchers • NCBI constantly strives to improve the accuracy of its human genome assembly and annotation • Feedback from outside groups and individual users, is used to improve the process
• Data Fre ez e The data are “frozen” at the start of the build process by making a copy of all of the data available for use at that time Freezing the data provides a stable set of inputs for the remainder of the build process • Th e Bui ld Cy cle A build begins with a freeze of the input data and ends with the public release of an annotated assembly of genomic sequences Few months between builds so that the latest build can be evaluated and improvements can be made
Processing of Biological Sequence Data
• The sequence database GenBank is made up of nucleotide sequences submitted by individual scientists and sequencing centers from around the world. • These sequences have been submitted directly to GenBank or are replicated from one of the collaborating databases • information management system that consists of two major components, the ID database and the IQ database • ID handles incoming sequences and feeds other databases with subsets to suit different needs • IQ holds links between sequences from ID and links from these sequences to other resources.
Abstract Syntax Notation 1 (ASN.1) Is the Data Format Used by the ID System. ASN.1 is the data description language in which all sequence data at NCBI are structured
Sources of Seque nce Da ta
GenBank sequences Reference sequences Sequences from other databases, such as SWISS-PROT, PIR, PRF, and PDB
Submiss io n
• large-volume submitters, such as HTGS, use FTP, often after using tools such as fa2htgs to convert their data to ASN.1
• Small-volume submitters typically use either BankIt or Sequin to prepare the ASN.1 for submission.
• Out put of Da ta from the ID Syst em
After Data Conversion data are then replicated to several different servers and also transformed into several different formats • Replication is necessary It separates the “incoming” data system from the “outgoing” data. Replicating the data to different servers helps balance the load of queries. it protects against data loss. •
ID Dat ab as e
Holds both ASN.1 objects and sequence identifier-related information. Accession numbers assigned to the sequences. When the understanding of that sequence changes, the sequence can have a new version. Gi number is assigned to the sequences which have version.
DB name Major content Initial Tech. Current Tech. Primary data types
Genbank DNA/RNA sequence, protein Disease phenotypes and genotypes, etc. Genetic map linkage data Text files Flat-file/ASN.1 Text, numeric, some complex types Flat-file/ASN.1 Text
Index cards/text files Flat file
Genetic map Text, numeric linkage data, sequence data (non-human) Sequence and Flat file- Flat file- Text sequence application application variants specific specific Biochemical Complex types, reactions and text, numeric pathways
Taxonomy databases Genomic databases -Genomic databases (non vertebrate) -Human and other vertebrate genomes Sequence databases -Nucleotide sequence databases -Protein sequence databases -RNA sequence databases Structure databases Proteomic databases
Types of Data (Databases)
➲ ➲ ➲ ➲ ➲
➲ ➲ ➲
Microarray databases Chemical databases Expression databases Enzyme databases Pathway databases (Metabolic and Signalling pathways) Disease databases (Human genes and databases) Literature databases Other molecular biology databases
Genes and Disease
Trisomy of chromosome 21 Down syndrome
•Rare form of cancer affecting young children in africa •Associated with Epstein-Barr virus •Translocation cause cancer •Translocation of Myc gene takes place
• Changes the pattern of Myc’s Expression disrupt Controlling in cellgrowth and proliferation • We are still not sure What cause Chromosomal translocation • Model organism gives a clue to understanding of how translocation occurs
•LNS is a rare inherited disease that disrupts the metabolism of the raw material of genes (purines) •The body can either make purines (de novo synthesis) or recycle them (the resalvage pathway) •When one of the enzymes is missing, a wide range of problems can occur. •Mutation in the HPRT1 gene affects the production of the enzyme hypoxanthineguanine phosphoribosyltransferase
•Very low level of the enzyme cannot speeds up recycling of purines from broken down of DNA and RNA •The mutation is inherited in an X-linked fashion •Three main problems Accumulation of uric acid Self-mutilation Mental retardation and severe muscle weakness. •In 2000 in vitro techniques were identified to treat the LNS •A virus was used to insert a normal copy of the HPRT1 gene into deficient human cells. • Medications are used to decrease the levels of uric acid.
•obesity has more than one cause: genetic, environmental, psychological and other factors •Subsequently the human Ob gene was mapped to chromosome 7. •The hormone leptin, produced by adipocytes (fat cells), was discovered in 2003 •Leptin is thought to act as a lipostat
• As the amount of fat stored in adipocytes rises, leptin is released into the blood and signals to the brain that the body has enough to eat. • Overweight people have high levels of leptin in their bloodstream, indicating that other molecules also effect feelings of satiety and contribute to the regulation of body weight.
•Rare disorder of lipid metabolism •Cause peripheral neuropathy, failure of muscle coordination, vision disorder •In 1997 the gene for Refsum disease was identified and mapped to chromosome 10. •The protein product of the gene, PAHX, is an enzyme that is required for the metabolism of phytanic acid
•Refsum disease is characterized by an accumulation of phytanic acid in the plasma and tissues. is a derivative of phytol, a component of chlorophyll. •Our bodies can not synthesize phytanic acid. we have to obtain all of it from our food. •Prolonged treatment with a diet deficient in phytanic acid can be beneficial.
• • • • • • • •
Pan cr eat ic can cer Ph en ylk et onur ia Pr ad er -Wi ll i syn drome Po rphy ria Ta ngi er dis ease Ta y- Sach s dis ease Wi lso n's dis ea se Ze ll weg er sy ndr ome
• Ad ren ole uko dyst roph y • Di ab et es , type 1 • Gau ch er di se ase • Glu co se gal acto se mal ab so rpt ion • Her edi tar y he mo ch ro mat osi s • Men kes sy ndr ome
• The Human Genome Project has already fueled the discovery of more than 1,800 disease genes. • There are now more than 1,000 genetic tests for human conditions. • These tests enable patients to learn their genetic risks for disease and also help healthcare professionals diagnose disease • At least 350 biotechnology-based products resulting from the Human Genome Project are currently in clinical trials.
• Biodiversity Provides genetic measure of biodiversity • The National Geographic magazine started its “Genographic” project An ambitious attempt to use data from the human genome to trace the pathways of human migration. Related data have shown that early humans migrated out of Africa along the coastline & finally into Australia around 40-60,000 years ago. • Comparative genomics Genomics Proteomics Gene Therapy Risk assessment Agriculture, Livestock breeding and Bioprocessing
•identify potential suspects at crime scenes •identify crime and catastrophe victims •establish paternity and other family relations •match organ donors with recipients in transplant programs
Evolution and Human Migration
•Comparison of sequences of genetically, racially and culturally diverse peoples •Comparison of sequences of peoples geographically apart but apparently related •Study of evolution of humanoid species and modern humans
SOME ETHICAL ISSUES
• This means that the person undergoing the test should only do so on a voluntary basis and with a full understanding of all the implications. • Limitati ons of the test need to be discussed prior to testing. The tests cannot always identify the mutation, even if it is present • Detection of the change in the gene is not nece ssa ri ly pred icti ve of f uture sym ptom s • Determine the sex of a baby by checking the chromosomes. There are sometimes requests for the use of the technology to ensure that a couple have a baby of a certain sex, for reasons not necessarily related to the health or well-being of the child.
• Analysis of an individual’s genetic make-up could also be used in the future by employers or insurers wishing to know the likelihood of a potential employee or insurance applicant developing a condition for which they carry a predisposition; for example, alcoholism, heart disease or cancer • Such knowledge could lead to discrimination
• With the new advances in genetics, as in any powerful new scientific tool, there is a potential for abuse. The boundaries need to be considered
• The policy of the U.S. Patent and Trademark Office (PTO) “life forms, as products of nature, were unpatentable. Only products and processes invented by humans could be patented “ • what about genetically modified life forms: are they invented or discovered, the product of nature or humans? • In 1980, the U.S. Supreme Court issued its 5–4 decision in Diamond v Chakrabarty that a bacterial strain that had been genetically modified to clean up oil spills could be patented since it was “man-made” and not naturally occurring • Since this PTO has awarded thousands of patents on biological products, including patents on genes, SNPs, ESTs, cell lines, mice, plants, rhesus monkeys, and human stem cells
Genomes to Life:
A DOE Systems Biology Program
Exploring Microbial Genomes for Energy and the Environment
• identify the protein machines that carry out critical life functions • characterize the gene regulatory networks that control these machines • characterize the functional repertoire of complex microbial communities in their natural environments • develop the computational capabilities to integrate and understand these data and begin to model complex biological systems
GTL Applications in Energy Security and Global Climate Change
The International HapMap Project
• Although the DNA sequence of any two people is 99.9% identical, the variations crucially affect an individual’s disease risk • The points where the sequence differs at a single DNA base are called single nucleotide polymorphisms (SNPs). • Sets of SNPs on the same chromosome are inherited in blocks called haplotypes. ➲ purpose is to enable the study of genetic associations with disease
• The project was launched in 2002 with $100 million • Nine research groups and more than 200 researchers in six countries; Canada, China, Japan, Nigeria, the UK and the US. • Samples from people in Nigeria, Japan, and China and from those with northern and western European ancestry living in the US. • They mapped the entire genome of 269 people to identify tiny differences in key areas of DNA. • The HapMap is publicly accessible
EN CO DE
• The National Human Genome Research Institute (NHGRI) launched ENCODE, the Enc yclopedia Of DNA Elements, in September 2003, to carry out a project to identify all functional elements in the human genome sequence. A pilot phase and a technology development phase.
• The pilot phase tested and compared existing methods to analyze a defined portion of the human genome sequence • conclusions from this pilot project were published in June 2007 in Nature
All data generated by ENCODE participants released into public databases
Role of China
The Chinese Human Genome Project started in 1993 major aspects: ➲ The genome resource conservation and genetic polymorphism studies of multiple Chinese nationalities ➲ The development of an advanced technological system for genome research ➲ The cloning of some desease-related genes and a large number of expressed sequence tags(ESTs). ➲ Initiation of the functional genomics studies ➲ Ethical, legal and social issues related to human genome sciences.
• In 1988, the USSR council of Ministers adopted a resolution on the creation of a Human Genome Project • The project is under the scientific control of the council on the human genome • The Engelhardt institute of Molecular Biology is Concerned with the organism • Goals: Sequencing. Mapping, investigate model organism, Functional studies
• Council funds 57 million rubles • No international agreement on sequencing • Council funds two online databases • GE (Gene Express) was founded in 1988 at National institute of Scientific information • HGG (Human Genome Guide) is affiliated with the Institute of Brain Research
• These two databases contain information on DNA Sequences of Human, Viral, bacterial and mammalian
• India play a very significant role, by its special social structure
• It offers a rich resource for studying functional genomics or the functional aspects of the genetic map.
• With its caste based communities intermarrying among themselves, India provides rare genetic material in family pedigrees
• Good genes like mathematical ability and bad genes like for breast cancer are concentrated in families and communities because of selective inter-marriage.
• These inbred communities can reveal the inheritance pattern of genes and functional genomics can reveal what these genes actually do. • This is a powerful combination allowing the scientist to understand how genetic disease is transmitted and how, by understanding gene function, it can be treated
“At the moment there are 1,100 companies devoted to the manufacture of medicines through recombinant techniques, to which we have to add over 700 corporations interested in the sector. On the whole, these companies employ more than 100,000 people and represent a stock market value near 50,000 million dollars”.
SmithKline Beecham has collaborated with Human Genome Sciences, Eli Lilly with Millennium Pharmaceuticals and Pfizer with Incyte Genomics17.
In March 2000, President Clindon announced that the Genome sequence could not be patented, and should be made freely available to all researchers. The statement sent Celera's stock plummeting and dragged down the biotechnology. The biotechnology sector lost about $50 billion in market capitalization in two days.
What We Sti ll Don ’t Kn ow
• Gene number, exact locations, and Functions • Gene regulation • DNA sequence organization • Chromosomal structure and organization • Noncoding DNA types, amount,distribution, information content, and functions • Coordination of gene expression, protein synthesis, and post-translational events • Evolutionary conservation among organisms Diseasesusceptibility prediction based on gene sequence variation •Protein conservation (structure and function)
This action might not be possible to undo. Are you sure you want to continue?