Dr Joshua Boateng 21 /11 / 2011

Dr J Boateng BIOT 1011 Bioinformatics

Biotech and pharmaceutical companies spent $10 billion on hardware, software, and services in 2002.
–Source: Gartner
The biotechnology/IT market will increase at a compound annual growth rate (CAGR) of 24% to nearly $38 billion by 2006. – Source: IDC Research
Reference: Prof. A.S. Kolaskar Vice Chancellor, University of Pune
Dr J Boateng BIOT 1011 Bioinformatics

Genetics: the science of genes, heredity, and the variation of organisms. In modern research, genetics provides tools in the investigation of the function of a particular gene, e.g. analysis of genetic interactions. Genomics: the study of large-scale genetic patterns across the genome for a given species. It deals with the systematic use of genome information to provide answers in biology, medicine, and industry.
Dr J Boateng BIOT 1011 Bioinformatics

and determination of gene function.e. Dr J Boateng BIOT 1011 Bioinformatics . gene organization & mutations at the DNA level i.The study of sequences. measurement of gene expression. Major tools and methods related to genomics are bioinformatics. as well as new diagnostic methods. genetic analysis. the study of information flow within a cell Genomics has the potential of offering new therapeutic methods for the treatment of some diseases.

1 million Dr J Boateng BIOT 1011 Bioinformatics . Coli Chrom.5-30000 31000 14000 14000 19000 5000 Base pairs 3.7 million 365 million 137 million 97 million 4.1 billion 2. Genes 46 40 44 6 8 12 1 28-35.1 billion 3.000 22.GENOME COMPARISONS Species Humans Mouse Puffer fish Malaria Mosquito Fruit Fly Roundworm E.

the determination of the changes in mRNA levels of many genes – Genome analysis entails the prediction of genes in uncharacterized genomic sequences. including. – The 21st century has seen the announcement of the draft version of the human genome sequence.• Many diverse studies require the determination of the abundance of large numbers of specific DNA or RNA molecules in complex mixtures. for example. GENOMIC ANALYSIS Dr J Boateng BIOT 1011 Bioinformatics . Model organisms have been sequenced in both the plant and animal kingdoms.

Dr J Boateng BIOT 1011 Bioinformatics . • Experimental genome annotation is slow and time consuming.GENOMIC ANALSIS • However. the pace of genome annotation is not matching the pace of genome sequencing. • The process is more complex for eukaryotic cells where the coding DNA sequence is interrupted by random sequences called introns. • Computational gene prediction is relatively simple for the prokaryotes where all the genes are converted into the corresponding mRNA and then into proteins. The demand is to be able to develop computational tools for gene prediction.

transposons. • Divide a newly sequenced genome into the genes (coding) and the non-coding regions.BIOLOGICAL QUESTIONS Some of the questions biologists want to answer today are: • What part of and DNA sequence codes for a protein and what part of it is junk DNA? • Classify the junk DNA as intron. Dr J Boateng BIOT 1011 Bioinformatics . regulatory elements. dead genes. untranslated region.

” .Walter Gilbert Dr J Boateng BIOT 1011 Bioinformatics . and that the starting "point of a biological investigation will be theoretical. now emerging is that all the 'genes' will be known (in the sense of being resident in databases available electronically).Biological Research in 21st Century “The new paradigm.

Dr J Boateng BIOT 1011 Bioinformatics . • The chimp and human genomes vary by an average of just 2% i. A complete genome analysis of the two genomes would give a strong insight into the various mechanisms responsible for the differences. just about 160 enzymes.e.IMPORTANCE OF GENOME ANALYSIS • The importance of genome analysis can be understood by comparing the human and chimpanzee genomes.

COMPLEXITY IS AN UNDERSTATEMENT? Dr J Boateng BIOT 1011 Bioinformatics .

Dr J Boateng BIOT 1011 Bioinformatics .GENOMIC ANALYSIS_ basics • Techniques used to estimate the relative abundance of two or more sets of mRNA – differential screening of cDNA libraries. – subtractive hybridization. • However. – differential display. more advanced methods have been recently developed.

such as S. cerevisiae.GENOMICS ANALYSIS_Advances • Advanced methods are particularly amenable to organisms whose entire genome sequences are known. Dr J Boateng BIOT 1011 Bioinformatics . • It is now practicable to investigate changes of mRNA levels of all yeast open reading frames (ORFs) in one experiment.

Advanced genomic analysis techniques • DNA sequencing • DNA microarray technology – analysis of gene expression profiles at the mRNA level • Bioinformatic tools to organize and analyze such data • Chip-based analysis of samples • Models of gene networks Dr J Boateng BIOT 1011 Bioinformatics .

Microarray Technology Dr J Boateng BIOT 1011 Bioinformatics .

Post-genomic Era • Series of “omics” – Comparative genomics – Structural and functional genomics – Transriptomics – Proteomics – Metabolomics Dr J Boateng BIOT 1011 Bioinformatics .

Bioinformatics tools needed for analysis of data from these “omics”… Dr J Boateng BIOT 1011 Bioinformatics .

Data Mining Development of new tools for data mining – Sequence alignment – Genome sequencing – Genome comparison – Micro array data analysis – Proteomics data analysis – Small molecular array analysis To derive “information” and gain “knowledge” from the data Dr J Boateng BIOT 1011 Bioinformatics .

COMPARATIVE GENOMICS • Analyzing & comparing genetic material from different species to study evolution. and inherited disease • Understand the uniqueness between different species • Comparative genomics involves the use of computer programs that can line up multiple genomes and look for regions of similarity among them. gene function. Dr J Boateng BIOT 1011 Bioinformatics .

Comparative • Entire Genome compared to other entire genomes. • Use information from many genomes to learn more about the individual genes. BIOT 1011 Dr J Boateng Bioinformatics . • Use genome to inform us about the entire organism.When we BLAST a sequence is that comparative genomics? Difference is in Scale and Direction Other “omics” • One or several genes compared against all other known genes.

the mouse and a wide variety of other organisms . Dr J Boateng BIOT 1011 Bioinformatics .Background on Comparative Genomic Analysis • Sequencing the genomes of the human.from yeast to chimpanzees – • Driving force for the development of new field of biological research called comparative genomics.

BACKGROUND • Comparing the human genome with the genomes of different organisms helps to better understand gene structure and function and thereby develop new strategies in the battle against human disease. • Comparative genomics also provides a powerful new tool for studying evolutionary changes among organisms. Dr J Boateng BIOT 1011 Bioinformatics .

• Using computer-based analysis to zero in on the genomic features that have been preserved in multiple organisms over millions of years. Dr J Boateng BIOT 1011 Bioinformatics . • This should in turn translate into innovative approaches for treating human disease and improving human health.• This helps to identify the genes that are conserved among species along with the genes that give each organism its own unique characteristics. researchers will be able to pinpoint the signals that control gene function.

such as malaria and AIDS. For example. • A comparison of the sequence of genes involved in disease susceptibility may reveal the reasons for this species barrier. chimpanzees do not suffer from some of the diseases that strike humans.BACKGROUND • The evolutionary perspective may prove extremely helpful in understanding disease susceptibility. thereby suggesting new pathways for prevention of human disease. Dr J Boateng BIOT 1011 Bioinformatics .

are laid out along DNA's double-helix structure.adenine (A).BACKGROUND • Although living creatures look and behave in many different ways. the chemical chain that makes up the genes that code for thousands of different kinds of proteins. cytosine (C) and guanine (G) . thymine (T). all of their genomes consist of DNA. • Precisely which protein is produced by a given gene is determined by the sequence in which four chemical building blocks . Dr J Boateng BIOT 1011 Bioinformatics .

.. ideally.. and .BACKGROUND • In order for researchers to most efficiently use an organism's genome in comparative studies.... data about its DNA must be in large.. Dr J Boateng BIOT 1011 Bioinformatics .... anchored to chromosomes and. human (Homo sapiens). fully sequenced.. • Furthermore... contiguous segments. fruit fly (Drosophila melanogaster)... the data needs to be organized for easy access and high-speed analysis by sophisticated computer software.... • Organisms that have been completely sequenced include: mouse (Mus musculus)..

• Simply put. Dr J Boateng BIOT 1011 Bioinformatics . a March 2000 study comparing the fruit fly genome with the human genome discovered that about 60 percent of genes are conserved between fly and human. • For example. the two organisms appear to share a core set of genes. Researchers have found that two-thirds of human cancer genes have counterparts in the fruit fly.BACKGROUND • The fledgling field of comparative genomics has already yielded some dramatic results.

Dr J Boateng BIOT 1011 Bioinformatics . when scientists inserted a human gene associated with early-onset Parkinson's disease into fruit flies. they displayed symptoms similar to those seen in humans with the disorder. • This raises the possibility that the tiny insects could serve as a new model for testing therapies aimed at Parkinson's.BACKGROUND • More surprisingly.

•Exclusively by Human & Mosquito •Exclusively by P.f. & Mosquito Unique proteins in – Human P.Comparative Genomics What one should look for? Human P. Targets for anti-malarial drugs Mosquito Dr J Boateng BIOT 1011 Bioinformatics .f. falciparum Mosquito Proteins that are shared by – •All genomes •Exclusively by Human & P.f.

Procrustes : homology guided. • GeneWise. SGP1 (Syntetic Gene Prediction). • GenomeScan : Ab Initio modified by BLAST homologies. • Rosseta. SLAM. CEM (Conserved Exon Method) : gene prediction and sequence alignment are clearly separated.Comparative Gene Prediction • GenScan : ab initio gene prediction. • SGP-2. Dr J Boateng BIOT 1011 Bioinformatics . DoubleScan : modification of GenScan scoring schema to incorporate similarity to known proteins. TwinScan.

Proteome – by the dictionary •The term proteome.complete set of proteins that is expressed. Practical: the complement of proteins expressed by a cell at any one time. and modified by the entire genome in the lifetime of a cell. coined in 1994. Dr J Boateng BIOT 1011 Bioinformatics . A linguistic equivalent to the concept of genome Proteome .

Dr J Boateng BIOT 1011 Bioinformatics . Large scale separation : 2DE Liquid Chromatography Identification : MALDI MS Tandem MS/MS FT-MS ….the study of the proteome using technologies of large-scale protein separation and identification.Proteomics – by the dictionary Proteomics (Practical) ..

html Dr J Boateng BIOT 1011 Bioinformatics .com/archive/031704/horizons_horizons_comm.•

Proteomics according to Medline Development of Proteomics From 220 publications in the previous millennium (‘94-’99) To 21.350 (!!!) publications in this millennium (‘00-’05) 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1997 1998 1999 2000 2001 2002 2003 2004 Papers Reviews 1730 Dr J Boateng BIOT 1011 Bioinformatics .

Proteomics –by Google THE REALISTIC TRUTH. Proteomics 886.000 hits (2005) 2.000 hits (2004) 16.000 hits (2004) 4.000 hits (2005) Genomics Dr J Boateng BIOT 1011 Bioinformatics .070.700.000.

GEO No notion of completion Dr J Boateng BIOT 1011 Bioinformatics . cDNA. other modifications Dynamic Up/ down variants Poorly archived linear Dynamic Up/down 3D Archived Completion (EST.Comparing Proteomics & Genomics Genome DNA Nc-RNA Genomics analysis mRNA proteome Coding DNA Proteome analysis Proteins Peptides Glyco.

Proteomics – Genomics
More differences…

Gene/ RNA dynamic
Handle Stable molecules Handling cheap/ easy Minimal modification Works in isolation Tech Sequencing (established) DNA array / genotyping/ expression / CGH/

Protein dynamic
Fragile molecules Handling dependent Labile modification Protein-interaction Localization dependent MS related (not yet) Protein Chip (not yet) Antibodies array (not yet)


Dr J Boateng BIOT 1011 Bioinformatics

– Original definition: study of the proteins encoded by the genome of a biological sample – Current definition: study of the whole protein complement of a biological sample (cell, tissue, animal, biological fluid [urine, serum]) – Usually involves high resolution separation of polypeptides at front-end, followed by mass spectrometry identification and analysis
Dr J Boateng BIOT 1011 Bioinformatics

Challenges facing Proteomic Technologies
• Limited/variable sample material • Sample degradation (occurs rapidly, even during sample preparation) • Vast dynamic range required • Post-translational modifications (often skew results) • Specificity among tissue, developmental and temporal stages • Perturbations by environmental (disease/drugs) conditions • Researchers have deemed sequencing the genome “easy,” as PCR was able to assist in overcoming many of these issues in genomics.
Dr J Boateng BIOT 1011 Bioinformatics

The Proteomics Tool Kit • technologies for separating and visualizing proteins and peptides • technologies for assessing protein-protein interactions • technologies for identifying proteins* • technologies for quantifying protein expression* • bioinformatic tools for assessment and communication Dr J Boateng BIOT 1011 Bioinformatics .

Proteomic Technologies • Amino Acid Composition • Array-based Proteomics • 2D PAGE • Mass Spectrometry • Structural Proteomics • Informatics (and the challenges facing the Human Proteome Dr J Boateng BIOT 1011 Project) Bioinformatics .

Dr J Boateng BIOT 1011 Bioinformatics . • Cumbersome and tedious by today’s standards.Amino Acid Composition (Edmund) • Pioneering method of obtaining information from proteins. • Not “high-throughput” by today’s standards. hence. comp is no longer the most widely used technique. • Requires the use of terrible smelling ßmercaptoethanol.

fragmenting into peptides Dr J Boateng BIOT 1011 Bioinformatics .Protein Sequencing step 1.

sequencing the peptides by Edmund degradation.Protein Sequencing step 2. Dr J Boateng BIOT 1011 Separation by HPLC and detect by absorbance at 269nm. Bioinformatics .

FRET.Array-based Proteomics • Employ two-hybrid assays • Use GFP. and GST – GFP = green florescent protein – FRET = florescence resonance energy transfer – GST = glutathione S-transferase. Dr J Boateng BIOT 1011 Bioinformatics . a well characterized protein used as a marker protein.

Array-based Proteomics Dr J Boateng BIOT 1011 Bioinformatics .

Array-based Proteomics • Offer a high-throughput technique for proteome analysis. Dr J Boateng BIOT 1011 Bioinformatics . • These small plates are able to hold many different samples at a time. • Current research is ongoing in an attempt to interface array methodologies with Mass Spectrometry at ORNL.

there is no other technique that is capable of simultaneously resolving thousands of proteins in one separation procedure. • It works by separation of proteins by their pI's in one dimension using an immobilized pH gradient (first dimension: isoelectric focusing) and then by their MW's in the second dimension. • The core technology of proteomics is 2-DE • At present.2D PAGE • 2-D gel electrophoresis is a multi-step procedure that can be used to separate hundreds to thousands of proteins with extremely high resolution. (sited in 2000) Dr J Boateng BIOT 1011 Bioinformatics .

• Gel rods containing: 1. 2. reductant. 3. detergent. carrier ampholytes (form pH gradient). urea.Evolution of 2-DE methodology Traditional IEF procedure: • Iso electric focusing (IEF) in run in thin polyacrylamide gel rods in glass or plastic tubes. and 4. 2. tedious. In the past Dr J Boateng BIOT 1011 Bioinformatics . not reproducible. • Problem: 1.

• Therefore. 20 x 20 cm gel can resolved 100 x 100 = 10.Evolution of 2-DE methodology SDS-PAGE Gel size: • This “O’Farrell” techniques has been used for 20 years without major modification. • 20 x 20 cm have become a standard for 2-DE. 100 100 Dr J Boateng BIOT 1011 Bioinformatics . in theory.000 proteins. • Assumption: 100 bands can be resolved by 20 cm long 1-DE.

2. not good for denaturing proteins. 5.Evolution of 2-DE methodology Problems with traditional 1st dimension IEF • Works well for native protein. 3. thin. Lost of most basic proteins and some acidic protein. (the soft. Patterns are not reproducible enough. 4. Techniques are cumbersome. long gel rods needs excellent experiment technique) Batch to batch variation of carrier ampholytes. Dr J Boateng BIOT 1011 Bioinformatics . Takes longer time to run. because: OPERATOR DEPENDENT 1.

2D PAGE • 2-D gel electrophoresis process consists of these steps: • Sample preparation – First dimension: isoelectric focusing – Second dimension: gel electrophoresis • Staining • Imaging analysis via software Dr J Boateng BIOT 1011 Bioinformatics .

– Sensitivity and dynamic range of 2-DE must be adequate. Dr J Boateng BIOT 1011 Bioinformatics . Spot number: – 10.Challenges for 2-DE 1.000-150. – It’s impossible to display all proteins in one single gels. – PTM makes it difficult to predict real number.000 gene products in a cell.

they need to be handed separately. – pH gradient from 3-13 dose not exist. – For proteins which pI > 11. Isoelectric point spectrum: – pI of proteins: range from pH 3-13. (by in vitro translated ORF) – PTM would not alter the pI outside this range. Dr J Boateng BIOT 1011 Bioinformatics .Challenges for 2-DE 2.5.

Dr J Boateng BIOT 1011 Bioinformatics . – 1-DE (SDS-PAGE) can be run in a lane at the side of 2-DE.Challenges for 2-DE 3. – Protein > 250 kDa do not enter 2nd SDS-PAGE properly. molecular weights: – Small proteins or peptides can be analysed by modifying the gel and buffer condition of SDS-PAGE.

–Some hydrophobic proteins are lost during sample preparation and iso electric focusing (IEF).Challenges for 2-DE 4. –More chemical developments are required. Dr J Boateng BIOT 1011 Bioinformatics . hydrophobic proteins: –Some very hydrophobic proteins do not go in solution.

– Sensitivity of staining methods: 1.Challenges for 2-DE 5. Silver staining 2. even employing most sensitive staining methods. Sensitivity of detection: – Low copy number proteins are very difficult to detect. Fluorescent staining 3. Dye binding staining (CBR) Dr J Boateng BIOT 1011 Bioinformatics .

– A wide dynamic range of the SDS-PAGE is required to prevent merging of highly abundant protein.Challenges for 2-DE 6. more sample needs to be loaded. Loading capacity: – For detection of low abundant proteins. – Loading capacity: IEF > SDS-PAGE. Dr J Boateng BIOT 1011 Bioinformatics .

– Silver staining does not give reliable quantitative data. Quantitation: – The detection method must give reliable quantitative information.Challenges for 2-DE 7. Dr J Boateng BIOT 1011 Bioinformatics .

Reproducibility: – Highest importance in 2-DE experiment.Challenges for 2-DE 8. – Immobilized pH gradient strip have improved a lot for 1st dimension consistency – Variation most comes from sample preparation. Dr J Boateng BIOT 1011 Bioinformatics .

“A good-looking spot pattern – streak and smear free – is not a guarantee for best 2-DE protocol” Dr J Boateng BIOT 1011 Bioinformatics .

Technologies for identifying proteins • Western blotting • Chemical (Edman) sequencing of proteins • mass spectrometry – peptide mass fingerprint – mass spec decay – databases and search engines Dr J Boateng BIOT 1011 Bioinformatics .

a mass is determined. Dr J Boateng BIOT 1011 Bioinformatics . proteins are identified. • From this quantification. • In general a Mass Spectrometer consists of: – Ion Source – Mass Analyzer – Detector • Mass Spectrometers are used to quantify the mass-to-charge (m/z) ratios of substances.Mass Spectrometry • Mass Spectrometry is another tool to analyze the proteome. and further analysis is performed.


application of bioinformatics in the fields of genomics and proteomics Dr J Boateng BIOT 1011 Bioinformatics .

and statistics to understand and organize the information associated with these molecules on a large scale Dr J Boateng BIOT 1011 Bioinformatics .What is Bioinformatics? Conceptualizing biology in terms of molecules and then applying “informatics” techniques from math. computer science.

How do we use Bioinformatics? • Store/retrieve biological information (databases) • Retrieve/compare gene sequences • Predict function of unknown genes/proteins • Search for previously known functions of a gene • Compare data with other researchers • Compile/distribute data for other researchers Dr J Boateng BIOT 1011 Bioinformatics .

Sequence retrieval: National Center for Biotechnology Information GenBank and other genome databases Sequence comparison programs: BLAST GCG MacVector Protein Structure: 3D modeling programs – RasMol. Protein Explorer Dr J Boateng BIOT 1011 Bioinformatics .

Dr J Boateng BIOT 1011 Bioinformatics .

Similarity Search: BLAST A tool for searching gene or protein sequence databases for related genes of interest Alignments between the query sequence and any given database Dr J Boateng BIOT 1011 Bioinformatics .nlm.nih. allowing for mismatches and gaps.ncbi. function. and evolution of a gene may be determined by such comparisons http://www. indicate their degree of similarity The structure.

% identity CATTATGATA GTTTATGATT 70% MRCKTETGAR 90% MRCGTETGAR Dr J Boateng BIOT 1011 Bioinformatics .

Strengths: • • • Accessibility Growing rapidly User friendly Weaknesses: • • • • Sometimes not up-to-date Limited possibilities Limited comparisons and information Not accurate Dr J Boateng BIOT 1011 Bioinformatics .

Need for improved Bioinformatics Genomics: • • • • Proteomics: • • • Global view of protein function/interactions Protein motifs Structural databases Dr J Boateng BIOT 1011 Bioinformatics Human Genome Project Gene array technology Comparative genomics Functional genomics .

Data Mining Handling enormous amounts of data Sort through what is important and what is not Manipulate and analyze data to find patterns and variations that correlate with biological function Dr J Boateng BIOT 1011 Bioinformatics .

protein-protein interactions .conformation/folding .Proteomics • Uses information determined by biochemical/crystal structure methods • Visualization of protein structure • Make protein-protein comparisons • Used to determine: .antibody binding sites .computer aided drug design Dr J Boateng BIOT 1011 Bioinformatics .

students educators bioinformatics researchers institutions Dr J Boateng BIOT 1011 Bioinformatics .