You are on page 1of 56

Proteomics Analysis

Gordon Anderson gordon@pnl.gov April 19, 2004

Outline
What is Proteomics? Accurate Mass and Time tag (AMT) based proteomics Instrumentation Data analysis Data management Challenges

2

Proteomics
The study of proteins. www.genencor.com/wt/gcor/glossary The study of the full set of proteins encoded by a genome. www.ornl.gov/TechResources/Human_Genome/glossary/glossary.html The study of the proteome. www.niehs.nih.gov/nct/glossary.htm The study of the structure and function of proteins. www.oznet.ksu.edu/biotech/glossary.htm The study of the full set of proteins encoded by a genome. Source : Human Genome Project Information www.genomecanada.ca/prairie/glossary.asp The study and cataloging of proteins in the human body. These proteins and how they interact with each other may hold the keys to curing diseases in humans or providing the targets for drug development. www.changewave.com/Glossary.html

3

A Simplified View of Protein Production
DNA Proteins

Cellular processes

Genomics provides DNA sequence and predictions of proteins, ORFs.

Regulation

Environment

4

Definitions
Amino Acid A molecule of the general formula NH2-CHR-COOH, where "R" is one of a number of different side chains. Amino acids are the building blocks of proteins. Protein A large molecule composed of one or more chains of amino acids in a specific order; the order is determined by the base sequence of nucleotides in the gene that codes for the protein. Proteins are required for the structure, function, and regulation of the body's cells, tissues, and organs. Further, each protein has a unique function, for example as a hormone, enzyme, or antibody. Peptide Two or more amino acids joined by a bond called a "peptide bond." Polypeptide A protein or part of a protein made of a chain of amino acids joined by a peptide bond. DNA (deoxyribonucleic acid) The molecule that encodes genetic information. DNA is a double-stranded molecule held together by weak bonds between base pairs of nucleotides. The four nucleotides in DNA contain the bases adenine (A), guanine (G), cytosine (C), and thymine (T). In nature, base pairs form only between A and T and between G and C; thus the base sequence of each single strand can be deduced from that of its partner. Dalton An atomic mass unit. A unit of mass that equals 1/12 the mass of the most abundant isotope of carbon (12C), which is assigned a mass of 12.
5

Amino Acids
Name Alanine Arginine Asparagine Aspartic acid Cysteine Glutamic acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Three-Letter Code Ala Arg Asn Asp Cys Glu Gln Gly His Ile Leu Lys Met Phe Pro Ser Thr Trp Tyr Val Single-Letter Code A R N D C E Q G H I L K M F P S T W Y V

6

Proteins

>ORF02221 hypothetical protein MASPATPSPLATPPPSDCKSCTPTPPRPTNCGPRFRRGASVSRMKWPPRLPFWRRTRRRI SPDRNCWWTGAGAFEPPLLSASLEQFCELGVRRNSTATRYILRSAPCFALALYELYQSAR KKLCHLFVKCSRLGACPDAWWDSPPLPGSGRPSMAAMSRRTARRLPLRRLALLGAGAAAL TLVLSSCSGEKVQNAVNTTNSTRGLKVVLNQSYGPDTRNKLDVYAPQNAQGAPTILFIHG GSWQGGDKSGHAFVGESLARAGYVVGVMNYRLAPQNRYPSYVQDGAAALKWLRDHAGQFG GNPNNLFVSGHSAGGFNAVELVDNARWLAEVNVPVSSIRGVIGIAGPYSYDFRAYQTRVA FPENGNPDDIMPDRHVRPDAPPHLLLVAANDSVVAPQNALNMEAALQKARIPVQRVVLPR LNHVTIIGAVARNLTWLGGTRAEIIRFVDANRLK

7

Organism Complexity

Number of ORFs Number of AAs Min MW Max MW Average MW

Bbur E coli Yeast C elegans 849 4289 6117 19098 282890 1358701 2889982 8088749 3431.2 1723.1 2979.3 334.4 254247.2 251394.7 559312.3 1368569 38165.2 35097 53437.2 47879

8

120 110 100 90 80 Number of proteins 70 60 50 40 30 20 10 0 0 20000

E coli MW Distribution of Protein Mass

40000

60000

80000

100000

120000

140000

160000

180000

MW, in Daltons

9

Post-Translational Modifications
Acetylation Alkylation Amidation S-archaeol Beta-methylthiolation Biotin Bromination Carbamylation Cholesterol Cis-14-hydroxy-10,13-dioxo-7-heptadecenoic acid aspartate ester Citrullination C-Mannosylation Cysteine sulfenic acid (-SOH) Cysteine sulfinic acid (-SO2H) Deamidation S-diacylglycerol cysteine Dimethylation FAD Farnesylation Formylation Geranyl-geranylation Gamma-carboxyglutamic acid O-GlcNAc Glucosylation (Glycation) Glutathionylation Hydroxylation Lipoyl Methylation Myristoylation S-Nitrosylation n-Octanoate Omega-hydroxyceramide glutamate ester Palmitoylation S-palmitoleyl cysteine Phosphatidylethanolamine amidated glycine Phosphorylation Pyridoxal phosphate Phosphopantetheine Pyrrolidone carboxylic acid Sulfation Trimethylation

10

Sample Preparation

Cells

Lyses

Clean up

Analysis

Cell growth, selection, Isolation.

Extraction of proteins from Cells.

Isolation of proteins from other cellar material.

Analysis of prepared sample.

11

2D gel electrophoresis display

12

Protein Identification by Mass Spectrometry

Identification of proteins by mass alone is difficult due to many possible modifications. These modifications cause significant mass changes. Enzymatic digestion of proteins enables protein identification
Enzymes cleave at known locations, producing many peptides per protein. Unique, unmodified peptides are likely for every protein.

13

Tryptic Digestion of a Protein
Protein Sequence MNTLRIGLVSISDRASSGVYQDKGIPALEEWLTSALTTPFELETRLIPDEQAIIEQTLCELVDE MSCHLVLTTGGTGPARRDVTPDATLAVADREMPGFGEQMRQISLHFVPTAILSRQVGVIRK QALILNLPGQPKSIKETLEGVKDAEGNVVVHGIFASVPYCIQLLEGPYVETAPEVVAAFRPKS ARRDVSE Tryptic Peptides MNTLR IGLVSISDR ASSGVYQDK GIPALEEWLTSALTTPFELETR LIPDEQAIIEQTLCELVDEMSCHLVLTTGGTGPAR R DVTPDATLAVADR EMPGFGEQMR QISLHFVPTAILSR QVGVIR K QALILNLPGQPK SIK ETLEGVK DAEGNVVVHGIFASVPYCIQLLEGPYVETAPEVVAAFRPK SAR R DVSE

Trypsin is a commonly used enzyme that “cuts” after Lysine (K) and Arginine (R).

14

Uniqueness of Peptide Masses at Various Measurement Accuracies (C elegans)

100 0.1 ppm 80
Percent unique

1 ppm

60

40

20

10 ppm

0 500 1500 2500
Peptide Mass, in Da

100 ppm 3500

15

Peptide Detection Strategies

Measurement of peptide masses alone is not sufficient for identification.
Requires very high measurement accuracy Noise Incomplete digestion Non-specific cleavage sites

Additional information is necessary to increase the identification confidence.

16

Peptide Identification Using Fragmentation Data
Tandem Mass Spec Analysis provides additional information allowing peptide identification. 1. Select and isolate the peptide 2. Measure the mass of the peptide 3. Fragment the peptide 4. Record the fragmentation spectrum 5. Analyze the spectra using peptide ID algorithms to interpret spectra.
a10 b10 c10 GIPALEEWLTSALTTPFELETR x12 y12 z10

17

Outline
What is Proteomics? Accurate Mass and Time tag (AMT) based proteomics Instrumentation Data analysis Data management Challenges

18

Approach for Proteome Analysis
Cell lysis Protein extraction/fractionation Optional protein stable-isotope labeling Enzymatic digestion Liquid Chromatography separation with Mass Spec Data processing and display
19

PNNL Accurate Mass and Time Tag Approach
Given the constraint of a sequenced genome, high accuracy mass measurements by FTICR provide unique “biomarkers” for nearly all proteins :

Accurate Mass and Time Tags (AMTs)
Use of AMTs for protein identification - Subsequent MS/MS not required - Enables high throughput proteomics

20

Overall Methodology
Stage 1 Stage 2

Protein Extract Tryptic Digestion

Control

Perturbed

LC-MS/MS

LC-FTICR Accurate Mass Measurements

• Validate Accurate Mass Tags • Create AMT database

LC-FTICR

• Identify proteins using AMT database • Determine abundance ratios
21

Stage 1
Fragment from Elongation factor Tu

Practice

...APEEKARGITINTAHVEYQTETRHYSHVDCPGHADYVK...
Tryptic digestion

APEEKAR

GITINTAHVEYQTETR

HYSHVDCPGHADYVK

GITINTAHVEYQTETR Capillary LC- FTICR Capillary LC-MS/MS (e.g., with ion trap)
y9 y7

916.0

m/z

920.5

Calculated PMT mass (1830.1987) and elution time
y5 y4 b6 b7 200 600

y8 y6 b9 b8 y10 b11 y11 b10 b12 y12 b13 b14 y13 1,000 m/z 1,400 1,800

500

750

1,000 m/z

1,250

1,500

Accurate mass observed: 1830.1987

PMT identified from D. radiodurans: GITINTAHVEYQTETR

Validated AMT (GITINTAHVEYQTETR)
22

Stage 2
Generation of quantitative data and comparative displays

Practice
Seed culture

Perturbed

Control
Stable-isotope labeled

Harvest, process, and analyze by LC-FTICR

Perturbed proteome Control proteome

= Abundance ratio (AR)
23

Outline
What is Proteomics? Accurate Mass and Time tag (AMT) based proteomics Instrumentation Data analysis Data management Challenges

24

Proteomics Production Pipeline

25

9.4 Tesla FTICR Mass Spectrometer

26

ION Trap Mass Spectrometer

27

Outline
What is Proteomics? Accurate Mass and Time tag (AMT) based proteomics Instrumentation Data analysis Data management Challenges

28

Proteomics Analysis Pipeline

29

Liquid Chromatographic (LC) Separation Coupled with Mass Spectrometry

Automated LC separation platform
Injection of complex peptide mixture. Sample robot used for automated 24/7 operation. Peptide separation based on physical properties. Peptides “elute” from LC column over time thus lowering the complexity of each spectrum. High performance LC separation times of 100 to 150 minutes.

ESI-Mass Spectrometer

Collection of thousands of individual mass spectra.

Capture of raw data and experiment meta data.

30

Two Classes of Mass Spec Experiments

LC-MS - Each spectrum contains many peptide signals. - High performance MS. - Each injection results in 1000 to 5000 spectra. - Processing generally involves de-isotoping. - Simple instrument control.

LC-MS/MS - Requires precursor ion MS scan to select ions for fragmentation. - Instrument accumulates selected ion and then fragments this ion. - Fragmentation spectrum is recorded along with the precursor ion m/z. - Each fragmentation spectrum can ID only one peptide. - More complex instrument control, dynamic exclusion. - Lower throughput.

31

MS/MS acquisition

1) Acquire an MS precursor scan

2) Isolate first selected ion and acquire fragmentation spectrum.

4) Isolate last selected ion and acquire fragmentation spectrum.

3) Isolate next selected ion and acquire fragmentation spectrum.

32

# Rank/Sp --- ------1. 1 / 1 2. 2 / 63 3. 3 / 13 4. 4 / 2 5. 5 / 12 6. 6 / 9 7. 7 / 55 8. 8 / 5 9. 9 /135 10. 10 / 4

(M+H)+ deltCn --------- -----3558.0774 0.0000 3557.9677 0.5263 3557.9456 0.5334 3559.0343 0.5351 3558.2075 0.5391 3557.0066 0.5458 3556.2267 0.5458 3558.3387 0.5479 3557.1870 0.5499 3559.3433 0.5682

XCorr -----6.1056 2.8921 2.8488 2.8385 2.8138 2.7733 2.7732 2.7606 2.7481 2.6365

Sp ---1734.4 259.9 334.9 484.0 335.0 343.5 268.8 380.6 225.1 388.0

Ions Reference ---- --------40/128 DR0607 21/124 DRA0230 26/136 DR1991 27/128 DR0422 24/120 DR0353 25/144 DR1017 23/128 DR0799 26/140 DR1483 22/136 DRB0122 25/128 DRA0322

Peptide ------K.EMLRDIAAVTGGEVVSEDLGHKLENVGMEMLGR.A E.AGTICFSLTPHEHFADAWHFEIRSLAASRDGL.V L.VEDAGAETPRAQRGEVSATGTLFGRKVKPLTADAG.A L.IPERPYAQVVDLGCGTGEQTAQLAQRFPQATVL.G V.DKDGRMELIPIREETARGMIEDLMLLANKVV.A L.HGLEKPGNVGAILRTADAAGAAGVLVLGRGADPYGPN.V M.PLKPEIPKPEAEQKRVGFAIVGIGKLSAEELIP.A R.LSWSGILAGLVMGVVTTLTIIALGTVITALTGLTLS.G L.LLSRTLDVLSLGDDLATSLGTRVGAARLLCLSVGV.A G.GLFVGVVMFLPQGVAGLLTRRRTPVPLPAPELR.E
33

# Rank/Sp --- ------1. 1 / 45 2. 2 / 91 3. 3 / 8 4. 4 / 35 5. 5 / 11 6. 6 / 21 7. 7 / 47 8. 8 /190 9. 9 / 1 10. 10 /355

(M+H)+ deltCn --------- -----2715.9859 0.0000 2712.9732 0.0284 2713.1521 0.0876 2714.8187 0.1029 2714.0526 0.1148 2716.1890 0.1260 2712.1679 0.1261 2714.1862 0.1424 2716.3416 0.1456 2712.1210 0.1572

XCorr -----1.5074 1.4647 1.3753 1.3522 1.3344 1.3174 1.3173 1.2927 1.2879 1.2705

Sp ---146.8 129.5 196.3 153.2 186.1 165.2 145.9 110.1 250.4 95.9

Ions Reference ---- --------10/48 DRA0307 9/46 DR2279 12/52 DR0400 11/58 DR2493 10/46 DR0582 10/44 DR1778 10/52 DR1078 10/48 DR0860 13/50 DR0284 8/46 DR0363

Peptide ------V.QSSGSDVQRVGQGLVYSPIRDRVPN.I P.AKPDAEKEQKFEQEVQEVAPDAKP.E T.PYDGIPHLVRPVVTNPADAAGVLLGAV.A L.GHESVGGGSDGNFTAATVPTLDGLGAPGDG.A P.PDDDFVVEHNGLGPLTAKGWLRYL.N A.LRMLEERGMDRVFDPQKIVAVPD.H R.VLEIGAGTGRVTAFLTRRGAAVLGVEP.S Q.VPGTPLGQVLGILRQRAELASTHWL.A A.IGRALMARPSLLLLDEPSLGLAPLVV.E T.LEPLLATKWTASNGGKTYTFDLRK.N

34

High Resolution Mass Spectrum Analysis

CS 2

Abundance 6.19E+06

m/z 784.366

Fit 0.006

Average MW 1567.764

Monoisotopic MW 1566.7175

Most abundant MW 1566.7175

35

De-isotoping Procedure
Based on David Horn’s THRASH algorithm Conversion of isotopic distributions to tables of detected masses:
Select the most abundant peak in spectrum Use autocorrelation algorithm to determine charge state Measure peak resolution and signal to noise Calculate mass estimate using peak m/z and charge state Use average elemental composition of peptide to estimate molecular formula Model the isotopic distribution Fit the modeled distribution to experimental data and register Calculate the fit, lease square error, and report results

36

High Resolution MS Data

37

Overlapping Isotopic Distributions

CS 1 2

Abundance 4.07E+06 1.81E+06

m/z 674.3661 674.2631

Fit 0.005 0.012

Average MW 673.77 1347.4214

Monoisotopic MW 673.3588 1346.5117

Most abundant MW 673.3588 1346.5117
38

Overlapping Isotopic Distributions

CS 4 4 5

Abundance 3.70E+05 3.10E+05 1.89E+05

m/z 556.4735 557.2672 556.8497

Fit 0.021 0.026 0.036

Average MW 2223.3104 2225.4828 2779.9962

Monoisotopic MW 2221.8649 2224.0367 2778.2092

Most abundant MW 2222.8679 2225.0397 2779.2121
39

O18 Labeling, 4 Da Mass Delta

CS 2 2

Abundance 1.35E+07 1.25E+07

m/z 935.4978 937.5011

Fit 0.032 0.032

Average MW 1870.2082 1874.2156

Monoisotopic MW 1868.9809 1872.9878

Most abundant MW 1869.984 1873.9908

40

Isotopic Signature Identification, N14/N15

CS
2 2

Abundance
2.84E+08 2.47E+08

m/z
578.3451 585.3236

Fit
0.097 0.05

Average MW
1155.3760 1169.1556

Monoisotopic MW
1154.6757 1154.6741

Most abundant MW
1154.6757 1168.6326

Isotope
N14 N15

41

Capillary LC-FTICR 2-D Display of Peptides from a Yeast Soluble Protein Digest
>160,000 isotopic distributions corresponding to thousands of peptides

42

Unique Mass Classes Generation

1,280

1,255

12 UMC

Mr/u
1,230 1,205 1,180 243 249 255 261 267 273
43

Spectrum number (Time)

14N/15N

Pairs Generation

1,280
18

1,255

7 Pairs 16 16

Mr/u
1,230 1,205
19

16

12

15

1,180 243 249 255 261 267 273
44

Spectrum number (Time)

Outline
What is Proteomics? Accurate Mass and Time tag (AMT) based proteomics Instrumentation Data analysis Data management Challenges

45

Users (~75 active) Biologists Chemists Mass Spec. Scientists Bioinformaticists Technicians

FTICR Mass Spectrometers (4 instruments)

PRISM
Ion Trap Mass Spectrometers (7 instruments)

External Systems

Proteomics Research Information Storage and Management

Visualization Tools Data Mining Tools

Time-of-Flight Mass Spectrometers (3 instruments)

113 32 7,928 17,427 54,996 20 TB 85 150 GB

Research Campaigns Organisms Prepared Samples MS Instrument Runs Automated Analyses Data / Results Files Mass Tag Databases Data in Databases

EMSL Scientific Archive

46

PRISM Features
Manages several kinds of data and information Raw spectra files Analysis results files Metadata (information about biomaterial, sample prep., etc.) Automates processing of information Capture and store spectra files Analyze raw spectra with multiple tools Identify mass tags Archive data and results files Provides several access paths to information Web interface for users Database and file access for external tools and systems

47

PRISM Architecture

DMS
Capture Spectra Store Files Backup Files Perform Analyses Manage Metadata Web Interface

MTS
Create PMTs Identify AMTs Web Interface
48

PRISM Hardware
Dell PowerEdge 2650 (4 ea) 512 MB 2000 GB DMS file handling managers Dell PowerEdge 2650 1024 MB 70 GB SQL Server, Internet Information Server (IIS) Dell PowerEdge 2550 1024 MB 170 GB SQL Server, Internet Information Server (IIS)

Dell Precision Workstation 530 512 MB 33 GB DMS Analysis Manager, SEQUEST Dell PowerEdge 6400/700 512 MB 70 GB DMS Analysis Manager, SEQUEST Racksaver RS1200 (5 Ea) 2 dual CPU machines per box: 512 MB 20 GB DMS Analysis Manager, ICR2LS Racksaver RS1200 (3 ea) 2 dual CPU machines per box 512 MB 30 GB DMS Analysis Manager, ICR2LS, SEQUEST

49

DMS Information Management
File storage holds raw spectra and analysis results files Tracking database tracks files and contains metadata
Tracking Database Campaign Cell Culture File Storage
Dataset Folder

Experiment Dataset

Analysis Results Folder

Analysis Job

Tracking Entities: Campaign Major line of research Cell Culture Biomaterial source Experiment Prepared sample Dataset MS Instrument run Analysis Job Peptide Identification Peak Identification Total Ion Current (TIC)

50

MTS – Selection and Integration
MTS databases actively process DMS analysis results into mass tag information Main Database
Centralize references to DMS metadata Manage automatic update cycle

Peptide Database
Select Analysis Jobs from DMS
Based on metadata in DMS tracking DB

Extract peptide identifications
From DMS analysis results files Based on selection criteria

Mass Tag Database
Select analysis jobs from peptide database Import selected peptides and make PMTs Extract and match monoisotopic peaks
From DMS analysis results files Match to PMTs by mass and elution time
51

MTS Processing Pipelines
Processing data flow performed in a fixed sequence
Internal DB procedures External programs

Each DB has master update sequence procedure

Peptide DB Data Flow

Main DB controls overall schedule
Daily, weekly (on/off hours) Calls master update procedure in each individual DB
Mass Tag DB Data Flow
52

PRISM Operation
PRISM has been in steady operation for several years Unplanned downtime on order of 2%
We’ve learned to do better
9000

Experiments Experiments

8000

7000

6000

5000

4000

Users tend to take it for granted, like the plumbing Has evolved over time Phase 1 Basic Dataset Capture March 2000 Phase 2 Automated Analyses March 2001 Phase 3 Mass Tag System March 2002
Datasets Datasets
20000

3000

2000

1000

0
Ap r-0 Ju 0 n0 Au 0 g0 O 0 ct -0 D 0 ec -0 Fe 0 b01 Ap r-0 Ju 1 n0 Au 1 g0 O 1 ct -0 D 1 ec -0 Fe 1 b0 Ap 2 r-0 Ju 2 n0 Au 2 g0 O 2 ct -0 D 2 ec -0 Fe 2 b03 Ap r-0 Ju 3 n0 Au 3 g0 O 3 ct -0 D 3 ec -0 Fe 3 b04

Analysis Jobs
60000

18000
50000

16000

14000
40000

12000

10000

30000

8000
20000

6000

4000
10000

2000

0
4 2 1 3 0 2 1 3 1 2 3 0 4 3 2 1 3 0 1 3 2 0 1 2 -0 l- 0 p-0 v-0 n-0 r-0 y-0 l-0 p-0 v-0 n-0 r-0 y-0 l-0 p-0 v-0 n-0 r -0 y-0 l- 0 p-0 v-0 n-0 r-0 a a a a a ay Ju a a o J u S e No Ju Se No Ju Se No Ja M Ja M Ja M Ja M M M M Se N M

0

Ju n0 1 A ug -0 1 O ct -0 1 D ec -0 1 F eb -0 2 A pr -0 2 Ju n0 2 A ug -0 2

O ct -0 2 D ec -0 2 F eb -0 3 A pr -0 3

Ju n0 3 A ug -0 3

O ct -0 3 D ec -0 3 F eb -0 4

A pr -0 1

53

Outline
What is Proteomics? Accurate Mass and Time tag (AMT) based proteomics Instrumentation Data analysis Data management Challenges

54

Challenges
Datasets from many different mass spec instrument types Different analysis tools applied to each instrument type Organization of datasets is critical Multiple research campaigns in progress 800 datasets per month and growing Expanding data rates managed with fixed resources Lack of data standards Sample standards Large data volumes, long-term archival strategy Improved algorithms are needed at every level

55

Acknowledgements
Dick Smith, nearly 12 years of support Gary Kiebel Lars Kangas Lilly Pasa-Tolic Ken Swanson Nikola Tolic Dave Prior Christophe Masselon Bev Taylor Matt Monroe Josh Adkins Eric Strittmatter Ron Moore Weijun Qian Ruihua Fang Michael Buschbach Ken Auberry Dave Clark Mary Lipton Dave Camp Many past group members
56