You are on page 1of 50

Identification of Proteins Through Mass Spectrometry Databases: Processing and Precautions

Syed Kashif Raza


Dr. Panjwani Center for Molecular Medicine and Drug Research International Center for Chemical And Biological Sciences University of Karachi

Molecular Medicine
1

Proteomics and Mass Spectrometry


Proteome - complete set of proteins in cell Current methodologies: 2D gel, protein microarray, fluorescence microscopy, mass spectroscopy, chromatography, nuclear magnetic resonance, microfluidics, microchip

Mass spectrometry is an important practice for molecular and cell biology


New advances in automation of mass spectrometry like excision of protein spots, enzymatic digestion and acquirement of mass spectra and automatic data bases searching.

Techniques for modified proteins and quantification have been developed.

Servers available for Protein Identification through Mass Spectrometry


For PMF ASCQ_ME Bupid Mascot MassSearch MS-Fit (Protein Prospector) PepMAPPER Prpfound (Powl) Mowse PeptideSearch For Sequence Query Mascot MS-Seq (Protein Prospector) Tagldent For MS/MS Ion Search Inspect Mascot MS-Seq (Protein Prospector) Omssa PepFrag (Prowl) PepProbe

Rald_DbS
Sonar (Knexus) X!Tandam (The GPM)
3

Mascot

Software search engine Uses mass spectrometry data Mascot is unique Widely used Freely available by Matrix Science License is required for in-house use

Mascot Server

Gives excellent results with peak lists from instruments manufactured by:

Agilent, Bruker, Thermo Scientific


Waters AB Sciex, Shimadzu

In-house use:

Data sets that exceed the 1200 spectrum limit

Confidentiality
For automation To add and edit modifications, enzymes, quantitation methods, etc.

Time taken in search depends on number of processors.

Three proven ways of using mass spectrometry data

Peptide mass fingerprint

Uses the molecular masses of the peptides resulting from digestion of a protein by a specific enzyme

Sequence query Mass values combined with amino acid sequence or composition data.

MS/MS Ions Search Uninterpreted MS/MS data from a single peptide or from a complete LC-MS/MS run.

Peptide Mass Fingerprint

Peptide Mass Fingerprint

Peak picking

Find a utility to convert into a peak list


Mass matter most

Get as many peptide masses in the range 1000 to 3500 Da To perform a search

Paste your peak list or upload it as a file


Enter values for search parameters After submission, you receive the results. A list of matching proteins,

Protein Mass Fingerprinting


Fast simple analysis.


High Sensitivity Need a database of proteins

not EST

Sequence must be present in databases Not Good for mixtures Start with Swiss-Prot. Protein hit is significant if expect value below 0.05

MS/MS Ions Search

10

MS/MS Ions Search


Single protein or a complex mixture Use chromatography to regulate the flow of peptides into the mass spectrometer. Select peptides one at a time using the first stage of mass analysis. Each isolated peptide is then induced to fragment. Second stage of mass analysis used to collect an MS/MS spectrum. We use software to determine which peptide sequence in the database gives the best match. The degree of matching is scored.

11

Fragment ion structures

Peptide molecular ions fragment at preferred locations along backbone. Major peaks are b and y ions, Depends on the ionization technique, the mass analyser, and peptide structure. If peptides fragmented cleanly, we wouldnt need database search. A ladder of peaks for each ion series Fragmentation is rarely perfect

12

MS/MS ion search


Results complicated to report Report, lists a series of proteins and the peptide matches that have been assigned. Report uses a pop-up window to show the alternative peptide matches Top match has a high score

13

MS/MS ion search


Easily automated Searches can be slow

Without enzyme

Several variable modifications


Large dataset Large database

MS/MS is peptide identification

14

Sequence Query

15

Sequence tag search

Even the quality of spectrum is poor, its possible to pick out minimum of four clean peaks A few residues of amino acid sequence are interpreted What Mann and Wilm realized, that this very short stretch of amino acid sequence might provide sufficient specificity to provide identification if it was combined with the fragment ion mass values which enclose it, the peptide mass, and the enzyme specificity. Picking out a good tag requires both luck and experience. Requires interpretation of spectrum

Usually manual, hence not high throughput

Tag has to be called correctly

16

Peptide Sequence tag


Standard sequence tag is obsolete.

Easier to skip the interpretation step and pass the peak list to the search engine.
Rapid search times Error tolerant

17

Search parameters
Name, Email and Search Title
The name and email are saved as a browser cookie. If Mascot security is enabled, information taken from user database Email address used for sending results

18

Databases

Swiss-Prot (~500000 entries)

Best annotated database, ideal for PMF

NCBI nr and UniRef100 (~19000000 entries)

Large, comprehensive, best choice for MS/MS

EST databases (>400000000 entries in translation)

Huge, not advisable for PMF

Single genome databases Not suitable for PMF cRAP and Contaminants

19

Database
Choose the right database

how

In Mascot 2.3 and later, you can select multiple databases You cannot mix AA and DNA databases. Comprehensive database repositories, NCBI and EBI, to download nr, GenBank, Swiss-Prot, EMBL, Trembl, etc

Searching for a single organism, always include a database of common contaminants.


If interested in a bacterium/plant, try comprehensive protein databases e.g. NCBInr and UniRef100.

20

Nucleic Acid Databases


Mascot always performs a 6 frame translation

Translates entire sequence, don't look for start codon to begin


When a stop codon is encountered, leave a gap Uses the correct genetic code, as long as the taxonomy is known.

21

Taxonomy

Speeds up Simple report Keep indexes up to date Check the stats file for each database. If the correct protein from the correct species is not in the database , Dont specify a very narrow taxonomy.

22

Enzyme

First choice
Allowed missed cleavage sites to zero Choose a setting of 1 or 2 when youre not sure about your sample Higher number, increases the number of calculated peptide masses. No enzyme only in exceptional cases, never for PMF The list is user configurable.

23

Modifications

Fixed modifications Variable, post-translational modifications Display all modifications

Keep less number of variable modifications


Some modifications are worse then others

Mods that affect a terminus are less of a problem, e.g. Pyro-glu Mods that apply to residues with a high fractional abundance and at any position are BIG prob, e.g. Phospho (ST)

24

Modifications

Post-translational

Phosphorylation, acetylation Oxidation, acetylations Alkylation of cysteine Errors, SNPs, other varients

Artifacts

Derivatization

Sequence varients

Take complete list from unimod

And if alkylation agent is iodoacetamide (carbamidomethyl), iodoacetic acid (carboxymethyl), and MMTS (methylthio).

25

Phosphorylation

Site heterogeneity Poor ionization efficiency 3 fragmentation channels


Intact fragments Natural loss of HPO3 (80 Da) Natural loss of H3PO4 (98 Da)

Can occur at STY -~16% of residues

26

Protein mass

Mass of the intact protein in kDa. If this field is left blank, there is no restriction on protein mass Slow down the search a little.

27

Tolerance

Peptide tolerance MS/MS tolerance Error window on experimental peptide mass values Units: percentage, milli-mass units, parts per million, or Daltons. Protein/peptide view includes a graph of the mass errors for fragment ions.

Specifying too tight peptide tolerance , common reason for failing to get a match
A more appropriate tolerance should be +/- 0.3 in MS/MS

28

Mass type

Average or monoisotopic. Monoisotopic: most abundant natural isotopes First peak of isotope distribution. Average mass is the chemical mass, centre of gravity of the isotope distribution. Difference is approximately 0.06%.

If you get this setting wrong, the mass errors will be very large

29

Charge

Used on the sequence query and MS/MS forms. "1+" always means MH+, "1-" always means M-H-, etc.

For MALDI-PSD, generally be MH+

30

Data (PMF)

Mass
Query window are used when no data file. The data format is auto detected.

List of mass values, one per line. If a second values is present, it is assumed to be intensity. Any further values on the same line are ignored

Mascot also supports other peak list formats


Applied biosystems data explorer (.pkm) Bruker analysis autoxecute data report

Bruker XML
mzData (1.o5) mzML
31

Data (MS/MS)

The format cannot be auto-detected, and must be specified.

Instrument

Type of instrument used to acquire the data. This setting determines which fragment ion series will be used for scoring

32

Report

AUTO to display only protein hits with significant scores.

One additional after the cutoff at the significant score.

33

Final tip

Beware of

Narrowing the taxanomy Reducing mass tolerances

Removing modifications
Selecting spectras or mass values

Set search parameters using standard samples

34

Types of Summary Report

35

36

Scoring and statistics


A list of proteins

Some matches not statistically significant.


The score threshold for this search is 76, and the top scoring match is 47. Area shaded green to indicate random, meaningless matches.

37

Probability based scoring


Scoring whether the match is random or not. Probability: observed match, is a random event. Real match, not random, has very low probability. Reject anything with a probability greater than a chosen threshold The mascot score is 10log10(p)

38

Significant thresholds

The threshold is calculated from the number of trials

P=1/(20x500000)

Standard score MudPIT score

39

Expectetion value

The number of times you could expect to get this score or better by chance
E=Pthreshold*(10**((Sthreshold-score)/10))

A completely random match has an expectation value of 1 or more

The better the match, the smaller the expectation value.

40

Error tolerant search

Second pass exhaustive search of selected protein hits


Wide range of modifications, SNPs Relax enzyme specificity All fixed and variable mods retained Allow for one additional unsuspected modification

41

Error tolerant search

Take query 218. the observed mass difference could correspond to either carbamidomethylation or carboxymethylation at the N-terminus. Since sample was alkylated with iodoacetamide. carbamidomethylation is also very believable, known artefact of over-alkylation. Finds new matches by introducing mass shifts

42

Phosphorylation site localization


For confident site localization. Ascore, PTM score and MD-score MD -score, the score difference between top two matches

Depends on fragmentation techniques


Ability increases with increasing distance The MD score does not require complex computational

43

Validation (Decoy)

False discovery rate. Most reliable is decoy database

Separate databases or concatenated to target entries

44

Decoy

Search a decoy database


Very simple
Repeat the search Matches that are found in the decoy database are false positives. It isnt useful when small number of spectra.

45

Decoy

A utility to create a decoy database Reversed or randomised sequence of the same length is automatically generated and tested. The average amino acid composition of the random sequences is the same The matches and scores for the decoy sequences are recorded separately in the result file.

46

Mascot Daemon

Automates the submission of data files Batch mode

Real-time monitor mode


Follow-up tasks

47

Mascot Distiller

Access all of the popular data formats


To produce high quality peak lists Submit and review Mascot search results. Perform de novo sequencing and interpret sequence tags for tag searches

48

References

http://www.matrixscience.com Mikhail M. S., Simone L., Markus B., Manja L., Toby M., Marcus B., Bernard K., The American Society for Biochemistry and Molecular Biology. (2011) Ville R. Koskinen, Patrick A. Emery, David M. Creasy, and John S. Cottrell, Molecular and Cellular Proteomics, (2011) Elias, J. E. and Gygi, S. P., Natural Methods 4 207-214 (2007)

49

50

You might also like