Identification of Proteins Through Mass Spectrometry Databases

Identification of Proteins Through Mass Spectrometry Databases: Processing and Precautions
Syed Kashif Raza

Dr. Panjwani Center for Molecular Medicine and Drug Research International Center for Chemical And Biological Sciences University of Karachi
Molecular Medicine
1
Proteomics and Mass Spectrometry

Proteome - complete set of proteins in cell Current methodologies: 2D gel, protein microarray, fluorescence microscopy, mass spectroscopy, chromatography, nuclear magnetic resonance, microfluidics, microchip
Mass spectrometry is an important practice for molecular and cell biology

New advances in automation of mass spectrometry like excision of protein spots, enzymatic digestion and acquirement of mass spectra and automatic data bases searching.
Techniques for modified proteins and quantification have been developed.
Servers available for Protein Identification through Mass Spectrometry

For PMF ASCQ_ME Bupid Mascot MassSearch MS-Fit (Protein Prospector) PepMAPPER Prpfound (Powl) Mowse PeptideSearch For Sequence Query Mascot MS-Seq (Protein Prospector) Tagldent For MS/MS Ion Search Inspect Mascot MS-Seq (Protein Prospector) Omssa PepFrag (Prowl) PepProbe
Rald_DbS
Sonar (Knexus) X!Tandam (The GPM)
3
Mascot

Software search engine Uses mass spectrometry data Mascot is unique Widely used Freely available by Matrix Science License is required for in-house use
Mascot Server
Gives excellent results with peak lists from instruments manufactured by:

Agilent, Bruker, Thermo Scientific

Waters AB Sciex, Shimadzu
In-house use:

Data sets that exceed the 1200 spectrum limit
Confidentiality
For automation To add and edit modifications, enzymes, quantitation methods, etc.
Time taken in search depends on number of processors.
Three proven ways of using mass spectrometry data
Peptide mass fingerprint
Uses the molecular masses of the peptides resulting from digestion of a protein by a specific enzyme
Sequence query Mass values combined with amino acid sequence or composition data.
MS/MS Ions Search Uninterpreted MS/MS data from a single peptide or from a complete LC-MS/MS run.
Peptide Mass Fingerprint
Peptide Mass Fingerprint
Peak picking
Find a utility to convert into a peak list

Mass matter most
Get as many peptide masses in the range 1000 to 3500 Da To perform a search
Paste your peak list or upload it as a file

Enter values for search parameters After submission, you receive the results. A list of matching proteins,
Protein Mass Fingerprinting

Fast simple analysis.

High Sensitivity Need a database of proteins
not EST
Sequence must be present in databases Not Good for mixtures Start with Swiss-Prot. Protein hit is significant if expect value below 0.05
MS/MS Ions Search
10
MS/MS Ions Search

Single protein or a complex mixture Use chromatography to regulate the flow of peptides into the mass spectrometer. Select peptides one at a time using the first stage of mass analysis. Each isolated peptide is then induced to fragment. Second stage of mass analysis used to collect an MS/MS spectrum. We use software to determine which peptide sequence in the database gives the best match. The degree of matching is scored.
11
Fragment ion structures
Peptide molecular ions fragment at preferred locations along backbone. Major peaks are b and y ions, Depends on the ionization technique, the mass analyser, and peptide structure. If peptides fragmented cleanly, we wouldnt need database search. A ladder of peaks for each ion series Fragmentation is rarely perfect
12
MS/MS ion search

Results complicated to report Report, lists a series of proteins and the peptide matches that have been assigned. Report uses a pop-up window to show the alternative peptide matches Top match has a high score
13
MS/MS ion search

Easily automated Searches can be slow
Without enzyme
Several variable modifications

Large dataset Large database
MS/MS is peptide identification
14
Sequence Query
15
Sequence tag search
Even the quality of spectrum is poor, its possible to pick out minimum of four clean peaks A few residues of amino acid sequence are interpreted What Mann and Wilm realized, that this very short stretch of amino acid sequence might provide sufficient specificity to provide identification if it was combined with the fragment ion mass values which enclose it, the peptide mass, and the enzyme specificity. Picking out a good tag requires both luck and experience. Requires interpretation of spectrum
Usually manual, hence not high throughput
Tag has to be called correctly
16
Peptide Sequence tag

Standard sequence tag is obsolete.
Easier to skip the interpretation step and pass the peak list to the search engine.
Rapid search times Error tolerant
17
Search parameters
Name, Email and Search Title
The name and email are saved as a browser cookie. If Mascot security is enabled, information taken from user database Email address used for sending results
18
Databases
Swiss-Prot (~500000 entries)
Best annotated database, ideal for PMF
NCBI nr and UniRef100 (~19000000 entries)
Large, comprehensive, best choice for MS/MS
EST databases (>400000000 entries in translation)
Huge, not advisable for PMF
Single genome databases Not suitable for PMF cRAP and Contaminants
19
Database
Choose the right database
how
In Mascot 2.3 and later, you can select multiple databases You cannot mix AA and DNA databases. Comprehensive database repositories, NCBI and EBI, to download nr, GenBank, Swiss-Prot, EMBL, Trembl, etc
Searching for a single organism, always include a database of common contaminants.

If interested in a bacterium/plant, try comprehensive protein databases e.g. NCBInr and UniRef100.
20
Nucleic Acid Databases

Mascot always performs a 6 frame translation
Translates entire sequence, don't look for start codon to begin

When a stop codon is encountered, leave a gap Uses the correct genetic code, as long as the taxonomy is known.
21
Taxonomy

Speeds up Simple report Keep indexes up to date Check the stats file for each database. If the correct protein from the correct species is not in the database , Dont specify a very narrow taxonomy.
22
Enzyme
First choice
Allowed missed cleavage sites to zero Choose a setting of 1 or 2 when youre not sure about your sample Higher number, increases the number of calculated peptide masses. No enzyme only in exceptional cases, never for PMF The list is user configurable.
23
Modifications

Fixed modifications Variable, post-translational modifications Display all modifications
Keep less number of variable modifications

Some modifications are worse then others

Mods that affect a terminus are less of a problem, e.g. Pyro-glu Mods that apply to residues with a high fractional abundance and at any position are BIG prob, e.g. Phospho (ST)
24
Modifications
Post-translational
Phosphorylation, acetylation Oxidation, acetylations Alkylation of cysteine Errors, SNPs, other varients
Artifacts
Derivatization
Sequence varients
Take complete list from unimod
And if alkylation agent is iodoacetamide (carbamidomethyl), iodoacetic acid (carboxymethyl), and MMTS (methylthio).
25
Phosphorylation

Site heterogeneity Poor ionization efficiency 3 fragmentation channels

Intact fragments Natural loss of HPO3 (80 Da) Natural loss of H3PO4 (98 Da)
Can occur at STY -~16% of residues
26
Protein mass

Mass of the intact protein in kDa. If this field is left blank, there is no restriction on protein mass Slow down the search a little.
27
Tolerance

Peptide tolerance MS/MS tolerance Error window on experimental peptide mass values Units: percentage, milli-mass units, parts per million, or Daltons. Protein/peptide view includes a graph of the mass errors for fragment ions.
Specifying too tight peptide tolerance , common reason for failing to get a match
A more appropriate tolerance should be +/- 0.3 in MS/MS
28
Mass type

Average or monoisotopic. Monoisotopic: most abundant natural isotopes First peak of isotope distribution. Average mass is the chemical mass, centre of gravity of the isotope distribution. Difference is approximately 0.06%.
If you get this setting wrong, the mass errors will be very large
29
Charge

Used on the sequence query and MS/MS forms. "1+" always means MH+, "1-" always means M-H-, etc.
For MALDI-PSD, generally be MH+
30
Data (PMF)
Mass
Query window are used when no data file. The data format is auto detected.
List of mass values, one per line. If a second values is present, it is assumed to be intensity. Any further values on the same line are ignored
Mascot also supports other peak list formats

Applied biosystems data explorer (.pkm) Bruker analysis autoxecute data report
Bruker XML
mzData (1.o5) mzML
31
Data (MS/MS)
The format cannot be auto-detected, and must be specified.
Instrument

Type of instrument used to acquire the data. This setting determines which fragment ion series will be used for scoring
32
Report

AUTO to display only protein hits with significant scores.
One additional after the cutoff at the significant score.
33
Final tip
Beware of

Narrowing the taxanomy Reducing mass tolerances
Removing modifications
Selecting spectras or mass values
Set search parameters using standard samples
34
Types of Summary Report
35
36
Scoring and statistics

A list of proteins
Some matches not statistically significant.

The score threshold for this search is 76, and the top scoring match is 47. Area shaded green to indicate random, meaningless matches.
37
Probability based scoring

Scoring whether the match is random or not. Probability: observed match, is a random event. Real match, not random, has very low probability. Reject anything with a probability greater than a chosen threshold The mascot score is 10log10(p)
38
Significant thresholds
The threshold is calculated from the number of trials
P=1/(20x500000)
Standard score MudPIT score
39
Expectetion value
The number of times you could expect to get this score or better by chance
E=Pthreshold*(10**((Sthreshold-score)/10))
A completely random match has an expectation value of 1 or more
The better the match, the smaller the expectation value.
40
Error tolerant search
Second pass exhaustive search of selected protein hits

Wide range of modifications, SNPs Relax enzyme specificity All fixed and variable mods retained Allow for one additional unsuspected modification
41
Error tolerant search
Take query 218. the observed mass difference could correspond to either carbamidomethylation or carboxymethylation at the N-terminus. Since sample was alkylated with iodoacetamide. carbamidomethylation is also very believable, known artefact of over-alkylation. Finds new matches by introducing mass shifts
42
Phosphorylation site localization

For confident site localization. Ascore, PTM score and MD-score MD -score, the score difference between top two matches
Depends on fragmentation techniques

Ability increases with increasing distance The MD score does not require complex computational
43
Validation (Decoy)

False discovery rate. Most reliable is decoy database
Separate databases or concatenated to target entries
44
Decoy
Search a decoy database

Very simple
Repeat the search Matches that are found in the decoy database are false positives. It isnt useful when small number of spectra.
45
Decoy

A utility to create a decoy database Reversed or randomised sequence of the same length is automatically generated and tested. The average amino acid composition of the random sequences is the same The matches and scores for the decoy sequences are recorded separately in the result file.
46
Mascot Daemon

Automates the submission of data files Batch mode
Real-time monitor mode

Follow-up tasks
47
Mascot Distiller
Access all of the popular data formats

To produce high quality peak lists Submit and review Mascot search results. Perform de novo sequencing and interpret sequence tags for tag searches
48
References

http://www.matrixscience.com Mikhail M. S., Simone L., Markus B., Manja L., Toby M., Marcus B., Bernard K., The American Society for Biochemistry and Molecular Biology. (2011) Ville R. Koskinen, Patrick A. Emery, David M. Creasy, and John S. Cottrell, Molecular and Cellular Proteomics, (2011) Elias, J. E. and Gygi, S. P., Natural Methods 4 207-214 (2007)
49
50

Identification of Proteins Through Mass Spectrometry Databases

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Identification of Proteins Through Mass Spectrometry Databases

Uploaded by

Copyright:

Available Formats

Identification of Proteins Through Mass Spectrometry Databases: Processing and Precautions

Syed Kashif Raza

Proteomics and Mass Spectrometry

Mass spectrometry is an important practice for molecular and cell biology

Techniques for modified proteins and quantification have been developed.

Servers available for Protein Identification through Mass Spectrometry

Agilent, Bruker, Thermo Scientific

Data sets that exceed the 1200 spectrum limit

Time taken in search depends on number of processors.

Three proven ways of using mass spectrometry data

Peptide mass fingerprint

Peptide Mass Fingerprint

Peptide Mass Fingerprint

Find a utility to convert into a peak list

Paste your peak list or upload it as a file

Protein Mass Fingerprinting

Fast simple analysis.

MS/MS Ions Search

MS/MS Ions Search

Fragment ion structures

MS/MS ion search

MS/MS ion search

Easily automated Searches can be slow

Several variable modifications

MS/MS is peptide identification

Sequence tag search

Usually manual, hence not high throughput

Tag has to be called correctly

Peptide Sequence tag

Standard sequence tag is obsolete.

Swiss-Prot (~500000 entries)

Best annotated database, ideal for PMF

NCBI nr and UniRef100 (~19000000 entries)

Large, comprehensive, best choice for MS/MS

EST databases (>400000000 entries in translation)

Huge, not advisable for PMF

Searching for a single organism, always include a database of common contaminants.

Nucleic Acid Databases

Mascot always performs a 6 frame translation

Translates entire sequence, don't look for start codon to begin

Fixed modifications Variable, post-translational modifications Display all modifications

Keep less number of variable modifications

Take complete list from unimod

Site heterogeneity Poor ionization efficiency 3 fragmentation channels

Can occur at STY -~16% of residues

For MALDI-PSD, generally be MH+

Mascot also supports other peak list formats

The format cannot be auto-detected, and must be specified.

AUTO to display only protein hits with significant scores.

One additional after the cutoff at the significant score.

Narrowing the taxanomy Reducing mass tolerances

Set search parameters using standard samples

Types of Summary Report

Scoring and statistics

Some matches not statistically significant.

Probability based scoring

The threshold is calculated from the number of trials

Standard score MudPIT score

A completely random match has an expectation value of 1 or more

The better the match, the smaller the expectation value.

Error tolerant search

Second pass exhaustive search of selected protein hits

Error tolerant search

Phosphorylation site localization