Professional Documents
Culture Documents
ABSTRACT: Missing values are a major issue in quantitative data-dependent mass spectrometry-based proteomics. We
therefore present an innovative solution to this key issue by introducing a hurdle model, which is a mixture between a binomial
peptide count and a peptide intensity-based model component. It enables dramatically enhanced quantification of proteins
with many missing values without having to resort to harmful assumptions for missingness. We demonstrate the superior
performance of our method by comparing it with state-of-the-art methods in the field.
1
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
Label-free data-dependent quantitative mass spectrometry other methods. The fraction of true positive UPS1 proteins
(MS) is the preferred method for deep and high- flagged as DA in conditions C vs. A and B vs. A is plotted as
throughput identification and quantification of thousands a function of the fraction of false positive yeast proteins in the
of proteins in a single analysis1. However, this approach total number of DA proteins.
suffers from many missing values, which strongly reduce A common solution for missingness is to impute the
the amount of quantifiable proteins2, 3. There are three missing values. But, this comes at the expense of additional
common causes for this missingness: (1) true absence of assumptions, e.g. k-nearest neighbors (kNN) assumes
signal, or signal below detection limit in the MS1 intensity-independent missingness, and the popular
spectrum; (2) lack of fragmentation and hence missed software packages Perseus5 (“Perseus imputation”, PI) and
identification of the MS1 peak; and (3) failed identification MSstats6 assume missingness by low intensity. In reality,
of the acquired fragmentation spectrum. As a result, however, missing values originate from a mix of intensity-
missingness is more likely to occur for low-abundant dependent and -independent mechanisms, which can
proteins and/or poorly ionizing peptides. However, moreover be strongly data set-specific3, 7. As a result, state-
missingness may also extend to mid- and even high-range of-the-art imputation profoundly changes the distribution
intensities, e.g. when co-eluting peptides suppress an MS1 of protein-level intensities as it typically produces a second
signal or when poor quality of an MS2 spectrum interferes mode due to many imputed values, (shown for
with correct identification. Missingness due to lack of MaxQuant/Perseus in Fig. 1 b and Supplementary Fig. 1 –
fragmentation can be mitigated by “matching between 2; and for kNN in Supplementary Fig. 3 – 4). This can have
runs”, where unidentified MS1 peaks in one run are a deep impact on the downstream differential analysis. We
aligned to identified peaks in another run in narrow demonstrate these effects on the widely studied CPTAC
retention time and mass-over-charge (m/z) windows4. dataset8, where more than 47% of peptide data are missing.
Nevertheless, missing values remain widespread; a survey Because the spike-in concentrations in this dataset cover a
of 73 recent public proteomics data sets from the PRIDE wide range, the perils of imputation can be clearly shown.
database demonstrates an average of 44% missing values Indeed, while our MSqRob9 quantification tool shows
(Fig. 1 a, Supplementary Table 1). excellent performance when combined with PI for
comparison C (high spike-in concentration: 2.2 fmol/µL)
vs. A (lowest spike in concentration: 0.25 fmol/µL) (Fig. 1
c), MSqRob with PI performs poorly for comparisons B
(intermediate amount of spike-in: 0.74 fmol/µL) vs. A (Fig.
1 d) and C vs. B, where the differences in spike-in
concentration are smaller (Supplementary Fig. 5, 6), thus
rapidly accumulating false positives. MSqRob therefore
omits imputation by default to avoid a severe backlash in
performance10. However, without imputation, intensity-
based methods cannot cope with complete missingness in
one condition as it is impossible to calculate a fold change
(FC) relative to a missing value. Thus, potentially
interesting cases such as strong protein
synthesis/stabilization or protein degradation remain
undetected.
Figure 1. The impact of missing values and the superior Conversely, peptide counting approaches naturally handle
performance of the hurdle model. (a) Missing values are highly missing peptides in a run by a zero count. Yet relative
prevalent in recent proteomics data. The histogram shows the quantification by peptide counting generally performs
distribution of the percentage of missing values in 96 poorly (pink line in Fig. 1 c, d). This because counting
peptides.txt files from 73 PRIDE projects with a PRIDE disregards the inherent abundance-intensity relationship
publication date in 2017 that applied shotgun proteomics to (within a given dynamic range) for each peptide11, 12.
full or partial proteomes. Missingness ranges from 16 to 82%,
However, differentially abundant (DA) proteins for which
with an average of 44% missing values. (b) Perseus imputation
no FC can be estimated due to missingness do differ in
fails when too many missing values are present: the frequency
peptide counts (Supplementary Fig. 7) and proteins with a
distribution of the log2-transformed LFQ intensities in the
CPTAC dataset becomes bimodal after Perseus imputation higher concentration have on average both higher peptide
due to the many imputed values, as exemplified by run A5. ion intensities and higher peptide counts, as it is more
Observed values (978 LFQ intensities) are colored in blue, likely that even their poorly ionizing peptides are detected.
imputed values (484 LFQ intensities) are colored in red. Combining intensity-based methods with counting-based
Similar results for the other runs can be seen in methods to exploit this complementary information
Supplementary Fig. 1. (c, d) In the CPTAC dataset (38 UPS1 therefore seems promising. Webb-Robertson et al.
and 1,343 yeast proteins), the hurdle model outperforms combine intensity- and count-based statistics to filter out
1
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
peptides prior to differential analysis13. ProPCA uses peptides.txt files corresponding to 73 PRIDE projects
principal components to summarize intensities and that adhere to the following conditions:
spectral counts into one value14 and IDPQuantify uses - label-free shotgun proteomics,
Fisher’s method to combine the p-values from a two-sample
- full proteomes or enriched subsets of the proteome (i.e.
t-test with those of a quasi-Poisson regression on spectral
no projects where there was an enrichment step for a
counts15. However, ProPCA’s combined metric lacks an
chemical modification and no projects investigating
intuitive interpretation, and does not include any
protein-protein interactions),
downstream statistical analysis. Like Perseus, IDPQuantify
only handles simple pairwise comparisons without any - published in PRIDE in 2017 and
possibility to correct for confounding effects. Moreover, - searched with MaxQuant and peptides.txt file available
none of these methods is able to pinpoint whether the from PRIDE.
statistical significance is driven by the count component, An overview of the projects and peptides.txt files with
the intensity component, or both, thus making it their number of missing values, number of observed
impossible to interpret the results in terms of FCs. values and percentage of missing values is given in
Here, we introduce a hurdle model that unites the Supplementary Table 1.
advantages of MSqRob with the complementary
information present in peptide counts, that avoids Data availability
unrealistic imputation assumptions, and that provides We made use of two example datasets.
interpretable results. This hurdle model takes peptide-
1. In the benchmark CPTAC study 6, which can be
level information as input and consists of a mixture
downloaded from https://cptac-data-
model of two components: (1) a binary component that
portal.georgetown.edu/cptac/dataPublic/list?currentPat
distinguishes between log 2-transformed peptide
h=%2FPhase_I_Data%2FStudy6&nonav=true, the
intensities that are either missing or observed; and (2) an
human UPS1 standard is spiked in 5 different
MSqRob-based component to model the magnitude of
concentrations in a yeast proteome background 8. This
log2 peptide intensities passing the detection hurdle.
allows to make comparisons for which the ground truth
Inference on the parameters of both model components
is known: when comparing two different spike-in
has an intuitive interpretation: the binary component
conditions, only the UPS1 proteins are truly
can be used to assess differential detection (DD) and differentially abundant (DA), while the yeast proteins are
returns log odds ratios (log ORs), while the MSqRob not.
component allows to test for differential abundance (DA)
The three lowest spike-in conditions (A, B and C) from
of a protein and returns log2 fold changes (log2 FCs). When
LTQ-Orbitrap at site 86, LTQ-Orbitrap O at site 65 and
peptides are completely missing in one condition, the
LTQ-Orbitrap W at site 56 were searched with
hurdle model reduces to a binomial model. However,
MaxQuant 1.6.1.0 against a database containing 6,718
when peptides are present in both conditions, the hurdle
approach allows to combine information in the OR and reviewed Saccharomyces cerevisiae (strain ATCC
FC test statistics, and can be used to infer on DD, DA, 204508/S288c) proteins downloaded from UniProt on
or both in a post-hoc analysis. Note, that the peptide September 14, 2017, supplemented with the 48 human
counts can be overdispersed, which is why the variance UPS1 protein sequences provided by Sigma Aldrich.
component is estimated via quasi-likelihood and the Carbamidomethylcysteine was set as a fixed modification
model is termed a quasi-binomial model. and methionine oxidation, protein N-terminal
acetylation and N-terminal glutamine to pyroglutamate
EXPERIMENTAL SECTION conversion were set as variable modifications. Detailed
search settings are described in Supplementary Material.
In this section, we first demonstrate how we analyzed the
frequency of missing values in recent PRIDE projects. We only performed pairwise comparisons between the 3
Then, we discuss the nature and the origin of the CPTAC lowest spike-in concentrations because of the huge
and the HEART datasets, followed by an overview of how ionization competition effects in the higher spike-in
these datasets were preprocessed and imputed with concentrations, as described earlier 9.
different strategies. We then introduce the Perseus, 2. The HEART dataset originates from a large-scale
MSstats, MSqRob, the quasi-binomial model and the proteomics study of the human heart, where 16 different
hurdle model as methods for statistical inference. Finally, regions and 3 different cell types from three healthy
we refer to our GitHub page which enables to reproduce human adult hearts were studied 16.
all the analyses in this publication. For our analysis, we made use of the peptides.txt and
proteinGroups.txt files made available by Doll et al. in
Missing values in recent PRIDE projects ProteomeXchange via the PRIDE Archive repository
For our analysis of the frequency of missing values of under the identifier PXD006675. Here, we limited
recent datasets in PRIDE, we downloaded 96 ourselves to the data from 6 regions: the atrial and
2
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
ventricular septa (SepA and SepV), the left and right Statistical inference
atrium (LA and RA) and the left and right ventricle (LV In this subsection, we discuss differential protein analysis
and RV). We compare the atrial regions (LA, RA and with Perseus and MSstats, followed by MSqRob, the quasi-
SepA) to the ventricular regions (LV, RV and SepV), as binomial model and our hurdle model. All methods
well as LA vs. RA and LV vs. RV. mentioned below model the data protein by protein. To
improve readability, we will suppress the protein
Preprocessing for MSqRob and the quasi-
indicator in the remainder of this section.
binomial model
Perseus
Peptide intensities obtained from MaxQuant’s
peptides.txt file were log 2-transformed and quantile For the CPTAC dataset, LFQ intensities were imported
normalized. Next, potential contaminants, reverse into Perseus 1.6.0.7. Potential contaminants, reversed
sequences and proteins that were only identified by sequences and proteins only identified by peptides
peptides carrying a modification were removed from the carrying modification sites were removed from the data.
data. Finally, proteins identified by only a single peptide Next, empty columns were removed and via “Categorical
were also removed. annotation rows”, runs were sorted according to their
spike-in conditions (A, B or C). Data were either
Imputation methods imputed with PI (“Perseus with imputation”) or not
(“Perseus without imputation”). Finally, we performed
Missing values were either not imputed or imputed with
two-sample t-tests for each of the three comparisons.
kNN or Perseus imputation.
For the atrial vs. ventricular comparison in the HEART
No imputation
dataset, Doll et al. provided the results of their Perseus
For MSqRob without imputation, we additionally with imputation approach in their Supplementary Data
removed peptides that are only identified in a single run. 5 file, available from https://static-
Indeed, MSqRob directly models peptide intensities and content.springer.com/esm/art%3A10.1038%2Fs41467-
the peptide-specific effect for peptides with one 017-01747-
identification cannot be estimated because no replicates 2/MediaObjects/41467_2017_1747_MOESM7_ESM.x
are available. This additional filtering step is not lsx.
required for MSqRob with imputation, because missing
MSqRob
values in the peptide intensity matrix are imputed first.
kNN imputation MSqRob has been described in detail in Goeminne et al.
(2016)9. Briefly, for each protein, the log 2-transformed
k-nearest neighbors (kNN) imputation calculates a peptide intensities 𝑦𝑝𝑟 for peptide 𝑝 = 1, … , 𝑃 in run
Euclidean distance metric on all peptide intensities to
𝑟 = 1, … , 𝑅 are modeled as follows:
find the k most similar peptides and imputes the missing peptide
value with the average of the corresponding values from 𝑦𝑝𝑟 = 𝛽0 + 𝒙T𝒑𝒓 𝜷 + 𝛽𝑝 + 𝑢𝑟run + 𝜀𝑝𝑟 , (Eq. 1)
the k neighbors17. We used the default value of k = 10 where 𝛽0 is the intercept,
peptide
𝛽𝑝 is the fixed effect of
neighbors. the individual peptide sequence 𝑝, 𝑢𝑟run is a random run
Perseus imputation effect (𝑢𝑟run ~N(0, 𝜎𝑢2 )) that accounts for the correlation
We call “Perseus imputation (PI)” the standard of peptide intensities 𝑦𝑗𝑟 and 𝑦𝑘𝑟 from the same protein
T
imputation approach from the popular proteomics within the same run 𝑟, 𝒙T𝒑𝒓 = (𝑥1,𝑝𝑟 , … , 𝑥𝑚,𝑝𝑟 ) is a
computational platform Perseus 5. The imputation is vector with the covariate pattern of the 𝑚 remaining
achieved by using the “Replace missing values from predictors, 𝜷 = [𝛽1 , … , 𝛽𝑚 ]T is a vector of parameters
normal distribution” function in Perseus 1.0.6.7. We modeling the effect of each predictor on the peptide
applied PI on peptide intensities for MSqRob with PI intensity conditionally on the remaining covariates, and
and on LFQ intensities for Perseus with PI. 𝜀𝑝𝑟 is the error term that is assumed to be normally
PI first constructs a rescaled normal distribution with: distributed (𝜀𝑝𝑟 ~N(0, 𝜎 2 )).
(1) a downshifted mean equal to the average of all the
For the CPTAC and HEART datasets, the MSqRob
observed data minus 𝑑 times the standard deviation of
model can be written as follows:
the observed data and (2) a standard deviation equal to
𝑇 𝐵
𝑤 times the standard deviation of the observed data. 0 treat treat block block
Missing values are then imputed with random draws 𝑦𝑝𝑟 = 𝛽 + ∑ 𝑥𝑡,𝑝𝑟 𝛽𝑡 + ∑ 𝑥𝑏,𝑝𝑟 𝛽𝑏
from this rescaled distribution. We adopt Perseus’ 𝑡=1
peptide
𝑏=1
3
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
𝑝𝑒𝑝𝑡𝑖𝑑𝑒
HEART dataset), 𝛽𝑏block is a blocking factor (lab 𝑏 = 1 Var[𝑛𝑟 ] = 𝜑𝑃𝜋𝑟 (1 − 𝜋𝑟 ), (Eq. 6)
to 3 for CPTAC, patient 𝑏 = 1 to 3 for HEART), with 𝜑 a dispersion parameter accommodating for a
treat block
𝑥𝑡,𝑝𝑟 and 𝑥𝑏,𝑝𝑟 are dummy variables which are equal to more flexible variance function than that of the binomial
1 if run 𝑟 corresponds to treatment 𝑡 or block 𝑏, regression model. Similar to the original edgeR count
peptide
respectively, and 0 otherwise. 𝛽𝑝 is the effect of the model for gene expression, information is borrowed over
individual peptide sequence 𝑝, 𝑢𝑟run is a random effect the dispersion parameters of all models with the
that accounts for the fact that peptides within each run empirical Bayes procedure of the limma R package 20. To
𝑟 are correlated. Due to the parameterization of the avoid overly liberal results from inaccurate variance
model, the following restrictions apply: ∑𝑇𝑡=1 𝛽̂𝑡treat = 0, estimations, underdispersed 𝜑 estimates are set to 1.
∑𝐵𝑏=1 𝛽̂𝑏block = 0, ∑𝑃𝑝=1 𝛽̂𝑝peptide = 0 and ∑𝑅𝑟=1 𝑢̂𝑟run = 0. Note that the quasi-binomial approach models the log
(.)
odds, i.e. the logarithm of the probability that a peptide
The effect sizes 𝛽(.) (except for the intercept) are is detected divided by the probability that a peptide is
estimated using penalized regression. Distinct ridge not detected, i.e. the odds on detection. Hence, the
penalties are used for the treatment, block and peptide model will return log ORs when contrasting different
parameters, respectively and the ridge penalties are treatments to each other, which can be used to infer on
tuned by exploiting the link between ridge regression differential peptide detection for a specific protein.
and mixed models (see e.g. Ruppert et al. (2003), chapter Hurdle model
418). Outliers are accounted for using M-estimation with
The normalized intensities for each peptide 𝑝 in each
Huber weights. Protein-wise degrees of freedom are now
run 𝑟 are typically assumed to follow a log-normal
calculated in a less liberal way as: 𝑅 − (1 + (𝑇 − 1) +
distribution. Upon log 2-transformation, missing (zero)
(𝐵 − 1)). Variances are stabilized by borrowing
intensities are set at −∞ and cannot be modeled with
information over proteins using limma’s empirical Bayes
intensity-based methods such as Perseus, MSstats and
approach, which results in a moderated t-test. Scripts to
MSqRob. Missing values are therefore either omitted or
run MSqRob are provided on our GitHub page (see
imputed.
below under “Code availability”).
Here, we consider a hurdle model that consists of two
MSstats
parts: a binary component 𝑧𝑝𝑟 that distinguishes
MSstats has been described by Choi et al. (2014)6. For between peptide intensities in run 𝑟 that are missing
our analysis, we used the default settings of MSstats (𝑧𝑝𝑟 = 0) or observed (𝑧𝑝𝑟 = 1) with detection
version 3.12.2. During preprocessing, feature intensities probability 𝜋𝑟 ; and a normal component 𝑦𝑝𝑟 with mean
are log2-transformed and normalized by equalizing the
𝜇𝑝𝑟 and variance 𝜎 2 to model log 2 peptide intensities
run medians. Then, missing values are imputed with the
passing the detection hurdle. Note, that the detection
default MSstats settings. Features are summarized to the
probability 𝜋𝑟 and the mean 𝜇𝑝𝑟 can be further
protein level with Tukey’s median polish method.
Treatment effects are specified as “Condition” and parameterized using peptide specific effects, a random
blocking factors as “BioReplicate” in the MSstats run effect 𝑢𝑟run , and additional covariates 𝒙𝑝𝑟 . More
workflow. formally, the hurdle model for log2-transformed
intensities 𝑦𝑝𝑟 for peptide 𝑝 = 1, … , 𝑃 in run 𝑟 = 1, … , 𝑅
Quasi-binomial model
can be specified as follows:
When assuming that the number of observed peptides
peptide 𝑧𝑝𝑟 |𝒙𝑝𝑟 ~ Bernoulli(𝜋𝑟 ) (Eq. 7)
𝑛𝑟 in each run 𝑟 (after preprocessing) for a protein
are binomially distributed 𝑦𝑝𝑟 |𝑧𝑝𝑟 = 1, 𝒙𝑝𝑟 , 𝑢𝑟run ~ N(𝜇𝑝𝑟 , 𝜎 2 ) (Eq. 8)
peptide
𝑛𝑟 |𝒙𝑟 ~Binomial(𝑃, 𝜋𝑟 ), (Eq. 3) The log-likelihood for 𝝅 = [𝜋1 , … , 𝜋𝑅 ]T , 𝝁 =
[𝜇11 , … , 𝜇𝑃𝑅 ]T and 𝜎 2 given 𝒚 = [𝑦11 , … , 𝑦𝑃𝑅 ]T , 𝒛 =
with 𝑃 the total number of unique peptides observed [𝑧11 , … , 𝑧𝑃𝑅 ]T , 𝒖𝑟𝑢𝑛 = [𝑢1 , … , 𝑢𝑅 ]T , and 𝑿 the matrix
over all runs 𝑟 = 1, … , 𝑅 for this protein, and 𝜋𝑟 the
with rows 𝒙T𝑝𝑟 can then be written as:
probability to identify a peptide for this protein; the
peptide counts can be modeled using logistic regression. 𝑙(𝝅, 𝝁, 𝜎 2 |𝒚, 𝒛, 𝑿, 𝒖)
However, they are often overdispersed with respect to = ∑(1 − 𝑧𝑝𝑟 ) log(1 − 𝜋𝑟 ) + ∑ 𝑧𝑝𝑟 log(𝜋𝑟 )
the binomial distribution. We therefore adopt quasi- 𝑝𝑟 𝑝𝑟
binomial regression (McCullagh and Nelder (1989),
+ ∑ 𝑧𝑝𝑟 log[N(𝜇𝑝𝑟 , 𝜎 2 )] (Eq. 9)
section 4.519) where we model the first two moments
𝑝𝑒𝑝𝑡𝑖𝑑𝑒 𝑝𝑒𝑝𝑡𝑖𝑑𝑒 𝑝𝑟
(mean E[𝑛𝑟 ] and variance Var[𝑛𝑟 ]) as follows:
Note, that the log-likelihood implies an estimation
𝑝𝑒𝑝𝑡𝑖𝑑𝑒
E[𝑛𝑟 ] = 𝑃𝜋𝑟 , (Eq. 4) orthogonality between 𝜋𝑟 and 𝜇𝑝𝑟 and that the first two
𝜋𝑟 terms in the equation are equivalent to the log-likelihood
logit(𝜋𝑟 ) = log ( ) = 𝒙T𝒓 𝜷, (Eq. 5)
1 − 𝜋𝑟 of a Bernoulli process. Further, we omit a peptide-
4
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
5
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
500 and 1,000 most differential proteins for each dependent calcium channel CACNA2D2 and BMP10,
method is given in Supplementary Fig. 13. an essential factor for embryonal cardiac development 23.
For our novel approach we also observe a strong Finally, SERINC3 and PNMA1, which were declared
association between the OR and FC estimates (Fig. 2 b). DA by Doll et al. actually come with very limited
Out of the 1,765 hurdle identifiers that are significant at evidence for DA and their significance is likely driven by
the 5% FDR level, 397 were significantly DD, 815 the inherent randomness of PI (Supplementary Table 4).
significantly DA, 440 both significantly DD and DA, and
113 could not be attributed to either DA or DD in our CONCLUSIONS
post hoc analysis. All 440 identifiers declared both DD In summary, we demonstrated the negative impact of
and DA have a FC and OR estimate in the same imputation in protein quantification, and the potential
direction (i.e. both up or both down). flaws it induces in the downstream analysis when
imputation assumptions are violated. We therefore
propose a hurdle approach that can handle missingness
without having to resort to imputation and its associated
assumptions by leveraging both an intensity-based
approach (MSqRob) and a count-based quasi-binomial
approach. Our hurdle approach outcompetes state-of-
the-art methods with and without imputation in the
presence of both strong missingness as well as when
missingness is limited. Moreover, our hurdle method
continues to provide valid inference when peptides are
absent in one condition. It is therefore a highly
promising approach to detect the sudden appearance of
post-translationally modified peptides in protein
regulation alongside more traditional differential
protein expression.
ASSOCIATED CONTENT
Supporting Information
supplementary_informationHurdle.pdf: an extended
methods section and all supplementary figures. (PDF)
Figure 2. The hurdle model detects both DD and DA supplementary_table1.xlsx: Overview of the projects and files
proteins. (a) The Venn diagram shows the overlaps in the used in Fig. 1a together with the number of missing values,
1,500 most significant proteins when contrasting atrial the number of observed values and the percentage of missing
(average of left atrium, LA; right atrium, RA; atrial septum, values in each file. (XLSX)
SepA) to ventricular regions (average of left ventricle, LV; right
ventricle, RV; ventricular septum, SepV) in the HEART supplementary_table2.xlsx: Overview of the relevant protein-
dataset between hurdle, MSqRob without prior imputation, level output from the hurdle model (sheet 2), the Pereus with
and Perseus with imputation and SAM as implemented by imputation and SAM approach used by Doll et al. (sheet 3),
Doll et al. (b) Proteins that are significant with the hurdle MSqRob (sheet 4) and the quasibinomial model (sheet 5)
model often display a strong positive association between FC when comparing the atrial to the ventricular regions in the
and OR. Here, log2 FCs are plotted as a function of log2 ORs HEART dataset. (XLSX)
in the atrial vs. ventricular comparison for all 6,489 proteins
with an FC estimate. Proteins are colored by p-value. supplementary_table3.xlsx: Overview of the log2 FC, log2
Common FDR cut-offs are indicated on the color bar. OR, χ2 statistics, p-values and Benjamini-Hochberg-corrected
As in the original publication, comparing the left to the p-values (q-values) when comparing the left to the right atrium
with the hurdle model. (XLSX)
right ventricle did not render any significant proteins.
However, when comparing the left to the right atrium,
supplementary_table4.xlsx: Overview of the data underlying
we found eight significant proteins compared to only
proteins SERINC3, PNMA1, ACSM2A, PDE7B and
three in the original publication (Supplementary Table PNPLA7. (XLSX)
3). The higher abundance of the muscle contraction
protein myotilin 22 in the left atrium was confirmed by supplementary_table5.xlsx: Overview of the mock treatment
both methods, but the hurdle model additionally levels and the condition-lab combinations for each MS run in
indicates the heavy myosin chain MYH7 as DA. the mock analysis. (XLSX)
Moreover, the hurdle model also flags the voltage-
6
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
7
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
8
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
9
a % missing values in 96 searches b Perseus imputation run A5
Frequency Frequency
120
20 100
15 80
60
10
40
5
20
0 0
0 20 40 60 80 100 15 20 25
% missing values Log2(LFQ intensity)
True positive
bioRxiv preprint doi: https://doi.org/10.1101/782466; this version posted September 25, 2019. The copyright holder for this preprint (which was
not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available
under aCC-BY-NC 4.0 International license.
True positive
rate 1.0 Hurdle model rate 1.0 Hurdle model
0.8 0.8
MSqRob with Perseus imputation
MSqRob without imputation
0.6 0.6
MSqRob without imputation
0.4 MSstats 0.4 MSstats
MSqRob with Perseus imputation
Quasi-binomial model
Quasi-binomial model
0.2 Perseus with imputation 0.2
Perseus without imputation
Perseus without imputation Perseus with imputation
0.0 0.0
0.0 0.1 0.2 0.3 0.4 0.5 0.0 0.1 0.2 0.3 0.4 0.5
False discovery proportion False discovery proportion
a Atrial vs. ventricular regions
Hurdle MSqRob
164 507 247
without imputation
671
158 75
596
Perseus
with imputation
and SAM
0 1% FDR MYBPHL
5% FDR ADH1C;ADH1A NPPA
10% FDR PAM
MYH6
MYL7
1 METTL7B
MYL3 MYL2
TMEM159
MYL1,3
FHL2 RHOJ
LPL
SMYD2