Professional Documents
Culture Documents
* To whom correspondence should be addressed. Tel: +49 62213878473; Fax: +49 62213878306; Email: patil@embl.de
C The Author(s) 2018. Published by Oxford University Press on behalf of Nucleic Acids Research.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0/), which
permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
Nucleic Acids Research, 2018, Vol. 46, No. 15 7543
gene essentiality, or incorrect nutritional requirements. The dia (Figure 1C), the top-down approach is able to infer the
manual curation step is time-consuming and includes repet- uptake/secretion capabilities of an organism from genetic
itive tasks that must be performed for every new reconstruc- evidence alone (Figure 1D), making it especially suitable
tion. for organisms that cannot be cultivated under well-defined
In this work, we present CarveMe, a new reconstruction media. Finally, CarveMe automates the creation of micro-
tool that shifts this paradigm by implementing a top-down bial community models by merging selected sets of single-
reconstruction approach (Figure 1). We begin by recon- species models into community-scale networks.
structing a universal metabolic model, which is manually
curated for the common problems mentioned above. No-
tably, this universal model is simulation-ready: it includes MATERIALS AND METHODS
import/export reactions, a universal biomass equation, and Universal model building
contains no blocked or unbalanced reactions. Subsequently,
for every new reconstruction, the universal model is con- CarveMe provides a Python script to build an universal
verted to an organism-specific model using a process called draft model of metabolism by downloading all reactions
’carving’ (see Materials and Methods section for details). and metabolites in the BiGG database (28) into a single
In essence, this process removes reactions and metabolites SBML file (BiGG version 1.3 was used during this work).
not predicted to be present in the given organism, while All associated metadata are stored as SBML annotations.
preserving all the manual curation and relevant structural The same script assists with automated curation tasks to
properties of the original model. The lack of manual in- build a final universal model (as described next).
tervention makes this process automatable and the recon-
struction of multiple species can be easily parallelized (Fig-
Universal bacterial model
ure 1B). Unlike traditional (bottom-up) gap-filling, which
iteratively adds reactions to enable growth on different me- The draft model built from the entire BiGG database was
manually curated to generate a fully-functional universal
7544 Nucleic Acids Research, 2018, Vol. 46, No. 15
model of bacterial metabolism. First, all reactions and phosphatidylethanolamines, murein and a lipopolysaccha-
metabolites present in eukaryotic compartments were re- ride unit. We also provide a template for cyanobacteria that
moved. An universal biomass equation was then added contains a thylakoid compartment for photosynthesis, and
to the model. This equation was adapted from the Es- a template for archaea that contains ether lipids in the cell
cherichia coli biomass composition (29) in accordance to a membrane composition and lacks peptidoglycans.
recent study on universal biomass components in prokary- In all cases, the final biomass composition is normalized
otes (30). to represent 1 gram of cell dry weight. During reconstruc-
Next, we constrained reaction reversibility to eliminate tion, the user can select the universal bacterial template
thermodynamically infeasible phenotypes. Lower and up- or one of the specialized templates. Furthermore, a utility
per bounds for the Gibbs free energy change for each reac- to automatically generate customized templates with user-
tion (Gr ) were estimated using the formula: provided biomass compositions is included.
forcing network connectivity (i.e. gapless pathways): ated with the respective reactions: 1) forward direction pre-
ferred; –1) backward direction preferred; 0) reaction should
max s (y + y )
T f r
not occur.
s.t. Hard constraints are simply a list of flux bounds (vimin <
vi < vimax ). They can be used to force or block the utilization
S· v = 0 of any given reaction (please note that they can also make
the MILP problem infeasible so they should be used with
v > −Myr + εy f
care).
v < −εyr + My f
vi > 0 ∀i ∈ {forward irreversible} Gap-filling
All the models were used to simulate substrate utiliza- higher specificity (especially for S. oneidensis), but per-
tion (Biolog phenotype arrays) and gene essentiality, and form worse on other metrics. For E. coli, the CarveMe
the results were systematically compared against published model performs very closely to the iML1515 model, out-
experimental data. Two exceptions were M. genitalium (no competing the modelSEED model. In the case of B. sub-
Biolog data available) and R. solanacearum (no gene essen- tilis, both CarveMe and modelSEED perform considerably
tiality data available). The predictive ability of the models worse than the iYO844 model. This difference in perfor-
was evaluated in terms of multiple metrics: accuracy, preci- mance can be explained by the lack of annotated trans-
sion, sensitivity, specificity, and F1-score (see Methods for porters for B. subtilis, which have been manually added in
details on the experimental setup, metrics description, and the curated model.
data sources). To test the influence of transporter annotation, we re-
Notably, four out of five models generated with CarveMe peated the simulations for all models, this time allowing
were able to reproduce growth on minimal media from free diffusion of any metabolites without associated trans-
genome data alone, without specifying any growth re- porters (Supplementary Figure S1). It can be observed that
quirements during reconstruction or performing any ad- for B. subtilis the automated reconstructions now perform
ditional gap-filling after reconstruction. The other model more closely to the manually curated model, with the mod-
(R. solanacearum) required one single gap-filling reaction elSEED reconstruction outperforming CarveMe. For the
(asparagine synthetase) to reproduce growth on minimal other organisms, one can also observe a smaller perfor-
medium. All the models generated with modelSEED re- mance gap between the manually curated models and the
quired the specification of the minimal media during re- automatic reconstructions, with CarveMe generally outper-
construction. Without this a priori information, the models forming modelSEED.
were unable to reproduce growth on the given media.
Benchmark 2: Gene essentiality. Regarding the ability to
Benchmark 1: Biolog phenotype arrays. Regarding sub- predict gene essentiality, the same general pattern can be
strate utilization, it can be observed that the performance observed, with CarveMe models displaying a performance
of CarveMe models is in-between the performance of man- in-between the manually curated models and modelSEED
ually curated models and those generated with modelSEED (Figure 3). Two exceptions are M. genitalium (modelSEED
(Figure 2). In some cases the modelSEED models display outperforms CarveMe for most metrics) and S. oneidensis
7548 Nucleic Acids Research, 2018, Vol. 46, No. 15
(iSO783 and modelSEED display similar performance, be- This has been explored by Biggs and Papin (51), showing
ing both outperformed by CarveMe). that when gap-filling a model for multiple growth media,
One common pattern for most CarveMe models (and the selected order of the media influences the final network
modelSEED to some extent) is a slightly decreased sen- structure. Another source of ambiguity is the degeneracy of
sitivity with regard to gene essentiality (i.e. many essen- solutions when solving an optimization problem to gener-
tial genes reported as false negatives) when compared ate the model itself. To tackle this issue, CarveMe allows the
to the manually curated models. One possible reason is generation of model ensembles. The user can select the de-
the utilization of generic biomass templates that exclude sired ensemble size, and then use the ensemble model for
non-universal cofactors (30) and other organism-specific simulation purposes in order to explore alternative solu-
biomass precursors. To test the influence of biomass com- tions.
position on gene essentiality prediction, we built mod- We generated model ensembles (N = 100) for all organ-
els for E. coli and B. subtilis using three distinct biomass isms and repeated the phenotype array simulations and
compositions: our universal (core) biomass compositon, gene essentiality predictions using different voting thresh-
our Gram-negative and Gram-positive compositions, and olds (10%, 50%, 90%) to determine a positive prediction (see
the organism-specific compositions taken directly from the Methods). To confirm that the ensemble generation is unbi-
manually curated models (Supplementary Figure S2). Con- ased, we calculated the Jaccard distance between every pair
firming to our hypothesis, we observed a gain in sensitivity of models in each ensemble (Supplementary Figure S3A).
(and consequently in the F1-score) when using organism- We observed that all ensembles show some extent of vari-
specific biomass compositions. Nonetheless, the specialized ability with the pairwise distances following normal distri-
Gram-negative and Gram-positive compositions perform butions. The ensemble for M. genitalum shows the highest
closer to the organism-specific compositions than to the variability (average distance of 0.48), whereas the E. coli
core biomass composition. Hence, the utilization of special- ensemble shows the lowest variability (average distance of
ized templates seems to be a suitable compromise when an 0.04). This distance is a reflection of the degree of uncer-
organism-specific biomass composition is not available. tainty in the network structure.
Interestingly, we observe that the variability in network
Benchmark 3: Ensemble modeling. A general limitation in structure does not necessarily reflect into phenotype vari-
building genome-scale metabolic models, given the available ability, with the different models within an ensemble often
data, is the existence of several equally plausible models.
Nucleic Acids Research, 2018, Vol. 46, No. 15 7549
predicting the same phenotypes. Regarding the phenotype isms, ’Cofactor and Prosthetic Group Biosynthesis’, is also
array simulations (Supplementary Figure S3B–F), we can the same for both collections. Overall, this shows that the
generally observe an increase in sensitivity at the cost of de- quality of the CarveMe models (even without the Tramon-
creased specificity with respect to the single model recon- tano et al. data) is comparable to the manually-curated
structions. With regard to gene essentiality (Supplementary AGORA collection, which further highlights the advan-
Figure S3G–K), there are no observable differences between tages of our top-down curation approach.
the ensemble model simulations and the results obtained
with single models. This result is most likely a reflection of
Large-scale model reconstruction and community models
the robustness of cellular function to perturbations in the
network structure during evolution (52,53). In both scenar- To demonstrate the scalability of CarveMe, and to pro-
ios, the results seem to be independent of the voting thresh- vide the research community with a collection of recon-
community, 1000 assemblies for each community size) and modelSEED, making CarveMe especially suited for situa-
calculated their metabolic interaction potential (MIP score, tions where the growth requirements are not known a priori
as defined in (11)). In brief, the MIP score of a commu- (e.g. microbes that cannot be cultivated under well-defined
nity is a measure of the number of compounds that can be media). In case a generated model is not able to reproduce
exchanged between the community members, allowing the growth on a given medium, CarveMe can additionally per-
community to reduce its dependence on the environmental form gap-filling to enable the expected phenotype. The im-
supply of nutrients. plemented gap-filling approach differs from methods that
The results obtained with this collection (Figure 4E) simply minimize the number of added reactions (61), by pri-
closely match those previously obtained with a smaller oritizing reactions according to their gene homology scores,
model collection (1503 bacterial strains, generated with similarly to the likelihood-based method proposed by Bene-
modelSEED) (11). In particular, while the potential for in- dict and co-workers (62). Note that our approach differs
teraction increases with the community size, the number from the latter in the sense that the gap-filling is applied to a
of interactions normalized by the community size shows a model that was top-down generated from a fully-functional
maximum at a relatively small community size (6–7 species). universal model.
We used the BiGG database (28) to build our universal
model, which offers some advantages compared to other
DISCUSSION
commonly used reaction databases such as KEGG, Rhea,
We introduced CarveMe, an automated reconstruction tool or modelSEED (21,26,63). First, it was built by merging
for genome-scale metabolic models that implements a novel several genome-scale reconstructions (ranging from bac-
top-down reconstruction approach. Our extensive bench- terial to human models), which leverages on the cura-
mark shows that the performance of CarveMe models, in tion efforts that were applied to these models. Secondly,
terms of reproducing experimental phenotypes, is compa- BiGG reactions contain gene-protein-reaction associations,
rable to that of manually curated models and often exceeds compartment assignments, and human-readable identifiers.
the performance of models generated with modelSEED. These aspects facilitate the reconstruction process and the
Notably, many of our models correctly reproduced growth utilization of the generated models. However, BiGG is
on minimal media without any growth requirements being limited in size and scope compared to other databases,
specified during reconstruction. This was not observed with
Nucleic Acids Research, 2018, Vol. 46, No. 15 7551
which may result in lack of coverage of certain metabolic Finally, we have given particular attention to mak-
functions. A comparison between the reaction contents of ing CarveMe an easy-to-use tool. While the underly-
BiGG, modelSEED and KEGG (Supplementary Figure ing carving procedure can work with genetic evidence
S6), shows that BiGG contains a large number of reactions alone (i.e. without requiring the specification of a medium
which are not present in the other databases (most likely a composition), any collected experimental data (such as
consequence of the manual curation efforts), but lags be- growth media, known auxotrophies, metabolic exchanges,
hind in terms of total coverage of EC numbers. We ex- presence/absence of reactions) can be provided in a sim-
tracted all KEGG cross-references from the BiGG database ple tabular format as additional input for reconstruction.
and mapped them into KEGG’s global pathway map (Sup- Also, it implements a modular architecture that facilitates
plementary Figure S7). As expected, primary metabolism modification/integration of new components. For instance,
is essentially complete, whereas peripheral pathways asso- sequence alignments are performed with diamond (35) or
3. Machado,D. and Herrgård,M.J. (2015) Co-evolution of strain design 24. Dias,O., Rocha,M., Ferreira,E.C. and Rocha,I. (2015) Reconstructing
methods based on flux balance and elementary mode analysis. Metab. genome-scale metabolic models with merlin. Nucleic Acids Res., 43,
Eng. Commun., 2, 85–92. 3899–3910.
4. Brochado,A., Matos,C., Møller,B.L., Hansen,J., Mortensen,U.H. 25. Karp,P.D., Latendresse,M., Paley,S.M., Krummenacker,M.,
and Patil,K. (2010) Improved vanillin production in bakers yeast Ong,Q.D., Billington,R., Kothari,A., Weaver,D., Lee,T.,
through in silico design. Microb. Cell Factor., 9, 84. Subhraveti,P. et al. (2015) Pathway Tools version 19.0 update:
5. Kim,T.Y., Kim,H.U. and Lee,S.Y. (2010) Metabolite-centric software for pathway/genome informatics and systems biology. Brief.
approaches for the discovery of antibacterials using genome-scale Bioinformatics, 17, 877–890.
metabolic networks. Metab. Eng., 12, 105–111. 26. Kanehisa,M. and Goto,S. (2000) KEGG: kyoto encyclopedia of
6. Shlomi,T., Benyamini,T., Gottlieb,E., Sharan,R. and Ruppin,E. genes and genomes. Nucleic Acids Res., 28, 27–30.
(2011) Genome-scale metabolic modeling elucidates the role of 27. Thiele,I. and Palsson,B.Ø. (2010) A protocol for generating a
proliferative adaptation in causing the Warburg effect. PLoS Comput. high-quality genome-scale metabolic reconstruction. Nat. Protoc., 5,
Biol., 7, e1002018. 93–121.
7. Väremo,L., Nookaew,I. and Nielsen,J. (2013) Novel insights into 28. King,Z.A., Lu,J., Dräger,A., Miller,P., Federowicz,S., Lerman,J.A.,
44. Sung,J., Kim,S., Cabatbat,J., Jang,S., Jin,Y., Jung,G., Chia,N. and (2017) Generation of genome-scale metabolic reconstructions for 773
Kim,P. (2017) Global metabolic interaction network of the human members of the human gut microbiota. Nat. Biotechnol., 35, 81.
gut microbiota for context-specific community-scale analysis. Nat. 58. Nam,H., Lewis,N.E., Lerman,J.A., Lee,D.-H., Chang,R.L., Kim,D.
Commun., 8, 15393. and Palsson,B.O. (2012) Network context and selection in the
45. Wattam,A.R., Abraham,D., Dalay,O., Disz,T.L., Driscoll,T., evolution to enzyme specificity. Science, 337, 1101–1104.
Gabbard,J.L., Gillespie,J.J., Gough,R., Hix,D., Kenyon,R. et al. 59. Notebaart,R.A., Szappanos,B., Kintses,B., Pál,F., Györkei,Á.,
(2013) PATRIC, the bacterial bioinformatics database and analysis Bogos,B., Lázár,V., Spohn,R., Csörgő,B., Wagner,A. et al. (2014)
resource. Nucleic Acids Res., 42, D581–D591. Network-level architecture and the evolutionary potential of
46. Bornstein,B.J., Keating,S.M., Jouraku,A. and Hucka,M. (2008) underground metabolism. Proc. Natl. Acad. Sci. U.S.A., 111,
LibSBML: an API library for SBML. Bioinformatics, 24, 880–881. 11762–11767.
47. Pruitt,K.D., Tatusova,T. and Maglott,D.R. (2007) NCBI reference 60. Guzmán,G.I., Utrilla,J., Nurk,S., Brunk,E., Monk,J.M., Ebrahim,A.,
sequences (RefSeq): a curated non-redundant sequence database of Palsson,B.O. and Feist,A.M. (2015) Model-driven discovery of
genomes, transcripts and proteins. Nucleic Acids Res., 35, D61–D65. underground metabolic functions in Escherichia coli. Proc. Natl.
48. Monk,J.M., Lloyd,C.J., Brunk,E., Mih,N., Sastry,A., King,Z., Acad. Sci. U.S.A., 112, 929–934.