Professional Documents
Culture Documents
7
0019-9567/07/$08.00⫹0 doi:10.1128/IAI.00105-07
Copyright © 2007, American Society for Microbiology. All Rights Reserved.
MINIREVIEW
National Institute of Allergy and Infectious Diseases Bioinformatics
Resource Centers: New Assets for Pathogen Informatics䌤
John M. Greene,1 Frank Collins,2 Elliot J. Lefkowitz,3 David Roos,4 Richard H. Scheuermann,5
Bruno Sobral,6 Rick Stevens,7 Owen White,8 and Valentina Di Francesco9*
SRA International, Inc., Health Research Systems, Rockville, Maryland 208521; University of Notre Dame, Department of
Biological Sciences, Notre Dame, Indiana 465562; University of Alabama at Birmingham, Department of Microbiology, Birmingham,
Alabama 35294-21703; University of Pennsylvania, Penn Genomics Institute, Philadelphia, Pennsylvania 19104-60184; University of
The National Institute of Allergy and Infectious Diseases faces or Web services) to pathogens’ genomic and related data
(NIAID) began a new bioinformatic venture in July 2004 in- that are stored in a relational database management system,
tended to integrate the vast amount of genomic and other such as Oracle, Sybase, or MySQL. Such data are to include
biological data that are both available and being produced by the genome sequences of multiple strains of these organisms
the rapid increase in biodefense research. Eight Bioinformatics and related plasmids, protein sequences, annotations, single
Resource Centers for Biodefense and Emerging/Re-Emerging nucleotide polymorphisms (SNPs), microarray data, epitope
Infectious Disease (BRCs) were funded to provide the re- data, proteomic data, and epidemiological data, i.e., as much
search community working on a variety of pathogens access to data as possible to facilitate comparative genomics. Eventually,
integrated genomic data to aid in the discovery and develop- the BRCs will evolve to contain data relevant to host-pathogen
ment of innovative therapeutics, vaccines, and diagnostics for interactions. The variety of data types supported by the BRCs
these pathogenic organisms. The initial term of this effort is 5 is dependent on several factors, including the abundance of
years, and overall nearly $100 million dollars was awarded, data available either in public databases or in published liter-
facilitating the creation of what is likely to be the foremost ature, specific requests from the scientific community for the
bioinformatic effort to date dedicated to human pathogens. pathogens being supported, and ultimately the expertise of the
Pathogenic species of biodefense and special public health BRCs’ staff members. The eight awards are shown in Table 1.
interest have been classified by NIAID into three high- It is important that although the BRCs’ focus is mostly on
priority categories based on their relative capabilities for genomic information about organisms that could be manipu-
causing morbidity or mortality from disease in case of bio- lated and used as biological weapons, all data produced by the
warfare (categories A, B, and C [http://www3.niaid.nih.gov BRCs are freely and publicly available to the research com-
/Biodefense/bandc_priority.htm]). The phylogenic domains con- munity. In addition, the BRC program has a policy of publicly
tained in those categories not only encompass a large variety of releasing any type of new data placed in the BRCs within 6 to
microbial species and viral families but also embrace a small 12 months from data deposition, and immediately upon pub-
group of eukaryotic unicellular pathogens (Giardia and Ent- lication. Any data supplied to the BRCs are to be credited to
amoeba), protozoa such as Plasmodium and Toxoplasma, a their source, and genome annotations should have evidence
group of fungal species (Microsporidia), and a plant (Ricinus codes signifying whether the annotations were produced ex-
communis, or castor bean). Invertebrate vectors of human perimentally, computationally, or by other means. Standard
pathogens are also included in the scope of the BRCs, such as operating procedures will soon be released to describe how
Aedes aegypti, which is a vector for pathogen-caused diseases annotations were generated and to share the process with the
such as West Nile encephalitis, dengue hemorrhagic fever, and scientific community. The BRCs were required, in their pro-
filarial diseases such as elephantiasis. posals, to design robust, flexible, and scalable computational
Among the chief goals of the BRCs is to offer users easy systems and infrastructure in support of the upcoming deluge
Web access and graphical user interfaces as well as other types of genomic data for the targeted species. All software devel-
of software interfaces (such as application programming inter- oped from this program is to be made freely available as open
source software.
A major focus is outreach to the research community both to
* Corresponding author. Mailing address: NIH, NIAID, DMID, establish and then serve its needs and to make its members
6610 Rockledge Dr., Room 6004, MSC 6603, Bethesda, MD 20850-6603.
Phone: (301) 496-1884. Fax: (301) 480-4528. E-mail: vdifrancesco@niaid
aware of the significant resources provided by these new cen-
.nih.gov. ters. Examples of outreach activities performed by the BRCs
䌤
Published ahead of print on 9 April 2007. include training on the use of the bioinformatic tools and
3212
VOL. 75, 2007 MINIREVIEW 3213
ApiDB (http://www.apidb.org), David Roos Apicomplexan species, including Toxoplasma gondii (1), University of Pennsylvania,
(University of Pennsylvania) Cryptosporidium spp. (2), Theileria spp. (2), and University of Georgia
Plasmodium spp. (8)
BioHealthBase (http://www.biohealthbase.org), Giardia lamblia (*), Mycobacterium spp. (10), influenza Northrop Grumman Information
Richard Scheuermann (University of Texas virus (10,880), Francisella tularensis (5), Technology, University of Texas
Southwestern Medical Center) Encephalitozoon cuniculi (1), and Southwestern Medical Center,
Ricinus communis (*) Vecna Technologies, Amar
International
ERIC (http://www.ericbrc.org), John Greene Diarrheagenic Escherichia coli (10), Shigella spp. (9), SRA International, Inc., University
(SRA International, Inc.) Salmonella spp. (6), Yersinia enterocolitica (0), of Wisconsin—Madison
and Yersinia pestis (7)
Pathema (http://pathema.tigr.org), Owen Bacillus anthracis (10), Clostridium botulinum (1), TIGR, University of Maryland
White (TIGR) Burkholderia mallei (7), Burkholderia pseudomallei
(7), Clostridium perfringens (3), and Entamoeba
histolytica (1)
PATRIC (http://patric.vbi.vt.edu), Bruno Brucella sp. (4), Coxiella burnetii (4), Rickettsia sp. (10), VBI, Loyola University Medical
Sobral (VBI) caliciviruses (77), coronaviruses (217), hepatitis A Center, Social and Scientific
virus (47), hepatitis E virus (65), and Systems, University of Maryland,
lyssaviruses (15) Swiss Institute of Bioinformatics
VBRC (http://www.vbrc.org), Elliott Viral pathogens belonging to the following families: University of Alabama—Birmingham;
Lefkowitz (University of Arenaviridae (106), Bunyaviridae (857), Filoviridae University of Victoria, British
Alabama—Birmingham) (27), Flaviviridae (498), Paramyxoviridae (169), Columbia, Canada; Columbia
Poxviridae (106), Togaviridae (83), and University
hepatitis C virus (*)
VectorBase (Invertebrate Vectors of Human Anopheles gambiae (1), Aedes aegypti (1), Culex pipiens University of Notre Dame; European
Pathogens 关http://www.vectorbase.org兴), (1), Pediculus humanus (1), and Ixodes scapularis (1) Bioinformatics Institute; Imperial
Frank Collins (University of Notre Dame) College of London; Institute of
Molecular Biology and
Biotechnology, Crete; Harvard
University; Purdue University;
University of California—Riverside
a
*, will soon be available.
databases, booths and live demos at scientific meetings, and ensuring that the BRCs are properly responding to those
hands-on workshops or training sessions provided free of needs.
charge to groups of researchers. All of the BRCs have pre-
sented posters and talks at many conferences on infectious
diseases, specific organisms, bioinformatics, and biodefense. INTEROPERABILITY AND BRC CENTRAL
In addition to these direct interactions with the research A great deal of emphasis is being placed on interoperability
community, each BRC has a scientific working group (SWG) and establishing collaborations among the eight BRCs. A por-
of about 10 experts whose expertise ranges over the biology of tal site, BRC Central (http://www.brc-central.org/), links to all
that BRC’s pathogens, bioinformatics, evolutionary genomics, eight websites for the awarded BRCs. BRC Central contains
biodefense, and infectious disease. SWGs meet one or two information required from each BRC about common data
times a year to advise and provide feedback to each BRC on types, conferences and meetings to be attended by BRC rep-
data sources, useful analysis tools, and user interfaces. Mem- resentatives, notices of upcoming training on BRC tools, and
bers of the SWGs are also involved in outreach activities. publications by the BRCs and has a comprehensive list of
Altogether, the BRC program is advised by approximately 80 released, downloadable software tools of various types, includ-
prominent investigators who help to provide vision for the ing tools for comparative genomics, computational annotation
BRCs, develop community ties, and embody the needs of the pipelines, annotation systems, sequence alignments and align-
community of researchers working on each of these pathogens, ment editors, and genome viewers. BRC Central also provides
3214 MINIREVIEW INFECT. IMMUN.
search capabilities across all the BRCs, including searches for able with annotation; however, tens of strains of these species
keywords, gene attributes, and organism names and sequence are currently being sequenced by either NIAID-funded se-
BLAST searches against a data set of protein sequences or quencing centers (see the microbial sequencing center effort
contigs of the organisms supported by the BRCs. The website described below) or other sequencing sources and will become
also permits downloads of flat files of sequences and annota- available in the next 1 to 3 years. The amount of genome
tion data from all the BRCs in the generic feature format, sequence data that will be generated and made publicly avail-
version 3 (GFF3) file format (http://www.sequenceontology able by the largest sequencing centers is such that no large
.org/gff3.shtml). group of human curators will be able to perform a systematic
The GFF3 file format was adopted to provide consistency in annotation revision of each sequenced genome. Therefore, to
genome feature information. The full specifications for the ensure high-quality, efficient, and accurate annotation, espe-
BRC GFF3 file format are available at iowg.brcdevel.org cially across multiple strains of a single species, it is essential to
/gff3.html, including the list of feature types and attribute tags. rely on computational annotation pipelines that take full ad-
This document also describes a convention for storing genome, vantage of comparative genomic tools for ortholog identifica-
organism, and molecule descriptions in GFF3. The usage con- tion and to identify genome rearrangements or lateral gene
ventions are meant to be wholly compatible with the GFF3 transfer that may be indicative of similar mechanisms of patho-
related enteropathogens. Pathema and NMPDR support spe- Building upon previous work of the biological and bioinfor-
cific data types for numerous microbial genomes. Below, we matic community has allowed BioHealthBase to quickly create
present a brief summary of each BRC, in alphabetical order. a single cohesive resource for this wide variety of data to
The list of coinvestigators on each BRC is also provided to elucidate host-pathogen interactions.
acknowledge the valuable contributions of key individuals to One area of integration has been to combine information
the project. regarding the localization of immunological epitopes with se-
ApiDB. ApiDB (http://www.apidb.org) is an integrated BRC quence polymorphism analyses for influenza virus. Predicted
for protozoan pathogens (Table 1) in the phylum Apicomplexa, major histocompatibility complex class I cytotoxic T-cell epitopes
including the category B agents Cryptosporidium parvum and and validated immunological epitopes from the NIAID Im-
Toxoplasma gondii, and multiple (re)emerging infectious disease mune Epitope Database for influenza virus are localized to-
species within the phylum Plasmodium (the causative agents of gether with sequence conservation scores to indicate genomic
malaria). A federation of three databases (PlasmoDB, CryptoDB, regions under host selective pressure and protein motifs that
and ToxoDB) in ApiDB permits users to simultaneously query can serve as prospective vaccine candidates. Future work will
across all three. layer these data onto three-dimensional protein structure data
The Genomics Unified Schema relational database system to support an understanding of the relationships between pro-
literature, using SRA International, Inc.’s powerful NetOwl Pathema. Pathema, developed at TIGR, is a website (http:
suite of tools, which promise to be able to extract heretofore //pathema.tigr.org) composed of multiple databases and other
difficult-to-find semantic relationships from the literature for computer resources that are meant to serve as an online focal
the BRC Program. point for the biodefense community (Table 1).
ERIC is now providing training seminars on the use of the As mentioned earlier, Pathema has developed and released
annotation system, which will be expanded over time to include resource guide pages for Pathema’s category A organisms,
the use of other tools, such as the mAdb microarray applica- namely, the botulinum neurotoxin and Bacillus anthracis (an-
tion. John Greene of SRA International, Inc., is the PI for thrax) resource guides. Each guide contains 10 pages contain-
ERIC; the coinvestigators are Nicole Perna and Frederick ing curated links to resource providers and primary data orga-
Blattner, both of the University of Wisconsin—Madison. nized into topic-oriented Web pages, which include sequence,
NMPDR. NMPDR (http://www.nmpdr.org) contains the strain, structure, antibody, therapeutic, vaccine, inhibitor, pro-
complete genomes of nearly 50 strains of pathogenic bacteria tocol, reference, and related links. The C. botulinum guide has
that are the focus of its annotators, as well as more than 400 (as been reviewed by a number of botulinum researchers and will
of February 2007) other genomes that provide a broad context continue to be developed in close collaboration with the sci-
for comparative analysis. The NMPDR focus organisms are entific community. In addition to updating existing pages and
Owen White of TIGR is the PI for Pathema; the coinvesti- ing in a searchable, comprehensive minireview of gene func-
gator is Steven Salzberg of the University of Maryland, College tion relating genotypes to biological phenotypes, with special
Park. emphasis on pathogenesis. The VBRC also includes a variety
PATRIC. The PathoSystems Resource Integration Center of analytical and visualization tools on its website to aid in the
(PATRIC; http://patric.vbi.vt.edu) integrates genomic and as- understanding of the available data, including tools for genome
sociated data types for three genera of proteobacteria and five annotation, comparative analysis, whole-genome alignments,
single-stranded RNA viruses (Table 1). and phylogenetic analysis. Finally, an important aspect of the
PATRIC aims to provide a standardized curation for each ongoing work is to solicit feedback from the scientific commu-
pathosystem’s genomes. Curation is based on existing annota- nity, with the goals of enhancing and extending the VBRC,
tions and the output of PATRIC’s Genome Sequence Anno- thereby making it both used and useful in support of basic and
tation Pipeline and Protein Annotation Pipeline (5). Impor- applied research on these viral pathogens.
tantly, PATRIC works with organism experts to ensure that Elliott Lefkowitz of the University of Alabama—Birming-
curation efforts meet the expectations of the user base. These ham is the PI of VBRC; the coinvestigator for the project is
experts guide the selection of the reference genomes that re- Chris Upton at the University of Victoria, Canada.
ceive thorough curation. Gene annotations are then propa- VectorBase. VectorBase (http://www.vectorbase.org) is re-
VectorBase; coinvestigators include William Gelbart of Har- itating the identification and refinement of molecular targets to
vard University; Ewan Birney of the European Bioinformatics develop vaccines, therapeutics, diagnostics, and countermea-
Institute, United Kingdom; Kitsos Louis of the Institute of sures.
Molecular Biology and Biotechnology, Crete, Greece; and Fo-
tis Kafatos of the Imperial College, United Kingdom. ACKNOWLEDGMENTS
Conclusions. The NIAID-funded BRC program is tasked
The Bioinformatics Resource Centers for Biodefense and Emerg-
with supporting genomic and related data for a variety of ing/Re-emerging Infectious Disease are funded under R&D contracts
different pathogenic organisms. The BRCs import data from from the Division of Microbiology and Infectious Diseases, National
existing repositories, generate related data types, store the data Institute of Allergy and Infectious Diseases, National Institutes of
in databases, analyze them, and provide for investigator access Health. For financial disclosure purposes, J.M.G. is an employee and
stockholder of SRA International, Inc.
via user interfaces.
We thank our coinvestigators and staffs for their hard work and
As these BRCs mature, a significant challenge remaining express our gratitude to each other’s teams for the selfless cooperation
will be to develop user interfaces not only for experts but also between the BRCs.
for relative novices. This will also impact training on these
systems, as in addition to in-person workshops, manuals and REFERENCES
Editor: J. B. Kaper