Fallwinter03 PDF

NCBI News
National Center for Biotechnology Information

National Library of Medicine
National Institutes of Health Fall 2003
Department of Health and Human Services Winter 2004
Register Your Genome

Entrez Query Goes “Global”
Project Online at NCBI
The Entrez search and retrieval sys- Entrez Home Page which provides a
tem now offers a cross-database cross-database search box and a list- NCBI builds a reference sequence,
search that allows a single query to ing of the Entrez databases that can or RefSeq, for each complete genom-
span the traditional NCBI-sequence be searched in tandem. Question ic sequence in GenBank® as well as
databases; Nucleotide and Protein; mark icons to the right of each data- for contigs of incomplete genomes.
the literature databases, such as base name lead to descriptions of These RefSeqs are used to integrate
PubMed®, PMC, Books, OMIM™, database content. The database the genomic sequence with genomic
Journals, and MeSH; the structurally- names and icons link to homepages and other data at NCBI to provide
oriented databases, Structures, the where single-database queries can be pre-computed analyses that allow
Conserved Domain Database, 3D- constructed using lists of database- researchers to uncover relationships
Domains; the NCBI Taxonomy, specific field restrictions, or tables between sequences from different
Gene Expression Omnibus (GEO), that can be used to define search organisms, and between a sequence
Single Nucleotide Polymorphisms limits. In addition, the Entrez home- and a biological function. In addi-
(SNPs), Population Sets, Genomes, page toolbar provides links to popu- tion to the reference human genome,
Sequence Tagged Sites, UniGene, lar NCBI tools and resources such as and other major eukaryotic genomes,
Gene-centered information (Gene), the Map Viewer and BLAST®. Entrez Genomes includes sequence
and, finally, the NCBI Web site itself. for complete viral, microbial, and
Cross-Database Search Results
The cross database search option, organellar genomes representing a
labeled “Entrez” on the NCBI When a cross-database search is diverse group of model organisms,
homepage search menu, replaces completed, the number of matches pathogens, and organisms of envi-
'GenBank' as the default. for each database is displayed in ronmental importance. Since the
boxes adjacent to the database name utility of a genomic sequence to the
The New Entrez Home Page as shown in Figure 2 on the next
The Entrez toolbar link on the continued on page 2
continued on page 4
NCBI Home Page leads to a new
In this issue
1 Entrez Query Goes “Global”
1 Register Your Genome Project Online

at NCBI
3 New Genome Builds and Annotations
4 Entrez Gene Database Debuts
6 Recent Publications
6 New Microbial Genomes in GenBank
6 KOGs and COGs Now in CDD
7 Submission Corner
8 GenBank Release 139
8 UniGene Adds Four

Figure 1. Entrez homepage showing the new cross-database search engine with links to the 21 Entrez
databases covered. 8 RefSeq Version 3 Released
Register Your Genome Project for the addition of links to NCBI
NCBI News continued from page 1 resources, such as the Map Viewer
and genomic BLAST. Sequencing
centers are encouraged to
begin submitting mapping
NCBI News is distributed four times
a year. We welcome communication
and sequence data at the
from users of NCBI databases and early stages of the project in
software and invite suggestions for order to make it available to
articles in future issues. Send corre- the research community
spondence to NCBI News at the
quickly.
address below. To subscribe to NCBI
News, send your name and address to
To make the registration of a
either the street or E-mail address below.
sequencing project as simple
NCBI News as possible, NCBI has creat-
National Library of Medicine
ed an online form, shown in
Bldg. 38A, Room 3S-308
8600 Rockville Pike
Figure 1, for the submission
Bethesda, MD 20894 of project information. The
Phone: (301) 496-2475 form allows for the entry of
Fax: (301) 480-9241 general information such as
E-mail: info@ncbi.nlm.nih.gov
the name of the organism to
Editors be sequenced, the name of
Dennis Benson the sequencing center, and
David Wheeler Figure 1. Views of the first Web page of the Genome Project sub- the sequencing strategy to be
mission form. The pulldown shows the four "sequence availability"
Contributors options. used. In addition, the condi-
Medha Bhagwat tions under which the project
Monica Romiti research community increases dra- data, including sequences, are to be
Eric Sayers matically when it is placed in the submitted to NCBI can be specified.
context of other genomic sequences, A project can be listed or unlisted,
Writers
Vyvy Pham
NCBI strongly encourages the sub- and may be listed without sequence
David Wheeler mission of genomic data ranging data. Sequences may be held until
from mapping information to com- publication, the default, released
Editing and Production
Robert Yates
plete chromosomal sequences with immediately, or made available for
annotation. BLAST searches only. The form can
Graphic Design
also be used to set up an FTP site
Robert Yates NCBI is working in close collabora-
for the upload of data to NCBI, or
tion with many sequencing centers to
In 1988, Congress established the to specify a URL to be used by
National Center for Biotechnology provide new and updated informa-
NCBI for the download of project
Information as part of the National Library tion about ongoing genome sequenc-
or sequence data. The Genome
of Medicine; its charge is ing programs for organisms ranging
to create information systems for Project submission form along with
from microbes to multicellular plants
molecular biology and genetics links to detailed instructions for the
and animals. Sequencing centers can
data and perform research in submission of genomic sequence
computational molecular biology. register a sequencing project with
and annotations, can be found under
NCBI prior to the submission of any
The contents of this newsletter may "Submitting" in the sidebar of the
data registered. Sequencing projects
be reprinted without permission. Entrez Genomes page, or at:
The mention of trade names, com- may include those producing finished
mercial products, or organizations and unfinished genomic sequence, or www.ncbi.nih.gov/genomes/mpfsub-
does not imply endorsement by Whole Genome Shotgun (WGS) mission.cgi
NCBI, NIH, or the U.S. Government.
sequence. For each registered proj-
ect, a sequencing project page on the Questions about genome project reg-
NIH Publication No. 04-3272
NCBI Genomes website is created istration or genome submission can
ISSN 1060-8788
that gives a description of the proj- be addressed to the NCBI Genomes
ISSN 1098-8408 (Online Version)
ect, links out to genome-specific group at:
resources, and provides a focal point genomes@ncbi.nlm.nih.gov
—VP
Fall 2003/Winter 2004 NCBI News 2

New Genome Builds and Annotations at NCBI
One Build, Multiple Rounds of NCBI RefSeq transcript alignments. contigs. Shown alongside this assem-
Annotation For more on the Gnomon algorithm, bly in the Map Viewer is the NCBI
As reference genome assemblies, see the shaded box entitled "NT" assembly, which covers 25 mil-
such as those for human, mouse, and “Gnomon”. lion bases of sequence and uses con-
rat stabilize, a single build is expected tigs assembled by NCBI from fin-
Human Genome Build 34 Version 3
to pass through multiple cycles of ished BAC sequences in GenBank.
annotation. For this reason, a “build The NCBI Map Viewer now shows New in the Map Viewer for this
X version Y” identification system is build 34 version 3 of the human build, are tracks showing the align-
now being used for many genomes genome reference sequence, which is ment of all human, mouse, and rat
shown in the Map Viewer. An identi- based on data available as of July ESTs to the rat genome, as well as
fier such as, “build 34 version 1”, 2003, and includes the pseudoauto- the ab initio track, which replaces the
indicates the first version of annota- somal region of the Y chromosome. GenomeScan map for the display of
tion for genome build 34. Supplementing the reference gene models.
sequence are a separate assembly of
Gene Annotation chromosome 7 submitted by the The Mouse Build 32 Version 1, Anopheles
gambiae Build 2 Version 1
NCBI has switched from Center for Applied Genomics
GenomeScan as its standard method (TCAG), and the reference sequence Mouse build 32 version 1, based on
of predicting gene models, to for the DR51 haplotype in the Major data available as of September 2003,
Gnomon, a program developed by Histocompatibility Complex region. includes 24,819 mapped genes.
NCBI scientists. Gnomon differs Transcripts annotated on the TCAG Shown with the build 32 version 1
from GenomeScan by putting a assembly by the TCAG group are reference assembly in the Map
greater emphasis on coding propen- shown on a separate track in the Viewer is the Celera assembly of
sity and matches to existing proteins Map Viewer. Other new tracks for chromosome 16 with NCBI annota-
when predicting genes. Gnomon the human Map Viewer show the tions. Build 2 Version 1 of the
also checks more rigorously for alignment to the genome of all Anopheles gambiae genome, based on
shifts in reading frame within tran- human, mouse, rat, pig, and cow data available as of July 2003, is also
script models that are often indica- ESTs along with mRNAs. A new ab available for browsing in the Map
tive of pseudogenes. As a result, the initio track replaces the GenomeScan Viewer with over 12,000 annotated
number of genes appearing in the for the display of Gnomon gene pre- genes.
most recent NCBI annotations of dictions. —DW
the genomes of human, mouse, and Rat Genome Build 2 Version 1

rat has decreased significantly, while Gnomon
the number of models identified as NCBI build 2 version 1 of the rat To create a gene model, Gnomon finds
pseudogenes has increased. About genome reference sequence uses the the best self-consistent set of transcript
and protein alignments to a genomic
20% of the gene models appearing Rat Genome Sequencing Consortium region and uses these alignments as
in human genome build 34 version 1 version 3.1 assembly, which covers constraints for a Hidden Markov Model
were produced using Gnomon; the 2.8 billion bases of the genome in (HMM)-based gene prediction. Several
Whole Genome Shotgun (WGS) steps are involved.
remaining 80% were derived from Gnomon evaluates the statistical prop-
erties of the transcripts aligned to a
genome in order to determine their
Magnaporthe grisea, Bos taurus, Sus scrofa, Canis familiaris are new in
most probable coding regions. For each
Map Viewer gene model, the set of non-overlapping
NCBI has recently created Map Viewer displays for four more organisms. The display for transcript alignments with the best cod-
Magnaporthe grisea, a pathogen of rice that is a close relative of Neurospora crassa, ing propensity is chosen, after which
includes contig, gene, and transcript tracks. For cow1 and pig2, the Map Viewer displays the best matching proteins for the tran-
Meat Animal Research Center (MARC) linkage maps while for the dog the Map Viewer script sequences are aligned to the
shows the Canine 1Mb Radiation Hybrid map (RHDF5000).3 New Genome Guide pages, genomic DNA sequence. For HMM-
created by NCBI in cooperation with the genomic research communities to provide links based gene models without supporting
to an array of genome-specific resources, are available for cow, pig, and dog. These transcript evidence, the proteins that
pages can be seen at: are the best matches to the translated
genomic sequence are aligned.
www.ncbi.nlm.nih.gov/genome/guide/pig
Gnomon checks that the resulting pre-
www.ncbi.nlm.nih.gov/genome/guide/cow dicted gene has every exon in a read-
www.ncbi.nlm.nih.gov/genome/guide/dog ing frame consistent with the protein
1
Keele, J. A second-generation linkage map of the bovine genome. Genome Res. 1997 7(3): 235-49. PMID: 9074927 alignment, however, the program is free
2
Rohrer GA, et al. A comprehensive map of the porcine genome. Genome Res. 1996 6(5): 371-91. PMID: 8743988 to choose splice sites and to introduce
3
Guyon R, et al. A 1-Mb resolution radiation hybrid map of the canine genome. PNAS USA. 2003: 100(9): 5296-301. additional exons between segments of
PMID: 12700351
the protein alignment.
3 Fall 2003/Winter 2004 NCBI News

Entrez Gene Database Debuts
GEO, and UniGene are more readily apparent in the Links
menus, rather than a step away via LinkOut. Entrez Gene
Entrez Gene, the most recent addition to the growing
can be queried using the new Cross Database search
Entrez database collection, provides detailed reports on
described in the previous article beginning on Page 1. The
genes, both sequenced and unsequenced, with links to relat-
results of such a query are shown in Figure 2 on the facing
ed resources.
page with the 968 hits to the Gene database indicated in
Entrez Gene supports the queries and provides much the the lower left-hand corner.
same information as the familiar LocusLink resource it will
The Entrez Gene database will be featured in the Spring
eventually replace. However, the Gene database has a larg-
NCBI News. If you are interested in being notified of
er scope and currently covers over 350 microbial species,
changes in the Gene interface, consider subscribing to the
more than a thousand viral genomes, and more than 400
gene-announce mailing list:
eukaryotes. Being integrated into Entrez also means that
the connections between Gene and other databases in www.ncbi.nlm.nih.gov/mailman/listinfo/gene-announce
Entrez such as PubMed, OMIM, Nucleotide, Protein, —DW
Entrez Goes “Global” classifications, and sequence viewers PubMed in the cross-database results
continued from page 1 such as the Map Viewer. The last page, to reach the PubMed database
example shows a single hit to the display of the hits, and then clicking
page. The Figure shows the results Conserved Domain Database (CDD) on the “Details” link in the PubMed
of a cross-database search using the giving the sequence alignments used display to see how the cross-database
Entrez query “rat[organism] AND to define the domain indicated. A search was executed within the
kinase” indicating hits to a wide vari- link to a KOG (see “KOGs and PubMed database.
ety of databases, including those for COGs are Now Included in CDD”
The Entrez search and retrieval sys-
literature, sequence, gene expression, in this issue) appears in the upper
tem is robust and flexible. To learn
and structure. The four representa- right-hand corner.
more about searching and navigating
tive Entrez displays indicated by the Advanced Entrez Features between Entrez databases or display-
letters A, B, C, and D are reached by ing and downloading results, see:
A cross-database search produces an
first clicking on the the database
individual History page, giving a list Geer RC, Sayers EW. Entrez: making use
name, icon, or number of matches of its power. Briefings in Bioinformatics. 2003
of previously executed searches with
to reach the Entrez document sum- Jun;4(2):179-84. PMID: 12846398
links to the results, for each of the
mary of the hits, and then selecting a
Entrez databases. Other familiar www.ncbi.nlm.nih.gov/Entrez
record and a format for viewing.
Entrez features such as the Clip-
—MR, DW
Beginning in the upper left hand cor- board, Details, Limits, and Preview/
ner, the figure shows one of the Index are also retained separately for
54,988 hits to the PubMed database, each database since the databases Query Across All Entrez Databases
using the “Abstract” display from the by Script
support specialized arrays of fields
Entrez cross-database searches can be
“display” pulldown menu. A link to used to limit searches. For example, performed by script using a new E-Utility
the full text of the article in PubMed note that in Figure 2, some of the program called EGQuery. E-Utilities are
Central appears just to the right of boxes containing the number of hits programs residing on NCBI servers that
the pulldown menu. Moving in a can be accessed by posting a standard
to the databases are shaded. The URL containing the script name and a set
counter-clockwise direction, one of shading indicates that the cross-data- of parameters. For example, posting the
the 2,613 hits to the Nucleotide database query contained a field delimiter following URL from within a script will return
base is shown using the FASTA dis- not supported by the database in the number of hits to each Entrez database
for the query "stem cells":
play format with other possible for- question. The query of Figure 2, eutils.ncbi.nlm.nih.gov/entrez/eutils/egquery
mats listed in the “display” pulldown “rat[organism] AND kinase”, uses .fcgi?term=stem+cells
menu. One of the 968 records from the “organism” field restriction; the Results are returned in XML format. For
the Entrez Gene database that gives “organism” field is supported within details on the use of EGQuery or other E-
a detailed report on a gene. The Utilities, see:
the Nucleotide, CDD and Gene
eutils.ncbi.nlm.nih.gov/entrez/query/static/e
report contains a gene description, a databases, but not within the utils_help.html
graphic of the gene's structure, along PubMed literature database. In this To receive announcements of new E-Utility
with links to bibliographical refer- case, the search was performed using features, subscribe to the E-"utilities-
ences, GenBank sequences, RefSeqs, the term “rat” in “all fields”. This announce" listserve at:
computed and curated functional can be verified by clicking on www.ncbi.nlm.nih.gov/mailman/listinfo/utili-
ties-announce

A
B
D
A C
B D
Figure 2. Cross-database search results for the Entrez query “rat[organism] AND kinase”. The number of hits returned within each Entrez
database is given in boxes to the left of the database name. Shaded boxes indicate that the database in question did not support the field
restriction “organism” used to limit the term “rat” in the query. Portions of Entrez displays for a representative record returned from six of the
Entrez databases are also shown.
5
Fall 2003/Winter 2004 NCBI News
New Microbial Genomes in GenBank
Selected Recent
Publications by Organism GenBank | RefSeq For more detailed
Accession Numbers information, see the
NCBI Staff online version of the
Photorhabdus luminescens subsp. laumondii TTO1 BX470251 | NC_005126 Fall 2003/Winter
To view the citation for any article listed Gloeobacter violaceus BA000045 | NC_005125 2004 NCBI News, or
below, click on the PubMed link on the use the GenBank or
Chromobacterium violaceum ATCC 12472 AE016825 | NC_005085
navigation bar at the top of the NCBI RefSeq Accession
Porphyromonas gingivalis W83
Home Page, enter the PubMed ID num- AE015924 | NC_002950
Number to query
ber in the search query box, and click Go. Wolinella succinogenes BX571656 | NC_005090 Entrez “Genome”
Geobacter sulfurreducens PCA AE017180 | NC_002939 database using the
Schriml LM, Hill DP, Blake JA, Bono H,
Rhodopseudomonas palustris str CGA009 BX571963 | NC_005296 query box on the
Wynshaw-Boris A, Pavan WJ, Ring BZ, NCBI Home Page.
Beisel K, Setou M, Okazaki Y. Human Onion yellows phytoplasma AP006628 | NC_005303
disease genes and their cloned mouse
orthologs: exploration of the FANTOM2
cDNA sequence data set. Genome Research. KOGs and COGs Are Now Included In CDD
2003 Jun;13(6B):1496-500. PMID:
12819148
As part of the latest release (v1.63) datasets within CDD. CDD now
Szak ST, Pickeral OK, Landsman D,
Boeke JD. Identifying related L1 retro- of the Conserved Domain Database contains datasets that cluster proteins
transposons by analyzing 3' transduced (CDD), the alignment sets of the based on overall sequence similarity
sequences. Genome Biology. 2003;4(5):R30. KOG database (clusters of
1
(COGs and KOGs) along with those
Epub 2003 Apr 16. PMID: 12734010 euKaryotic Orthologous Groups) that cluster based on the presence of
Daraselia N, Dernovoy D, Tian Y, have been merged into CDD . The defined functional domains (Pfam,
Borodovsky M, Tatusov R, Tatusova T. KOG database is essentially a SMART, curated CDs). Multiple
Reannotation of Shewanella oneidensis
eukaryotic version of the COG data- domain proteins will therefore often
genome. Omics : a journal of integrative biolo-
gy. 2003 Summer;7(2):171-5. PMID: base (Clusters of Orthologous have two sets of hits in CDD: hits
14506846 Groups) that was integrated into from COGs and KOGs to large por-
Dufresne A, Salanoubat M, Partensky F, CDD in late 2002 (v1.60). KOGs tions of the sequence, and hits to
Artiguenave F, Axmann IM, Barbe V, and COGs cluster eukaryotic and Pfam, SMART, and/or CDD for
Duprat S, Galperin MY, Koonin EV, Le prokaryotic each function-
Gall F, Makarova KS, Ostrowski M, proteins al domain. In
Oztas S, Robert C, Rogozin IB, Scanlan
DJ, Tandeau de Marsac N, Weissenbach J,
respectively order to show
Wincker P, Wolf YI, Hess WR. Genome into groups Figure 1. Graphical overview of Conserved Domain Search results both sets of
sequence of the cyanobacterium containing for human SRC protein, RefSeq accession NP_005408, showing hits in a sim-
Prochlorococcus marinus SS120, a nearly min- sequences hits to KOG0197 and a PFAM-based conserved domain for tyro-
ple display,
imal oxyphototrophic genome. Proceedings sine kinases, as well as hits to SH2 and SH3 domains.
that are each CDD
of the National Academy of Sciences, USA.
2003 Aug 19;100(17):10020-5. Epub 2003 mutual best hits in sequence similari- record is now classified as either a
Aug 13. PMID: 12917486 ty searches between different species. "single" or "multiple" domain record,
Mu J, Ferdig MT, Feng X, Joy DA, Duan The KOG database includes proteins and the best hits from each set are
J, Furuya T, Subramanian G, Aravind L, from H. sapiens, D. melanogaster, C. ele- shown when the Domains link is
Cooper RA, Wootton JC, Xiong M, Su gans, A. thaliana, S. cerevisiae, S. pombe, clicked for a record in Entrez
XZ. Multiple transporters associated with and E. cuniculi. With RPS-BLAST Protein. Moreover, the Conserved
malaria parasite responses to chloroquine
and quinine. Molecular Microbiology. 2003 searches available for KOGs and Domain Architecture Retrieval Tool
Aug;49(4):977-89. PMID: 12890022 COGs in CDD, users can now classi- (CDART) only uses single domain
fy query sequences by similarity to records to group protein sequences
Shabalina SA, Ogurtsov AY, Lipman
DJ, Kondrashov AS. Patterns in inter- these pre-determined sets alongside by domain architecture. In the exam-
species similarity correlate with nucleotide the alignments from Pfam, SMART, ple shown above for NP_005408, the
composition in mammalian 3'UTRs. and the curated NCBI Conserved human SRC protein, hits are shown
Nucleic Acids Research. 2003 Sep to both the multiple domain
Domains. Because CDD data is also
15;31(18):5433-9. PMID: 12954780
incorporated into Entrez as the KOG0197 (tyrosine kinases) and to
Snijder EJ, Bredenbeek PJ, Dobbe JC, Domains database, KOGs and single domains pfam00018 (SH3),
Thiel V, Ziebuhr J, Poon LL, Guan Y,
Rozanov M, Spaan WJ, Gorbalenya AE. COGs can be found using standard pfam00017 (SH2), and cd00192
Unique and conserved features of Entrez queries by fields such as title, (TyrKc, tyrosine kinase catalytic
genome and proteome of SARS-coron- organism, or text words. domain). —ES
avirus, an early split-off from the coron-
With KOGs and COGs now includ- 1
Tatusov RL, Fedorova ND, Jackson JJ, Jacobs AR,
avirus group 2 lineage. Journal of Molecular Kiryutin B, Koonin EV, Krylov DM, Mazumder R,
Biology. 2003 Aug 29;331(5):991-1004. ed in CDD, the displays of pre-com- Mekhedov SL, Nikolskaya AN, Rao BS, Smirnov S,
PMID: 12927536 puted RPS-BLAST results have been Sverdlov AV, Vasudevan S, Wolf YI, Yin JJ, Natale
DA. The COG database: an updated version includes
updated to reflect the different clus- eukaryotes. BioMed Central Bioinformatics. 2003 Sep 11
tering schemes underlying the several [Epub ahead of print] PMID: 12969510

Submitting a Population or Phylogenetic Sequence Set
The Entrez PopSet database accommodates four varieties of sequence set, to represent versions of a gene or sequence region
derived from varying sources. The sources may be isolates of a single organism, comprising a “population set”, an ensemble of
organisms, comprising a “phylogenetic set”, individuals from a population of unclassified or unknown organisms, comprising an
“environmental set”, or various mutational forms, comprising a “mutation set”. Figure 1 shows a PopSet, consisting of GenBank
records AF474412-AF474791 for Adelie penguin mitochondrial DNA. PopSets such as this one can be submitted to GenBank in
four easy steps using NCBI’s sequence-submission tool, Sequin.
Step 1—Generate an alignment or FASTA set of ways, such as by the name of an individual, by a culture collec-
Sequin can import alignments in any of the FASTA+GAP, tion number or locality, by an arbitrary identifier, or by a label
PHYLIP, or NEXUS formats illustrated at: used within the submitting laboratory.
www.ncbi.nlm.nih.gov/Sequin/QuickGuide/sequin.htm#before A specimen voucher is the remainder from which a sequence has
The source information, such as isolates and specimen vouchers, been obtained, or, where coidentity is assured, a representative of
can be included in the definition lines of the alignment to be the sequence source specimen. Vouchers should be deposited in
imported by Sequin. The definition line begins with a “>” sign respositories accessible to the public, such as herbaria or museum
and is included at the bottom of the alignment for the PHYLIP collections. Specimen vouchers allow verification of the identity
or NEXUS format or just above the sequence for the of a taxon and serve as a source for additional molecular analyses.
FASTA+GAP format. An example of a definition line for a pop- A common format for vouchers includes the collector's name and
ulation set in NEXUS format from Escherichia coli strain ECOR10 a unique number, plus the repository or its abbreviation. For
is: example:
C.S. Shen 2459 (HMAS)
>[organism= Escherichia coli] [strain=ECOR10] [clone=1]
A.J. Smith 12.iii.2002 (AMNH)
The modifier information must be included in square brackets H. Perrier s.n. (P)
with no spaces on either side of the “=” sign. There is no limit
to the number of modifiers you can add from the list found at:
www.ncbi.nlm.nih.gov/Sequin/sequin.hlp.html Figure1. Views
from Sequin and
If the sequences are included in the form of an aligned or Entrez, respec-
unaligned multiple FASTA sequence set, include the sequence tively, of a popu-
lation set of
identifier, in this case, “seqid”, after the “>” sign followed by the Adelie penguin
modifiers, as in: mitochondrial
DNA sequences
>seqid1 [organism= Escherichia coli] [strain=ECOR10] [clone=1] taken from
Step 2—Import the Sequences ancient bone and
fresh blood for a
After entering the submission and contact information in Sequin, study of rates of
evolution.
choose “Population study”, “Phylogenetic study”, “Mutation
study” or “Environmental samples” in the “Sequence Format”
panel. Next, import the set of nucleotide sequences in the
desired alignment format by clicking on “Import Nucleotide”.
In the absence of specimen vouchers, the following source modi-
Step 3—Add and propagate features
fiers are helpful:
If you imported the sequences as an alignment, as opposed to an
- cultivar, strain, isolate, breed, ecotype, or genotype name
unaligned multiple FASTA set, you have an easy way of propagat-
- germplasm, seed, or stock center accession number
ing features from one member to all members of the set. Select
the first entry under “Target Sequence” and add the appropriate - collection locality, date, and/or collection number.
features to this entry using the Annotate panel. Then, you may Along with the specimen voucher information, you may provide
propagate all or the selected features to the remaining entries online images of the specimens that will be made available in
using the Edit—Feature propagate option. Entrez through LinkOut. For an example, retrieve the entry
AY090229 in Entrez and click on LinkOut through "Links" on
Step 4—Adding Distinguishing information
the right hand side of the page to view an image of an insect
NCBI strongly encourages distinguishing information for the specimen.
individual sequences in PopSets. This information can include
When your submission is complete, save the entries in the native
strains for cultured bacteria, algae, fungi and laboratory animals;
Sequin format and e-mail to:
clones for sequences obtained by direct PCR-amplification and
cloning of an environmental bulk DNA sample; and specimen gb-sub@ncbi.nlm.nih.gov
vouchers for sets of multicellular organisms. You will receive confirmation of your submission along with your
Strain identifiers help distinguish specific cultures from other iso- accession numbers within approximately 2 business days.
lates of the same taxon. A strain may be designated in a variety —MB
7 Fall 2003/Winter 2004 NCBI News

GenBank® Release 139 UniGene Adds Four RefSeq Version 3 Released
GenBank Release 139, made avail- Four new organisms are now repre- RefSeq Release 3, covering 2,218
able in December 2003, contains sented in the UniGene collection of organisms, and containing over
over 30 million sequence entries gene-oriented transcript sequences; 844,000 protein sequences and anno-
totaling more than 36 billion base Toxoplasma gondii with 3,106 clusters, tations, includes the RefSeq data as
pairs. Release 140 is expected in Saccharum officinarum (noble cane)with of January 13, 2004.
February. GenBank is accessible via 5,197 clusters, Schistosoma mansoni (the
Download the release at:
the Entrez search and retrieval sys- blood fluke) with 1,170 clusters,
tem. The flatfile and ASN.1 versions Strongylocentrotus purpuratus (the purple ftp.ncbi.nih.gov/refseq/release
of the release are found in the “gen- sea urchin) with 2,592 clusters, and Daily updates between releases are
bank” and “ncbi-asn1” directories Physcomitrella patens (physcomitrella made available at:
respectively at: moss) with 6,927 clusters. UniGene
ftp.ncbi.nih.gov clusters are created at NCBI for all ftp.ncbi.nih.gov/refseq/daily/new
Uncompressed, the Release 139 flat- organisms represented in GenBank To receive announcements of future
files consume about 122 gigabytes with more than 70,000 Expressed RefSeq releases and incremental large
while the ASN.1 version consumes Sequence Tag sequences. UniGene updates, subscribe to NCBI's refseq-
about 108 gigabytes. The data can now covers over 30 animals and announce mail list:
also be downloaded at two mirror plants and can be searched using the
refseq-announce@ncbi.nlm.nih.gov
sites: Entrez search system where it is
linked to nucleotide records.
genbank.sdsc.edu/pub
bio-mirror.net/biomirror/genbank
NCBI News
NATIONAL INSTITUTES OF HEALTH ■ National Library of Medicine Fall 2003
Winter 2004

Fallwinter03 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Fallwinter03 PDF

Uploaded by

Copyright:

Available Formats

NCBI News

National Center for Biotechnology Information

Register Your Genome

1 Register Your Genome Project Online

3 New Genome Builds and Annotations

4 Entrez Gene Database Debuts

6 New Microbial Genomes in GenBank

6 KOGs and COGs Now in CDD

8 GenBank Release 139

8 UniGene Adds Four

Fall 2003/Winter 2004 NCBI News 2

the genomes of human, mouse, and Rat Genome Build 2 Version 1

3 Fall 2003/Winter 2004 NCBI News

Fall 2003/Winter 2004 NCBI News 4

Fall 2003/Winter 2004 NCBI News 6

7 Fall 2003/Winter 2004 NCBI News

You might also like