You are on page 1of 26

EXPERIMENT – 01

BASIC BIOLOGICAL ANALYSIS AND MOLECULAR SEQUENCE


RETRIEVAL FROM NCBI DATABASES – 1
EB3233: BIOINFORMATICS LABORATORY

Abstract
The most commonly used interface for obtaining information from biological databases is the
NCBI Entrez system. This system is developed by National Center for Biotechnology
Information (NCBI), which incorporates PubMed database and 39 other scientific literature,
nucleotide and protein databases, protein domain data, population research datasets, expression

1
data, pathways and interactive molecule systems, full genome descriptions and taxonomic
information into a closely interconnected system. Therefore, Entrez is a portal that allows for
text-based searches for a variety of data. A key feature of Entrez is its ability to integrate
information. It is based on pre-existing and logical relationships between individual entries
through cross-referencing between NCBI databases. Furthermore, this system is very convenient
because users do not need to view multiple databases located in disparate locations. About this
feature is discussed in this report with some examples.

Entrez can be single words, short phrases, sentences, database identifiers, genetic codes or
names. Simple searches often result in a large number of results but no results at all. There are a
number of built-in features that help create more effective queries. These include Boolean
operators, query translation, and field search using any indexed field in the database. Any of
these can be used for handwriting and editing, but are included in various sections of the
interface so that accurate query results can be obtained without having to write complex query
statements. Below mentioned Entrez Searching Options are discussed by providing some
examples in this report.

 Boolean operators - It provides a way to generate reliable queries that provide well-
defined sets of data or results.
 The Advanced Search page - Useful for making complex and highly effective queries.
 Limits - filtering the search of the user and according their preference

To sum up, using above mentioned Entrez features and searching options we was able to know
that how to use Entrez to search databases and also to view and use the various biological
databases available on the NCBI databases.

Introduction
This report represents how to use Entrez to search databases and to view and use the various
biological databases available on the NCBI databases. Entrez is a data retrieval system. This
system is developed by National Center for Biotechnology Information (NCBI), which

2
incorporates PubMed database and 39 other scientific literature, nucleotide and protein
databases, protein domain data, population research datasets, expression data, pathways and
interactive molecule systems, full genome descriptions and taxonomic information into a closely
interconnected system. Also, Entrez is known as GQuery and it thereby act as the search engine
for NCBI databases. All component databases can be accessed via a single query (Help, 2020).

The Entrez retrieval system uses an intuitive user interface to quickly search for sequence and
bibliographic data. A unique feature of the system is the use of pre-set similarity searches to link
relevant records in the "neighbors" or other Entrez databases for each record. These connections
facilitate integrated access across a number of databases (Help, 2020). An Entrez GQuery offers
search capabilities for an Entrez database subset at one time (Searching NCBI databases using
Entrez - BITS wiki, 2020). Results can be viewed in a number of formats, including FlatFile,
FASTA, XML, and many other formats. A graphical interface offers easy visualization of entire
genes or chromosomes, and also biological descriptions of individual sequences (Entrez
Molecular Sequence Database System, 2020). In addition, Entrez enables Batch downloads of
broad search results.

In there, Limits allow users to restrict their search by their preference. An Advanced Search
interface allows more comprehensive queries to be performed.

Objectives
1. To learn how to use Entrez to search databases
2. To view and use the various biological databases available on the NCBI databases

3
Materials
1. Computer
2. Internet connection
3. NCBI website

Methods and Results


1. First using the following link, the NCBI website was accessed.
https://www.ncbi.nlm.nih.gov/

4
Figure 1_GQuery search tool
2. After that, “Search” button was clicked with an empty search box and hereby the Entrez
page was displayed in the NCBI website like as below.

Figure 2_Entrez page

5
3. As an example, HIV was wanted to search therefore the query word “HIV” was typed in
the search box and “Search” button was clicked. The results were displayed with their
corresponding number of records (hits) as shown as below.

Figure 1_Search results for "HIV"

6
4. After that, “Nucleotide” database was selected to see the results from Genbank’s
nucleotide section by clicking on the hyperlink of Nucleotide as shown in below.

Figure 2__Search results for "HIV

Figure 4_Search results for "HIV"

5. Bacteria filter (at the left of the results page) can be used to check how many nucleotide
records that are returned by the search for “HIV” are from bacterial origin. In here, it was
showed that the search returned more than 34 000 records (34522) from bacterial origin.

7
6. RefSeq filter (at the left of the results page) can be used to check how many RefSeq
records are returned by the search of “HIV”. In here, it was showed that the search
returned more than 11 000 records (11606) from RefSeq.

Figure 5_Nucleotide results for HIV

8
7. Limits can be used to restrict user's search. As an example, all HIV mRNA sequences
submitted in the last year were wanted to retrieve. For that, filter was used on the left
side of the page. mRNA was selected and release date was entered as 2019/01/01 -
2019/12/31 as shown below.

Figure 6_Results page for all HIV mRNA sequences submitted in the last year

8. Next, the filtering process was wanted to continue with the previous search. But the HIV
words appear in the title of all search results were wanted to emphasize.

9
9. For that as the next step, advanced search was selected.

Figure 7_ Advanced search page

10. Then the History (previous search) details were added to Builder category as given below
in the Figure. After that, Search button was clicked.

Figure 8_Advanced search page

10
11. Advanced search results were displayed as shown below.

Figure 9_Advanced search results

11
12. Next, all records of proteases except those from HIV viruses were wanted to retrieve. For
that, as the first step, filter from the previous search was cleared. After that, the query
phrase "protease NOT hiv [Organism]" was typed in search box and clicked the search
button. Then the results page was displayed as given below.

Figure 10_Search results for "protease NOT hiv [Organism]"

12
13. At last, all records from proteases with a length between 1000 and 2000 amino acids
except those from HIV viruses were wanted to retrieve. For that filter was used on the left
side of the page. Sequence length was selected and entered the range as 1000 to 2000.
Then the results page was displayed as given below.

Figure 11_All records from proteases NOT HIV

13
Figure 12_All records from proteases with a length between 1000 and 2000 amino acids

14
14. In addition to that, advanced search and Shortcuts also can be used here like as shown
below.

Figure 13_ Advanced search for proteases with a length between 1000 and 2000 amino acids

15
Figure 14_All records from proteases with a length between 1000 and 2000 amino acids via
advanced search

16
Discussion
Using Entrez to search databases and viewing and using the various biological databases
available on the NCBI databases were learnt in this session. NCBI (National Center for
Biotechnology Information) is a nationally funded facility for creating national databases,
conducting research on computer biology, developing software tools for analyzing genetic data,
and disseminating biological information (Help, 2020). And also, Entrez is described as a
molecular biological database system that provides integrated access to nucleotide and protein
sequencing data, gene-centric and genetic mapping information, 3D structural data, PubMed
MEDLINE, and more(Entrez Molecular Sequence Database System, 2020).

Therefore, in this experiment firstly, Entrez page was accessed and to search about HIV on the
NCBI website, the query word "HIV" was typed in the search box. All NCBI databases are listed
on the GQuery results page and including the number of records found containing the word
"HIV". User can view search results for the database by clicking on the number of records found.
Also, it depends on the type of data user is looking for. So that, in this experiment “Nucleotide”
database was selected to see the results from Genbank’s nucleotide section. The result page is
shown in Figure 4. That result page was generated on 31 Oct 2020. Since then, the nucleotide
database will be constantly updated so that the numbers seen in the statistics will no longer be
the same as they are now.

After that, different types of filters were applied and the results were obtained separately. Filters
are mentioned in below.

 Bacteria Filter - it was used to check how many nucleotide records that are returned by
the search for “HIV” are from bacterial origin. Results page was shown the search
returned more than 34000 records from bacterial.
 RefSeq Filter - it was used to check how many RefSeq records are returned by the search
of “HIV”. Results page was shown the search returned more than 11000 records from
Refseq.

17
In Entrez page, Limits allow users to restrict their search by their choice. Therefore, limits were
used to restrict our search in this experiment. So that, all HIV mRNA sequences submitted in the
last year were wanted to retrieve. Then, filter was used as given below:

 mRNA was selected and release date was entered as 2019/01/01 - 2019/12/31.

Above-mentioned limits were applied to search only the search term (HIV) in reports containing
the mRNA sequences and submitted to GenBank last year. The query showed that only 17733
reports were retrieved. According to that, it was proved that limits allow users to restrict their
search.

Advanced search is another feature on the Entrez page. The advanced search interface allows for
queries that are more detailed. Therefore, advanced search was used to retrieve advanced queries
in this experiment. For that, querying process was wanted to continue with the previous search
but HIV words appear in the title of all search results were wanted to emphasize. Then, advanced
search was selected to search that query. Already filters were activated here as given below.

 Filters: mRNA, to 2019/01/01 - 2019/12/31

Then, “HIV” was added as “Title” in Builder. After selecting the Search button, the results page
was displayed and it was narrower results. The query showed that only 66 reports were retrieved.
Advance search can be used first and then the filter can be activated or vice versa. This is
because there is no strict sequence of steps to be taken until all the necessary information is
retrieved.

After that, “NOT” Boolean operator and one search term were used to retrieve records. For that,
below mentioned query was used in this experiment.

 all records of proteases except those from HIV viruses

Then first of all, previous search was cleared. And then, "protease NOT hiv [Organism]" was
typed in search box. The reasons for using this are explained below.

 "protease" - is used to find all proteases


 "NOT" - is used to exclude these from HIV viruses

18
 "protease NOT hiv [Organism]" - finally this is used to specify the need to exclude
proteases originating from HIV viruses

After that, the result page was displayed as in Figure 11

At last, in this experiment, above-mentioned query was continued by using a filter.

 Sequence length: between 1000 and 2000

Above mentioned filter is used to retrieve all records from proteases with a length between 1000
and 2000 amino acids except those from HIV viruses. Then the results page was displayed as in
Figure 12.

In the conclusion of this experiment, below mentioned points are learnt.

 The main advantage of GQuery is the common interface for different databases. The
reason for saying so is that once knowing how to set up a query in one NCBI database, it
can be done for any other NCBI database.
 Not only searches are similar but also the result pages of all NCBI databases look the
same in Entrez.
 How to use Entrez to search databases and to view and use the various biological
databases available on the NCBI databases.

19
References
2020. [online] Available at:
<https://www.researchgate.net/publication/51696022_Searching_NCBI_databases_using_Entrez
> [Accessed 1 November 2020].

Help, E., 2020. Entrez Help. [online] Ncbi.nlm.nih.gov. Available at:


<https://www.ncbi.nlm.nih.gov/books/NBK3837/> [Accessed 30 October 2020].

Laboratory Manual: Bioinformatics, Nilai University (Anon, 2014)

Ncbi.nlm.nih.gov. 2020. Entrez Molecular Sequence Database System. [online] Available at:
<https://www.ncbi.nlm.nih.gov/Web/Search/entrezfs.html> [Accessed 31 October 2020].

Wiki.bits.vib.be. 2020. Searching NCBI Databases Using Entrez - BITS Wiki. [online] Available
at: <https://wiki.bits.vib.be/index.php/Searching_NCBI_databases_using_Entrez> [Accessed 1
November 2020].

20
Post – Lab Questions
1. Search through Entrez for breast cancer. Provide the following information

a) The number of nucleotide sequence records associated with breast cancer.

571,748

Figure 15_Search results for nucleotide sequence records associated with breast cancer

21
b) Number of nucleotide sequence records associated with breast cancer that are from the
Refseq database.

25,492

Figure 16_Search results for nucleotide sequence records associated with breast cancer that
are from the Refseq database.

22
c) Number of nucleotide sequences associated with breast cancer that are mRNAs

328,499

Figure 17_Search results for nucleotide sequences associated with breast cancer that are
mRNAs

23
d) Number of the mRNA sequence records that are in Refseq and the words breast cancer
appear in their titles.

5461

Figure 18_Search results for mRNA sequence records that are in Refseq and the words breast
cancer appear in their titles.
24
e) Number of Human gene BRCA1 Refseq mRNA sequence records with the words breast
cancer in their titles.

Figure 19_Search results for Human gene BRCA1 Refseq mRNA sequence records with the
words breast cancer in their titles

25
f) Accession number of the mRNA record for human BRCA1 variant 1 gene in Refseq

Accession: NM_007294

Figure 20- mRNA record for human BRCA1 variant one gene in Refseq

26

You might also like