Accessing Bibliographic Databases

Accessing Bibliographic Databases
Aim:
To access the bibliographic database.
Principle:
The bibliographic databases provide access to abstracts and in some cases full text and figures of
published literature. Pubmed is the important database for literature. Pubmed comprises more
than 23 million citations for biomedical literature from MEDLINE, life science journals, and
online books. Citations may include links to full text content from Pubmed Central and publisher
website.
Procedure:
 Go to NCBI website www.ncbi.nlm.nih.gov .
 Click on PubMed option.
 Type query in query box and click search button.
 Hit list of all journals matching given query will appear.
 Select the appropriate option.
Result:
Literature regarding the given query was found.
Sequence Retrieval From Nucleic Acid Databases
Aim:
To retrieve sequence from nucleic acid database.
Principle:
The primary databases for nucleic acid sequences are Genbank, EMBL and DDBJ. Among these
three, Genbank is the most commonly used database.
GenBank is the NIH genetic sequence database, an annotated collection of all publicly available
DNA sequences. GenBank is part of the International Nucleotide Sequence Database
Collaboration , which comprises the DNA DataBank of Japan (DDBJ), the European Molecular
Biology Laboratory (EMBL), and GenBank at NCBI. These three organizations exchange data
on a daily basis.
The complete release notes for the current version of GenBank are available on the NCBI ftp
site. A new release is made every two months. GenBank growth statistics for both the traditional
GenBank divisions and the WGS division are available from each release.
Procedure:
 Go to NCBI website www.ncbi.nlm.nih.gov .
 Click on GenBank option.
 Select the required query from the hit list the file will open.
 Click on FASTA option to view the sequence in FASTA format.
Result:
Sequence was retrieved in FASTA format.
Protein Sequence Retrieval From Protein Sequence Databases
Aim:
To retrieve sequence from protein sequence databases.
Principle:
UniProtKB/Swiss-Prot is the manually annotated and reviewed section of the UniProt
Knowledgebase (UniProtKB). It is a high quality annotated and non-redundant protein sequence
database, which brings together experimental results, computed features and scientific
conclusions.Since 2002, it is maintained by the UniProt consortium and is accessible via
the UniProt website.
Procedure:
 Go to the uniprotwebsite www.uniprot.org .
 Type query in query box and click search button .
 Click on FASTA option to view the sequence in FASTA format.
Result:
Sequence was retrieved in FASTA format.
Protein Structure Retrieval From PDB
Aim:
To retrieve the protein structure from PDB.
Principle:
The Protein Data Bank (PDB) is a repository for the three-dimensional structural data of large
biological molecules, such as proteins and nucleic acids. The data, typically obtained by X-ray
crystallography or NMR spectroscopy and submitted by biologists and biochemists from around
the world, are freely accessible on the Internet via the websites of its member organisations
(PDBe, PDBj, and RCSB). The PDB is overseen by an organization called the Worldwide
Protein Data Bank, wwPDB.
Procedure:
 Type www.rcsb.org and enter the home page of PDB.
 In the RCSB homepage type protein name in query window and click on search.
 From search result select any one and click on the allow button for downloading PDB
file.
 Save the file on local computer.
Result:
Protein Structure Visualization Using RasMol
Aim:
To visualize protein structure using RasMol.
Principle:
Rasmol is the molecular visualisation software, which is used to view the three dimensional
structure of proteins. The pdb file of the protein needs to be gives as input for this software. It is
used for the structural analysis of the protein.
The name RasMol comes from raster display of molecules. Raster is a type of computer display
especially useful for showing solid surfaces. It may not be a coincidence that the letters Ras are
also the initials of RasMol's creator, Roger A. Sayle of Glaxo Corporation and the University of
Edinburgh, Scotland.
Procedure:
 Go to PDB website www.rcsb.org .
 Download the data in PDB format and input it into RasMol software
Result:
3D imagery of the protein is displayed on the screen.
Restriction Map Analysis
Aim:
To map restriction sites for nucleotides sequence of internet.
Principle:
Restriction Mapper is a web site that finds restriction endonuclease cleavage sites in DNA
sequences. It supports linear and circular DNA and provides several ways to sort and filter
output. Also provided is a virtual digest function that simulates a simultaneous digest of your
sequence with enzymes of your choice.
Procedure:
 Get the nucleotide sequence from NCBI.
 Go to www.restrictionmapper.org
 Paste the sequence collected from the nucleotide sequence database in the given window.
 Click on map sites button on right side of window.
Result:
Restriction mapping of the given sequence was done.
ORF finder
Aim:
To find open reading frames in the given DNA sequence.
Principle:
The ORF Finder (Open Reading Frame Finder) is a graphical analysis tool which finds all open
reading frames of a selectable minimum size in a user's sequence or in a sequence already in the
database.
This tool identifies all open reading frames using the standard or alternative genetic codes. The
deduced amino acid sequence can be saved in various formats and searched against the sequence
database using the WWW BLAST server. The ORF Finder should be helpful in preparing
complete and accurate sequence submissions. It is also packaged with the Sequin sequence
submission software.
Procedure:
 Go to NCBI.
 Get the nucleotide sequence.
 Select the ORF finder.
 Paste the sequence in the box provided and click on ORF find.
Result:
NEB Cutter
Aim:
To find the ORF sequence of the given sequence using NEBcutter.
Principle:
This tool will take a DNA sequence and find the large, non-overlapping open reading frames
using the E.coli genetic code and the sites for all Type II and commercially available Type III
restriction enzymes that cut the sequence just once. By default, only enzymes available from
NEB are used, but other sets may be chosen. Just enter your sequence and "submit". Further
options will appear with the output. The maximum size of the input file is 1 MByte, and the
maximum sequence length is 300 KBases.
Procedure:
 Go to NEBcutter website www.tools.neb.com .
 Insert the given sequence in FASTA format in the space provided and click submit.
Result :
Sequence Similarity Search Using FASTA
Aim:
To find the similarity between sequences using FASTA.
Principle:
This tool provides sequence similarity searching against protein databases using the FASTA
suite of programs. FASTA provides a heuristic search with a protein query. FASTX and FASTY
translate a DNA query. Optimal searches are available with SSEARCH (local), GGSEARCH
(global) and GLSEARCH (global query, local database).
Procedure :
 Go to EMBL website www.ebi.ac.uk and select FASTA tool.
 Enter the query in query box and click enter
 Hit list matching to the query will appear on the screen.
 Select the required query.
Result :
Sequence Similarity Search Using BLAST
Aim :
To find similarities between sequences using BLAST.
Principle:
In bioinformatics, Basic Local Alignment Search Tool, or BLAST, is an algorithm for
comparing primary biological sequence information, such as the amino-acid sequences of
different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to
compare a query sequence with a library or database of sequences, and identify library sequences
that resemble the query sequence above a certain threshold. Different types of BLASTs are
available according to the query sequences. For example, following the discovery of a previously
unknown gene in the mouse, a scientist will typically perform a BLAST search of the human
genome to see if humans carry a similar gene; BLAST will identify sequences in the human
genome that resemble the mouse gene based on similarity of sequence. The BLAST program
was designed byStephenAltschul, Warren Gish, Webb Miller, Eugene Myers, and David J.
Lipman at the NIH and was published in the Journal of Molecular Biology in 1990.
Procedure :
 Open NCBI homepage www.ncbi.nlm.nih.gov .
 Enter the protein sequence in the space provided and click on BLAST button.
Result:
Pairwise Sequence Alignment
Aim:
To align two different sequence by using EMBOSS pair wise alignment tool.
Principle:
EMBOSS Needle reads two input sequences and writes their optimal global sequence alignment
to file. It uses the Needleman-Wunsch alignment algorithm to find the optimum alignment
(including gaps) of two sequences along their entire length.
This tool can be used in the Protein Alignment and nucleotide alignment.
Procedure:
 Type www.ebi.ac.uk/emboss.align and enter the net.
 Collect the nucleotide sequence from the nucleic acid sequence database.
 Paste the sequences in the space given.
 Click the run button.
Result:
The given sequences are aligned pair wise by using EMBOSS.
Multiple Sequence Alignments
Aim:
To align multiple nucleotide sequences by using “ClustalW” multiple sequence alignment
programmer.
Principle:
Multiple Sequence Alignment (MSA) is generally the alignment of three or more biological
sequences (protein or nucleic acid) of similar length. From the output, homology can be inferred
and the evolutionary relationships between the sequences studied.
By contrast, Pairwise Sequence Alignment tools are used to identify regions of similarity that
may indicate functional, structural and/or evolutionary relationships between two biological
sequences.
ClustalW is a general purpose multiple sequence alignment program for DNA or proteins.
Procedure:
 Type www.ebi.ac.uk/clustalw/index.html and enter the net.
 Collect four different nucleotide sequences from the nucleic acid sequence database.
 Paste the sequence one by one in the space given.
Result:
The given sequences are aligned using “ClustalW”.
Phylogenetic Analysis Using ClustalW
Aim :
To analyze phylogenetic relationship among sequences using clustalw.
Principle :
A phylogram is a branching (tree) that is assumed to be an estimate of a phylogeny. The branch
lengths are proportional to the amount of inferred evolutionary change. A cladogram is a
branching diagram (tree) assumed to be an estimate of a phylogeny where the branches are of
equal length. Therefore, cladograms show common ancestry, but do not indicate the amount of
evolutionary “time” separating taxa. It is possible to see the tree distances by clicking on the
diagram to get a menu of options. The options available allow you to do things like changing the
colors of lines and fonts and showing the distances.
ClustalW phylogenetic calculations are based around the neighbor-joining method of Saitou and
Nei.
Procedure:
 Type www.ebi.ac.uk/clustalw/index.html and enter the net.
 Collect four different nucleotide sequences from the nucleic acid sequence database.
 Paste the sequence one by one in the space given.
 Then click the guide tree button.
Result :
Cladogram representation of the sequence similarity was displayed on the screen.
Identification Of Genes From Genomes
Aim:
To find gene in the genome.
Principle :
In bioinformatics GENSCAN is a program to identify complete gene structures in

genomic DNA. It is a GHMM-based program that can be used topredict the location of
genes and their exon-intron boundaries in genomic sequences from a variety of organisms. The
GENSCAN Web server can be found at MIT.
GENSCAN was developed by Christopher Burge in the research group of Samuel
Karlin Department of Mathematics, Stanford University.
Procedure:
 Type www.genes.mit.edu and enter the net.
 Paste the DNA sequence in the space provided.
 Click on run gene scan button.
Result:
The output of genscan is obtained.
Primer BLAST
Aim :
To analyze the primers in the given genome.
Principle:
Primer-BLAST was developed at NCBI to help users make primers that are specific to the input
PR template. It uses Primer3 to design PR primers and then submits then automatically analyzed
to avoid primer pairs (all combinations including forward-reverse primer pair, forward- forward
as well as reverse-reverse pairs) that can cause amplification of targets other than the input
template.
Procedure:
 Go to NCBI webpage www.ncbi.nlm.nih.gov .
 Click on primer BLAST tool.
 Enter sequence in FASTA format in the space provided.
 Click BLAST button
Result:
The primers of the given genome are displayed on the screen.
Analysis Of Protein Structure Using Ramchandran Plot
Description
The Ramachandran plot shows the phi-psi torsion angles for all residues in the structure (except
those at the chain termini). Glycine residues are separately identified by triangles as these are not
restricted to the regions of the plot appropriate to the other side chain types.
The coloring/shading on the plot represents the different regions (see below) described in Morris
et al. (1992): the darkest areas (here shown in red) corresponds to the “core” regions representing
the most favorable combinations of phi-psi values.
Ideally, one would hope to have over 90%of the residues in these “core” regions. The percentage
of residues in the “core “regions are one of the better guides to stereo chemical quality.
Ramachandran plot regions
The different regions on the Ramachandran plot are as described in Morris et al. (1992).
The regions are labeled as follows:
A-Core alpha L- Core left-handed alpha
a- Allowed alpha I-Allowed left-handed alpha
̴a – Generous alpha ̴I – Generous left-handed alpha
B – Core beta p – Allowed epsilon
b-Allowed beta ̴p – Generous epsilon
̴b- Generous beta
The different regions were taken from the observed phi-psi distribution for 121,870 residues
from 463known X-ray protein structures. The two most favored regions are the “core “and
“allowed” regions which correspond to 100×100 pixels having more than 100 & 8 residues in
them, respectively . The “generous” regions were defined by Morris et al. (1992) by extending
out by 200 (two pixels) all round the “allowed” regions. In fact, the authors found very few
residues in these “generous” regions, so they can probably be treated much like the “disallowed”
regions & any residues in them investigated more closely.
Protein Secondary Structure Prediction
Aim:
To predict the secondary structure of protein.
Principle:
Protein structure prediction is the prediction of the three-dimensional structure of a protein from
its amino acid sequence — that is, the prediction of its secondary, tertiary, and quaternary
structure from its primary structure. Structure prediction is fundamentally different from the
inverse problem of protein design. Protein structure prediction is one of the most important goals
pursued by bioinformatics and theoretical chemistry; it is highly important in medicine (for
example, in drug design) and biotechnology (for example, in the design of novel enzymes).
Every two years, the performance of current methods is assessed in the CASP experiment
(Critical Assessment of Techniques for Protein Structure Prediction). A continuous evaluation of
protein structure prediction web servers is performed by the community project CAMEO3D.
Procedure:
 Go to Pole Bioinformatics homepage www.npsa-pbil.fr .
 Click on the structure predict tool.
 Insert the sequence in the space provided.
 Click on submit button.
Result :
Secondary structure prediction is displayed on the screen.
Automated Docking
Introduction:
Auto Dock is an automated procedure for predicting the interaction of legends with bio
macromolecular targets. The motivation for this work arises from problem in the design of
bioactive compounds, and in particular the field of computer-aided drug design. Progress in bio
molecular x-ray crystallography continues to provide important protein and nucleic acid
structures. These structures could be targets for bioactive agents in the control of animal and
plant diseases, or simply key to the understanding of fundamental aspects of biology. The precise
interaction of such agents or candidate molecules with their targets is important in development
process. Our goal has been to provide a computational tool to assist researchers in the
determination of bio molecular complexes.
In any docking scheme, two conflicting requirements must be balanced: the desire for a robust
and accurate procedure, and the desire to keep the computational demands at a reasonable level.
The ideal procedure would find the global minimum in the interaction energy between the
substrate and the target protein, exploring all available degrees of freedom (DOF) for the system.
However, it must also run on a laboratory workstation within an amount of time comparable to
other computations that a structural researcher may undertake, such as a crystallographic
refinement. In order to meet these demands a number of docking techniques simplify the docking
procedure.
Auto Dock combines two method to achieve these goals; rapid grid-based energy evaluation and
efficient search of torsional freedom.
This guide includes information on the methods and files used by Auto Dock and information on
use of Auto Dock Tolls to generate these files and to analyze result.
PROCEDURE:
Auto dock and AutoDock Tools, the graphical user interface for Auto Dock are available on the
WWW at :http://autodock.scripps.edu/
The WWW site also includes many resources for use of AutoDock, including detailed Tutorials
that guide users through worked of basic AutoDock usage, docking with flexible rings, and
virtual screening with AutoDock. Tutorial may be found at: http://autodock.scripps.edu/faqs-
help/tutorial4
AutoDock calculations are performed in several steps : 1) Preparation of coordinate files using
AutoDock Tools, 2) Precalculation of atomic affinities using AutoGrid, 3) Docking of legends
using AutoDocks, and 4) Analysis of results using AutoDock Tolls.
Step 1 – Coordinate file preparation. AutoDock4.2 is parameterized to use a model of the
protein and ligand that includes polar hydrogen atoms, but not hydrogen atoms bonded to carbon
atoms. An extended PDB format, termed PDBQT, is used for coordinate files, which includes
atomic partial charges and atom types. The current AutoDock force field uses several atom types
for the most common atoms, including separate types for aliphatic and aromatic carbon atoms,
and separate types polar atoms that form hydrogen bonds and those that do not. PDBQT files
also include information on the torsional degrees of freedom. In cases where specific side chains
in the protein are treated as flexible, a separate PDBQT file is also created for the side chain
coordinates. In most cases, AutoDock Tools will be used for creating PDBQT files from
traditional PDB files.
Step 2- AutoGrid Calculation. Rapid energy evaluation is achieved by pre calculating atomic
affinity potentials for each atom type in the ligand molecule being docked. In the AutoGrid
procedure the protein is embedded in a three-dimensional grid and a probe atom is placed at each
grid point. The energy of interaction of this single atom with the protein is assigned to the grid
point. AutoGrid affinity grids are calculated for each type of atom in the ligand, typically carbon,
oxygen, nitrogen and hydrogen, as well as grids of electrostatic and desolation potentials. Then,
during the AutoDock calculation, the energetic of a particular ligand configuration is evaluated
using values from the grids.
Step3 - Docking using AutoDock. Docking is carried out using one of the several search
methods. The most efficient method is a Lamarckian genetic algorithm (LGA), but traditional
genetic algorithms and stimulated annealing are also available. For typical systems AutoDock is
run several times to give several dock conformations, and analysis of the predicted energy and
the consistency of results are combined to identify the best solutions.
Step4- Analysis using AutoDock Tools. AutoDock Tools includes a number of methods for
analyzing the results of docking simulation, including tools for clustering results by
conformational similarity, visualizing confirmations, visualizing interactions between ligands
and proteins and visualizing the affinity potentials created by AutoGrid.
Introduction To SPSS
SPSS is a computer program used for survey authoring and development (IBM SPSS Data
Collection), data mining (IBM SPSS Modeller), text analysis, statistical analysis, and
collaboration and development (batch and automated scoring services).
Statistics program
SPSS (originally, statistical package for Social Sciences) was released in its first version in its
first version in 1968 after being developed by Norman H. Nie and C, Hadlai Hull. SPSS is
among the most widely used programs for statistical analysis in social science. It is used by
market researchers, health researchers, survey companies, government, education researchers,
marketing organizations and others. The original SPSS manual (Nie,Bent and Hull,1970) has
been described as one of “sociology’s mosy influential books” . (1)In addition to statistical
analysis, data management (case selection, file reshaping, creating derived data) and data
documentation (metadata dictionary is stored in the data file) are features of the base software.
Statistics included in the base software:
Descriptive statistics: Cross tabulation, Frequencies, Descriptive, Explore, descriptive ratio
statistics
Bivariate statistics: Means, t-test, ANOVA, Correlation (bivariate. Partial, distances),
Nonparametric tests
Prediction for numerical outcomes: Linear regression
Prediction for identifying groups: Factor analysis, Cluster analysis(two- steps, K-means,
hierarchical), Discriminant
The many features of SPSS are accessible via pull-down menus or can be programmed with a
pro proprietary 4GL command syntax language. Command syntax programming has the benefits
of reproducibility, simplifying repetitive tasks, and handling complex data manipulations and
analysis. Additionally, some complex applications can only be programmed in syntax and are not
accessible through the menu structure. The pull-down menu interface also generates command
syntax; this can be displayed in the output, although the default settings have to be changed to
make the syntax visible to the user. They can also be pasted in to a syntax file using the “paste”
button present in each menu. Programs can be run interactively or unattended, using the supplied
production job facility. Additionally a “macro” language can be used to write command language
subroutines and a python programmability extension can access the information in the data
dictionary and data and dynamically build command syntax programs. The python
programmability extension, introduced in SPSS14, replaced the less functional SAX Basic
“scripts” for most purpose, although SAX Basic remains available. In addition, the python
extension allows SPSS to run any of the statistics in the free software package R. From version
14 onwards SPSS can “plug-ins”.
SPSS places constraints on internal files structures, data type, data processing and matching files
which together considerably simplify programming. SPSS datasets have a 2-dimensional table
structure where the rows typically represent cases (such as individuals or households) and the
columns represents measurements (such as age, sex or household income). Only 2 data types are
defined: numeric and text (or “string”). All data processing occurs sequentially case-by-case
through the file. Files can be matched one-to-one and one-to-many, but not many-to-many.
The graphical user interface has two views which can be toggled by clicking on one of the two
tabs in bottom left of the SPSS window. The ‘Data view’ shows a spread sheet view of the cases
(rows) and variables (columns). The ‘Variable View’ displaces the metadata dictionary where
each row represents a variable and shows the variable name, variable label, and variable label(s),
print width, measurement type and a variety of other characteristics. Cells in both views can be
manually edited, defining the file structure and allowing data entry without using command
syntax. This may be sufficient for small datasets. Larger datasets such as statistical surveys are
more often created in data entry software, or entered during computer-assisted personal
interviewing, by scanning and using optical character recognition and optical mark recognition
software, or by direct capture from online questionnaires. These datasets are then read into SPSS.
SPSS can read and write data from ASCII text files (including hierarchical files), other statistics
packages, spreadsheets and databases. SPSS can read and write to external relational database
tables via ODBS and SQL.
Statistical output is to a proprietary file format (*.spv file, supporting pivot tables) for which, in
addition to the in-package viewer, a stand-alone reader can be downloaded. The proprietary
output can be exported to text or Microsoft Word, PDF, Excel, and other formats. Alternatively,
output can be captured as data (using the OMS command), as text, tab-delimited text, PDF, XLS,
HTML, XML, SPSS database or a variety of graphic image formats (JPEG, PNG, BMP, and
EMF).
The SPSS logo used prior to the renaming in January 2010.SPSS Server is a version of SPSS
with client/server architecture. It had some features not available in the desktop version, such as
scoring function (Scoring functions are included in the desktop version from version 19).
Calculation of Standard Error
This below worksheet containing the solved example shows how to calculate Standard Error.
The formula given in the below image is used to find out the standard error.

Where
SEx̄ = Standard Error of Mean

s = Standard Deviation of Mean
n = Number of Observations of Sample
Solved Example
X = 10, 20, 30, 40, 50
Total Inputs (N) = (10, 20, 30, 40, 50)
Total Inputs (N) =5
To find Mean:
Mean (xm) = (x1 + x2 + x3 + .... + xn)/N
Mean (xm) = 150/5
Mean (xm) = 30
To find SD:
SD = √(1/(N-1) * ((x1 - xm)2 + (x2 - xm)2 + ... + (xn - xm)2))
= √(1/(5-1)((10-30)2+(20-30)2+(30-30)2+(40-30)2+(50-30)2))
= √(1/4((-20)2+(-10)2+(0)2+(10)2+(20)2))
= √(1/4((400)+(100)+(0)+(100)+(400)))
= √(250)
= 15.811
To Find Standard Error:

Standard Error = SD / √(N)
Standard Error = 15.811388300841896/√(5)
Standard Error = 15.8114/2.2361
Standard Error = 7.0711
t-test/ ANOVA Using MS Excel
Data Observation 1 Observation 2

Sample 1 5.1 4.0
Sample 2 4.9 5.2
Sample 3 5.4 4.5
Sample 4 4.6 5.6
Sample 5 5.5 5.4
Sample 6 5.7 4.5
t-test 0.397276592
Anova: Single Factor
SUMMARY
Groups Count Sum Average Variance
Row 1 2 9.1 4.55 0.605
Row 2 2 10.1 5.05 0.045
Row 3 2 9.9 4.95 0.405
Row 4 2 10.2 5.1 0.5
Row 5 2 10.9 5.45 0.005
Row 6 2 10.2 5.1 0.72
ANOVA
Source of
Variation SS df MS F P-value F crit
Between
Groups 0.846667 5 0.169333 0.445614 0.803512 4.387374
Within
Groups 2.28 6 0.38
Total 3.126667 11
Result :
t-test value for the given data was found to be 0.397276592.

P-value for the given data was found to be 0.803512.

Accessing Bibliographic Databases

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Accessing Bibliographic Databases

Uploaded by

Copyright:

Available Formats

Accessing Bibliographic Databases

In bioinformatics GENSCAN is a program to identify complete gene structures in

SEx̄ = Standard Error of Mean

To Find Standard Error:

Data Observation 1 Observation 2

Anova: Single Factor

t-test value for the given data was found to be 0.397276592.

You might also like