You are on page 1of 80

“A Bio-Python Based Program to Generate Random Protein Sequences,

each sequence being 100 amino acid residues long”

Contents
1. Project Outline
2. Introduction
3. Objective
4. Introduction to Amino Acids
A. General Structure
B. Physical Properties
C. Classification
D. Peptide Bond Formation
E. Physiochemical Properties
F. Proteinogenic amino acids
5. The Genetic Code
A. RNA Codon Table
B. DNA Codon Table
6. Gene Expression
A. Transcription
B. RNA Processing
C. Translation
7. FASTA & FASTA Format
8. Python
A. Introduction,Features,Uses etc.
B. The IDLE User Interfac…
C. Data Types
9. Bio-Python
10. PROJECT CODE: “A Bio-Python Based Program to Generate Random Protein
Sequences, each sequence being 100 amino acid residues long.”
11. Explanation of the Code & Outputs
A. Bio-Python Libraries Used
B. Sample Outputs
12. Conclusion
13. Recommendations for improving this project
14. Glossary
15. Bibliography

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Introduction to Amino Acids


Amino acids are molecules containing an amine group, a carboxylic acid
group and a side-chain that varies between different amino acids. The key
elements of an amino acid are carbon, hydrogen, oxygen, and nitrogen.
They are the chemical units or "building blocks" of the body that make up
proteins. Protein substances make up the muscles, tendons, organs, glands,
nails, and hair. Growth, repair and maintenance of all cells are dependent
upon them.

History
The first few amino acids were discovered in the early 19th century. In
1806, the French chemists Louis-Nicolas Vauquelin and Pierre Jean Robiquet
isolated a compound in asparagus that proved to be asparagine, the first amino
acid to be discovered. Another amino acid that was discovered in the early 19th
century was cystine, in 1810 although its monomer, cysteine, was discovered
much later, in 1884. Glycine and leucine were also discovered around this time,
in 1820 by H.Braconnot from gelatin. Usage of the term amino acid in the
English language is from 1898.

General Structure
In the structure shown at the top of the page, R
represents a side-chain specific to each amino
acid. The carbon atom next to the carboxyl group
is called the α–carbon and amino acids with a side-
chain bonded to this carbon are referred to as
alpha amino acids. These are the most common
form found in
nature. In the alpha
amino acids, the α–
carbon is a chiral carbon atom, with the
exception of glycine. In amino acids that have a
carbon chain attached to the α–carbon (such as
lysine) the carbons are labeled in order as α, β, γ,
δ, and so on. In some amino acids, the amine
group is attached to the β or γ-carbon, and these
are therefore referred to as beta or gamma amino acids .

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Optical Isomers Of Amino Acids


If a Carbon atom is attached to four different groups, it is asymmetric
and therefore exhibits optical isomerism. The Amino Acids except glycine
possess four distinct groups (R, H, COO- , NH3+) held by an α-Carbon. Thus all the
amino acids (except glycine where R=H) have optical isomers.Because of these
four different groups attached to the same carbon atom, amino acids (apart
from glycine) are chiral.

The lack of a plane of symmetry means that there will be two stereoisomers of
an amino acid (apart from glycine) - one the non-superimposable mirror image
of the other.

For a general 2-amino acid, the isomers are:

All the naturally occurring amino acids have the right-hand structure in this
diagram. This is known as the "L-" configuration. The other one is known as the
"D-" configuration.We can recognise the L- configuration by imagining that we
are looking down from above on the right-hand structure in the above diagram.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

We can't tell by looking at a structure whether that isomer will rotate the plane
of polarisation of plane polarised light clockwise or anticlockwise.

All the naturally occurring amino acids have the same L- configuration, but they
include examples which rotate the plane clockwise (+) and those which do the
opposite (-).

For example:

 (+) Alanine
 (-) Cysteine
 (-) Tyrosine
 (+)Valine

Zwitterions
The amine and carboxylic acid functional groups found in amino acids allow
them to have amphiprotic properties. Carboxylic acid groups (-CO2H) can be
deprotonated to become negative carboxylates (-CO2- ), and α-amino groups
(NH2-) can be protonated to become positive α-ammonium groups (+NH3-). At
pH values greater than the pKa of the carboxylic acid group (mean for the 20
common amino acids is about 2.2the negative carboxylate ion predominates. At
pH values lower than the pKa of the α-ammonium group (mean for the 20
common α-amino acids is about 9.4), the nitrogen is predominantly protonated
as a positively charged α-ammonium group. Thus, at pH between 2.2 and 9.4,
the predominant form adopted by α-amino acids contains a negative
carboxylate and a
positive α-ammonium
group, as shown in
structure (2) on the
right, so has net zero
charge. This
molecular state is
known as a
zwitterion, from the
German Zwitter An amino acid in its (1) unionized and (2) zwitterionic form

forms meaning hermaphrodite or hybrid. Below pH 2.2, the predominant form


will have a neutral carboxylic acid group and a positive α-ammonium ion (net
charge +1), and above pH 9.4, a negative carboxylate and neutral α-amino
group (net charge -1). The fully neutral form (structure (1)above) is a very
minor species in aqueous solution throughout the pH range (less than 1 part in

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

107). Amino acids also exist as zwitterions in the solid phase, and crystallize
with salt-like properties unlike typical organic acids or amines.

Isoelectric point

At pH values between the two pKa values, the zwitterion predominates, but
coexists in dynamic equilibrium with small amounts of net negative and net
positive ions. At the exact midpoint between the two pKa values, the trace
amount of net negative and trace of net positive ions exactly balance, so that
average net charge of all forms present is zero. This pH is known as the
isoelectric pointpI, so pI = ½(pKa1 + pKa2). The individual amino acids all have
slightly different pKa values, so have different isoelectric points. For amino
acids with charged side-chains, the pKa of the side-chain is involved. Thus for
Aspartic Acid, Glutamine with negative side-chains, pI = ½(pKa1 + pKaR), where
pKaR is the side-chain pKa. Cysteine also has potentially negative side-chain
with pKaR = 8.14, so pI should be calculated as for Aspartic Acid and Glutamine,
even though the side-chain is not significantly charged at neutral pH. For His,
Lysine, and Arginine with positive side-chains, pI = ½(pKaR + pKa2). Amino acids
have zero mobility in electrophoresis at their isoelectric point, although this
behaviour is more usually exploited for peptides and proteins than single amino
acids. Zwitterions have minimum solubility at their isolectric point and some
amino acids (in particular, with non-polar side-chains) can be isolated by
precipitation from water by adjusting the pH to the required isoelectric point.

Physical properties
Melting points

The amino acids are crystalline solids with surprisingly high melting points. It is
difficult to pin the melting points down exactly because the amino acids tend to
decompose before they melt. Decomposition and melting tend to be in the
200 - 300°C range.For the size of the molecules, this is very high. Something
unusual must be happening.
If we look at the general structure of an amino acid, we see that it has both a
basic amine group and an acidic carboxylic acid group.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

There is an internal transfer of a hydrogen ion from the -COOH group to the -
NH2 group to leave an ion with both a negative charge and a positive charge.
Zwitterionic form is the form that amino acids exist in, even in the solid state.
Instead of the weaker hydrogen bonds and other intermolecular forces that we
expect, we actually have much stronger ionic attractions between one ion and
its neighbours.These ionic attractions take more energy to break and so the
amino acids have high melting points for the size of the molecules.

Sl. Melting Point


Amino Acid
No (°C)
1 Alanine 297
2 Arginine 244
3 Asparagine 234-235
4 Aspartic Acid 270-271
5 Cysteine 175-178
6 Glutamic Acid 260-261
7 Glutamine 247-249
8 Glycine 185-186
9 Histidine 233
10 Isoleucine 287, 196
11 Leucine 274
12 Lysine 284
13 Methionine 293-295
14 Phenylalanine 224.5
15 Proline 280-282
16 Serine 283
17 Threonine 220-222
18 Tryptophan 228
19 Tyrosine 255-257
20 Valine 289

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Solubility

Amino acids are generally soluble in water and insoluble in non-polar organic
solvents such as hydrocarbons.

This again reflects the presence of the zwitterions. In water, the ionic
attractions between the ions in the solid amino acid are replaced by strong
attractions between polar water molecules and the zwitterions. This is much
the same as any other ionic substance dissolving in water.
The extent of the solubility in water varies depending on the size and nature
of the "R" group.
The lack of solubility in non-polar organic solvents such as hydrocarbons is
because of the lack of attraction between the solvent molecules and the
zwitterions. Without strong attractions between solvent and amino acid, there
won't be enough energy released to pull the ionic lattice apart.

Sl. Solubility (g/100


Amino Acid
No. mL H2O)

1 Alanine 16.65
2 Arginine 15
3 Asparagine 3.53
4 Aspartic Acid 0.778
5 Cysteine very soluble
6 Glutamic Acid 0.864
7 Glutamine 2.5
8 Glycine 24.99
9 Histidine 4.19
10 Isoleucine 4.117
11 Leucine 2.426
12 Lysine very
13 Methionine 3.381
14 Phenylalanine 2.965
15 Proline 162.3
16 Serine 5.023
17 Threonine very soluble
18 Tryptophan 1.136
19 Tyrosine 0.0453
20 Valine 8.85

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Classification of Amino Acids


There are twenty amino acids that are used to form proteins in the human
body, these are called the proteinogenic amino acids. There appear to be many
different classification systems, three of which are presented here.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

I. Classification based on Polarity of Amino Acids


Identifying amino acids as polar or non-polar. A further subclassification of
acidic-polar when the side chain contains a carboxylic acid, and basic-polar
when the side chain contains an amino group.

Classification Amino Acid


Glycine
Alanine
Valine
Leucine
Nonpolar Isoleucine
Proline
Methionine
Phenylalanine
Tryptophan
Serine
Threonine
Asparagine
Polar
Glutamine
Cysteine
Tyrosine
Aspartic Acid
Acidic (Polar)
Glutamic Acid
Lysine
Basic (Polar) Arginine
Histidine

Table 1:Classification based on Polarity of Amino Acids

II. Structural Classification.

Superstructure Structure Amino Acid


Glycine
L-Alanine
L-Valine
Monoamino,
moncarboxylic Unsubstituted L-Leucine
L-Isoleucine
L-Proline
Heterocyclic
L-Phenylalanine
L-Tyrosine
Aromatic
L-Tryptophan
………..Monoamino, Thioether L-Methionine
moncarboxylic L-Serine
Hydroxy
L-Threonine

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Mercapto L-Cysteine
L-Asparagine
Carboxamide
L-Glutamine
Monamino, L-Aspartate
dicarboxylic L-Glutamate
L-Lysine
Diamino,
L-Arginine
monocarboxylic
L-Histidine

Table 2: Structural Classification

III. Classification based on Structure of Side Chain

Classification Amino Acid


Glycine
Alanine
Alphatic (do not contain N,O,S in side
Valine
chain)
Leucine
Isoleucine
Cysteine
Sulfur-containing
Methionine
Phenylalanine
Aromatic (benzene ring in side chain) Tyrosine
Tryptophan
Serine
Neutral (hydroxyl or amide groups in side Threonine
chain) Asparagine
Glutamine
Aspartic acid
Acidic (carboxylate groups in side chain)
Glutamic acid
Lysine
Basic
Arginine
Imino acid (special case) Proline

Table 3: Classification based on Structure of Side Chain

Peptide bond formation

The condensation of two amino acids to forms a peptide bond.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

A peptide bond (amide bond) is a covalent chemical bond formed


between two molecules when the carboxyl group of one molecule reacts with
the amino group of the other molecule, thereby releasing a molecule of water
(H2O). This is a dehydration synthesis reaction (also known as a condensation
reaction), and usually occurs between amino acids. The resulting C(O)NH bond
is called a peptide bond, and the resulting molecule is an amide. The four-atom
functional group -C(=O)NH- is called a peptide link. Polypeptides and proteins
are chains of amino acids held together by peptide bonds

As both the amine and carboxylic acid groups of amino acids can react to
form amide bonds, one amino acid molecule can react with another and
become joined through an amide linkage. This polymerization of amino acids is

what creates proteins. This condensation reaction yields the newly formed
peptide bond and a molecule of water. In cells, this reaction does not occur
directly; instead the amino acid is first activated by attachment to a transfer
RNA molecule through an ester bond. This aminoacyl-tRNA is produced in an
ATP-dependent reaction carried out by an aminoacyltRNAsynthetase. This
aminoacyl-tRNA is then a substrate for the ribosome, which catalyzes the attack
of the amino group of the elongating protein chain on the ester bond. As a

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

result of this mechanism, all proteins made by ribosomes are synthesized


starting at their N-terminus and moving towards their C-terminus.

Physicochemical properties of amino acids


The 20 amino acids encoded directly by the genetic code can be divided into
several groups based on their properties. Important factors are charge,
hydrophilicity or hydrophobicity, size, and functional groups. These properties
are important for protein structure and protein–protein interactions. The
water-soluble proteins tend to have their hydrophobic residues (Leucine, Iso
Leucine, Valine, Phenyl Alanine, and Tryptophan) buried in the middle of the
protein, whereas hydrophilic side-chains are exposed to the aqueous solvent.
The integral membrane proteins tend to have outer rings of exposed
hydrophobic amino acids that anchor them into the lipid bilayer. In the case
part-way between these two extremes, some peripheral membrane proteins
have a patch of hydrophobic amino acids on their surface that locks onto the
membrane. In similar fashion, proteins that have to bind to positively-charged
molecules have surfaces rich with negatively charged amino acids like
glutamate and aspartate, while proteins binding to negatively-charged
molecules have surfaces rich with positively charged chains like lysine and
arginine. There are different hydrophobicity scales of amino acid residues.
Some amino acids have special properties such as cysteine, that can form
covalent disulfide bonds to other cysteine residues, proline that forms a cycle
to the polypeptide backbone, and glycine that is more flexible than other amino
acids.

Many proteins undergo a range of posttranslational modifications, when


additional chemical groups are attached to the amino acids in proteins. Some
modifications can produce hydrophobic lipoproteins, or hydrophilic
glycoproteins. These types of modification allow the reversible targeting of a
protein to a membrane. For example, the addition and removal of the fatty acid
palmitic acid to cysteine residues in some signaling proteins causes the proteins
to attach and then detach from cell membranes.

Figure:
A Venn diagram showing therelationship of the
20 naturally occurring amino acids to a selection
of physio-chemical properties thought to be
important in the determination of protein
structure

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

an

an
da
ab

ab
rd

ev

ns
br
ac
in

id
m
of

ia
st
le

ti
T

o
a

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Proteinogenic amino acid


Proteinogenic amino acids are those amino acids that can be found in proteins
and require cellular machinery coded for in the genetic code of any organism for
their isolated production.
There are 22 standard amino acids, but only 21 are found in eukaryotes. Of the
22, 20 are directly encoded by the universal genetic code. Humans can
synthesize 11 of these 20 from each other or from other molecules of
intermediary metabolism. The other 9 must be consumed in the diet, and so
are called essential amino acids; those are histidine, isoleucine, leucine, lysine,
methionine, phenylalanine, threonine, tryptophan, and valine. The remaining
two, selenocysteine and pyrrolysine, are incorporated into proteins by unique
synthetic mechanisms.The word proteinogenic means "protein building".
Proteinogenic amino acids can be assembled into a polypeptide (the subunit of
a protein) through a process called translation (the second stage of protein
biosynthesis, part of the overall process of gene expression).
. Proteinogenic amino acids .

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Genetic code
The genetic code is the set of rules by which information encoded in
genetic material (DNA or mRNA sequences) is translated into proteins (amino
acid sequences) by living cells. The code defines a mapping between tri-
nucleotide sequences, called codons, and amino acids. With some exceptions, a
triplet codon in a nucleic acid sequence specifies a single amino acid. Because
the vast majority of genes are encoded with exactly the same code (see the
RNA codon table), this particular code is often referred to as the canonical or
standard genetic code, or simply the genetic code, though in fact there are
many variant codes. For example, protein synthesis in human mitochondria
relies on a genetic code that differs from the standard genetic code

Not all genetic information is stored using the genetic code. All
organisms' DNA contains regulatory sequences, intergenic segments,
chromosomal structural areas, and other non-coding DNA that can contribute
greatly to phenotype. Those elements operate under sets of rules that are
distinct from the codon-to-amino acid paradigm underlying the genetic code.

After the structure of DNA was discovered by James Watson and Francis Crick,
who used the experimental evidence of Maurice Wilkins and Rosalind Franklin
(among others), serious efforts to understand the nature of the encoding of
proteins began. George Gamow postulated that a three-letter code must be
employed to encode the 20 standard amino acids used by living cells to encode
proteins, because 3 is the smallest integer n such that 4n is at least 20.

Transfer of information via the genetic code


The genome of an organism is inscribed in DNA, or in the case of some
viruses, RNA. The portion of the genome that codes for a protein or RNA is
called a gene. Those genes that code for proteins are composed of tri-
nucleotide units called codons, each coding for a single amino acid. Each
nucleotide sub-unit consists of a phosphate, a deoxyribose sugar and one of the
four nitrogenous nucleobases. The purine bases adenine (A) and guanine (G)
are larger and consist of two aromatic rings. The pyrimidine bases cytosine (C)
and thymine (T) are smaller and consist of only one aromatic ring. In the
double-helix configuration, two strands of DNA are joined to each other by
hydrogen bonds in an arrangement known as base pairing. These bonds almost
always form between an adenine base on one strand and a thymine base on
the other strand, or between a cytosine base on one strand and a guanine base
on the other. This means that the number of A and T bases will be the same in a

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

given double helix, as will the number of G and C bases. In RNA, thymine (T) is
replaced by uracil (U), and the deoxyribose is substituted by ribose.

Each protein-coding gene is transcribed into a molecule of the related


polymer RNA. In prokaryotes, this RNA functions as messenger RNA or mRNA;
in eukaryotes, the transcript needs to be processed to produce a mature mRNA.
The mRNA is, in turn, translated on the ribosome into an amino acid chain or
polypeptide.The process of translation requires transfer RNAs specific for
individual amino acids with the amino acids covalently attached to them,
guanosine triphosphate as an energy source, and a number of translation
factors. tRNAs have anticodons complementary to the codons in mRNA and can
be "charged" covalently with amino acids at their 3' terminal CCA ends.
Individual tRNAs are charged with specific amino acids by enzymes known as

aminoacyltRNAsynthetases, which have high specificity for both their


cognate amino acids and tRNAs. The high specificity of these enzymes is a
major reason why the fidelity of protein translation is maintained.

There are 4³ = 64 different codon combinations possible with a triplet codon


of three nucleotides; all 64 codons are assigned for either amino acids or stop
signals during translation. If, for example, RNA sequence, UUUAAACCC is
considered and the reading frame starts with the first U (by convention, 5' to
3'), there are three codons, namely, UUU, AAA and CCC, each of which specifies
one amino acid. This RNA sequence will be translated into an amino acid
sequence, three amino acids long. A given amino acid may be encoded by
between one and six different codon sequences. A comparison may be made
with computer science, where the codon is similar to a word, which is the
standard "chunk" for handling data (like one amino acid of a protein), and a
nucleotide is similar to a bit, in that it is the smallest unit.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

RNA codon table

The codon AUG both codes for methionine and serves as an initiation site: the
first AUG in an mRNA's coding region is where translation into protein begins

RNA Codon Inverse Table

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

DNA codon table


The genetic code is traditionally represented as a RNA codon table due to the
biochemical nature of the protein translation process. However, with the rise
of computational biology and genomics, proteins have become increasingly
studied at a genomic level. As a result, the practice of representing the genetic
code as a DNA codon table has become more popular. The DNA codons in such
tables occur on the sense DNA strand and are arranged at a 5' → 3' directionality.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

The codon ATG both codes for methionine and serves as an initiation site: the first ATG in
DNA's coding region is where translation into protein begins

GENE EXPRESSION
Gene expression is the process by which information from a gene is used
in the synthesis of a functional gene product. These products are often
proteins, but in non-protein coding genes such as ribosomal RNA (rRNA),
transfer RNA (tRNA) or Small nuclear RNA (snRNA) genes, the product is a
functional RNA. The process of gene expression is used by all known life -
eukaryotes (including multicellular organisms), prokaryotes (bacteria and
archaea) and viruses - to generate the macromolecular machinery for life.
Several steps in the gene expression process may be modulated, including the
transcription, RNA splicing, translation, and post-translational modification of a
protein. Gene regulation gives the cell control over structure and function, and
is the basis for cellular differentiation, morphogenesis and the versatility and
adaptability of any organism. Gene regulation may also serve as a substrate for
evolutionary change, since control of the timing, location, and amount of gene
expression can have a profound effect on the functions (actions) of the gene in
a cell or in a multicellular organism.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

L-Leucine P-Proline
H-Histidine E- Glutamic Acid
L-Leucine E- Glutamic Acid
K- Lysine
T- Threonine

This Table explains the Gene Expression, Short Codes, Genetic-


Codons & Abbreviations for all the 20 Amino Acids

The Mechanism of Gene Expression involves the following stages

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

1. Transcription
2. RNA Processing or Post Transcriptional Modifications
3. Translation

I. Transcription

Transcription is the first step leading to gene expression.

Transcription is the process of creating a complementaryRNA copy of a


sequence of DNA. Both RNA and DNA are nucleic acids, which use base pairs of
nucleotides as a complementary language that can be converted back and forth
from DNA to RNA by the action of the correct enzymes.

During transcription, a DNA sequence is read by RNA polymerase, which


produces a complementary, antiparallel RNA strand. As opposed to DNA
replication, transcription results in an RNA complement that includes uracil (U)
in all instances where thymine (T) would have occurred in a DNA complement.

Transcription can be explained easily in 4 or 5 steps, each moving like a wave


along the DNA.

1. Helicase unwinds/"unzips" the DNA by breaking the hydrogen bonds


between complementary nucleotides.
2. RNA nucleotides are paired with complementary DNA bases.
3. RNA sugar-phosphate backbone forms with assistance from RNA
polymerase.
4. Hydrogen bonds of the untwisted RNA+DNA helix break, freeing the
newly synthesized RNA strand.
5. If the cell has a nucleus, the RNA is further processed (addition of a 3'
poly-A tail and a 5' cap) and exits through to the cytoplasm through the
nuclear pore complex.

The stretch of DNA transcribed into an RNA molecule is called a


transcription unit and encodes at least one gene. If the gene transcribed
encodes a protein, the result of transcription is messenger RNA (mRNA), which
will then be used to create that protein via the process of translation.
Alternatively, the transcribed gene may encode for either ribosomal RNA
(rRNA) or transfer RNA (tRNA), other components of the protein-assembly
process, or other ribozymes.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

A DNA transcription unit encoding for a protein contains not only the
sequence that will eventually be directly translated into the protein (the coding
sequence) but also regulatory sequences that direct and regulate the synthesis
of that protein. The regulatory sequence before (upstream from) the coding
sequence is called the five prime untranslated region (5'UTR), and the sequence
following (downstream from) the coding sequence is called the three prime
untranslated region (3'UTR).[citation needed]

As in DNA replication, DNA is read from 3' → 5' during transcription.


Meanwhile, the complementary RNA is created from the 5' → 3' direction. This
means its 5' end is created first in base pairing. Although DNA is arranged as
two antiparallel strands in a double helix, only one of the two DNA strands,
called the template strand, is used for transcription. This is because RNA is only
single-stranded, as opposed to double-stranded DNA. The other DNA strand is
called the coding strand, because its sequence is the same as the newly created
RNA transcript (except for the substitution of uracil for thymine). The use of
only the 3' → 5' strand eliminates the need for the Okazaki fragments seen in
DNA replication.[citation needed]

Transcription is divided into 5 stages: pre-initiation, initiation, promoter clearance,


elongation and termination

1. Pre-initiation

In eukaryotes, RNA polymerase, and therefore the initiation of transcription,


requires the presence of a core promoter sequence in the DNA. Promoters are
regions of DNA that promote transcription and, in eukaryotes, are found at -30,
-75, and -90 base pairs upstream from the start site of transcription.RNA
polymerase is able to bind to core promoters in the presence of various specific
transcription factors.

The most common type of core promoter in eukaryotes is a short DNA


sequence known as a TATA box, found 25-30 base pairs upstream from the
start site of transcription.[citation needed] The TATA box, as a core promoter, is
the binding site for a transcription factor known as TATA-binding protein (TBP),
five more transcription factors and RNA polymerase combine around the TATA
box in a series of stages to form a preinitiation complex. One transcription
factor, DNA helicase, has helicase activity and so is involved in the separating of
opposing strands of double-stranded DNA to provide access to a single-
stranded DNA template. Thus, pre-initiation complex contains

A. Core Promoter Sequence


B. Transcription Factors
C. DNA Helicase

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

D. RNA Polymerase
E. Activators and Repressors

2. Initiation

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

RNAP = RNA polymerase

In bacteria, transcription begins with the binding of RNA polymerase to


the promoter in DNA. RNA polymerase is a core enzyme consisting of five
subunits: 2 α subunits, 1 β subunit, 1 β' subunit, and 1 ω subunit. At the start of
initiation, the core enzyme is associated with a sigma factor that aids in finding
the appropriate -35 and -10 base pairs downstream of promoter sequences.

Transcription initiation is more complex in eukaryotes. Eukaryotic RNA


polymerase does not directly recognize the core promoter sequences. Instead,
a collection of proteins called transcription factors mediate the binding of RNA
polymerase and the initiation of transcription. Only after certain transcription
factors are attached to the promoter does the RNA polymerase bind to it. The
completed assembly of transcription factors and RNA polymerase bind to the
promoter, forming a transcription initiation complex. Transcription in the
archaea domain is similar to transcription in eukaryotes.

3. Promoter clearance

After the first bond is synthesized, the RNA polymerase must clear the
promoter. During this time there is a tendency to release the RNA transcript
and produce truncated transcripts. This is called abortive initiation and is
common for both eukaryotes and prokaryotes. Abortive initiation continues to
occur until the σ factor rearranges, resulting in the transcription elongation
complex (which gives a 35 base-pair moving footprint). The σ factor is released
before 80 nucleotides of mRNA are synthesized. Once the transcript reaches
approximately 23 nucleotides, it no longer slips and elongation can occur. This,
like most of the remainder of transcription, is an energy-dependent process,
consuming adenosine triphosphate (ATP).

4. Elongation

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

One strand of the DNA, the template strand (or noncoding strand), is used
as a template for RNA synthesis. As transcription proceeds, RNA polymerase
traverses the template strand and uses base pairing complementarity with the
DNA template to create an RNA copy. Although RNA polymerase traverses the
template strand from 3' → 5', the coding (non-template) strand and newly-
formed RNA can also be used as reference points, so transcription can be
described as occurring 5' → 3'. This produces an RNA molecule from 5' → 3', an
exact copy of the coding strand (except that thymines are replaced with uracils,
and the nucleotides are composed of a ribose (5-carbon) sugar where DNA has
deoxyribose (one less oxygen atom) in its sugar-phosphate backbone)

Elongation also involves a proofreading mechanism that can replace


incorrectly incorporated bases. In eukaryotes, this may correspond with short
pauses during transcription that allow appropriate RNA editing factors to bind.

5.
Termination

Bacteria use two different strategies for transcription termination. In Rho-


independent transcription termination, RNA transcription stops when the
newly synthesized RNA molecule forms a G-C-rich hairpin loop followed by a
run of Us. When the hairpin forms, the mechanical stress breaks the weak rU-
dA bonds, now filling the DNA-RNA hybrid. This pulls the poly-U transcript out
of the active site of the RNA polymerase, in effect, terminating transcription. In
the "Rho-dependent" type of termination, a protein factor called "Rho"
destabilizes the interaction between the template and the mRNA, thus
releasing the newly synthesized mRNA from the elongation complex.

Transcription termination in eukaryotes is less understood but involves


cleavage of the new transcript followed by template-independent addition of
As at its new 3' end, in a process called polyadenylation.

II. RNA Processing or Post Transcriptional


Modifications

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

RNA Processing or Post-transcriptional modification is the process by which,


in eukaryotic cells, primary transcript RNA is converted into mature RNA. A
notable example is the conversion of precursor messenger RNA into
maturemessenger RNA (mRNA), which includes splicing and occurs prior to
protein synthesis. This process is vital for the correct translation of the
genomes of eukaryotes as the human primary RNA transcript that is produced
as a result of transcription contains both exons, which are coding sections of
the primary RNA transcript and introns, which are the non-coding sections of
the primary RNA transcript.

1. m-RNA processing
The pre-mRNA molecule undergoes three main modifications. These
modifications are 5' capping, 3' polyadenylation, and RNA splicing, which occur
in the cell nucleus before the RNA is translated.

A. 5' Processing involves CAPPING

Capping of the pre-mRNA involves the addition of 7-methylguanosine


(m7G) to the 5' end. To achieve this, the terminal 5' phosphate requires
removal, which is done with the aid of a phosphatase enzyme. The enzyme
guanosyltransferase then catalyses the reaction, which produces the
diphosphate 5' end. The diphosphate 5' prime end then attacks the α
phosphorus atom of a GTP molecule in order to add the guanine residue in a
5'5' triphosphate link. The enzyme (guanine-N7-)-methyltransferase ("cap
MTase") transfers a methyl group from S-adenosyl methionine to the guanine
ring.This type of cap, with just the (m7G) in position is called a cap 0 structure.
The ribose of the adjacent nucleotide may also be methylated to give a cap 1.
Methylation of nucleotides downstream of the RNA molecule produce cap 2,
cap 3 structures and so on. In these cases the methyl groups are added to the 2'
OH groups of the ribose sugar. The cap protects the 5' end of the primary RNA
transcript from attack by ribonucleases that have specificity to the 3'5'
phosphodiester bonds.

B. 3' Processing involves CLEAVAGE & POLYADENYLATION

The pre-mRNA processing at the 3' end of the RNA molecule involves
cleavage of its 3' end and then the addition of about 200 adenine residues to
form a poly(A) tail. The cleavage and adenylation reactions occur if a

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

polyadenylation signal sequence (5'- AAUAAA-3') is located near the 3' end of
the pre-mRNA molecule, which is followed by another sequence, which is
usually (5'-CA-3'). The second signal is the site of cleavage. A GU-rich sequence
is also usually present further downstream on the pre-mRNA molecule. After
the synthesis of the sequence elements, two multisubunitproteins called
cleavage and polyadenylation specificity factor (CPSF) and cleavage stimulation
factor (CStF) are transferred from RNA Polymerase II to the RNA molecule. The
two factors bind to the sequence elements. A protein complex forms that
contains additional cleavage factors and the enzyme Polyadenylate Polymerase
(PAP). This complex cleaves the RNA between the polyadenylation sequence
and the GU-rich sequence at the cleavage site marked by the (5'-CA-3')
sequences. Poly(A) polymerase then adds about 200 adenine units to the new
3' end of the RNA molecule using ATP as a precursor. As the poly(A) tails is
synthesised, it binds multiple copies of poly(A) binding protein, which protects
the 3'end from ribonuclease digestion.

2. RNA Splicing

Splicing is a
modification of an RNA
after transcription, in
which introns are
removed and exons are
joined. This is needed
for the typical
eukaryoticmessenger RNA before it can be used to produce a correct protein
through translation. For many eukaryotic introns, splicing is done in a series of
reactions which are catalyzed by the spliceosome, a complex of small nuclear
ribonucleoproteins (snRNPs), but there are also self-splicing introns .

IV. Translation

Translation is the third stage of protein biosynthesis (part of the overall


process of gene expression). In translation, messenger RNA (mRNA) produced
by transcription is decoded by the ribosome to produce a specific amino acid

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

chain, or polypeptide, that will later fold into an active protein. In Bacteria,
translation occurs in the cell's cytoplasm, where the large and small subunits of
the ribosome are located, and bind to the mRNA. In Eukaryotes, translation
occurs across the membrane of the endoplasmic reticulum in a process called
vectorial synthesis. The ribosome facilitates decoding by inducing the binding of
tRNAs with complementaryanticodon sequences to that of the mRNA. The
tRNAs carry specific amino acids that are chained together into a polypeptide as
the mRNA passes through and is "read" by the ribosome in a fashion
reminiscent to that of a stock ticker and ticker tape.

In many instances, the entire ribosome/mRNA complex will bind to the outer
membrane of the rough endoplasmic reticulum and release the nascent protein
polypeptide inside for later vesicle transport and secretion outside of the cell.
Many types of transcribed RNA, such as transfer RNA, ribosomal RNA, and small
nuclear RNA, do not undergo translation into proteins.

Translation proceeds in four phases: activation, initiation, elongation and


termination (all describing the growth of the amino acid chain, or polypeptide
that is the product of translation). Amino acids are brought to ribosomes and
assembled into proteins.

In activation, the correct amino acid is covalently bonded to the correct transfer
RNA (tRNA). The amino acid is joined by its carboxyl group to the 3' OH of the
tRNA by an ester bond. When the tRNA has an amino acid linked to it, it is
termed "charged". Initiation involves the small subunit of the ribosome binding
to the 5' end of mRNA with the help of initiation factors (IF). Termination of the
polypeptide happens when the A site of the ribosome faces a stop codon (UAA,
UAG, or UGA). No tRNA can recognize or bind to this codon. Instead, the stop
codon induces the binding of a release factor protein that prompts the
disassembly of the entire ribosome/mRNA complex.

Translation Process

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

FASTA & FASTA Format


 FASTA is a DNA and proteinsequence alignment software package

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

 FASTA format is a text-based format for representing either nucleotide


sequences or peptide sequences, in which base pairs or amino acids are
represented using single-letter codes. The format also allows for sequence
names and comments to precede the sequences.

The simplicity of FASTA format makes it easy to manipulate and parse


sequences using text-processing tools and scripting languages like Python&Perl.

The FASTA format may be used to represent either single sequences or many
sequences in a single file. A series of single sequences, concatenated, constitute
a multisequence file.

A sequence in FASTA format is represented as a series of lines, which


should be no longer than 120 characters and usually do not exceed 80
characters

 The first line in a FASTA file starts either with a ">" (greater-than) symbol
or a ";" (semicolon) and was taken as a comment.
 Subsequent lines starting with a semicolon would be ignored by
software. Since the only comment used was the first, it quickly became
used to hold a summary description of the sequence, often starting with
a unique library accession number, and with time it has become
commonplace use to always use ">" for the first line and to not use ";"
comments .
 Following the initial line (used for a unique description of the sequence)
is the actual sequence itself in standard one-letter code. Anything other
than a valid code would be ignored (including spaces, tabulators,
asterisks, etc...).

A sample Sequence:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephasmaximusmaximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY

Python Programs using amino acid query sequences, the accepted amino acid
codes are:

A alanine P proline
B aspartate/asparagine Q glutamine
C cystine R arginine

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

D aspartate S serine
E glutamate T threonine
F phenylalanine U selenocysteine
G glycine V valine
H histidine W tryptophan
I isoleucine Y tyrosine
K lysine Z glutamate/glutamine
L leucine X any
M methionine * translation stop
N asparagine -gap of indeterminate length

Sequence identifiers

The NCBI defined a standard for the unique identifier used for the sequence
(SeqID) in the header line. The formatdbman page has this to say on the
subject: "formatdb will automatically parse the SeqID and create indexes, but
the database identifiers in the FASTA definition line must follow the
conventions of the FASTA Defline Format.

However they do not give a definitive description of the FASTA defline format.
An attempt to create such a format is given below

GenBankgi|gi-number|gb|accession|locus
EMBL Data Library gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|name
Patents pat|country|number
GenInfo Backbone Id bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local Sequence identifier lcl|identifier

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Python is an interpreter, general-purpose high-level Object


OrientedProgramming language (OOP) whose design philosophy emphasizes
code readability. Python aims to combine "remarkable power with very clear
syntax", and its standard library is large and comprehensive. Its use of
indentation for block delimiters is unique among popular programming
languages.

Python supports multiple programming paradigms, primarily but not limited to


object-oriented, imperative and, to a lesser extent, functional programming
styles. It features a fully dynamic type system and automatic memory
management, similar to that of Scheme, Ruby, Perl, and Tcl. Like other dynamic
languages, Python is often used as a scripting language, but is also used in a
wide range of non-scripting contexts.

Python interpreters are available for many operating systems, and Python
programs can be packaged into stand-alone executable code for many systems
using various tools.

HISTORY
Guido van Rossum created Python and is affectionately
bestowed with the title "Benevolent Dictator For Life" by
the Python community.

The following are the features of Python


1. Scripting language: A script is a program that controls other programs.
Scripting languages are good for quick development and prototyping
because they're good at passing messages from one component to another
and at handling fiddly stuff like memory management so that the
programmer doesn't have to. Python has grown beyond scripting
languages, which are used mostly for small applications. The Python
community prefers to call Python a dynamic programming language.

2. Indentation for statement grouping: Python specifies that several


statements are part of a single group by indenting them. The indented
group is called a code block. Other languages use different syntax or
punctuation for statement grouping. For example, the C programming
language uses { to begin an instruction and } to end it. Indentation is
considered good practice in other languages also, but Python was one of
the first to enforce indentation. Indentation makes code easier to read, and

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

code blocks set off with indentation have fewer begin/end words and
punctuation to accidentally leave out (which means fewer bugs).

3. High-level data types: Computers store everything in 1s and 0s, but


humans needto work with data in more complex forms, such as text. A
language that supports such complex data is said to have high-level data
types. A high-level data type is easy to manipulate. For example, Python
strings can be searched, sliced, joined, split, set to upper- or lowercase, or
have white space removed. High-level data types in Python, such as lists
and dicts (which can store other data types), encompass much more
functionality than in other languages.

4. Extensibility: An extensible programming language can be added to. These


languages are very powerful because additions make them suitable for
multiple applications and operating systems. Extensions can add data types
or concepts, modules, and plug-ins. Python is extensible in several ways. A
core group of programmers works on modifying and improving the
language, while hundreds of other programmers write modules for specific
purposes.

5. Interpreted: Interpreted languages run directly from source code that


humans generate (whereas programs written in compiled languages, like
C++, must be translated to machine code before they can run). Interpreted
languages run more slowly because the translation takes place on the fly,
but development and debugging is faster because we don't have to wait for
the compiler. Interpreted languages are easier to run on multiple operating
systems. In the case of Python, it's easy to write code that works on
multiple operating systems—with no need to make modifications.

The following are the uses of Python


In addition to being a well-designed programming language, Python is useful for
accomplishing real-world tasks—the sorts of things developers do day in and
day out.

1. Systems Programming

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Python’s built-in interfaces to operating-system services make it ideal for


writing portable, maintainable system-administration tools and utilities
(sometimes called shelltools). Python programs can search files and
directory trees, launch other programs, doparallel processing with
processes and threads, and so on.Python’s standard library comes with
POSIX bindings and support for all the usual OStools: environment
variables, files, sockets, pipes, processes, multiple threads,
regularexpression pattern matching, command-line arguments, standard
stream interfaces,shell-command launchers, filename expansion, and
more. In addition, the bulk of Python’s system interfaces are designed to
be portable; for example, a script that copiesdirectory trees typically runs
unchanged on all major Python platforms.

2. GUIs

Python’s simplicity and rapid turnaround also make it a good match for
graphical userinterface programming. Python comes with a standard
object-oriented interface to theTk GUI API called tkinter (Tkinter in 2.6)
that allows Python programs to implement portable GUIs with a native
look and feel. Python/tkinter GUIs run unchanged onMicrosoft Windows,
X Windows (on Unix and Linux), and the Mac OS (both Classicand OS X). A
free extension package, PMW, adds advanced widgets to the
tkintertoolkit. In addition, the wxPython GUI API, based on a C++ library,
offers an alternativetoolkit for constructing portable GUIs in Python.

3. Internet Scripting

Python comes with standard Internet modules that allow Python programs
to performa wide variety of networking tasks, in client and server modes.
Scripts can communicateover sockets; extract form information sent to
server-side CGI scripts; transfer files byFTP; parse, generate, and analyze
XML files; send, receive, compose, and parse email;fetch web pages by
URLs; parse the HTML and XML of fetched web pages; communicate over
XML-RPC, SOAP, and Telnet; and more. Python’s libraries make thesetasks
remarkably simple.In addition, a large collection of third-party tools are
available on the Web for doingInternet programming in Python. For
instance, the HTMLGen system generates HTMLfiles from Python class-
based descriptions, the mod_python package runs

4. Component Integration

Python’s ability to be extended by and embedded in C and C++


systemsmakes it useful as a flexible glue language for scripting the

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

behavior of other systemsand components. For instance, integrating a C


library into Python enables Python totest and launch the library’s
components, and embedding Python in a product enablesonsite
customizations to be coded without having to recompile the entire
product (orship its source code at all).

5. Database Programming

For traditional database demands, there are Python interfaces to all


commonly usedrelational database systems—Sybase, Oracle, Informix,
ODBC, MySQL, PostgreSQL,SQLite, and more. The Python world has also
defined a portable database API for accessing SQL database systems from
Python scripts, which looks the same on a varietyof underlying database
systems. For instance, because the vendor interfaces implementthe
portable API, a script written to work with the free MySQL system will
work largelyunchanged on other systems (such as Oracle); all wehave to
do is replace the underlying vendor interface.

Python’s standard pickle module provides a simple object persistence


system—it allowsprograms to easily save and restore entire Python
objects to files and file-like objects. Furthermore, as of Python 2.5, the in-
process SQLite embedded SQL database engineis a standard part of
Python itself.

6. Rapid Prototyping

To Python programs, components written in Python and C look the same.


Because ofthis, it’s possible to prototype systems in Python initially, and
then move selected components to a compiled language such as C or C++
for delivery. Unlike some prototypingtools, Python doesn’t require a
complete rewrite once the prototype has solidified. Partsof the system
that don’t require the efficiency of a language such as C++ can
remaincoded in Python for ease of maintenance and use.

7. Numeric and Scientific Programming

The NumPy numeric programming extension for Python mentioned earlier


includessuch advanced tools as an array object, interfaces to standard
mathematical libraries,and much more. By integrating Python with
numeric routines coded in a compiledlanguage for speed, NumPy turns
Python into a sophisticated yet easy-to-use numericprogramming tool that
can often replace existing code written in traditional compiledlanguages

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

such as FORTRAN or C++. Additional numeric tools for Python


supportanimation, 3D visualization, parallel processing, and so on. The
popular SciPy and ScientificPython extensions, for example, provide
additional libraries of scientific programming tools and use NumPycode.

The following are the advantages of Python over other


languages
1. It’s Object-Oriented

Python is an object-oriented language, from the ground up. Its class model
supportsadvanced notions such as polymorphism, operator overloading, and
multiple inheritance; yet, in the context of Python’s simple syntax and typing,
OOP is remarkably easyto apply. In fact, if we don’t understand these terms,
you’ll find they are much easierto learn with Python than with just about any
other OOP language available.Besides serving as a powerful code structuring
and reuse device, Python’s OOP naturemakes it ideal as a scripting tool for

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

object-oriented systems languages such as C++and Java. For example, with the
appropriate glue code, Python programs can subclass(specialize) classes
implemented in C++, Java, and C#.Of equal significance, OOP is an option in
Python; we can go far without having tobecome an object guru all at once.
Much like C++, Python supports both proceduraland object-oriented
programming modes. Its object-oriented tools can be applied ifand when
constraints allow. This is especially useful in tactical development modes,which
preclude design phases.

2. It’s Free

Python is completely free to use and distribute. As with other open source
software,such as Tcl, Perl, Linux, we can fetch the entire Python system’s
sourcecode for free on the Internet. There are no restrictions on copying it,
embedding it inour systems, or shipping it with our products. In fact, we can
even sell Python’ssource code, if we are so inclined.But don’t get the wrong
idea: “free” doesn’t mean “unsupported.” Python online community responds
to user queries with a speed that most commercial software help desks would
do well to try to emulate.

3. It’s Portable

The standard implementation of Python is written in portable ANSI C, and it


compilesand runs on virtually every major platform currently in use. For
example, Python programs run today on everything from PDAs to
supercomputers. As a partial list, Pythonis available on:

•Linux and Unix systems


•Microsoft Windows and DOS
•Mac OS (both OS X and Classic)

•BeOS, OS/2, VMS, and QNX


•Real-time systems such as VxWorks
•PDAs running Palm OS, PocketPC, and Linux
•Cell phones running Symbian OS and Windows Mobile

•And more

Like the language interpreter itself, the standard library modules that ship with
Pythonare implemented to be as portable across platform boundaries as
possible. Further,Python programs are automatically compiled to portable byte
code, which runs thesame on any platform with a compatible version of Python
installed.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

4. It’s Powerful

From a features perspective, Python is something of a hybrid. Its toolset places


it between traditional scripting languages (such as Tcl, Scheme, and Perl) and
systems development languages (such as C, C++, and Java). Python provides all
the simplicityand ease of use of a scripting language, along with more advanced
software-engineeringtools typically found in compiled languages. Unlike some
scripting languages, thiscombination makes Python useful for large-scale
development projects.

 Dynamic typing:Python keeps track of the kinds of objects our program


uses when it runs; it doesn’t require complicated type and size
declarations in our code.
 Automatic memory management: Python automatically allocates
objects and reclaims (“garbage collects”) themwhen they are no longer
used, and most can grow and shrink on demand.
 Programming-in-the-large support: For building larger systems, Python
includes tools such as modules, classes, and exceptions. These tools
allow we to organize systems into components, use OOP to reuse and
customize code, and handle events and errors gracefully.
 Built-in object types: Python provides commonly used data structures
such as lists, dictionaries, and strings as intrinsic parts of the language;
as you’ll see, they’re both flexible and easy to use. For instance, built-in
objects can grow and shrink on demand, can be arbitrarily nested to
represent complex information, and more.
 Built-in tools:To process all those object types, Python comes with
powerful and standard operations, including concatenation (joining
collections), slicing (extracting sections), sorting, mapping, and more.
 Library utilities:For more specific tasks, Python also comes with a
large collection of precoded library tools that support everything from
regular expression matching to net-working. Once we learn the
language itself, Python’s library tools are where muchof the
application-level action occurs.
 Third-party utilities: Because Python is open source, developers are
encouraged to contribute precodedtools that support tasks beyond
those supported by its built-ins,Despite the array of tools in Python, it
retains a remarkably simple syntax and design.The result is a powerful
programming tool with all the usability of a scripting language.

5. It’s Mixable

Python programs can easily be “glued” to components written in other


languages in avariety of ways. For example, Python’s C API lets C programs call

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

and be called byPython programs flexibly. That means we can add functionality
to the Python systemas needed, and use Python programs within other
environments or systems.Mixing Python with libraries coded in languages such
as C or C++, for instance, makesit an easy-to-use frontend language and
customization tool. As mentioned earlier, thisalso makes Python good at rapid
prototyping; systems may be implemented in Pythonfirst, to leverage its speed
of development, and later moved to C for delivery, one pieceat a time,
according to performance demands.

6. It’s Easy to Use

To run a Python program, we simply type it and run it. There are no
intermediatecompile and link steps, like there are for languages such as C or C+
+. Python executesprograms immediately, which makes for an interactive
programming experience andrapid turnaround after program changes—in
many cases, we can witness the effect ofa program change as fast as we can
type it.Python programs are simpler, smaller, and more flexible than equivalent
programs in languages like C, C++, and Java.

The IDLE User Interface

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

IDLE provides a graphical user interface for doing Pythondevelopment, and it’s
a standard and free part of the Python system. It is usually referredto as an
integrated development environment (IDE), because it binds together
variousdevelopment tasks into a single view.In short, IDLE is a GUI that lets you
edit, run, browse, and debug Python programs,all from a single interface.
Moreover, because IDLE is a Python program that uses thetkinter GUI toolkit
(known as Tkinter in 2.6), it runs portably on most Python platforms, including
Microsoft Windows, X Windows (for Linux, Unix, and Unix-likeplatforms), and
the Mac OS (both Classic and OS X). For many, IDLE represents aneasy-to-use
alternative to typing command lines, and a less problem-prone alternativeto
clicking on icons.

Data Types in Python

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

When we write computer programs, we usually want to give the computer


information (data) and have it do things with the data and give we results.
Programming languages like Python have several data types that we do
different things with.We decide which data type to use depending on what we
want to do. For example, if we want to do mathematical calculations, we might
choose a number data type.

The following list briefly introduces some of Python's data types:

• Numbers are for data that we want to do math with.

• Strings are for text characters and for binary data.

• Sequences are for lists of related data that we might want to sort, merge,
and so on.

• Dictionaries are collections of data that associate a unique key with each
value.

• Sets are for doing set operations (finding the intersection, difference, and so

on) with multiple values.

• Files are for data that is or will be stored as a document on a computer.

1. Numeric data

Python has four built-in numeric data types, as shown in Table

Except when we're doing division with integers or using the decimal module
we don't have to worry about what kind of number data type we're using.
Python converts numbers into compatible types automatically. For example, if

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

we multiply an integer and a floating point number, Python automatically gives


the answer as a floating point number:

>>> x = 5
>>> y = 1.5
>>> x * y
7.5

2. Sequential data

Sequential data types contain multiple pieces of data, each of which is


numbered, or indexed. Each piece of data inside a sequence is called an
element. Three sequential data types are built into Python:

A. Lists can store multiple kinds of data (both text and numbers, for
example). We can change elements inside a list, and we can organize
the data in various ways (for example, by sorting).
B. Tuples, like lists, can include different kinds of data, but they can't be
changed. In Python terminology, they are immutable.
C. Strings store text or binary data. Strings are immutable (like tuples).

Table: Python's Built-in Sequence Data Types

To see the data type of a Python object, use the type() function, like this:
>>> type ('foo')
<type 'str'>

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

3. Dictionaries

Python's dictionary (its keyword is dict) is a data type that stores multiple data
items elements) of different types. In a dictionary, each element is associated
with a unique key, which is a value of any immutable type. When we use a dict,
we use the key to return the element associated with the key.

We use a dictionary when we want to store and retrieve items by using a key
that doesn't change and when we don't care in what order Python stores the
items. (In dictionaries, elements aren't numbered.)

A Python dictionary bears only a small resemblance to the kind of dictionary


that contains words and their definitions. In Python, a dictionary is more like a
list of employees and their employee numbers. Because each employee
number is unique, we can look up that employee by typing his or her number.
Dictionaries are mutable, like lists, but their keys are immutable.

Here is an example of a dictionary with two key:value pairs:

swallow_velocity = {"european": "47", "african": "69"}

4. Sets

A set stores multiple items, which can be of different types, but each item in a
set must be unique. We can use Python sets to find unions, intersections,
differences, and so on. One use for sets is when we have repetitious data and
we want to ignore the repetition.
For example, imagine that we have an address database and we want to find
out which cities are represented, but we don't need to know how many times
each city appears in the database. A set will list each city in the database only
once.

The syntax for a set is a little different from the syntax of the other data
types.We use the word set followed by a name (or a group of elements) in
parentheses. Here is a set that finds each unique element in a list. We'll notice
that the elements are out of order in the set. That's because Python doesn't
store set elements in alphanumeric order (the same is true for dicts):

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

>>> mylist = ['spam', 'lovely', 'spam', 'glorious', 'spam']


>>> set(mylist)
set(['lovely', 'glorious', 'spam']

5. Files

Python uses the file data type to work with files on our computer or on the
Internet. Note that the file type is not the same as the actual file. The file type
is Python's internal representation of a computer or Internet file.

REMEMBER Before Python can work with an existing file or a new file, we need
to open the file inside Python.
This example opens a file called myfile:
open("myfile")

File objects are Python code’s main interface to external files on your
computer. Files are a core type, but they’re something of an oddball—there is
no specific literal syntax for creating them. Rather, to create a file object, you
call the built-in open function, passing in an external filename and a processing
mode as strings. For example, to create a text output file, you would pass in its
name and the 'w' processing mode string towrite data:

>>> f = open('data.txt', 'w') # Make a new file in output mode


>>> f.write('Hello\n') # Write strings of bytes to it
6
>>> f.write('world\n') # Returns number of bytes written in Python 3.0
6
>>> f.close() # Close to flush output buffers to disk

This creates a file in the current directory and writes text to it (the filename can
be a full directory path if you need to access a file elsewhere on your
computer). To read back what you just wrote, reopen the file in 'r' processing
mode, for reading text input—this is the default if you omit the mode in the
call. Then read the file’s content into a string, and display it. A file’s contents
are always a string in your script, regardless of the type of data the file
contains:

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

>>> f = open('data.txt') # 'r' is the default processing mode


>>> text = f.read() # Read entire file into a string
>>> text
'Hello\nworld\n'
>>> print(text) # print interprets control characters
Hello
world
>>> text.split() # File content is always a string
['Hello', 'world']

File objects provide more ways of reading and writing (read accepts an optional
byte size, readline reads one line at a time, and so on), as well as other tools
(seek moves to a new file position). As we’ll see later, though, the best way to
read a file today is to not read it at all—files provide an iterator that
automatically reads lineby line in for loops and other contexts.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

BIOPYTHON
Biopython is a set of libraries to provide the ability to
deal with “things” of interest to biologists working on
the computer.

Biopython is a set of freely available tools for biological computation written in


Python by an international team of developers.

The main Biopython releases have lots of functionality, including:

 The ability to parse bioinformatics files into Python utilizable data


structures, including support for the following formats:
o Blast output – both from standalone and WWW Blast
o Clustalw
o FASTA
o GenBank
o PubMed and Medline
o ExPASy files, like Enzyme and Prosite
o SCOP, including ‘dom’ and ‘lin’ files
o UniGene
o SwissProt
 Files in the supported formats can be iterated over record by record or
indexed and accessed via a Dictionary interface.
 Code to deal with popular on-line bioinformatics destinations such as:
o NCBI – Blast, Entrez and PubMed services
o ExPASy – Swiss-Prot and Prosite entries, as well as Prosite
searches
 Interfaces to common bioinformatics programs such as:
o Standalone Blast from NCBI
o Clustalw alignment program
o EMBOSS command line tools
 A standard sequence class that deals with sequences, ids on sequences,
and sequence features.
 Tools for performing common operations on sequences, such as
translation, transcription and weight calculations.
 Code to perform classification of data using k Nearest Neighbors, Naive
Bayes or Support Vector Machines.
 Code for dealing with alignments, including a standard way to create and
deal with substitution matrices.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

 Code making it easy to split up parallelizable tasks into separate


processes.
 GUI-based programs to do basic sequence manipulations, translations,
BLASTing, etc.
 Extensive documentation and help with using the modules, including this
file, on-line wiki documentation, the web site, and the mailing list.
 Integration with BioSQL, a sequence database schema also supported by
the BioPerl and BioJava projects.

Installing Biopython
It is available at the Biopython’s download page
(http://biopython.org/wiki/Download)

For Windows pre-compiled click-and-run installers are available, while for Unix
and other operating systems you must install from source as described in the
included README file. This is usually as simple as the standard commands:

python setup.py build


python setup.py test
sudo python setup.py install

Working with sequences


The central object in bioinformatics is the sequence. Sequences usually refer to
a string of letters like ‘AGTACACTGGT’. We can create such Seq object with this
sequence as follows - the “>>>” represents the Python prompt followed by
what you would type in:

>>> from Bio.Seq import Seq


>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> print my_seq
AGTACACTGGT
>>> my_seq.alphabet
Alphabet()

What we have here is a sequence object with a generic alphabet - reflecting the
fact we have not specified if this is a DNA or protein sequence (okay, a protein
with a lot of Alanines, Glycines, Cysteines and Threonines!). In addition to
having an alphabet, the Seq object differs from the Python string in the
methods it supports. You can’t do this with a plain string:

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.complement()
Seq('TCATGTGACCA', Alphabet())
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT', Alphabet())

The next most important class is the SeqRecord or Sequence Record.


This holds a sequence (as a Seq object) with additional annotation including an
identifier, name and description. The Bio.SeqIO module for reading and writing
sequence file formats works with SeqRecord objects.

Parsing sequence file formats


A large part of much bioinformatics work involves dealing with the many types
of file formats designed to hold biological data. These files are loaded with
interesting biological data, and a special challenge is parsing these files into a
format so that. We can manipulate them with some kind of programming
language. However the task of parsing these files can be frustrated by the fact
that the formats can change quite regularly, and that formats may contain
small subtleties which can break even the most well designed parsers.

Using the NCBI website by hand. Let’s just take a look through the nucleotide
databases at NCBI, using an Entrez online search
(http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide) for
everything mentioning the text Cypripedioideae (this is the subfamily of lady
slipper orchids).

Simple FASTA parsing example

If we open the lady slipper orchids FASTA file ls_orchid.fasta in our text editor,
We’ll see that the file starts like this:

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1


and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTGGAAT
AAACGATCGAGTG
AATCCGGAGGACCGGTGTACTCAGCTCACCGGGGGCATTGCTCCCGTGGTGACCCTG
ATTTGTTGTTGGG
...

It contains 94 records, each has a line starting with “>” (greater-than symbol)
followed by the sequence on one or more lines. Now try this in Python:

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

from Bio import SeqIO


for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"):
print seq_record.id
print repr(seq_record.seq)
print len(seq_record)

We should get something like this on our screen:

gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTG
G...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...
GCC', SingleLetterAlphabet())
592

Notice that the FASTA format does not specify the alphabet, so Bio.SeqIO has
defaulted to the rather generic SingleLetterAlphabet() rather than something
DNA specific.

Connecting with biological databases


One of the very common things that you need to do in bioinformatics is extract
information from biological databases. It can be quite tedious to access these
databases manually, especially if you have a lot of repetitive work to do.
Biopython attempts to save you time and energy by making some on-line
databases available from Python scripts. Currently, Biopython has code to
extract information from the following databases:

 Entrez (and PubMed) from the NCBI


 ExPASy
 SCOP

The code in these modules basically makes it easy to write Python code that
interact with the CGI scripts on these pages, so that you can get results in an
easy to deal with format. In some cases, the results can be tightly integrated
with the Biopython parsers to make it even easier to extract information.

Sequences and Alphabets

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

The alphabet object is perhaps the important thing that makes the Seq object
more than just a string. The currently available alphabets for Biopython are
defined in the Bio.Alphabet module. We’ll use the IUPAC alphabets
(http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite
objects: DNA, RNA and Proteins.

Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but
additionally provides the ability to extend and customize the basic definitions.
For instance, for proteins, there is a basic IUPACProtein class, but there is an
additional ExtendedIUPACProtein class providing for the additional elements
“U” (or “Sec” for selenocysteine) and “O” (or “Pyl” for pyrrolysine), plus the
ambiguous symbols “B” (or “Asx” for asparagine or aspartic acid), “Z” (or “Glx”
for glutamine or glutamic acid), “J” (or “Xle” for leucine isoleucine) and “X” (or
“Xxx” for an unknown amino acid). For DNA you’ve got choices of
IUPACUnambiguousDNA, which provides for just the basic letters,
IUPACAmbiguousDNA (which provides for ambiguity letters for every possible
situation) and ExtendedIUPACDNA, which allows letters for modified bases.
Similarly, RNA can be represented by IUPACAmbiguousRNA or
IUPACUnambiguousRNA.

The advantages of having an alphabet class are two fold. First, this gives an idea
of the type of information the Seq object contains. Secondly, this provides a
means of constraining the information, as a means of type checking.

Now that we know what we are dealing with, let’s look at how to utilize this
class to do interesting work. You can create an ambiguous sequence with the
default generic alphabet like this:

>>> from Bio.Seq import Seq


>>> my_seq = Seq("AGTACACTGGT")
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.alphabet Alphabet()

However, where possible you should specify the alphabet explicitly when
creating your sequence objects - in this case an unambiguous DNA alphabet
object:

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("AGTACACTGGT", IUPAC.unambiguous_dna)
>>> my_seq
Seq('AGTACACTGGT', IUPACUnambiguousDNA())
>>> my_seq.alphabet
IUPACUnambiguousDNA()

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Unless of course, this really is an amino acid sequence:

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> my_prot = Seq("AGTACACTGGT", IUPAC.protein)
>>> my_prot
Seq('AGTACACTGGT', IUPACProtein())
>>> my_prot.alphabet
IUPACProtein()

Sequences act like strings


In many ways, we can deal with Seq objects as if they were normal Python
strings, for example getting the length, or iterating over the elements:

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
>>> for index, letter in enumerate(my_seq):
... print index, letter
0 G
1 A
2 T
3 C
4 G
>>> print len(my_seq)
5

We can access elements of the sequence in the same way as for strings (but
Python counts from zero and ends with -1)

>>> print my_seq[0] #first letter


G
>>> print my_seq[2] #third letter
T
>>> print my_seq[-1] #last letter
G

The Seq object has a .count() method, just like a string. Note that this means
that like a Python string, this gives a non-overlapping count:

>>> from Bio.Seq import Seq


>>> "AAAA".count("AA")
2
>>> Seq("AAAA").count("AA")
2

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

For some biological uses, you may actually want an overlapping count (i.e. 3 in
this trivial example). When searching for single letters, this makes no
difference:

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) /
len(my_seq)
46.875

While we could use the above snippet of code to calculate a GC%, note that the
Bio.SeqUtils module has several GC functions already built. For example:

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> from Bio.SeqUtils import GC
>>> my_seq = Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPAC.unambiguous_dna)
>>> GC(my_seq)
46.875

Note that using the Bio.SeqUtils.GC() function should automatically cope with
mixed case sequences and the ambiguous nucleotide S which means G or C.

PROJECT CODE:
“A Bio-Python Based Program to Generate Random Protein Sequences, each
sequence being 100 amino acid residues long.”

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

# File Name Project-621033475.py


# standard library
import os
import random

# biopython
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
from sys import *

residueList1 = ["C","D","E","F","G","H","I"]
residueList2 = ["A","K","L","M","N","S"]
residueList3 = ["P","Q","R","T","V","W","Y"]
residueList4 = ["C","A","G","U"]

def getProteinSeqRecord(residue, seqcount):


strSeq = ""
for i in range(0,100,1):
index = random.randint(0, len(residue)-1)
strSeq += residue[index]

sequence = Seq(strSeq, IUPAC.IUPACProtein)


seqRec = SeqRecord(sequence, id = 'randSeq' +
str(seqcount), description= 'A random sequence using 100
Amino acid residues.')
return seqRec

def getProteinSequence(residue):
strSeq = ""
for i in range(0,100,1):
index = random.randint(0, len(residue)-1)
strSeq += residue[index]

sequence = Seq(strSeq, IUPAC.IUPACProtein)


return sequence

def randomProteinSeqRecord(index):
if(index%2)==0:
return getProteinSeqRecord(residueList1, index)
elif(index%3)==0:
return getProteinSeqRecord(residueList2, index)
else:
return getProteinSeqRecord(residueList3, index)

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

#information

print 'This is python based program to generate random


sequences '
print 'Provide number of random sequences to generate. '
print 'Inorder to save to a file provide file path or
filename '
print 'If none or invalid filepath is provided then
results will be displayed to console '
print 'The file will be created in FASTA format '
print

filepathProvided = False
#raw_input received the user input as string
try:
filepath = raw_input('Enter filepath to save sequences
(X:/filename) ... ')
filepath = filepath + '.fasta'
#handle = open(filepath, "w")
#handle.close()

filepathProvided = True
except IOError:
print 'Invalid or No File provided will print results
to console'
print
ranSeqCount = 1
try:
ranSeqCount = int(raw_input('Enter number of random
sequences to generate ... '))
except ValueError:
ranSeqCount = 1
pass

print 'Sequence Count : '


print ranSeqCount

records = []
for i in range(0,ranSeqCount,1):
records.append(randomProteinSeqRecord(i+1))

if(filepathProvided):
SeqIO.write(records, filepath, "fasta")

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

print 'File created at: ' + filepath, 'NOTE: File


created is in FASTA-format, It can be opened in Notepad'

else:
print 'Writing to console is not supported. :/'

print
raw_input('Press Enter to exit ...')
print

Libraries used in the project code.


1. The SeqLibrary
In Biopython, sequences are usually held as Seq objects, which hold the
sequence string and an associated alphabet.

The BiopythonSeq object, defined in the Bio.Seq module (together with related
objects like the MutableSeq, plus some general purpose sequence functions

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

The Seq object essentially combines a Python string with an (optional)


biological alphabet. For example:

>>>fromBio.SeqimportSeq
>>>my_seq = Seq("AGTACACTGGT")
>>>my_seq
Seq('AGTACACTGGT', Alphabet())
>>>my_seq.alphabet
Alphabet()

In the above example, we haven't specified an alphabet so we end up with a


default generic alphabet. Biopython doesn't know if this is a nucleotide
sequence or a protein rich in alanines, glycines, cysteines and threonines. If we
know, we should supply this information:

>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna, generic_protein
>>>my_seq = Seq("AGTACACTGGT")
>>>my_seq
Seq('AGTACACTGGT', Alphabet())
>>>my_dna = Seq("AGTACACTGGT", generic_dna)
>>>my_dna
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_protein = Seq("AGTACACTGGT", generic_protein)
>>>my_protein
Seq('AGTACACTGGT', ProteinAlphabet())

Why is this important? Well it can catch some errors for we - we wouldn't want
to accidentally try and combine a DNA sequence with a protein would we:

>>>my_protein + my_dna
Traceback(most recent call last):
...
TypeError: Incompatable alphabets ProteinAlphabet()andDNAAlphabet()

Biopython will also catch things like trying to use nucleotide only methods like
translation (see below) on a protein sequence.

General methods

The Seq object has a number of methods which act just like those of a Python
string, for example the find method:

>>>fromBio.SeqimportSeq

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

>>>fromBio.Alphabetimportgeneric_dna
>>>my_dna = Seq("AGTACACTGGT", generic_dna)
>>>my_dna
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_dna.find("ACT")
5
>>>my_dna.find("TAG")
-1

There is a count method too:

>>>my_dna.count("A")
3
>>>my_dna.count("ACT")
1

However, watch out because just like the Python string's count, this is a non-
overlapping count!

>>>"AAAA".count("AA")
2
>>>Seq("AAAA", generic_dna).count("AA")
2

In some biological situations, we might prefer an overlapping count which


would give three for this example.

Nucleotide methods

If we have a nucleotide sequence (or a sequence with a generic alphabet) we


may want to do things like take the reverse complement, or do a translation.

Complement and reverse complement

These are very simple - the methods return a new Seq object with the
appropriate sequence and the same alphabet:

>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>my_dna = Seq("AGTACACTGGT", generic_dna)

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

>>>my_dna
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_dna.complement()
Seq('TCATGTGACCA', DNAAlphabet())
>>>my_dna.reverse_complement()
Seq('ACCAGTGTACT', DNAAlphabet())

Transcription and back transcription

If we have a DNA sequence, we may want to turn it into RNA. In bioinformatics


we normally assume the DNA is the coding strand (not the template strand) so
this is a simple matter of replacing all the thymines with uracil:

>>>my_dna
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_dna.transcribe()
Seq('AGUACACUGGU', RNAAlphabet())

Naturally, given some RNA, we might want the associated DNA - and again
Biopython does a simple U/T substitution:

>>>my_rna = my_dna.transcribe()
>>>my_rna
Seq('AGUACACUGGU', RNAAlphabet())
>>>my_rna.back_transcribe()
Seq('AGTACACTGGT', DNAAlphabet())

If we actually do want the template strand, we'd have to do a reverse


complement on top:

>>>my_rna
Seq('AGUACACUGGU', RNAAlphabet())
>>>my_rna.back_transcribe().reverse_complement()
Seq('ACCAGTGTACT', DNAAlphabet())

Translation

We can translate RNA:

>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_rna
>>>messenger_rna =
Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
>>>messenger_rna.translate()

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Seq('MAIVMGR*KGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

Or DNA - which is assumed to be the coding strand:

>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
generic_dna)
>>>coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

In either case there are several useful options - by default as we will notice the
in example above translation continues through any stop codons, but this is
optional:

>>>coding_dna.translate(to_stop=True)
Seq('MAIVMGR', ExtendedIUPACProtein())
>>>coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>>coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))

We can of course combine these options:

>>>coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())

The SeqRecord object


The SeqRecord object in BioPython is used to hold a sequence (as a Seq object)
with identifiers (ID and name), description and optionally annotation and sub-
features.

Most of the sequence file format parsers in BioPython can return SeqRecord
objects (and may offer a format specific record object too, see for example
Bio.SwissProt). The SeqIO system will only return SeqRecord objects.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Most of the time we'll create SeqRecord objects by parsing a sequence file with
Bio.SeqIO. However, it is useful to know how to create a SeqRecord directly.
For example,

fromBio.SeqimportSeq
fromBio.SeqRecordimportSeqRecord
fromBio.Alphabetimport IUPAC
record =
SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
IUPAC.protein),
id="YP_025292.1", name="HokC",
description="toxic membrane protein, small")
print record

This would give the following output:

ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())

Extracting information from a SeqRecord


Lets look in closer detail at the well annotated SeqRecord objects Biopython
creates from a GenBank file, such as ls_orchid.gbk, which we'll load using the
SeqIO module. This file contains 94 records:

from Bio importSeqIO


for index, record inenumerate(SeqIO.parse(open("ls_orchid.gbk"),
"genbank")) :
print"index %i, ID = %s, length %i, with %i features" \
%(index, record.id, len(record.seq), len(record.features))

And this is some of the output. Remember python likes to count from zero, so
the 94 records in this file have been labelled 0 to 93:

index 0, ID = Z78533.1, length 740, with 5 features


index 1, ID = Z78532.1, length 753, with 5 features
index 2, ID = Z78531.1, length 748, with 5 features
...
index 92, ID = Z78440.1, length 744, with 5 features
index 93, ID = Z78439.1, length 592, with 5 features

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Lets look in a little more detail at the final record:

print record

That should give we a hint of the sort of information held in this object:

ID: Z78439.1
Name: Z78439
Desription: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/source=Paphiopedilumbarbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ...,
'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed
spacer', 'ITS1', 'ITS2']
/references=[<Bio.SeqFeature.Reference ...>, <Bio.SeqFeature.Reference ...>]
/data_file_division=PLN
/date=30-NOV-2006
/organism=Paphiopedilumbarbatum
/gi=2765564
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTT
TGGTC ...', IUPACAmbiguousDNA())

Most of the values in the dictionary are simple strings, but this isn't always the
case - have a look at the references entry for this example - its a list of
Reference objects:

>>>printrecord.annotations["references"].__class__
<type'list'>
>>>printlen(record.annotations["references"])
2
>>>for ref inrecord.annotations["references"] : printref.authors
Cox,A.V., Pridgeon,A.M., Albert,V.A. andChase,M.W.
Cox,A.V.

Next is features which is another list property, and it contains SeqFeature


objects:
>>>printrecord.features.__class__
<type 'list'>
>>>printlen(record.features)
5

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

SeqFeature objects are complicated enough to warrant their own wiki page...
for now please refer to the Tutorial.

We can convert the SeqRecord into a string using one of the output formats
supported by Bio.SeqIO, for example:

>>>printrecord.format("fasta")

This should give:

>Z78439.1 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC
ACCCATGGGCATTTGCTGTTGAAGTGACCTAGATTTGCCATCGAGCCTCCTTGGGAGCTT
TCTTGTTGGCGAGATCTAAACCCCTGCCCGGCGGAGTTGGGCGCCAAGTCATATGACACA
TAATTGGTGAAGGGGGTGGTAATCCTGCCCTGACCCTCCCCAAATTATTTTTTTAACAAC
TCTCAGCAACGGATATCTCGGCTCTTGCATCGATGAAGAACGCAGCGAAATGCGATAATG
GTGTGAATTGCAGAATCCCGTGAACATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCA
TCAGGCCAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTG
TCCATACATACTGTTCAGCCGGTGCGGATGTGAGTTTGGCCCCTTGTTCTTTGGTACGGG
GGGTCTAAGAGCTGCATGGGCTTTGGATGGTCCTAAATACGGAAAGAGGTGGACGAACTA
TGCTACAACAAAATTGTTGTGCAAATGCCCCGGTTGGCCGTTTAGTTGGGCC

The SeqIOLibrary
Bio.SeqIO provides a simple uniform interface to input and output assorted
sequence file formats (including multiple sequence alignments), but will only
deal with sequences as SeqRecord objects. There is a sister interface
Bio.AlignIO for working directly with sequence alignment files as Alignment
objects.

With Bio.SeqIO we can treat sequence alignment file formats just like any other
sequence file, but the new Bio.AlignIO module is designed to work with such
alignment files directly. You can also convert a set of SeqRecord objects from
any file format into an alignment - provided they are all the same length. Note
that when using Bio.SeqIO to write sequences to an alignment file format, all
the (gapped) sequences should be the same length.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Sequence Input

The main function is Bio.SeqIO.parse() which takes a file handle and format
name, and returns a SeqRecord iterator. This lets we do things like:

from Bio importSeqIO


handle = open("example.fasta", "rU")
for record inSeqIO.parse(handle, "fasta") :
print record.id
handle.close()

In the above example, we opened the file using the built-in python function
open. The argument 'rU' means open for reading using universal readline mode
- this means we don't have to worry if the file uses Unix, Mac or DOS/Windows
style newline characters.

Iterators are great for when we only need the records one by one, in the order
found in the file. For some tasks we may need to have random access to the
records in any order. In this situation, use the built in python list function to
turn the iterator into a list:

from Bio importSeqIO


handle = open("example.fasta", "rU")
records = list(SeqIO.parse(handle, "fasta"))
handle.close()
print records[0].id#first record
print records[-1].id#last record

Sequence Output

For writing records to a file use the function Bio.SeqIO.write(), which takes a
SeqRecord iterator (or list), output handle and format string:

from Bio importSeqIO


sequences = ... # add code here
output_handle = open("example.fasta", "w")
SeqIO.write(sequences, output_handle, "fasta")
output_handle.close()

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

There are more examples in the following section on converting between file
formats.

Note that if we are writing to an alignment file format, all our sequences must
be the same length.

If we supply the sequences as a SeqRecord iterator, then for sequential file


formats like Fasta or GenBank, the records can be written one by one. Because
only one record is created at a time, very little memory is required. See the
example below filtering a set of records.

On the other hand, for interlaced or non-sequential file formats like Clustal, the
Bio.SeqIO.write() function will be forced to automatically convert an iterator
into a list. This will destroy any potential memory saving from using an
generator/iterator approach.

Random subsequences [USED IN THIS PROJECT]

This script will read a Genbank file with a whole mitochondrial genome (e.g. the
tobacco mitochondrion, Nicotianatabacum mitochondrionNC_006581), create
500 records containing random fragments of this genome, and save them as a
fasta file. These subsequences are created using a random starting points and a
fixed length of 200.

from Bio importSeqIO


fromBio.SeqRecordimportSeqRecord
fromrandomimportrandint
 
for i inrange(0, 500) :
end=start+200
record=SeqRecord(mito_frag,'fragment_%i'%(i+1),'','')

output_handle = open("file name.fasta", "w")


output_handle.close()

That should give something like this as the output file,

>fragment_1
TGGGCCTCATATTTATCCTATATACCATGTTCGTATGGTGGCGCGATGTTCTACGTGAAT
CCACGTTCGAAGGACATCATACCAAAGTCGTACAATTAGGACCTCGATATGGTTTTATTC
TGTTTATCGTATCGGAGGTTATGTTCTTTTTTGCTCTTTTTCGGGCTTCTTCTCATTCTT
CTTTGGCACCTACGGTAGAG

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

...
>fragment_500
ACCCAGTGCCGCTACCCACTTCTACTAAGGCTGAGCTTAATAGGAGCAAGAGACTTGGAG
GCAACAACCAGAATGAAATATTATTTAATCGTGGAAATGCCATGTCAGGCGCACCTATCA
GAATCGGAACAGACCAATTACCAGATCCACCTATCATCGCCGGCATAACCATAAAAAAGA
TCATTAAAAAAGCGTGAGCC

Writing to a string

Sometimes we won't want to write our SeqRecord object(s) to a file, but to a


string. For example, we might be preparing output for display as part of a
webpage. If we want to write multiple records to a single string, use StringIO to
create a string-based handle.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Sample Outputs
1. Screenshot showing the Project-621033475.py in the IDLE USER Interface

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

2. Screenshot showing the 1st Input Menu,


i.e “Enter filepath to save seq 'Enter filepath to save sequences
(X:/filename) ...”

 where X:/ is the drive where the file has to be stored.


 A folder or subfolder (path) can also be provided
Example:
 D:/User/Documents or just

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

 D:/Sample
[which creates a file named Sample.fasta in D:/ (D-Drive) ]
 If No Path and file name is provided
[A file with no name with .fasta format containgi the sequences is
created in the folder where Project-621033475.py is located]

3. Screenshot showing the example of the 1st Input

 Here the user enters the filepath as


 D:/Sample
[which creates a file named Sample.fasta in D:/ (D-Drive) ]

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

4. Screenshot showing 2nd Input Menu


i.e “'Enter number of random sequences to generate ...

 where the user enters the number of random sequences he wishes to


be created.

 Only Integers are supported.


Example:
 12

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

5. Screenshot showing the confirmation that the .fasta file is created at the
specified location with the specified number of sequences.

i.e “ 'Sequence Count:


12
File created at d:/Sample.fasta NOTE: File created is in FASTA-
format, It can be opened in Notepad

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

6. Screenshot showing Sample.fasta file opened in Notepad program of


Windows XP

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

The Final Output in FASTA Format


From D:/Sample.fasta

>randSeq1 A random sequence using 100 Amino acid residues.

PQWQVYYPTVQQPYRPYWRYQRQYPPWTRWPQVTYYTQWPTPWPYPPYWQQRWVPVPYWV

PRQWYTTTTQWQQTVVQRTPWYTPRYTTQQRRWRWQQTPR

>randSeq2 A random sequence using 100 Amino acid residues.

IDGECEEGHGFHFDFHGGIHDFFCDFGCGEHGIIFGGFHDGGHIIDHFFCHEGIGHGFID

EHEEHGHEIGHDCGEFCFHHHEGEFEDFIFHGCFDDEIHG

>randSeq3 A random sequence using 100 Amino acid residues.

LLLLNMMNKLSLMASSLALSSMMSLKMSKANMAMASLLAKKLLKANLSNKNLLKNLKLSS

KSLSLLMANLSAASKNMMLKNLLKAAKLLAKNMMSLASKN

>randSeq4 A random sequence using 100 Amino acid residues.

DIDGFFDGHHCICCECEHFDHCDIHGGDDCIDFIFIFHDGGGDFDCEDHCECFHICIIHG

FFEDCHCGCCDCFIDIIIFHHEDFDIEGFCCIEGHHHFGD

>randSeq5 A random sequence using 100 Amino acid residues.

VRTTQPQQVWRTYQTWWVPWWYPQYQRYVQQVTWRPPRPQWQVVWQRWTVTPTVPPYPVR

RRPPVRWVQRWVVWTQWPYYTPWVRYTRTVVTPWYQYVVQ

>randSeq6 A random sequence using 100 Amino acid residues.

CDDGGCFCFHIICCEEIIHHIGEIIHHHFIDEDFGIGDEGCHIIDGCFHIIGGGGIHHFI

FDGFGDFHFEHEGDGDHFFEFGIGIFGFIGDECDCIICED

>randSeq7 A random sequence using 100 Amino acid residues.

VVTYYYPQYRPWTYTTRQWTPRPPTYWQQVYTQRVTWTVPPQTWVTRRYQPTTWTPYTQT

YVWWQQWWRQTRTYWYVQWYVTPTWTQPQTTVQQQTTVWW

>randSeq8 A random sequence using 100 Amino acid residues.

FFCEDFDHCGCGGHFHGDGHHEFIEFIGCHHGCFCGECGEGECDFIDCCFEEHEIDCIIH

GCHHIEIDIEICFIGCHHIICGFHCGHFFHDEEFIFDICF

>randSeq9 A random sequence using 100 Amino acid residues.

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

LSMLNNSLMSSNKLNKKLSLSAKALAAMASLAALKKNSMSSMNSKAKAKSAAKKAASALA

KASKLNLSMMNNLMASMNLSNLMNALNSMMKLANNNNMSL

>randSeq10 A random sequence using 100 Amino acid residues.

ICFHDGFIIHECIHHCGFHCDFDEEHHGIFDDDICFGFIEGDHECGFEFCGIHGFFEFID

FCCGIECCEEGGGIGCCGGHFIHIDCDFGHGGECIDGDIH

>randSeq11 A random sequence using 100 Amino acid residues.

TRWVVQQVWWRPYPRYWPYRPVVQTYQTTWWPTRWRTRQRYYQQQPWTYTPRTQYYPRQQ

WRVVTPQQYTRQQPVRQWWVRWPWVTQQYVWVWYPQPRQQ

>randSeq12 A random sequence using 100 Amino acid residues.

DGFFFCGGIIECDDFIGIGECHGICGGEHCICGIDIIGEGECEHIGGCFEDFEFECEDCH

CGDIGIIDEHIGIEEDICHEIHDDEHGHEIEFGGGHGDGE

******************************************

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

1. Conclusion
2. Recommendations for improving this project

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

GLOSSARY
Accession number: An identifier supplied by the curators of the major biological
databases upon submission of a novel entry that uniquely identifies that sequence (or
other) entry. 

Adenine: A purine base found in DNA and RNA 

Amino acid: One of the 20 chemical building blocks that are joined by amide
(peptide) linkages to form a polypeptide chain of a protein 

Assembly: Compilation of overlapping sequences from one or more related genes that
have been clustered together based on their degree of sequence identity or similarity.
Sequence assembly may be used to piece together "shotgun" sequencing fragments
(see shotgun sequencing) based upon overlapping restriction enzyme digests, or may
be used to identify and index novel genes from "single-pass" cDNA sequencing efforts. 

Base pair: A pair of nitrogenous bases (a purine and a pyrimidine), held together by
hydrogen bonds, that form the core of DNA and RNA i.e the A:T, G:C and A:U
interactions. 

Beta sheet: A three dimensional arrangement taken up by polypeptide chains that


consists of alternating strands linked by hydrogen bonds. The alternating strands
together form a sheet that is frequently twisted. One of the secondary structural
elements characteristic of proteins. 

Bioinformatics: 
1.The field of endeavor that relates to the collection, organization and analysis of large
amounts of biological data using networks of computers and databases (usually with
reference to the genome project and DNA sequence information). 
2. Bioinformatics, sometimes, is used interchangeably with the term Computational
Biology. Precisely, Computational Biology is defined as the systematic development
and application of computing systems and computational solution techniques to
models of biological phenomena; Bioinformatics is defined as the systematic
development and application of computing systems and
computational solution techniques analyzing data obtained by experiments, modeling,
database search, and instrumentation regarding biological aspect. 

Codon: A sequence of three adjacent nucleotides that designates a specific amino


acid or start/stop site for transcription. 

Cytosine: A pyrimidine base found in DNA and RNA. 

Database Any file system by which data gets stored following a logical process.  (see
also relational database) 

DNA (deoxyribonucleic acid) The chemical that forms the basis of the genetic
material in virtually all organisms. DNA is composed of the four nitrogenous bases
Adenine, Cytosine, Guanine, and Thymine, which are covalently bonded to a
backbone of deoxyribose-phosphate to form a DNA strand. Two complementary
strands (where all Gs pair with Cs and As with Ts) form a double helical structure
which is held together by hydrogen bonding between the cognate bases. 

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

DNA polymerase An enzyme that catalyzes the synthesis of DNA from a DNA
template given the deoxyribonucleotide precursors. 

Expression (gene or protein) A measure of the presence, amount, and time-course of


one or more gene products in a particular cell or tissue.  Expression studies are
typically performed at the RNA (mRNA) or protein level in order to determine the
number, type, and level of genes that may be up-regulated or down-regulated during
a cellular process, in response to an external stimulus, or in sickness or disease. 
Gene chips and proteomics now allow the study of expression profiles of sets of genes
or even entire genomes. 

FASTA format A sequence in FASTA format begins with a single-line description,


followed by lines of sequence data. The description line is distinguished from the
sequence data by a greater-than (">") symbol in the first column. It is recommended
that all lines of text be shorter than 80 characters in length. An example sequence in
FASTA format is: 
>gi|532319|pir|TVFV2E|TVFV2E envelope protein
ELRLRYCAPAGFALLKCNDADYDGFKTNCSNVSVVHCTNLMNTTVTTGLLLNGSYSENRT
QIWQKHRTSNDSALILLNKHYNLTVTCKRPGNKTVLPVTIMAGLVFHSQKYNLRLRQAWC
HFPSNWKGAWKEVKEEIVNLPKERYRGTNDPKRIFFQRQWGDPETANLWFNCHGEFFYC
KMDWFLNYLNNLTVDADHNECKNTSGTKSGNKRAPGPCVQRTYVACHIRSVIIWLETISKK

TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKN
L LAAVEAQQQMLKLTIWGVK

Sequences are expected to be represented in the standard IUB/IUPAC amino acid and
nucleic acid codes with these exceptions:  lower-case letters are accepted and are
mapped into upper-case; a single hyphen or dash can be used to represent a gap of
indeterminate length; and in amino acid sequences, U and * are acceptable letters

GenBank Data bank of genetic sequences operated by a division of the National


Institutes of Health. 

Gene Classically, a unit of inheritance. In practice, a gene is a segment of DNA on a


chromosome that encodes a protein and all the regulatory sequences (promoter)
required to control expression of that protein. 

Gene expression The conversion of information from gene to protein via transcription
and translation. 

Genetic code The mapping of all possible codons into the 20 amino acids including
the start and stop codons. 

Genome The complete genetic content of an organism. 

Guanine (G) One of the nitrogenous purine bases found in DNA and RNA 

Hydrogen bond A weak chemical interaction between an electronegative atom (e.g.


nitrogen or oxygen) and a hydrogen atom that is covalently attached to another atom.
This bond maintains the two-helices of DNA together and is also the primary
interaction between water molecules. 

Introns Nucleotide sequences found in the structural genes of eukaryotes that are
non-coding and interrupt the sequences containing information that codes for

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

polypeptide chains. Intron sequences are spliced out of their RNA transcripts before
maturation and protein synthesis. (cf. Exons) 

Iteration A series of steps in an algorithm whereby the processing of data is


performed repetitively until the result exceeds a particular threshold. Iteration is often
used in multiple sequence alignments whereby each set of pairwise alignments are
compared with every other, starting with the most similar pairs and progressing to the
least similar, until there are no longer any sequence-pairs remaining to be aligned. 

Library A large collection of compounds, peptides, cDNAs or genes which may be


screened in order to isolate cognate molecules. 

Messenger RNA (mRNA) The complementary RNA copy of DNA formed from a single-
stranded DNA template during transcription that migrates from the nucleus to the
cytoplasm where it is processed into a sequence carrying the information to code for a
polypeptide domain. 

Nuclease Any enzyme that can cleave the phosphodiester bonds of nucleic acid
backbones. 

Nucleoside A five-carbon sugar covalently attached to a nitrogen base. 

Nucleotide A nucleic acid unit composed of a five carbon sugar joined to a phosphate
group and a nitrogen base. 

Peptide A short stretch of amino acids each covalently coupled by a peptide (amide)
bond.

Peptide bond (amide bond) A covalent bond formed between two amino acids when
the amino group of one is linked to the carboxy group of another (resulting in the
elimination of one water molecule). 

Poly(A) tail The stretch of Adenine (A) residues at the 3’ end of eukaryotic mRNA that
is added to the pre-mRNA as it is processed, before its transport from the nucleus to
the cytoplasm and subsequent translation at the ribosome. 

Polyadenylation site A site on the 3’-end of messenger RNA (mRNA) that signals the
addition of a series of Adenines during the RNA processing step and before the mRNA
migrates to the cytoplasm.  These so-called poly(A) "tails" increase mRNA stability
andallow one to isolate mRNA from cells by PCR-amplification using poly(T) primers. 

Polypeptide A single chain of covalently attached amino acids joined by peptide


bonds. Polypeptide chains usually fold into a compact, stable form (a domain) that is
part (or all) of the final protein. 

Post-transcriptional modification Alterations made to pre-mRNA before it leaves the


nucleus and becomes mature mRNA. 

Primary sequence (protein) The linear sequence of a polypeptide or protein. 

Purine A nitrogen-containing compound with a double-ring structure. The parent


compound of Adenine and Guanine. 

Pyrimidine A nitrogen-containing compound with a single six-membered ring


structure. The parent compound of Thymidine and Cytosine. 

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Repeats (repeat sequences) Repeat sequences and approximate repeats occur


throughout the DNA of higher organisms (mammals). For example, the Alu sequences
of length about 300 characters, appear hundreds of thousands of times in Human
DNA with about 87% homology to a consensus Alu string. Some short substrings
such as TATA-boxes, poly-A and (TG)* also appear more often than by chance. Repeat
sequences may also occur within genes, as mutations or alterations to those genes.
Repetitive sequences, especially mobile elements, have many applications in genetic
research. DNA transposons and retroposons are routinely used for insertional
mutagenesis, gene mapping, gene tagging, and gene transfer in several model
systems. 

Replication The synthesis of an informationally identical macromolecule (e.g. DNA)


from a template molecule. 

Ribonucleic acid (RNA) A category of nucleic acids in which the component sugar is
ribose and consisting of the four nucleotides Thymidine, Uracil, Guanine, and
Adenine. The three types of RNA are messenger RNA (mRNA), transfer RNA (tRNA) and
ribosomal RNA (rRNA). 

Splice site The sequence found at the 5’ and 3’ region of exon/intron boundaries,
usually defined by a consensus sequence: 
Intron
5’ CAGGTAAGT---------TNCAGG 3’ 
A G C T 
N represents any nucleotide; the bottom line represents alternative nucleotides at the
indicated positions. 

Splicing The joining together of separate DNA or RNA component parts. For example,
RNA splicing in eukaryotes involves the removal of introns and the stitching together
of the exons from the pre-mRNA transcript before maturation. 

ThymineA pyrimidine base found in DNA but not in RNA. 

Transcript The single-stranded mRNA chain that is assembled from a gene template. 

Transcription The assembly of complementary single-stranded RNA on a DNA


template.

Transcription factors A group of regulatory proteins that are required for


transcription in eukaryotes. Transcription factors bind to the promoter region of a
gene and facilitate transcription by RNA polymerase. 

Transfer RNA (tRNA) A small RNA molecule that recognizes a specific amino acid,
transports it to a specific codon in the mRNA, and positions it properly in the nascent
polypeptide chain. 

Translation The process of converting RNA to protein by the assembly of a


polypeptide chain from an mRNA molecule at the ribosome. 

Uracil Nitrogenous pyrimidine base found in RNA but not DNA. 

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011
“A Bio-Python Based Program to Generate Random Protein Sequences,
each sequence being 100 amino acid residues long”

Project submitted by Hemant Kumar Betala (Reg No: 621033475)


2 nd Semester, MSc-Bioinformatics, Sikkim Manipal University Feb- 2011

You might also like