Professional Documents
Culture Documents
Contents
1. Project Outline
2. Introduction
3. Objective
4. Introduction to Amino Acids
A. General Structure
B. Physical Properties
C. Classification
D. Peptide Bond Formation
E. Physiochemical Properties
F. Proteinogenic amino acids
5. The Genetic Code
A. RNA Codon Table
B. DNA Codon Table
6. Gene Expression
A. Transcription
B. RNA Processing
C. Translation
7. FASTA & FASTA Format
8. Python
A. Introduction,Features,Uses etc.
B. The IDLE User Interfac…
C. Data Types
9. Bio-Python
10. PROJECT CODE: “A Bio-Python Based Program to Generate Random Protein
Sequences, each sequence being 100 amino acid residues long.”
11. Explanation of the Code & Outputs
A. Bio-Python Libraries Used
B. Sample Outputs
12. Conclusion
13. Recommendations for improving this project
14. Glossary
15. Bibliography
History
The first few amino acids were discovered in the early 19th century. In
1806, the French chemists Louis-Nicolas Vauquelin and Pierre Jean Robiquet
isolated a compound in asparagus that proved to be asparagine, the first amino
acid to be discovered. Another amino acid that was discovered in the early 19th
century was cystine, in 1810 although its monomer, cysteine, was discovered
much later, in 1884. Glycine and leucine were also discovered around this time,
in 1820 by H.Braconnot from gelatin. Usage of the term amino acid in the
English language is from 1898.
General Structure
In the structure shown at the top of the page, R
represents a side-chain specific to each amino
acid. The carbon atom next to the carboxyl group
is called the α–carbon and amino acids with a side-
chain bonded to this carbon are referred to as
alpha amino acids. These are the most common
form found in
nature. In the alpha
amino acids, the α–
carbon is a chiral carbon atom, with the
exception of glycine. In amino acids that have a
carbon chain attached to the α–carbon (such as
lysine) the carbons are labeled in order as α, β, γ,
δ, and so on. In some amino acids, the amine
group is attached to the β or γ-carbon, and these
are therefore referred to as beta or gamma amino acids .
The lack of a plane of symmetry means that there will be two stereoisomers of
an amino acid (apart from glycine) - one the non-superimposable mirror image
of the other.
All the naturally occurring amino acids have the right-hand structure in this
diagram. This is known as the "L-" configuration. The other one is known as the
"D-" configuration.We can recognise the L- configuration by imagining that we
are looking down from above on the right-hand structure in the above diagram.
We can't tell by looking at a structure whether that isomer will rotate the plane
of polarisation of plane polarised light clockwise or anticlockwise.
All the naturally occurring amino acids have the same L- configuration, but they
include examples which rotate the plane clockwise (+) and those which do the
opposite (-).
For example:
(+) Alanine
(-) Cysteine
(-) Tyrosine
(+)Valine
Zwitterions
The amine and carboxylic acid functional groups found in amino acids allow
them to have amphiprotic properties. Carboxylic acid groups (-CO2H) can be
deprotonated to become negative carboxylates (-CO2- ), and α-amino groups
(NH2-) can be protonated to become positive α-ammonium groups (+NH3-). At
pH values greater than the pKa of the carboxylic acid group (mean for the 20
common amino acids is about 2.2the negative carboxylate ion predominates. At
pH values lower than the pKa of the α-ammonium group (mean for the 20
common α-amino acids is about 9.4), the nitrogen is predominantly protonated
as a positively charged α-ammonium group. Thus, at pH between 2.2 and 9.4,
the predominant form adopted by α-amino acids contains a negative
carboxylate and a
positive α-ammonium
group, as shown in
structure (2) on the
right, so has net zero
charge. This
molecular state is
known as a
zwitterion, from the
German Zwitter An amino acid in its (1) unionized and (2) zwitterionic form
107). Amino acids also exist as zwitterions in the solid phase, and crystallize
with salt-like properties unlike typical organic acids or amines.
Isoelectric point
At pH values between the two pKa values, the zwitterion predominates, but
coexists in dynamic equilibrium with small amounts of net negative and net
positive ions. At the exact midpoint between the two pKa values, the trace
amount of net negative and trace of net positive ions exactly balance, so that
average net charge of all forms present is zero. This pH is known as the
isoelectric pointpI, so pI = ½(pKa1 + pKa2). The individual amino acids all have
slightly different pKa values, so have different isoelectric points. For amino
acids with charged side-chains, the pKa of the side-chain is involved. Thus for
Aspartic Acid, Glutamine with negative side-chains, pI = ½(pKa1 + pKaR), where
pKaR is the side-chain pKa. Cysteine also has potentially negative side-chain
with pKaR = 8.14, so pI should be calculated as for Aspartic Acid and Glutamine,
even though the side-chain is not significantly charged at neutral pH. For His,
Lysine, and Arginine with positive side-chains, pI = ½(pKaR + pKa2). Amino acids
have zero mobility in electrophoresis at their isoelectric point, although this
behaviour is more usually exploited for peptides and proteins than single amino
acids. Zwitterions have minimum solubility at their isolectric point and some
amino acids (in particular, with non-polar side-chains) can be isolated by
precipitation from water by adjusting the pH to the required isoelectric point.
Physical properties
Melting points
The amino acids are crystalline solids with surprisingly high melting points. It is
difficult to pin the melting points down exactly because the amino acids tend to
decompose before they melt. Decomposition and melting tend to be in the
200 - 300°C range.For the size of the molecules, this is very high. Something
unusual must be happening.
If we look at the general structure of an amino acid, we see that it has both a
basic amine group and an acidic carboxylic acid group.
There is an internal transfer of a hydrogen ion from the -COOH group to the -
NH2 group to leave an ion with both a negative charge and a positive charge.
Zwitterionic form is the form that amino acids exist in, even in the solid state.
Instead of the weaker hydrogen bonds and other intermolecular forces that we
expect, we actually have much stronger ionic attractions between one ion and
its neighbours.These ionic attractions take more energy to break and so the
amino acids have high melting points for the size of the molecules.
Solubility
Amino acids are generally soluble in water and insoluble in non-polar organic
solvents such as hydrocarbons.
This again reflects the presence of the zwitterions. In water, the ionic
attractions between the ions in the solid amino acid are replaced by strong
attractions between polar water molecules and the zwitterions. This is much
the same as any other ionic substance dissolving in water.
The extent of the solubility in water varies depending on the size and nature
of the "R" group.
The lack of solubility in non-polar organic solvents such as hydrocarbons is
because of the lack of attraction between the solvent molecules and the
zwitterions. Without strong attractions between solvent and amino acid, there
won't be enough energy released to pull the ionic lattice apart.
1 Alanine 16.65
2 Arginine 15
3 Asparagine 3.53
4 Aspartic Acid 0.778
5 Cysteine very soluble
6 Glutamic Acid 0.864
7 Glutamine 2.5
8 Glycine 24.99
9 Histidine 4.19
10 Isoleucine 4.117
11 Leucine 2.426
12 Lysine very
13 Methionine 3.381
14 Phenylalanine 2.965
15 Proline 162.3
16 Serine 5.023
17 Threonine very soluble
18 Tryptophan 1.136
19 Tyrosine 0.0453
20 Valine 8.85
Mercapto L-Cysteine
L-Asparagine
Carboxamide
L-Glutamine
Monamino, L-Aspartate
dicarboxylic L-Glutamate
L-Lysine
Diamino,
L-Arginine
monocarboxylic
L-Histidine
As both the amine and carboxylic acid groups of amino acids can react to
form amide bonds, one amino acid molecule can react with another and
become joined through an amide linkage. This polymerization of amino acids is
what creates proteins. This condensation reaction yields the newly formed
peptide bond and a molecule of water. In cells, this reaction does not occur
directly; instead the amino acid is first activated by attachment to a transfer
RNA molecule through an ester bond. This aminoacyl-tRNA is produced in an
ATP-dependent reaction carried out by an aminoacyltRNAsynthetase. This
aminoacyl-tRNA is then a substrate for the ribosome, which catalyzes the attack
of the amino group of the elongating protein chain on the ester bond. As a
Figure:
A Venn diagram showing therelationship of the
20 naturally occurring amino acids to a selection
of physio-chemical properties thought to be
important in the determination of protein
structure
an
an
da
ab
ab
rd
ev
ns
br
ac
in
id
m
of
ia
st
le
ti
T
o
a
Genetic code
The genetic code is the set of rules by which information encoded in
genetic material (DNA or mRNA sequences) is translated into proteins (amino
acid sequences) by living cells. The code defines a mapping between tri-
nucleotide sequences, called codons, and amino acids. With some exceptions, a
triplet codon in a nucleic acid sequence specifies a single amino acid. Because
the vast majority of genes are encoded with exactly the same code (see the
RNA codon table), this particular code is often referred to as the canonical or
standard genetic code, or simply the genetic code, though in fact there are
many variant codes. For example, protein synthesis in human mitochondria
relies on a genetic code that differs from the standard genetic code
Not all genetic information is stored using the genetic code. All
organisms' DNA contains regulatory sequences, intergenic segments,
chromosomal structural areas, and other non-coding DNA that can contribute
greatly to phenotype. Those elements operate under sets of rules that are
distinct from the codon-to-amino acid paradigm underlying the genetic code.
After the structure of DNA was discovered by James Watson and Francis Crick,
who used the experimental evidence of Maurice Wilkins and Rosalind Franklin
(among others), serious efforts to understand the nature of the encoding of
proteins began. George Gamow postulated that a three-letter code must be
employed to encode the 20 standard amino acids used by living cells to encode
proteins, because 3 is the smallest integer n such that 4n is at least 20.
given double helix, as will the number of G and C bases. In RNA, thymine (T) is
replaced by uracil (U), and the deoxyribose is substituted by ribose.
The codon AUG both codes for methionine and serves as an initiation site: the
first AUG in an mRNA's coding region is where translation into protein begins
The codon ATG both codes for methionine and serves as an initiation site: the first ATG in
DNA's coding region is where translation into protein begins
GENE EXPRESSION
Gene expression is the process by which information from a gene is used
in the synthesis of a functional gene product. These products are often
proteins, but in non-protein coding genes such as ribosomal RNA (rRNA),
transfer RNA (tRNA) or Small nuclear RNA (snRNA) genes, the product is a
functional RNA. The process of gene expression is used by all known life -
eukaryotes (including multicellular organisms), prokaryotes (bacteria and
archaea) and viruses - to generate the macromolecular machinery for life.
Several steps in the gene expression process may be modulated, including the
transcription, RNA splicing, translation, and post-translational modification of a
protein. Gene regulation gives the cell control over structure and function, and
is the basis for cellular differentiation, morphogenesis and the versatility and
adaptability of any organism. Gene regulation may also serve as a substrate for
evolutionary change, since control of the timing, location, and amount of gene
expression can have a profound effect on the functions (actions) of the gene in
a cell or in a multicellular organism.
L-Leucine P-Proline
H-Histidine E- Glutamic Acid
L-Leucine E- Glutamic Acid
K- Lysine
T- Threonine
1. Transcription
2. RNA Processing or Post Transcriptional Modifications
3. Translation
I. Transcription
A DNA transcription unit encoding for a protein contains not only the
sequence that will eventually be directly translated into the protein (the coding
sequence) but also regulatory sequences that direct and regulate the synthesis
of that protein. The regulatory sequence before (upstream from) the coding
sequence is called the five prime untranslated region (5'UTR), and the sequence
following (downstream from) the coding sequence is called the three prime
untranslated region (3'UTR).[citation needed]
1. Pre-initiation
D. RNA Polymerase
E. Activators and Repressors
2. Initiation
3. Promoter clearance
After the first bond is synthesized, the RNA polymerase must clear the
promoter. During this time there is a tendency to release the RNA transcript
and produce truncated transcripts. This is called abortive initiation and is
common for both eukaryotes and prokaryotes. Abortive initiation continues to
occur until the σ factor rearranges, resulting in the transcription elongation
complex (which gives a 35 base-pair moving footprint). The σ factor is released
before 80 nucleotides of mRNA are synthesized. Once the transcript reaches
approximately 23 nucleotides, it no longer slips and elongation can occur. This,
like most of the remainder of transcription, is an energy-dependent process,
consuming adenosine triphosphate (ATP).
4. Elongation
One strand of the DNA, the template strand (or noncoding strand), is used
as a template for RNA synthesis. As transcription proceeds, RNA polymerase
traverses the template strand and uses base pairing complementarity with the
DNA template to create an RNA copy. Although RNA polymerase traverses the
template strand from 3' → 5', the coding (non-template) strand and newly-
formed RNA can also be used as reference points, so transcription can be
described as occurring 5' → 3'. This produces an RNA molecule from 5' → 3', an
exact copy of the coding strand (except that thymines are replaced with uracils,
and the nucleotides are composed of a ribose (5-carbon) sugar where DNA has
deoxyribose (one less oxygen atom) in its sugar-phosphate backbone)
5.
Termination
1. m-RNA processing
The pre-mRNA molecule undergoes three main modifications. These
modifications are 5' capping, 3' polyadenylation, and RNA splicing, which occur
in the cell nucleus before the RNA is translated.
The pre-mRNA processing at the 3' end of the RNA molecule involves
cleavage of its 3' end and then the addition of about 200 adenine residues to
form a poly(A) tail. The cleavage and adenylation reactions occur if a
polyadenylation signal sequence (5'- AAUAAA-3') is located near the 3' end of
the pre-mRNA molecule, which is followed by another sequence, which is
usually (5'-CA-3'). The second signal is the site of cleavage. A GU-rich sequence
is also usually present further downstream on the pre-mRNA molecule. After
the synthesis of the sequence elements, two multisubunitproteins called
cleavage and polyadenylation specificity factor (CPSF) and cleavage stimulation
factor (CStF) are transferred from RNA Polymerase II to the RNA molecule. The
two factors bind to the sequence elements. A protein complex forms that
contains additional cleavage factors and the enzyme Polyadenylate Polymerase
(PAP). This complex cleaves the RNA between the polyadenylation sequence
and the GU-rich sequence at the cleavage site marked by the (5'-CA-3')
sequences. Poly(A) polymerase then adds about 200 adenine units to the new
3' end of the RNA molecule using ATP as a precursor. As the poly(A) tails is
synthesised, it binds multiple copies of poly(A) binding protein, which protects
the 3'end from ribonuclease digestion.
2. RNA Splicing
Splicing is a
modification of an RNA
after transcription, in
which introns are
removed and exons are
joined. This is needed
for the typical
eukaryoticmessenger RNA before it can be used to produce a correct protein
through translation. For many eukaryotic introns, splicing is done in a series of
reactions which are catalyzed by the spliceosome, a complex of small nuclear
ribonucleoproteins (snRNPs), but there are also self-splicing introns .
IV. Translation
chain, or polypeptide, that will later fold into an active protein. In Bacteria,
translation occurs in the cell's cytoplasm, where the large and small subunits of
the ribosome are located, and bind to the mRNA. In Eukaryotes, translation
occurs across the membrane of the endoplasmic reticulum in a process called
vectorial synthesis. The ribosome facilitates decoding by inducing the binding of
tRNAs with complementaryanticodon sequences to that of the mRNA. The
tRNAs carry specific amino acids that are chained together into a polypeptide as
the mRNA passes through and is "read" by the ribosome in a fashion
reminiscent to that of a stock ticker and ticker tape.
In many instances, the entire ribosome/mRNA complex will bind to the outer
membrane of the rough endoplasmic reticulum and release the nascent protein
polypeptide inside for later vesicle transport and secretion outside of the cell.
Many types of transcribed RNA, such as transfer RNA, ribosomal RNA, and small
nuclear RNA, do not undergo translation into proteins.
In activation, the correct amino acid is covalently bonded to the correct transfer
RNA (tRNA). The amino acid is joined by its carboxyl group to the 3' OH of the
tRNA by an ester bond. When the tRNA has an amino acid linked to it, it is
termed "charged". Initiation involves the small subunit of the ribosome binding
to the 5' end of mRNA with the help of initiation factors (IF). Termination of the
polypeptide happens when the A site of the ribosome faces a stop codon (UAA,
UAG, or UGA). No tRNA can recognize or bind to this codon. Instead, the stop
codon induces the binding of a release factor protein that prompts the
disassembly of the entire ribosome/mRNA complex.
Translation Process
The FASTA format may be used to represent either single sequences or many
sequences in a single file. A series of single sequences, concatenated, constitute
a multisequence file.
The first line in a FASTA file starts either with a ">" (greater-than) symbol
or a ";" (semicolon) and was taken as a comment.
Subsequent lines starting with a semicolon would be ignored by
software. Since the only comment used was the first, it quickly became
used to hold a summary description of the sequence, often starting with
a unique library accession number, and with time it has become
commonplace use to always use ">" for the first line and to not use ";"
comments .
Following the initial line (used for a unique description of the sequence)
is the actual sequence itself in standard one-letter code. Anything other
than a valid code would be ignored (including spaces, tabulators,
asterisks, etc...).
A sample Sequence:
>gi|5524211|gb|AAD44166.1| cytochrome b [Elephasmaximusmaximus]
LCLYTHIGRNIYYGSYLYSETWNTGIMLLLITMATAFMGYVLPWGQMSFWGATVITNLFSAIPYIGTNLV
EWIWGGFSVDKATLNRFFAFHFILPFTMVALAGVHLTFLHETGSNNPLGLTSDSDKIPFHPYYTIKDFLG
LLILILLLLLLALLSPDMLGDPDNHMPADPLNTPLHIKPEWYFLFAYAILRSVPNKLGGVLALFLSIVIL
GLMPFLHTSKHRSMMLRPLSQALFWTLTMDLLTLTWIGSQPVEYPYTIIGQMASILYFSIILAFLPIAGX
IENY
Python Programs using amino acid query sequences, the accepted amino acid
codes are:
A alanine P proline
B aspartate/asparagine Q glutamine
C cystine R arginine
D aspartate S serine
E glutamate T threonine
F phenylalanine U selenocysteine
G glycine V valine
H histidine W tryptophan
I isoleucine Y tyrosine
K lysine Z glutamate/glutamine
L leucine X any
M methionine * translation stop
N asparagine -gap of indeterminate length
Sequence identifiers
The NCBI defined a standard for the unique identifier used for the sequence
(SeqID) in the header line. The formatdbman page has this to say on the
subject: "formatdb will automatically parse the SeqID and create indexes, but
the database identifiers in the FASTA definition line must follow the
conventions of the FASTA Defline Format.
However they do not give a definitive description of the FASTA defline format.
An attempt to create such a format is given below
GenBankgi|gi-number|gb|accession|locus
EMBL Data Library gi|gi-number|emb|accession|locus
DDBJ, DNA Database of Japan gi|gi-number|dbj|accession|locus
NBRF PIR pir||entry
Protein Research Foundation prf||name
SWISS-PROT sp|accession|name
Patents pat|country|number
GenInfo Backbone Id bbs|number
General database identifier gnl|database|identifier
NCBI Reference Sequence ref|accession|locus
Local Sequence identifier lcl|identifier
Python interpreters are available for many operating systems, and Python
programs can be packaged into stand-alone executable code for many systems
using various tools.
HISTORY
Guido van Rossum created Python and is affectionately
bestowed with the title "Benevolent Dictator For Life" by
the Python community.
code blocks set off with indentation have fewer begin/end words and
punctuation to accidentally leave out (which means fewer bugs).
1. Systems Programming
2. GUIs
Python’s simplicity and rapid turnaround also make it a good match for
graphical userinterface programming. Python comes with a standard
object-oriented interface to theTk GUI API called tkinter (Tkinter in 2.6)
that allows Python programs to implement portable GUIs with a native
look and feel. Python/tkinter GUIs run unchanged onMicrosoft Windows,
X Windows (on Unix and Linux), and the Mac OS (both Classicand OS X). A
free extension package, PMW, adds advanced widgets to the
tkintertoolkit. In addition, the wxPython GUI API, based on a C++ library,
offers an alternativetoolkit for constructing portable GUIs in Python.
3. Internet Scripting
Python comes with standard Internet modules that allow Python programs
to performa wide variety of networking tasks, in client and server modes.
Scripts can communicateover sockets; extract form information sent to
server-side CGI scripts; transfer files byFTP; parse, generate, and analyze
XML files; send, receive, compose, and parse email;fetch web pages by
URLs; parse the HTML and XML of fetched web pages; communicate over
XML-RPC, SOAP, and Telnet; and more. Python’s libraries make thesetasks
remarkably simple.In addition, a large collection of third-party tools are
available on the Web for doingInternet programming in Python. For
instance, the HTMLGen system generates HTMLfiles from Python class-
based descriptions, the mod_python package runs
4. Component Integration
5. Database Programming
6. Rapid Prototyping
Python is an object-oriented language, from the ground up. Its class model
supportsadvanced notions such as polymorphism, operator overloading, and
multiple inheritance; yet, in the context of Python’s simple syntax and typing,
OOP is remarkably easyto apply. In fact, if we don’t understand these terms,
you’ll find they are much easierto learn with Python than with just about any
other OOP language available.Besides serving as a powerful code structuring
and reuse device, Python’s OOP naturemakes it ideal as a scripting tool for
object-oriented systems languages such as C++and Java. For example, with the
appropriate glue code, Python programs can subclass(specialize) classes
implemented in C++, Java, and C#.Of equal significance, OOP is an option in
Python; we can go far without having tobecome an object guru all at once.
Much like C++, Python supports both proceduraland object-oriented
programming modes. Its object-oriented tools can be applied ifand when
constraints allow. This is especially useful in tactical development modes,which
preclude design phases.
2. It’s Free
Python is completely free to use and distribute. As with other open source
software,such as Tcl, Perl, Linux, we can fetch the entire Python system’s
sourcecode for free on the Internet. There are no restrictions on copying it,
embedding it inour systems, or shipping it with our products. In fact, we can
even sell Python’ssource code, if we are so inclined.But don’t get the wrong
idea: “free” doesn’t mean “unsupported.” Python online community responds
to user queries with a speed that most commercial software help desks would
do well to try to emulate.
3. It’s Portable
•And more
Like the language interpreter itself, the standard library modules that ship with
Pythonare implemented to be as portable across platform boundaries as
possible. Further,Python programs are automatically compiled to portable byte
code, which runs thesame on any platform with a compatible version of Python
installed.
4. It’s Powerful
5. It’s Mixable
and be called byPython programs flexibly. That means we can add functionality
to the Python systemas needed, and use Python programs within other
environments or systems.Mixing Python with libraries coded in languages such
as C or C++, for instance, makesit an easy-to-use frontend language and
customization tool. As mentioned earlier, thisalso makes Python good at rapid
prototyping; systems may be implemented in Pythonfirst, to leverage its speed
of development, and later moved to C for delivery, one pieceat a time,
according to performance demands.
To run a Python program, we simply type it and run it. There are no
intermediatecompile and link steps, like there are for languages such as C or C+
+. Python executesprograms immediately, which makes for an interactive
programming experience andrapid turnaround after program changes—in
many cases, we can witness the effect ofa program change as fast as we can
type it.Python programs are simpler, smaller, and more flexible than equivalent
programs in languages like C, C++, and Java.
IDLE provides a graphical user interface for doing Pythondevelopment, and it’s
a standard and free part of the Python system. It is usually referredto as an
integrated development environment (IDE), because it binds together
variousdevelopment tasks into a single view.In short, IDLE is a GUI that lets you
edit, run, browse, and debug Python programs,all from a single interface.
Moreover, because IDLE is a Python program that uses thetkinter GUI toolkit
(known as Tkinter in 2.6), it runs portably on most Python platforms, including
Microsoft Windows, X Windows (for Linux, Unix, and Unix-likeplatforms), and
the Mac OS (both Classic and OS X). For many, IDLE represents aneasy-to-use
alternative to typing command lines, and a less problem-prone alternativeto
clicking on icons.
• Sequences are for lists of related data that we might want to sort, merge,
and so on.
• Dictionaries are collections of data that associate a unique key with each
value.
• Sets are for doing set operations (finding the intersection, difference, and so
1. Numeric data
Except when we're doing division with integers or using the decimal module
we don't have to worry about what kind of number data type we're using.
Python converts numbers into compatible types automatically. For example, if
>>> x = 5
>>> y = 1.5
>>> x * y
7.5
2. Sequential data
A. Lists can store multiple kinds of data (both text and numbers, for
example). We can change elements inside a list, and we can organize
the data in various ways (for example, by sorting).
B. Tuples, like lists, can include different kinds of data, but they can't be
changed. In Python terminology, they are immutable.
C. Strings store text or binary data. Strings are immutable (like tuples).
To see the data type of a Python object, use the type() function, like this:
>>> type ('foo')
<type 'str'>
3. Dictionaries
Python's dictionary (its keyword is dict) is a data type that stores multiple data
items elements) of different types. In a dictionary, each element is associated
with a unique key, which is a value of any immutable type. When we use a dict,
we use the key to return the element associated with the key.
We use a dictionary when we want to store and retrieve items by using a key
that doesn't change and when we don't care in what order Python stores the
items. (In dictionaries, elements aren't numbered.)
4. Sets
A set stores multiple items, which can be of different types, but each item in a
set must be unique. We can use Python sets to find unions, intersections,
differences, and so on. One use for sets is when we have repetitious data and
we want to ignore the repetition.
For example, imagine that we have an address database and we want to find
out which cities are represented, but we don't need to know how many times
each city appears in the database. A set will list each city in the database only
once.
The syntax for a set is a little different from the syntax of the other data
types.We use the word set followed by a name (or a group of elements) in
parentheses. Here is a set that finds each unique element in a list. We'll notice
that the elements are out of order in the set. That's because Python doesn't
store set elements in alphanumeric order (the same is true for dicts):
5. Files
Python uses the file data type to work with files on our computer or on the
Internet. Note that the file type is not the same as the actual file. The file type
is Python's internal representation of a computer or Internet file.
REMEMBER Before Python can work with an existing file or a new file, we need
to open the file inside Python.
This example opens a file called myfile:
open("myfile")
File objects are Python code’s main interface to external files on your
computer. Files are a core type, but they’re something of an oddball—there is
no specific literal syntax for creating them. Rather, to create a file object, you
call the built-in open function, passing in an external filename and a processing
mode as strings. For example, to create a text output file, you would pass in its
name and the 'w' processing mode string towrite data:
This creates a file in the current directory and writes text to it (the filename can
be a full directory path if you need to access a file elsewhere on your
computer). To read back what you just wrote, reopen the file in 'r' processing
mode, for reading text input—this is the default if you omit the mode in the
call. Then read the file’s content into a string, and display it. A file’s contents
are always a string in your script, regardless of the type of data the file
contains:
File objects provide more ways of reading and writing (read accepts an optional
byte size, readline reads one line at a time, and so on), as well as other tools
(seek moves to a new file position). As we’ll see later, though, the best way to
read a file today is to not read it at all—files provide an iterator that
automatically reads lineby line in for loops and other contexts.
BIOPYTHON
Biopython is a set of libraries to provide the ability to
deal with “things” of interest to biologists working on
the computer.
Installing Biopython
It is available at the Biopython’s download page
(http://biopython.org/wiki/Download)
For Windows pre-compiled click-and-run installers are available, while for Unix
and other operating systems you must install from source as described in the
included README file. This is usually as simple as the standard commands:
What we have here is a sequence object with a generic alphabet - reflecting the
fact we have not specified if this is a DNA or protein sequence (okay, a protein
with a lot of Alanines, Glycines, Cysteines and Threonines!). In addition to
having an alphabet, the Seq object differs from the Python string in the
methods it supports. You can’t do this with a plain string:
>>> my_seq
Seq('AGTACACTGGT', Alphabet())
>>> my_seq.complement()
Seq('TCATGTGACCA', Alphabet())
>>> my_seq.reverse_complement()
Seq('ACCAGTGTACT', Alphabet())
Using the NCBI website by hand. Let’s just take a look through the nucleotide
databases at NCBI, using an Entrez online search
(http://www.ncbi.nlm.nih.gov:80/entrez/query.fcgi?db=Nucleotide) for
everything mentioning the text Cypripedioideae (this is the subfamily of lady
slipper orchids).
If we open the lady slipper orchids FASTA file ls_orchid.fasta in our text editor,
We’ll see that the file starts like this:
It contains 94 records, each has a line starting with “>” (greater-than symbol)
followed by the sequence on one or more lines. Now try this in Python:
gi|2765658|emb|Z78533.1|CIZ78533
Seq('CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATCATTGATGAGACCGTG
G...CGC', SingleLetterAlphabet())
740
...
gi|2765564|emb|Z78439.1|PBZ78439
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACT...
GCC', SingleLetterAlphabet())
592
Notice that the FASTA format does not specify the alphabet, so Bio.SeqIO has
defaulted to the rather generic SingleLetterAlphabet() rather than something
DNA specific.
The code in these modules basically makes it easy to write Python code that
interact with the CGI scripts on these pages, so that you can get results in an
easy to deal with format. In some cases, the results can be tightly integrated
with the Biopython parsers to make it even easier to extract information.
The alphabet object is perhaps the important thing that makes the Seq object
more than just a string. The currently available alphabets for Biopython are
defined in the Bio.Alphabet module. We’ll use the IUPAC alphabets
(http://www.chem.qmw.ac.uk/iupac/) here to deal with some of our favorite
objects: DNA, RNA and Proteins.
Bio.Alphabet.IUPAC provides basic definitions for proteins, DNA and RNA, but
additionally provides the ability to extend and customize the basic definitions.
For instance, for proteins, there is a basic IUPACProtein class, but there is an
additional ExtendedIUPACProtein class providing for the additional elements
“U” (or “Sec” for selenocysteine) and “O” (or “Pyl” for pyrrolysine), plus the
ambiguous symbols “B” (or “Asx” for asparagine or aspartic acid), “Z” (or “Glx”
for glutamine or glutamic acid), “J” (or “Xle” for leucine isoleucine) and “X” (or
“Xxx” for an unknown amino acid). For DNA you’ve got choices of
IUPACUnambiguousDNA, which provides for just the basic letters,
IUPACAmbiguousDNA (which provides for ambiguity letters for every possible
situation) and ExtendedIUPACDNA, which allows letters for modified bases.
Similarly, RNA can be represented by IUPACAmbiguousRNA or
IUPACUnambiguousRNA.
The advantages of having an alphabet class are two fold. First, this gives an idea
of the type of information the Seq object contains. Secondly, this provides a
means of constraining the information, as a means of type checking.
Now that we know what we are dealing with, let’s look at how to utilize this
class to do interesting work. You can create an ambiguous sequence with the
default generic alphabet like this:
However, where possible you should specify the alphabet explicitly when
creating your sequence objects - in this case an unambiguous DNA alphabet
object:
We can access elements of the sequence in the same way as for strings (but
Python counts from zero and ends with -1)
The Seq object has a .count() method, just like a string. Note that this means
that like a Python string, this gives a non-overlapping count:
For some biological uses, you may actually want an overlapping count (i.e. 3 in
this trivial example). When searching for single letters, this makes no
difference:
While we could use the above snippet of code to calculate a GC%, note that the
Bio.SeqUtils module has several GC functions already built. For example:
Note that using the Bio.SeqUtils.GC() function should automatically cope with
mixed case sequences and the ambiguous nucleotide S which means G or C.
PROJECT CODE:
“A Bio-Python Based Program to Generate Random Protein Sequences, each
sequence being 100 amino acid residues long.”
# biopython
from Bio.Seq import Seq
from Bio.Alphabet import IUPAC
from Bio.SeqRecord import SeqRecord
from Bio import SeqIO
from sys import *
residueList1 = ["C","D","E","F","G","H","I"]
residueList2 = ["A","K","L","M","N","S"]
residueList3 = ["P","Q","R","T","V","W","Y"]
residueList4 = ["C","A","G","U"]
def getProteinSequence(residue):
strSeq = ""
for i in range(0,100,1):
index = random.randint(0, len(residue)-1)
strSeq += residue[index]
def randomProteinSeqRecord(index):
if(index%2)==0:
return getProteinSeqRecord(residueList1, index)
elif(index%3)==0:
return getProteinSeqRecord(residueList2, index)
else:
return getProteinSeqRecord(residueList3, index)
#information
filepathProvided = False
#raw_input received the user input as string
try:
filepath = raw_input('Enter filepath to save sequences
(X:/filename) ... ')
filepath = filepath + '.fasta'
#handle = open(filepath, "w")
#handle.close()
filepathProvided = True
except IOError:
print 'Invalid or No File provided will print results
to console'
print
ranSeqCount = 1
try:
ranSeqCount = int(raw_input('Enter number of random
sequences to generate ... '))
except ValueError:
ranSeqCount = 1
pass
records = []
for i in range(0,ranSeqCount,1):
records.append(randomProteinSeqRecord(i+1))
if(filepathProvided):
SeqIO.write(records, filepath, "fasta")
else:
print 'Writing to console is not supported. :/'
print
raw_input('Press Enter to exit ...')
print
The BiopythonSeq object, defined in the Bio.Seq module (together with related
objects like the MutableSeq, plus some general purpose sequence functions
>>>fromBio.SeqimportSeq
>>>my_seq = Seq("AGTACACTGGT")
>>>my_seq
Seq('AGTACACTGGT', Alphabet())
>>>my_seq.alphabet
Alphabet()
>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna, generic_protein
>>>my_seq = Seq("AGTACACTGGT")
>>>my_seq
Seq('AGTACACTGGT', Alphabet())
>>>my_dna = Seq("AGTACACTGGT", generic_dna)
>>>my_dna
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_protein = Seq("AGTACACTGGT", generic_protein)
>>>my_protein
Seq('AGTACACTGGT', ProteinAlphabet())
Why is this important? Well it can catch some errors for we - we wouldn't want
to accidentally try and combine a DNA sequence with a protein would we:
>>>my_protein + my_dna
Traceback(most recent call last):
...
TypeError: Incompatable alphabets ProteinAlphabet()andDNAAlphabet()
Biopython will also catch things like trying to use nucleotide only methods like
translation (see below) on a protein sequence.
General methods
The Seq object has a number of methods which act just like those of a Python
string, for example the find method:
>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>my_dna = Seq("AGTACACTGGT", generic_dna)
>>>my_dna
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_dna.find("ACT")
5
>>>my_dna.find("TAG")
-1
>>>my_dna.count("A")
3
>>>my_dna.count("ACT")
1
However, watch out because just like the Python string's count, this is a non-
overlapping count!
>>>"AAAA".count("AA")
2
>>>Seq("AAAA", generic_dna).count("AA")
2
Nucleotide methods
These are very simple - the methods return a new Seq object with the
appropriate sequence and the same alphabet:
>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>my_dna = Seq("AGTACACTGGT", generic_dna)
>>>my_dna
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_dna.complement()
Seq('TCATGTGACCA', DNAAlphabet())
>>>my_dna.reverse_complement()
Seq('ACCAGTGTACT', DNAAlphabet())
>>>my_dna
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_dna.transcribe()
Seq('AGUACACUGGU', RNAAlphabet())
Naturally, given some RNA, we might want the associated DNA - and again
Biopython does a simple U/T substitution:
>>>my_rna = my_dna.transcribe()
>>>my_rna
Seq('AGUACACUGGU', RNAAlphabet())
>>>my_rna.back_transcribe()
Seq('AGTACACTGGT', DNAAlphabet())
>>>my_rna
Seq('AGUACACUGGU', RNAAlphabet())
>>>my_rna.back_transcribe().reverse_complement()
Seq('ACCAGTGTACT', DNAAlphabet())
Translation
>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_rna
>>>messenger_rna =
Seq("AUGGCCAUUGUAAUGGGCCGCUGAAAGGGUGCCCGAUAG", generic_rna)
>>>messenger_rna.translate()
>>>fromBio.SeqimportSeq
>>>fromBio.Alphabetimportgeneric_dna
>>>coding_dna = Seq("ATGGCCATTGTAATGGGCCGCTGAAAGGGTGCCCGATAG",
generic_dna)
>>>coding_dna.translate()
Seq('MAIVMGR*KGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
In either case there are several useful options - by default as we will notice the
in example above translation continues through any stop codons, but this is
optional:
>>>coding_dna.translate(to_stop=True)
Seq('MAIVMGR', ExtendedIUPACProtein())
>>>coding_dna.translate(table=2)
Seq('MAIVMGRWKGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>>coding_dna.translate(table="Vertebrate Mitochondrial")
Seq('MAIVMGRWKGAR*', HasStopCodon(ExtendedIUPACProtein(), '*'))
>>>coding_dna.translate(table=2, to_stop=True)
Seq('MAIVMGRWKGAR', ExtendedIUPACProtein())
Most of the sequence file format parsers in BioPython can return SeqRecord
objects (and may offer a format specific record object too, see for example
Bio.SwissProt). The SeqIO system will only return SeqRecord objects.
Most of the time we'll create SeqRecord objects by parsing a sequence file with
Bio.SeqIO. However, it is useful to know how to create a SeqRecord directly.
For example,
fromBio.SeqimportSeq
fromBio.SeqRecordimportSeqRecord
fromBio.Alphabetimport IUPAC
record =
SeqRecord(Seq("MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF",
IUPAC.protein),
id="YP_025292.1", name="HokC",
description="toxic membrane protein, small")
print record
ID: YP_025292.1
Name: HokC
Description: toxic membrane protein, small
Number of features: 0
Seq('MKQHKAMIVALIVICITAVVAALVTRKDLCEVHIRTGQTEVAVF', IUPACProtein())
And this is some of the output. Remember python likes to count from zero, so
the 94 records in this file have been labelled 0 to 93:
print record
That should give we a hint of the sort of information held in this object:
ID: Z78439.1
Name: Z78439
Desription: P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
Number of features: 5
/source=Paphiopedilumbarbatum
/taxonomy=['Eukaryota', 'Viridiplantae', 'Streptophyta', 'Embryophyta', ...,
'Paphiopedilum']
/keywords=['5.8S ribosomal RNA', '5.8S rRNA gene', 'internal transcribed
spacer', 'ITS1', 'ITS2']
/references=[<Bio.SeqFeature.Reference ...>, <Bio.SeqFeature.Reference ...>]
/data_file_division=PLN
/date=30-NOV-2006
/organism=Paphiopedilumbarbatum
/gi=2765564
Seq('CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTT
TGGTC ...', IUPACAmbiguousDNA())
Most of the values in the dictionary are simple strings, but this isn't always the
case - have a look at the references entry for this example - its a list of
Reference objects:
>>>printrecord.annotations["references"].__class__
<type'list'>
>>>printlen(record.annotations["references"])
2
>>>for ref inrecord.annotations["references"] : printref.authors
Cox,A.V., Pridgeon,A.M., Albert,V.A. andChase,M.W.
Cox,A.V.
SeqFeature objects are complicated enough to warrant their own wiki page...
for now please refer to the Tutorial.
We can convert the SeqRecord into a string using one of the output formats
supported by Bio.SeqIO, for example:
>>>printrecord.format("fasta")
>Z78439.1 P.barbatum 5.8S rRNA gene and ITS1 and ITS2 DNA.
CATTGTTGAGATCACATAATAATTGATCGAGTTAATCTGGAGGATCTGTTTACTTTGGTC
ACCCATGGGCATTTGCTGTTGAAGTGACCTAGATTTGCCATCGAGCCTCCTTGGGAGCTT
TCTTGTTGGCGAGATCTAAACCCCTGCCCGGCGGAGTTGGGCGCCAAGTCATATGACACA
TAATTGGTGAAGGGGGTGGTAATCCTGCCCTGACCCTCCCCAAATTATTTTTTTAACAAC
TCTCAGCAACGGATATCTCGGCTCTTGCATCGATGAAGAACGCAGCGAAATGCGATAATG
GTGTGAATTGCAGAATCCCGTGAACATCGAGTCTTTGAACGCAAGTTGCGCCCGAGGCCA
TCAGGCCAAGGGCACGCCTGCCTGGGCATTGCGAGTCATATCTCTCCCTTAATGAGGCTG
TCCATACATACTGTTCAGCCGGTGCGGATGTGAGTTTGGCCCCTTGTTCTTTGGTACGGG
GGGTCTAAGAGCTGCATGGGCTTTGGATGGTCCTAAATACGGAAAGAGGTGGACGAACTA
TGCTACAACAAAATTGTTGTGCAAATGCCCCGGTTGGCCGTTTAGTTGGGCC
The SeqIOLibrary
Bio.SeqIO provides a simple uniform interface to input and output assorted
sequence file formats (including multiple sequence alignments), but will only
deal with sequences as SeqRecord objects. There is a sister interface
Bio.AlignIO for working directly with sequence alignment files as Alignment
objects.
With Bio.SeqIO we can treat sequence alignment file formats just like any other
sequence file, but the new Bio.AlignIO module is designed to work with such
alignment files directly. You can also convert a set of SeqRecord objects from
any file format into an alignment - provided they are all the same length. Note
that when using Bio.SeqIO to write sequences to an alignment file format, all
the (gapped) sequences should be the same length.
Sequence Input
The main function is Bio.SeqIO.parse() which takes a file handle and format
name, and returns a SeqRecord iterator. This lets we do things like:
In the above example, we opened the file using the built-in python function
open. The argument 'rU' means open for reading using universal readline mode
- this means we don't have to worry if the file uses Unix, Mac or DOS/Windows
style newline characters.
Iterators are great for when we only need the records one by one, in the order
found in the file. For some tasks we may need to have random access to the
records in any order. In this situation, use the built in python list function to
turn the iterator into a list:
Sequence Output
For writing records to a file use the function Bio.SeqIO.write(), which takes a
SeqRecord iterator (or list), output handle and format string:
There are more examples in the following section on converting between file
formats.
Note that if we are writing to an alignment file format, all our sequences must
be the same length.
On the other hand, for interlaced or non-sequential file formats like Clustal, the
Bio.SeqIO.write() function will be forced to automatically convert an iterator
into a list. This will destroy any potential memory saving from using an
generator/iterator approach.
This script will read a Genbank file with a whole mitochondrial genome (e.g. the
tobacco mitochondrion, Nicotianatabacum mitochondrionNC_006581), create
500 records containing random fragments of this genome, and save them as a
fasta file. These subsequences are created using a random starting points and a
fixed length of 200.
>fragment_1
TGGGCCTCATATTTATCCTATATACCATGTTCGTATGGTGGCGCGATGTTCTACGTGAAT
CCACGTTCGAAGGACATCATACCAAAGTCGTACAATTAGGACCTCGATATGGTTTTATTC
TGTTTATCGTATCGGAGGTTATGTTCTTTTTTGCTCTTTTTCGGGCTTCTTCTCATTCTT
CTTTGGCACCTACGGTAGAG
...
>fragment_500
ACCCAGTGCCGCTACCCACTTCTACTAAGGCTGAGCTTAATAGGAGCAAGAGACTTGGAG
GCAACAACCAGAATGAAATATTATTTAATCGTGGAAATGCCATGTCAGGCGCACCTATCA
GAATCGGAACAGACCAATTACCAGATCCACCTATCATCGCCGGCATAACCATAAAAAAGA
TCATTAAAAAAGCGTGAGCC
Writing to a string
Sample Outputs
1. Screenshot showing the Project-621033475.py in the IDLE USER Interface
D:/Sample
[which creates a file named Sample.fasta in D:/ (D-Drive) ]
If No Path and file name is provided
[A file with no name with .fasta format containgi the sequences is
created in the folder where Project-621033475.py is located]
5. Screenshot showing the confirmation that the .fasta file is created at the
specified location with the specified number of sequences.
PQWQVYYPTVQQPYRPYWRYQRQYPPWTRWPQVTYYTQWPTPWPYPPYWQQRWVPVPYWV
PRQWYTTTTQWQQTVVQRTPWYTPRYTTQQRRWRWQQTPR
IDGECEEGHGFHFDFHGGIHDFFCDFGCGEHGIIFGGFHDGGHIIDHFFCHEGIGHGFID
EHEEHGHEIGHDCGEFCFHHHEGEFEDFIFHGCFDDEIHG
LLLLNMMNKLSLMASSLALSSMMSLKMSKANMAMASLLAKKLLKANLSNKNLLKNLKLSS
KSLSLLMANLSAASKNMMLKNLLKAAKLLAKNMMSLASKN
DIDGFFDGHHCICCECEHFDHCDIHGGDDCIDFIFIFHDGGGDFDCEDHCECFHICIIHG
FFEDCHCGCCDCFIDIIIFHHEDFDIEGFCCIEGHHHFGD
VRTTQPQQVWRTYQTWWVPWWYPQYQRYVQQVTWRPPRPQWQVVWQRWTVTPTVPPYPVR
RRPPVRWVQRWVVWTQWPYYTPWVRYTRTVVTPWYQYVVQ
CDDGGCFCFHIICCEEIIHHIGEIIHHHFIDEDFGIGDEGCHIIDGCFHIIGGGGIHHFI
FDGFGDFHFEHEGDGDHFFEFGIGIFGFIGDECDCIICED
VVTYYYPQYRPWTYTTRQWTPRPPTYWQQVYTQRVTWTVPPQTWVTRRYQPTTWTPYTQT
YVWWQQWWRQTRTYWYVQWYVTPTWTQPQTTVQQQTTVWW
FFCEDFDHCGCGGHFHGDGHHEFIEFIGCHHGCFCGECGEGECDFIDCCFEEHEIDCIIH
GCHHIEIDIEICFIGCHHIICGFHCGHFFHDEEFIFDICF
LSMLNNSLMSSNKLNKKLSLSAKALAAMASLAALKKNSMSSMNSKAKAKSAAKKAASALA
KASKLNLSMMNNLMASMNLSNLMNALNSMMKLANNNNMSL
ICFHDGFIIHECIHHCGFHCDFDEEHHGIFDDDICFGFIEGDHECGFEFCGIHGFFEFID
FCCGIECCEEGGGIGCCGGHFIHIDCDFGHGGECIDGDIH
TRWVVQQVWWRPYPRYWPYRPVVQTYQTTWWPTRWRTRQRYYQQQPWTYTPRTQYYPRQQ
WRVVTPQQYTRQQPVRQWWVRWPWVTQQYVWVWYPQPRQQ
DGFFFCGGIIECDDFIGIGECHGICGGEHCICGIDIIGEGECEHIGGCFEDFEFECEDCH
CGDIGIIDEHIGIEEDICHEIHDDEHGHEIEFGGGHGDGE
******************************************
1. Conclusion
2. Recommendations for improving this project
GLOSSARY
Accession number: An identifier supplied by the curators of the major biological
databases upon submission of a novel entry that uniquely identifies that sequence (or
other) entry.
Amino acid: One of the 20 chemical building blocks that are joined by amide
(peptide) linkages to form a polypeptide chain of a protein
Assembly: Compilation of overlapping sequences from one or more related genes that
have been clustered together based on their degree of sequence identity or similarity.
Sequence assembly may be used to piece together "shotgun" sequencing fragments
(see shotgun sequencing) based upon overlapping restriction enzyme digests, or may
be used to identify and index novel genes from "single-pass" cDNA sequencing efforts.
Base pair: A pair of nitrogenous bases (a purine and a pyrimidine), held together by
hydrogen bonds, that form the core of DNA and RNA i.e the A:T, G:C and A:U
interactions.
Bioinformatics:
1.The field of endeavor that relates to the collection, organization and analysis of large
amounts of biological data using networks of computers and databases (usually with
reference to the genome project and DNA sequence information).
2. Bioinformatics, sometimes, is used interchangeably with the term Computational
Biology. Precisely, Computational Biology is defined as the systematic development
and application of computing systems and computational solution techniques to
models of biological phenomena; Bioinformatics is defined as the systematic
development and application of computing systems and
computational solution techniques analyzing data obtained by experiments, modeling,
database search, and instrumentation regarding biological aspect.
Database Any file system by which data gets stored following a logical process. (see
also relational database)
DNA (deoxyribonucleic acid) The chemical that forms the basis of the genetic
material in virtually all organisms. DNA is composed of the four nitrogenous bases
Adenine, Cytosine, Guanine, and Thymine, which are covalently bonded to a
backbone of deoxyribose-phosphate to form a DNA strand. Two complementary
strands (where all Gs pair with Cs and As with Ts) form a double helical structure
which is held together by hydrogen bonding between the cognate bases.
DNA polymerase An enzyme that catalyzes the synthesis of DNA from a DNA
template given the deoxyribonucleotide precursors.
TYAPPREGHLECTSTVTGMTVELNYIPKNRTNVTLSPQIESIWAAELDRYKLVEITPIGF
APTEVRRYTGGHERQKRVPFVXXXXXXXXXXXXXXXXXXXXXXVQSQHLLAGILQQQKN
L LAAVEAQQQMLKLTIWGVK
Sequences are expected to be represented in the standard IUB/IUPAC amino acid and
nucleic acid codes with these exceptions: lower-case letters are accepted and are
mapped into upper-case; a single hyphen or dash can be used to represent a gap of
indeterminate length; and in amino acid sequences, U and * are acceptable letters
Gene expression The conversion of information from gene to protein via transcription
and translation.
Genetic code The mapping of all possible codons into the 20 amino acids including
the start and stop codons.
Guanine (G) One of the nitrogenous purine bases found in DNA and RNA
Introns Nucleotide sequences found in the structural genes of eukaryotes that are
non-coding and interrupt the sequences containing information that codes for
polypeptide chains. Intron sequences are spliced out of their RNA transcripts before
maturation and protein synthesis. (cf. Exons)
Messenger RNA (mRNA) The complementary RNA copy of DNA formed from a single-
stranded DNA template during transcription that migrates from the nucleus to the
cytoplasm where it is processed into a sequence carrying the information to code for a
polypeptide domain.
Nuclease Any enzyme that can cleave the phosphodiester bonds of nucleic acid
backbones.
Nucleotide A nucleic acid unit composed of a five carbon sugar joined to a phosphate
group and a nitrogen base.
Peptide A short stretch of amino acids each covalently coupled by a peptide (amide)
bond.
Peptide bond (amide bond) A covalent bond formed between two amino acids when
the amino group of one is linked to the carboxy group of another (resulting in the
elimination of one water molecule).
Poly(A) tail The stretch of Adenine (A) residues at the 3’ end of eukaryotic mRNA that
is added to the pre-mRNA as it is processed, before its transport from the nucleus to
the cytoplasm and subsequent translation at the ribosome.
Polyadenylation site A site on the 3’-end of messenger RNA (mRNA) that signals the
addition of a series of Adenines during the RNA processing step and before the mRNA
migrates to the cytoplasm. These so-called poly(A) "tails" increase mRNA stability
andallow one to isolate mRNA from cells by PCR-amplification using poly(T) primers.
Ribonucleic acid (RNA) A category of nucleic acids in which the component sugar is
ribose and consisting of the four nucleotides Thymidine, Uracil, Guanine, and
Adenine. The three types of RNA are messenger RNA (mRNA), transfer RNA (tRNA) and
ribosomal RNA (rRNA).
Splice site The sequence found at the 5’ and 3’ region of exon/intron boundaries,
usually defined by a consensus sequence:
Intron
5’ CAGGTAAGT---------TNCAGG 3’
A G C T
N represents any nucleotide; the bottom line represents alternative nucleotides at the
indicated positions.
Splicing The joining together of separate DNA or RNA component parts. For example,
RNA splicing in eukaryotes involves the removal of introns and the stitching together
of the exons from the pre-mRNA transcript before maturation.
Transcript The single-stranded mRNA chain that is assembled from a gene template.
Transfer RNA (tRNA) A small RNA molecule that recognizes a specific amino acid,
transports it to a specific codon in the mRNA, and positions it properly in the nascent
polypeptide chain.