You are on page 1of 19

Introduction to Biopython

2019/11/11
Learning Objectives

• Biopython as a toolkit
• Seq objects and their methods
• SeqRecord objects have data fields
• SeqIO to read and write sequence
objects
• Direct access to GenBank with
Entrez.efetch
• BLAST & Multiple sequence
alignment
Modules
• Python functions are divided into 3 sets
1. A small core set that are always available
2. Some built-in modules such as math that can be imported from the
basic install (Eg. >>> import math)
3. An extremely large number of optional modules that must be
downloaded and installed before you can import them
4. Codes using such modules is said to have “dependencies”
• Biopython belongs to the third and fourth category
• The code for dependencies are located in different places such
as SourceForge, GitHub, and developer’s own websites (Perl
and R are better organized)
• Trouble?: Ask the TA’s as each persons problem is mostly
unique and no general solution
Install Biopython
• Website for installation instruction:
• http://biopython.org/wiki/Download
• Required Software
• Python (version above 2.6)
• NumPy (Numerical Python)

• Optional Software
• ReportLab – used for pdf graphics code
• psycopg – used for BioSQL with a PostgreSQL database
• mysql-connector – used for BioSQL with a MySQL database
• MySQLdb – An alternative MySQL library used by BioSQL

4
Information Source: http://biopython.org/wiki/Download
Is your Biopython installed
correctly?
• Type the following on your terminal or interpreter:

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet.IUPAC import unambiguous_dna
>>> new_seq = Seq('GATCAGAAG', unambiguous_dna)
>>> new_seq[0:2]
Seq('GA', IUPACUnambiguousDNA())
>>> new_seq.translate()
Seq('DQK', IUPACProtein())
• Biopython is an integrated collection of modules for
“biological computation” including tools for working
with DNA/protein sequences, sequence alignments,
population genetics, and molecular structures
• It also provides interfaces to common biological
databases (eg. GenBank) and to some common locally
installed software (eg. BLAST).
• Loosely based on BioPerl
• Relatively fewer protein specific functions in
Biopython
Biopython Tutorial
• Biopython has a “Tutorial & Cookbook” :
http://biopython.org/DIST/docs/tutorial/Tutorial.html

by: Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck, Michiel de
Hoon, Peter Cock, Tiago Antao, Eric Talevich, Bartek Wilczyński

Most of the examples in this class are drawn from the above link
Python is an Object-Oriented language
• Composed of data structures (known as classes)
• can contain complex and well-defined forms of data, and
they can also have built in methods
• Complex objects are built from other objects
• Eg. String, list and other data types have certain methods
• Many classes of objects have the same method and can be
used without a defined call
• Eg. “print” method
• Specifying the given data type belonging to this class, and it
inherits all the properties
The Seq object

• The Seq object class is simple and fundamental for a lot of Biopython work.
A Seq object can contain DNA, RNA, or protein.

1. Data: this is the actual sequence data string of the sequence.


2. Alphabet – an object describing what the individual characters making
up the string “mean” and how they should be interpreted.
The Seq object: {Data, Alphabet}
• It is a complex object with a string sub-object (the
sequence)
 Inherits properties of the Python string object
 Also defines an alphabet for that string
 This constraints the allowed properties of the string object
• The alphabets are actually Biopython objects such as
IUPACAmbiguousDNA or IUPACProtein (Int Union of Pure and Applied
Chem)
• Which are defined in the Bio.Alphabet module
• A Seq object with a DNA alphabet is different from an Amino Acid
alphabet
The Seq object: {Data, Alphabet}
1. Data: this is the actual sequence data string of the sequence.
2. alphabet – an object describing what the individual characters making up the
string “mean” and how they should be interpreted.

Biopython Seq method allows you to create the Seq object


library
>>> from Bio.Seq import Seq Importing the alphabets from its module
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq('AGTACACTGGT', IUPAC.unambiguous_dna)
>>> my_seq Function call that creates the Seq object. minimum: data attribute
Seq('AGTACACTGGT', IUPAC.unambiguous_dna())
>>> print(my_seq)
AGTACACTGGT Eg. of Print method working on different objects, here the Seq object

>>> my_seq = Seq(‘MRTAVACTKGT')


>>> my_seq
Seq('MRTAVACTKGT', Alphabet())
You can create an ambiguous sequence with the default generic alphabet like this
Seq objects have string methods
• Seq objects have methods that work just like string objects
• You can get the len() of a Seq, slice it, and count() specific letters in it:

• Get single characters/Count sequence length


>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCG", IUPAC.unambiguous_dna)
>>> for index, letter in enumerate(my_seq): print("%i %s" % (index, letter))
0G
1A
2T
3C
4G
>>> print(len(my_seq))
5
>>> print(my_seq[0]) #first letter
G
>>> print(my_seq[2]) #third letter
T
Seq objects have string methods
• Seq object has a len(), count() method, just like a string. Like a Python
string, this gives a non-overlapping count:
>>> from Bio.Seq import Seq
>>> "ATGCATAT".count("AT"))
3
>>>Seq(”AAAAA").count(”AA")
2
Eg. Determining GC content

>>> from Bio.Seq import Seq


>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq(“GATCGATGGGCCTATATAGGATCGAAAATCGS”,IUPAC.unambiguous_dna)
>>> len(my_seq)
32
>>> my_seq.count("G")
9
>>> 100 * float(my_seq.count("G") + my_seq.count("C")) / len(my_seq)
43.75
13
Seq objects have string methods:
slice
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GATCGATGGGCCTATATAGGATCGAAAATCGC",
IUPAC.unambiguous_dna)
>>> my_seq[4:12]
• Seq('GATGGGCC', IUPACUnambiguousDNA())

14
Turn a Seq object into a string
• Sometimes you will need to work with just the sequence
string in a Seq object using a tool that is not aware of the
Seq object methods
• Turn a Seq object into a string with str()
• You will lose the alphabet and just get back the string.
• You can input it into other programs and work with it

>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPACUnambiguousDNA())
>>> seq_string=str(my_seq)
>>> seq_string
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’

16
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’

• Seq object is “read only”, immutable; have to set Seq as mutable


>>> mutable_seq = my_seq.tomutable() Dot method works on the parameter
>>> mutable_seq preceding the dot
MutableSeq('GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA', IUPACUnambiguousDNA())
>>> mutable_seq [5] = "G"
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA',
IUPACUnambiguousDNA())
17
Seq objects have special methods:
MutableSeq
• Alternatively, you can create a MutableSeq object directly from a string:
>>> from Bio.Seq import MutableSeq
>>> from Bio.Alphabet import IUPAC
>>> mutable_seq =
MutableSeq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)

• Either way will give you a sequence object which can be changed:
>>> mutable_seq [5] = “G”
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA',
IUPACUnambiguousDNA())

• Note:
• You can’t use a MutableSeq object as a dictionary key.
• You can use a Python string or a Seq object in this way.
Seq objects have special methods:
Changing case
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
>>> dna_seq.upper() Dot method works on the parameter
Seq('ACGTACGT', DNAAlphabet()) preceding the dot
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())

• Strictly speaking the IUPAC alphabets are for upper case sequences only,
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> dna_seq.lower()
Seq('acgt', DNAAlphabet())

• Note: You can also use MutableSeq to change case

You might also like