Professional Documents
Culture Documents
2019/11/11
Learning Objectives
• Biopython as a toolkit
• Seq objects and their methods
• SeqRecord objects have data fields
• SeqIO to read and write sequence
objects
• Direct access to GenBank with
Entrez.efetch
• BLAST & Multiple sequence
alignment
Modules
• Python functions are divided into 3 sets
1. A small core set that are always available
2. Some built-in modules such as math that can be imported from the
basic install (Eg. >>> import math)
3. An extremely large number of optional modules that must be
downloaded and installed before you can import them
4. Codes using such modules is said to have “dependencies”
• Biopython belongs to the third and fourth category
• The code for dependencies are located in different places such
as SourceForge, GitHub, and developer’s own websites (Perl
and R are better organized)
• Trouble?: Ask the TA’s as each persons problem is mostly
unique and no general solution
Install Biopython
• Website for installation instruction:
• http://biopython.org/wiki/Download
• Required Software
• Python (version above 2.6)
• NumPy (Numerical Python)
• Optional Software
• ReportLab – used for pdf graphics code
• psycopg – used for BioSQL with a PostgreSQL database
• mysql-connector – used for BioSQL with a MySQL database
• MySQLdb – An alternative MySQL library used by BioSQL
4
Information Source: http://biopython.org/wiki/Download
Is your Biopython installed
correctly?
• Type the following on your terminal or interpreter:
by: Jeff Chang, Brad Chapman, Iddo Friedberg, Thomas Hamelryck, Michiel de
Hoon, Peter Cock, Tiago Antao, Eric Talevich, Bartek Wilczyński
Most of the examples in this class are drawn from the above link
Python is an Object-Oriented language
• Composed of data structures (known as classes)
• can contain complex and well-defined forms of data, and
they can also have built in methods
• Complex objects are built from other objects
• Eg. String, list and other data types have certain methods
• Many classes of objects have the same method and can be
used without a defined call
• Eg. “print” method
• Specifying the given data type belonging to this class, and it
inherits all the properties
The Seq object
• The Seq object class is simple and fundamental for a lot of Biopython work.
A Seq object can contain DNA, RNA, or protein.
14
Turn a Seq object into a string
• Sometimes you will need to work with just the sequence
string in a Seq object using a tool that is not aware of the
Seq object methods
• Turn a Seq object into a string with str()
• You will lose the alphabet and just get back the string.
• You can input it into other programs and work with it
>>> my_seq
Seq('GATCGATGGGCCTATATAGGATCGAAAATCGC',
IUPACUnambiguousDNA())
>>> seq_string=str(my_seq)
>>> seq_string
'GATCGATGGGCCTATATAGGATCGAAAATCGC'
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’
16
Seq objects have special methods:
MutableSeq
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> my_seq = Seq("GCCATTGTAATGGGCCGCTGAAAGGGTGCCCGA",
IUPAC.unambiguous_dna)
>>> my_seq[5] = "G"
TypeError: 'Seq' object does not support item assignment
>>>my_seq.reverse()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'Seq' object has no attribute 'reverse’
• Either way will give you a sequence object which can be changed:
>>> mutable_seq [5] = “G”
>>> mutable_seq
MutableSeq('GCCATGGTAATGGGCCGCTGAAAGGGTGCCCGA',
IUPACUnambiguousDNA())
• Note:
• You can’t use a MutableSeq object as a dictionary key.
• You can use a Python string or a Seq object in this way.
Seq objects have special methods:
Changing case
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import generic_dna
>>> dna_seq = Seq("acgtACGT", generic_dna)
>>> dna_seq
Seq('acgtACGT', DNAAlphabet())
>>> dna_seq.upper() Dot method works on the parameter
Seq('ACGTACGT', DNAAlphabet()) preceding the dot
>>> dna_seq.lower()
Seq('acgtacgt', DNAAlphabet())
• Strictly speaking the IUPAC alphabets are for upper case sequences only,
>>> from Bio.Seq import Seq
>>> from Bio.Alphabet import IUPAC
>>> dna_seq = Seq("ACGT", IUPAC.unambiguous_dna)
>>> dna_seq
Seq('ACGT', IUPACUnambiguousDNA())
>>> dna_seq.lower()
Seq('acgt', DNAAlphabet())